Last active
October 5, 2021 18:47
-
-
Save CanOfBees/5472f8520400718122bc14aeeaf551a2 to your computer and use it in GitHub Desktop.
checking tika-generated xhtml output from PDFs
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
(: | |
: for each xh:html document in the database, check for 1 occurence of $str1 | |
: and 2 occurences of $str2, returning the db:path (or name) of the document where | |
: true | |
:) | |
declare namespace xh = "http://www.w3.org/1999/xhtml"; | |
for $html in //xh:html | |
let $str1 := "(original signatures are on file with official student records" | |
let $str2 := "to the graduate council:" | |
let $sc1 := count($html//xh:div[@class='page']//xh:p/text()[contains(lower-case(.),$str1)]) | |
let $sc2 := count($html//xh:div[@class='page']//xh:p/text()[contains(lower-case(.),$str2)]) | |
where $sc1 = 1 and $sc2 = 2 | |
return db:path($html) |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
not as fast as i would like, but hopefully good enough for govt. work.