Skip to content

Instantly share code, notes, and snippets.

@CanOfBees
Last active October 5, 2021 18:47
Show Gist options
  • Save CanOfBees/5472f8520400718122bc14aeeaf551a2 to your computer and use it in GitHub Desktop.
Save CanOfBees/5472f8520400718122bc14aeeaf551a2 to your computer and use it in GitHub Desktop.
checking tika-generated xhtml output from PDFs
(:
: for each xh:html document in the database, check for 1 occurence of $str1
: and 2 occurences of $str2, returning the db:path (or name) of the document where
: true
:)
declare namespace xh = "http://www.w3.org/1999/xhtml";
for $html in //xh:html
let $str1 := "(original signatures are on file with official student records"
let $str2 := "to the graduate council:"
let $sc1 := count($html//xh:div[@class='page']//xh:p/text()[contains(lower-case(.),$str1)])
let $sc2 := count($html//xh:div[@class='page']//xh:p/text()[contains(lower-case(.),$str2)])
where $sc1 = 1 and $sc2 = 2
return db:path($html)
@CanOfBees
Copy link
Author

not as fast as i would like, but hopefully good enough for govt. work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment