1_pdfliberation_hackathon_activity.md

Raw

1_pdfliberation_hackathon_activity.md

Start Here

1. Log into GitHub
2. Fork this Gist
3. Edit your version to share your team's activity

PDF Liberation Hackpad
IRC: https://webchat.freenode.net/ Channel: #sunlightlabs
GitHub Markdown-Cheatsheet

Raw

2_who.md

Who

Who is working together?

Rostislav Tsiomenko [OpenGov Foundation]

Marjorie Roswell

Travis Korte

Nick Lyell

Raw

4_pdfs.md

PDF Samples

How would you categorize the PDFs?

Sample documents

PDF URL	Document Title
http://www.domain.org/docs/docurl.pdf	Report of Economic Data 2012

Content category

Number of pages

1 page
2 to 9 pages
10+ pages
100+ pages

Other observations

Collection includes PDFs made from scanned documents
PDFs include hand-written text

PDF Generation

Human authored
Machine generated
God only knows

Raw

6_tools.md

Tool

What tool(s) are you using to extract the data?

Tool	How we used it
ABBYY Cloud SDK	Loaded up PDFs and converted to txt, rtf, or XML

Notes

ABBYY is commercial :(

Raw

7_how.md

How

How did you extract the desired data that produced the best results?

https://github.com/rtsio/financial_disclosure_scraping/tree/master/ABBYY-working-example/README.txt

ABBYY provides best results for tabular data. Tesseract and etc. unfortunately do not come close (note that this OCR, not text PDFs).

Improvements

What would have to be changed/added to the tool or process to achieve success?

Account for general structure, different Schedule tables (III, IV, V, etc.), many more improvements to go - only result of 5 hours of research into PDFs.

Raw

9_code.md

Code

Please list code, tips and howto's of your processing pipeline.

https://github.com/rtsio/financial_disclosure_scraping/tree/master/ABBYY-working-example

rtsio/1_pdfliberation_hackathon_activity.md

Start Here

Who

Challenge

PDF Samples

Sample documents

Content category

Number of pages

Other observations

PDF Generation

Type of data embedded in PDF

Desired output of data

Tool

Notes

How

Improvements

Results quality

Speed

Notes

Code