1. Log into GitHub
2. Fork this Gist
3. Edit your version to share your team's activity
PDF Liberation Hackpad
IRC: https://webchat.freenode.net/ Channel: #sunlightlabs
GitHub Markdown-Cheatsheet
1. Log into GitHub
2. Fork this Gist
3. Edit your version to share your team's activity
PDF Liberation Hackpad
IRC: https://webchat.freenode.net/ Channel: #sunlightlabs
GitHub Markdown-Cheatsheet
Which challenge are you working on?
How would you categorize the PDFs?
| PDF URL | Document Title |
|---|---|
| http://www.domain.org/docs/docurl.pdf | Report of Economic Data 2012 |
How did you extract the desired data that produced the best results?
https://github.com/rtsio/financial_disclosure_scraping/tree/master/ABBYY-working-example/README.txt
ABBYY provides best results for tabular data. Tesseract and etc. unfortunately do not come close (note that this OCR, not text PDFs).
What would have to be changed/added to the tool or process to achieve success?
Account for general structure, different Schedule tables (III, IV, V, etc.), many more improvements to go - only result of 5 hours of research into PDFs.
Please list code, tips and howto's of your processing pipeline.
https://github.com/rtsio/financial_disclosure_scraping/tree/master/ABBYY-working-example