PDFs are a hassle for those of us that have to work with them to get at their data. When I was at the Open Data NJ summit last month, the reporters and journalists went on and on about how utilizing PDFs the worst thing in the world, and they’re right.
Please check out the follow up post made on 11/10/2019 for updated information.
Fortunately, there are a few data mining techniques out there that you can use to make this a lot easier process, especially if you are left with only a few options.
Digging for a solution to convert a PDF made up completely of images to text, I came across pypdfocr. There are a lot of dependencies for it that you must install, through brew and pip. Once you’ve met all of the requirements, you can cd into your folder of choice and run the following command:
pypdfocr filename.pdf
It takes a little while, but this will split the PDF into a PNG file for each page, and then, an additional html page for each of these. In the end, all of these files get cleaned up and you’re left with a properly OCR’d PDF.
You may need to remove the ODR’d text from a PDF, because it is corrupt and did not render properly. You can find an excellent guide on how to do that on the Mac here.
Ruby-based Tabula is pretty solid in extracting tables from a PDF, but if it’s a larger document, it may be extremely slow or fail. The program is still not 100% operational, but for smaller documents, it does as good of a job locally as ScraperWiki does as a freemium service.
A great Python-based solution to extract the text from a PDF is PDFMiner. After installing it, cd into the directory where your OCR’d PDF is located and run the following command:
pdf2txt.py -o output.html filename_ocr.pdf
The resulting file will be output.html, a single webpage of the PDF pages combined. You can now use BeautifulSoup or your favorite text editor to clean up the document and mine the data.
I wrote a quick script that will separate each page into its own dictionary entry, and insert each line of HTML as an item in a list. I’ve made it available on Github.
If you only want the text and don’t want to mess around with HTML, the following command works best:
pdf2txt.py -o test.txt -t text test_ocr.pdf
When I try to run pypdfocr filename.pdf I get a SyntaxError: invalid syntax.
Can you give me a hint where to search for the issue?
Peter,
Try running pypdfocr from the command line instead of from inside a Python shell.
I use xpdf for text extraction from ocr’ed files.
http://www.foolabs.com/xpdf/download.html
$ pdftotext file.pdf
For doing OCR, I find pypdfocr pretty slow on mac when compared with Windows. Do you have any other recommendations?