Scraping PDFs with Python

PDFs are a hassle for those of us that have to work with them to get at their data.  When I was at the Open Data NJ summit last month, the reporters and journalists went on and on about how utilizing PDFs the worst thing in the world, and they’re right.

Please check out the follow up post made on 11/10/2019 for updated information.

Fortunately, there are a few data mining techniques out there that you can use to make this a lot easier process, especially if you are left with only a few options.

Digging for a solution to convert a PDF made up completely of images to text, I came across pypdfocr.  There are a lot of dependencies for it that you must install, through brew and pip.  Once you’ve met all of the requirements, you can cd into your folder of choice and run the following command:

pypdfocr filename.pdf

It takes a little while, but this will split the PDF into a PNG file for each page, and then, an additional html page for each of these.  In the end, all of these files get cleaned up and you’re left with a properly OCR’d PDF.

You may need to remove the ODR’d text from a PDF, because it is corrupt and did not render properly. You can find an excellent guide on how to do that on the Mac here.

Ruby-based Tabula is pretty solid in extracting tables from a PDF, but if it’s a larger document, it may be extremely slow or fail.  The program is still not 100% operational, but for smaller documents, it does as good of a job locally as ScraperWiki does as a freemium service.

A great Python-based solution to extract the text from a PDF is PDFMiner. After installing it, cd into the directory where your OCR’d PDF is located and run the following command:

pdf2txt.py -o output.html filename_ocr.pdf

The resulting file will be output.html, a single webpage of the PDF pages combined. You can now use BeautifulSoup or your favorite text editor to clean up the document and mine the data.

I wrote a quick script that will separate each page into its own dictionary entry, and insert each line of HTML as an item in a list.  I’ve made it available on Github.

If you only want the text and don’t want to mess around with HTML, the following command works best:

pdf2txt.py -o test.txt -t text test_ocr.pdf

5 comments

  1. When I try to run pypdfocr filename.pdf I get a SyntaxError: invalid syntax.

    Can you give me a hint where to search for the issue?

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.