Scraping PDFs with Python – Follow Up

It has been a number of years since I first wrote about scraping PDFs using Python, and it has been by-far my most popular post on this blog.

I am not going to say that I am an expert in this field, as my career has shifted and I find that I don’t even have a need to do a lot with advanced PDF reading anymore. Being in the construction industry, I use software like Bluebeam Revu Xtreme which does have an OCR module built into it, so I don’t need solutions to do this. However, the cost of this particular program can be as high as $350.00 USD for a single license, and it is mainly construction-focused; I can understand why most people wouldn’t be jumping to purchase any sort of software if free solutions already exist.

I decided that, with as many vistors that are still coming to my site regarding this topic, I should write a follow up on this with a new and easier method of going about this task. There’s a few things that you’ll need to do before you start.

1) Find a Document to Use

There are a number of test PDFs that you can download in order to test using the my workflow, including this one that I found from a quick search. Gather a bunch of PDFs that you want to test things out on.

2) Install Python 3

It is my recommendation that you download the latest Python 3 build before you start the PDF scraping process.

3) Use OCRmyPDF Python Library + My Web App

I have tried so many different solutions to OCR PDFs that all have broken over time. OCRmyPDF just got a major update which takes the library away from the command line and into an API. I’ve been seriously impressed with the author and their attention to detail, and have waited patiently for this new release to drop before I finished this post (my first draft says it was 9 months ago!). I’ve been spending a great deal of time on a platform for the code, and I think that it is all finally stable. I am using a popular flask file uploader template that I also found on Github. My code can do a few things:

  1. OCR a PDF
  2. Wipe the text away from a PDF and allow for re-OCR.
  3. Dump a text file from the PDF.

So, check out PythonPDFScraper on Github. There are a number of other features that I want to build later on, including keyword searching, but I have just not gotten there yet. Please let me know if this helps you at all, or if you are experiencing issues installing.

Leave a Reply

Your email address will not be published. Required fields are marked *