cleanco: A Python Module for Checking Business Names

I had a business problem that I ended up solving during after-work hours. I was having trouble comparing two company strings to match them.

For instance, “Merck” should equal “Merck, Inc.”, right? But it doesn’t, which means that I would have to process data by eye, in the tradition of Amazon Mechanical Turk.

I didn’t want to do that, though.

A company name can be written a few ways with many suffixes. clenanco, a python-based module I wrote and released on github, attempts to strip out extraneous pieces and leave behind a clean version of a company name. In the case of “Merck, Inc.”, running it through my module would transform it into “Merck”, making a match. Sounds great, right?

The hardest part of this project was accounting for all of the different business suffixes used worldwide. I spent hours on the Types of Business Entity article on Wikipedia to make sure that I was accounting for as many as possible, but I’m sure that there’s much more work to do here. I also built features to figure out the possible company types, the countries, and industries. For certain abbreviations, I replace them with full names, too.

There’s still a lot more work to be done, especially when it comes to optimizing the algorithm, but I’m satisfied with the code so far. More information can be found through the project’s github page.

Share on Facebook0Share on LinkedIn0Email this to someoneShare on Google+0Share on Reddit0Tweet about this on Twitter
 

2 thoughts on “cleanco: A Python Module for Checking Business Names

  1. That is a good point, the help function can be a ton of help. But that asusmes your friend or the random guy-on-the-internet’s module contains a docstring. I will be honest and say that I commonly won’t put a docstring in many of my modules unless they are super solid and ready to be distributed. So if help doesn’t return anything helpful, it’s good to have dir() to fall back upon. Thanks for the comment!

Leave a Reply

Your email address will not be published. Required fields are marked *