I had a business problem that I ended up solving during after-work hours. I was having trouble comparing two company strings to match them.
For instance, “Merck” should equal “Merck, Inc.”, right? But it doesn’t, which means that I would have to process data by eye, in the tradition of Amazon Mechanical Turk.
I didn’t want to do that, though.
A company name can be written a few ways with many suffixes. clenanco, a python-based module I wrote and released on github, attempts to strip out extraneous pieces and leave behind a clean version of a company name. In the case of “Merck, Inc.”, running it through my module would transform it into “Merck”, making a match. Sounds great, right?
The hardest part of this project was accounting for all of the different business suffixes used worldwide. I spent hours on the Types of Business Entity article on Wikipedia to make sure that I was accounting for as many as possible, but I’m sure that there’s much more work to do here. I also built features to figure out the possible company types, the countries, and industries. For certain abbreviations, I replace them with full names, too.
There’s still a lot more work to be done, especially when it comes to optimizing the algorithm, but I’m satisfied with the code so far. More information can be found through the project’s github page.