If you have trouble matching similar company names, you’re more than likely trying to clean up some sort of database. This is one of my major issues that I’ve tried to explore for myself, so hopefully my findings can assist others who find themselves in a similar situation.
Think about it: if you have two similar company names, how can you get your code (really, a computer) to figure out that it is a match? There should be a way to say that, for example “Abbott Labs” and “Abbott Laboratories, Inc.” are the same thing. You need to make some extensive rules, that is, an algorithm to handle this process.
I wrote a method to match company names in Python, but after some searching, I came across a very helpful Stack Overflow thread that makes what I’ve been doing a lot better.
The first response to the Stack Overflow question is extremely informative. User Michael J. Barber suggests going through this workflow first with each of the company name strings, to make sure they come out as close as possible:
Standardizing lettercase (e.g., all lowercase)
Standardizing punctuation (e.g., commas must be followed by spaces)
Standardizing whitespace (e.g., converting all runs of whitespace to single spaces)
Standardizing accented and special characters (e.g., converting accented letters to ASCII equivalents)
Standardizing legal control terms (e.g., converting “Co.” to “Company”)
Standardizing letter case can be accomplished by the .lower() function.
Standardizing punctuation can be accomplished by replacing hyphens or commas with spaces.
Standardizing whitespace can be accomplished using several methods:
1. .lstrip() to remove spaces on the left side of a string
2. .strip() to remove spaces on the right side of a string
3. ” “.join(your_string) to remove excess spaces in the middle of a string
Standardizing accented/special characters — I haven’t come across a solution for this yet, mainly because I need to seriously read up on ASCII / UTF. I may have actually figured this out in my code, but I don’t have a way to describe it yet.
Standardizing legal control terms – Accomplished through my cleanco python module.
Let’s say that you wrote a program that did all of this for “Abbott Labs” and “Abbott Laboratories, Inc.” — you’d get back “abbott labs” and “abbott laboratories” (my cleanco module strips the “, Inc.” away). These are pretty close together, but still not an exact match.
Barber then suggests using an algorithm to calculate edit distances, that is, a way to measure how unlike two strings are; he mentions the Jaccard index as being the best way to do this.
Enter the distance python module. It gives you several algorithms to choose from to compare strings, including the Jaccard index. The edit distance is a percentage, that is, how unalike each string is. I like to invert this number by subtracting it from one, to get a “% match”; I think it makes more sense.
If you were to calculate the Jaccard index for “Abbott Labs” and “Abbott Laboratories, Inc.”, you’d find that the % match would be 50% — meaning that if you wanted to confirm a result that was > 50%, it would fail.
However, the % match for “abbott labs” and “abbott laboratories” using the Jaccard index is 70% — this is much more like it! If the threshold for a company match was over 50%, this would have met it.
With the right amount of tweaking, you can make this work for your own projects.