If you have trouble matching similar company names, you’re more than likely trying to clean up some sort of database. This is one of my major issues that I’ve tried to explore for myself, so hopefully my findings can assist others who find themselves in a similar situation.
Think about it: if you have two similar company names, how can you get your code (really, a computer) to figure out that it is a match? There should be a way to say that, for example “Abbott Labs” and “Abbott Laboratories, Inc.” are the same thing. You need to make some extensive rules, that is, an algorithm to handle this process.
I wrote a method to match company names in Python, but after some searching, I came across a very helpful Stack Overflow thread that makes what I’ve been doing a lot better.
The first response to the Stack Overflow question is extremely informative. User Michael J. Barber suggests going through this workflow first with each of the company name strings, to make sure they come out as close as possible:
Standardizing lettercase (e.g., all lowercase)
Standardizing punctuation (e.g., commas must be followed by spaces)
Standardizing whitespace (e.g., converting all runs of whitespace to single spaces)
Standardizing accented and special characters (e.g., converting accented letters to ASCII equivalents)
Standardizing legal control terms (e.g., converting “Co.” to “Company”)
Standardizing letter case can be accomplished by the .lower() function.
Standardizing punctuation can be accomplished by replacing hyphens or commas with spaces.
Standardizing whitespace can be accomplished using several methods:
1. .lstrip() to remove spaces on the left side of a string
2. .strip() to remove spaces on the right side of a string
3. ” “.join(your_string) to remove excess spaces in the middle of a string
Standardizing accented/special characters — I haven’t come across a solution for this yet, mainly because I need to seriously read up on ASCII / UTF. I may have actually figured this out in my code, but I don’t have a way to describe it yet.
Standardizing legal control terms – Accomplished through my cleanco python module.
Let’s say that you wrote a program that did all of this for “Abbott Labs” and “Abbott Laboratories, Inc.” — you’d get back “abbott labs” and “abbott laboratories” (my cleanco module strips the “, Inc.” away). These are pretty close together, but still not an exact match.
Barber then suggests using an algorithm to calculate edit distances, that is, a way to measure how unlike two strings are; he mentions the Jaccard index as being the best way to do this.
Enter the distance python module. It gives you several algorithms to choose from to compare strings, including the Jaccard index. The edit distance is a percentage, that is, how unalike each string is. I like to invert this number by subtracting it from one, to get a “% match”; I think it makes more sense.
If you were to calculate the Jaccard index for “Abbott Labs” and “Abbott Laboratories, Inc.”, you’d find that the % match would be 50% — meaning that if you wanted to confirm a result that was > 50%, it would fail.
However, the % match for “abbott labs” and “abbott laboratories” using the Jaccard index is 70% — this is much more like it! If the threshold for a company match was over 50%, this would have met it.
With the right amount of tweaking, you can make this work for your own projects.
I have been trying to do the exact same thing on my end for some time now and I came across your blog here. It’s a dream come true.
I’m NOT a programmer by trade, I’ve taught myself a few things here and there.
But basically, I have a MySQL database filled with company names. And your Abbott Labs/Abbott Laboratories example is a classic case of what I am dealing with.
I have two questions: (remember I’m NOT a programmer, lol)
1) How do I take your code and connect it to my MySQL database
2) How do I fold in the Jaccard Index on top of your cleanco.
I’ve done mine largely in excel and it’s a monster, cleans names very well, but the MATCHING is the problem!
I’d love to hear from you soon!
R.J. (Rashod Welch)
You can make a database connection with Python. There are steps online to do this.
OMG… This is amazing!!!
Again, Thank you!
I’m using the Levenshtein Distance and calculating a percentage from it. Have you used this as well?
I personally have and have found it to be good – except for cases such as “care” vs “case” which have a very a small Levenshtein/edit distance but an entirely different meaning.
I think a combination of normalization, string matching and phonetic match is the best way forward IMHO.
This is Varshini from India. Actually, I need help in my project. I’m using hospital names for the comparison purpose. I’m having dataset which contains 4500 hospital names. If user gives their own hospital name, the matched name from my dataset should be displayed as output. This what my requirement is. So, please help me with some solutions.
Thanks in advance.
Lol, sure, I’ll write the code for you. Anything else, a coffee maybe ?
It’s pretty useful. Implementing these rules is fun as well. I love you 3000