CSV-Tool: Similar Strings
Clean your CSV and remove typos etc.
Not every data set is clean and organized. In order to help users deal with such messy data sets, we are building a set of tools to help us and you deal with such cases. Our first tool helps you remove typos and similar mistakes from a set of strings. We have used this tool in projects where we wanted to group columns by the name of an organization. In order to do so, we are using two methods to calculate the similarity of a string. You can try our tool directly on this web site or you can use it as a command line tool via NPM or of course as a node package, inside your application. A short description of how our tool works is provided at the end of the article.
Paste the content of a CSV file into the text field below or drag'n'drop a CSV file onto the text field. The first row needs to contain the column names.
In order to maintain a responsive interaction with the system don't use the web interface for large CSVs (<2MB).
What delimiter is used in the CSV?
This should be detected automatically.
Which column should we analyse?
Which algorithm should we use to compare the strings?
An explanation of the methods can be found below.
Additional parameters for the Fingerprint-Algorithm:
Type of Fingerprinting:
Should the words be stemmed:
Additional parameters for the KNN-Algorithm:
Minimum number of characters to be the same in both strings (recommended: 6):
Maximum difference in percentage (recommended: 10):
Based on the information provided above the following field holds a list of string comparison results. If a group consists of only one item with "ok":2, then there was no similar item found. Otherwise, there should be multiple items per group, all besides one should have "ok":1, which means that in the next step every item with a string in the "ok":1 group, will be replaced by the "ok":2 string. You are free to modify which item should receive the "2".
Depending on the size of your input data set, this might take a while.
The following text field holds your cleaned data.
Depending on input size, this might take a while.
How does this work?
The fundamental idea behind the fingerprinting approach is to reduce the text string down to its essential components, which then becomes it's fingerprint. The fingerprint is used to compare each string with all other strings and their fingerprints.
There are two methods for fingerprinting implemented in this tool. The first runs the following processes on the string:
- Remove spaces at the beginning and end of a string.
- Convert to lower case characters.
- Remove punctuation and spaces.
- Convert special characters (e.g. ä > ae).
- Remove invisible characters.
This process, for example, converts:
'Grips'-Theater gemeinnützige Gesellschaft gripstheatergemeinnuetzigegesellschaft
This means that all of the following variations share the same fingerprint:
'Grips'-Theater gemeinnützige Gesellschaft Grips-Theater gemeinnützige Gesellschaft Grips Theater gemeinnützige Gesellschaft Grips Theater gemeinnützigeGesellschaft gripstheatergemeinnuetzigegesellschaft
This method is very fast, but in turn is not able to detect more complex typos and problems.
The method of phonetic fingerprinting is translating text strings into it's corresponding phonetic markers. Thereby, we can identify some typos and mistakes, for example if people write names by ear.
The following variations all result in the same phonetic markers:
Sebastian Sebästian Sebastien 81826
In order to convert strings to phonetic markers the module is using two libraries: cologne-framework for German and metaphone for other languages.
KNN / Neighbour
The neighbourhood method is slightly more complex and therefore, on the one hand, requires more resources but, on the other hand, is a lot more precise. The foundation is a direct comparison of two text strings, calculating the distance between two strings using the Levenshtein distance. The latter calculates the number of manipulations required to convert string-1 to string-2 (Delete, Add, Replace). Calculating the distance between all items in a set would take quite a while, depending on the size of the data set, therefore, before calculating the distance, our module checks which strings actually contain corresponding sub-strings. The minimum length of a corresponding sub-string defines how detailed the check is, but also how fast it will be (longer sub-strings lead to fast processing but also less detailed checks; default length: 6).
The following example demonstrates what sub-strings are (length:6):
Technologie Techno echnol chnolo hnolog nologi ologie Technik Techni echnik Technische Techni echnis chnisc hnisch nische
In the example above Technik and Technische both share the substring "Techni".
For the strings that share a certain sub-string, the module calculates the exact distance. Following a few examples and their distance:
Technik <> Technische = 4 Berlin <> Bern = 2 Senat <> Stern = 4
We are interpreting the Levenshtein distance (LD) relative to the size of the text string itself. While a long string with an LD of two might actually be the same, a short string with LD 2 might not be.
We did not come up with the methods that we implemented into our tool, they are common methods for solving such problems. Our implementation was strongly inspired by the open source tool Open Refine, which is a desktop application to help users search and clean their data.