Fuzzy String Matching

November 10, 2016 Kyle Martin

Anyone who works regularly in Revit has seen just how unorganized naming conventions can get. Work sharing with a handful of people or a dozen people can lead to many different forms of abbreviation, capitalization, and numbering. For example on a recent project, the production team decided to add the level name to each view and sheet. Permutations of Level, LEVEL, LVL, and L were combined with a number value X or 0X to match the number of characters in floors with double digits. For a 17-floor project this can lead to a 102 possible combinations.

Dynamo is well-known for its capability to automate various tasks in Revit and save significant amounts of time. The key to its application is the ability to target specific views and elements for applying an alteration. When the naming of items is all over the place, it makes it very difficult to use Dynamo to its fullest potential. A lack of organization in the model may reduce or eliminate opportunities for automation and optimization later in the project.

Eric Rudisaile has recently released another tool for the Dynamo arsenal that makes this task a whole lot easier. The package is called FuzzyDyno and uses the computer science principal of approximate string matching — also known as fuzzy string matching — to make estimated pairings between two disparate lists of values. In fact, a version of fuzzy string matching is used in the Search functions for the Library within Dynamo and the online Forum.
wide
Fuzzy string matching has had useful applications since the earliest days of databases, where various records across multiple databases needed to be matched to each other. Think for example of two sets of medical records that need to be merged together. There are many pieces of information in each of the patient’s records, and each entry could have slight variations in spelling, or small errors. To combine these databases it would be quite useful to mathematically describe how similar two strings of characters are.

Like any good computer programmer Eric didn’t set out to re-invent the wheel; instead he thought to check for open source implementations of the algorithms he wanted to use. The FuzzyDyno package uses the dynamic-link library (dll) FuzzyString [1], which contains a number of methods for comparing two strings or sets of information. The Jaro-Winkler algorithm[2] is the one used in FuzzyDyno, a suitable choice because it is easy to understand the result, and was designed for short strings. Jaro-Winkler is a variation on the Jaro distance, which measures how similar two words are by calculating the number of matching characters the strings have within a certain distance from each other. Winkler modified the Jaro distance to favor strings which have matching prefixes, very useful when you are comparing words which may contain typos, but are unlikely to start with totally different letters.

String comparison algorithms come in many flavors, and each have special quirks. The tables below show the outputs from 3 different algorithms and how each treat differences between strings. To explore how the algorithms differ we will use iterations of “Dynamo” and see how the algorithms rate our “misspellings”.

Note that the Jaccard Index[1] rates “Dynamo” and “yDnamo” as being identical. It is looking for strings whose set of letters match. Jaccard also rates “Dyno” as being a better match than “Dinomo”, because although “Dyno” is only four letters long, it shares more letters in common.

Jaccard with Border
The JaroWinkler[2] algorithm places special importance on the early letters, and even the ‘d’ not being capitalized in “dynamo” severely hurts its ranking. It does however recognize that “yDnamo” is not exactly the same, the comparison is good at 0.94, but it is not one as in the Jaccard Index.

Jaro-winkler with Border

The final table is the Levenshtein Distance. It is an example of an algorithm which calculates the “edit distance”[3]. The edit distance is “minimum number of operations required to transform one string into the other” where an operation is something like adding/removing or substituting letters from the strings.

Levenshtein with border
The node “GetBestMatch” uses these algorithms in series to narrow down its results to what is hopefully the best option. First the Levenshtein distance gives us only matches which require the same lowest number of edits, this leaves “dynamo” and “Dymamo”. Then the JaccardIndex compares the remaining values 0.714-“dynamo” vs 0.833, so the node returns only “Dymamo”.

Testing out FuzzyDyno on a very large healthcare project proved to be a relative success. Using simple list management techniques, the Level designation can be extracted from each sheet name and then matched against a master list of keys. The FuzzyStringComparisons.JaroWinkler node returns a list of values in a range of 0 to 1, with 1 being a perfect match. Therefore you can take the maximum value from each sublist and associated sheet name at that Index. Once an entire list of sheet names and their associated levels has been establish, it can become a tool for standardizing naming conventions. Using string manipulation nodes and list management principals, each sheet name can be stripped of its original level, re-structured, and replaced with a cohesive format.

In theory, this approach eliminates the need to match ALL potential derivatives of levels. However, the node still occasionally results in multiple values — for example 1 and 10 or 02 and 20 may return the same value because they are extremely close. Similar to many other Dynamo workflows that are used to parse the BIM model for information, there are bound to be inconsistencies. Using FuzzyDyno allows a user to get 80% there and then round trip back into Revit for final validation via a Schedule or navigating elements in the model.

FuzzyDyno is a perfect example of how principles from other industries can positively influence AEC production. Thanks to custom developers like Eric, additional capabilities are being added to Dynamo that allow for more intelligent list management and improved efficiency. Do you have a tricky situation that requires matching slight variations of a key term to a list of words? Perhaps FuzzyDyno can help! Download it from the Package Manager today…

[1] More about the Jacquard Index algorithm: https://en.wikipedia.org/wiki/Jaccard_index
[2] More about the Jaro-Winkler Distance algorithm: https://en.wikipedia.org/wiki/Jaro%E2%80%93Winkler_distance
[3] More about the Levenshtein (Edit) Distance: https://en.wikipedia.org/wiki/Edit_distance
[4] More about FuzzyString can be found here: https://fuzzystring.codeplex.com/

P24small 750x170+border
Eric Rudisaile is currently a Solutions Specialist at Microdesk, an Autodesk Platinum Partner. As a consultant for Microdesk Eric aims to take BIM to the next level by specializing in workflow automation, using C#, Python or if he’s lucky, Dynamo. Before joining Microdesk Eric was a mechanical designer at Cosentini, working on a variety of spaces including mixed-use high rise, labs and theaters.

Kyle Martin works on healthcare, academic, and residential development projects at Shepley Bulfinch where he specializes in design technology. He is also the founder of the Dynamo-litia Boston user group and an adjunct instructor of visual programming and advanced BIM workflows at the Boston Architectural College.