Fuzzy Matching with the SPEDIS function

Introduction:

We routinely use the SAS Data Step’s  MERGE statement to join SAS datasets that share either:  (a) a common numeric variable such as a student’s ID number or (b) a common character variable such as an student’s name.  The MERGE statement works great for such exact matching applications.   MERGE can join the datasets and also create datasets containing the observations that could not be matched.

But what do you use when you need to do fuzzy matching  i.e. matching on a common character variable, like student name, that may have a slightly different value in the datasets?   The SAS SPEDIS (“spelling distance”) function is ideal for these fuzzy matching applications.

A common use of SPEDIS is when a researcher is working with historical records such as census material, where names were handwritten by the enumerators (who often had handwriting even worse than mine!).  SPEDIS is used to help determine if Angus McDonald in the 1860 census is the same person as Angas Mcconald in the 1870 census.

A use of SPEDIS at Guelph:

In the CCS TestScoring System we use the SPEDIS function to help identify the correct name for students who have made an error when they shaded in their name on theirTestScoring Answer sheet.   These students won’t be found in the course’s Class List file, so their marks won’t be able to be imported into Colleague (Guelph’s Student Information System).

The CCS TestScoring System identifies the correct names for these students with the SPEDIS function.   Basically we use SPEDIS to compare a misspelled name from the TestScoring Answer Sheets with every name in the Class List file.   SPEDIS returns a number indicating how closely the names match in spelling.  The correct name for the student is the one where SPEDIS returns the lowest number.

An explanation of SPEDIS and an example of fuzzy matching from a book by SAS author Ron Cody.

A great explanation of fuzzy matching is found on SAS Institute’s web page at the URL below, where they show how SAS author-extraordinaire Ron Cody explains the SPEDIS function.   What follows below is copied from the URL below, and is acknowledged as completely the work of SAS Institute and SAS author Ron Cody.   We acknowledge and thank Ron Cody for providing such a great example and explanation of the SPEDIS function.

What follows is copied exactly from this URL:

http://blogs.sas.com/content/publishing/2011/08/15/sas-authors-tip-spedis-and-fuzzy-matching/

 

The following excerpt is from SAS Press author Ron Cody’s book Learning SAS by Example: A Programmer’s Guide,  Copyright © 2007, SAS Institute Inc., Cary, North Carolina, USA. ALL RIGHTS RESERVED. (please note that results may vary depending on your version of SAS software)

Performing a Fuzzy Match

The SPEDIS function (stands for spelling distance) is used for fuzzy matching, which is comparing character values that may be spelled differently. The logic is a bit complicated, but using this function is quite easy. As an example, suppose you want to search a list of names to see if the name Friedman is in the list. You want to look for an exact match or names that are similar. Here is such a program:

Program 12-18 Using the SPEDIS function to perform a fuzzy match

data fuzzy;
input Name $20.;
Value = spedis(Name,’Friedman’);
datalines;
Friedman
Freedman
Xriedman
Freidman
Friedmann
Alfred
FRIEDMAN
;

Here is a listing of data set Fuzzy:

data set from Ron Cody's Learning SAS by Example

The SPEDIS function returns a 0 if the two arguments match exactly. The function assigns penalty points for each type of spelling error. For example, getting the first letter wrong is assigned more points than misspelling other letters. Interchanging two letters is a relatively small error, as is adding an extra letter to a word.

Once the total number of penalty points has been computed, the resulting value is computed as a percentage of the length of the first argument. This makes sense because getting one letter wrong in a 3-letter word would be a more serious error than getting one letter wrong in a 10-letter word.

Notice that the two character values evaluated by the SPEDIS function are case-sensitive (look at the last observation in the listing). If case may be a problem, use the UPCASE or LOWCASE function before testing the value with SPEDIS.

To identify any name that is similar to Friedman, you could extract all names where the value returned by the SPEDIS function is less than some predetermined value. In the program here, values less than 15 or 20 would identify some reasonable misspellings of the name.

To learn more Ron Cody and all of his books, visit his author page. There you can read free book chapters, see reviews from other SAS users, listen to interviews, and find out more about his soon-to-be-published book SAS Statistics by Example. To receive notification about the availability of this new book, sign up here.

Screen Shot 2013-11-18 at 7.37.15 PM