String Functions: substr, scan, tranwrd, cat

As much as we’d really like to say that we ONLY work with numbers, there are times when we need to work with string variables.  This post will use phone numbers as an example of a situation where you may need to create and change a string variable.  The following examples will work on any “string” variable.

The sample dataset we will be playing with contains 10 fictitious phone numbers:

Data ssample;
input ID phone $20.;
datalines;
1 512-555-1234
2 412-555-0143
3 612-555-7689
4 512-555-1346
5 812-555-4589
6 512-555-7856
7 912-555-0040
8 412-555-4594
9 512-555-3692
10 512-555-6400
;
Run;

SUBSTRING

substr function allows us to pull out part of a variable.  For instance, we want to create a new variable called areacode that contains the first 3 digits of the phone number.  The syntax for the substr function is:

newvar =  substr(variable, starting position, # of characters we want to retrieve);

For our example:

areacode = substr(phone, 1, 3);  <–  we want to create a new variable called areacode                                                                   which consists of 3 characters starting at the first                                                                   position of the variable phone in our dataset.

Data acode;
set ssample;
areacode = substr(phone,1,3);
Run;

Proc print data=acode;
Run;

 SCAN

scan function allows you to pull pieces of a string variable out by taking advantage of a delimiter that already is present in the variable.  If you have one of the following characters embedded in your string variable the scan function can be very useful.

blank  .  <  (  +  &  !  $  *  )  ;  ^  –  /  ,   %   |

In our dataset our phone numbers are delimited with the “-“.  We want to create 3 new variables from our phone number:

  • areacode=the first 3 numbers of the phone number
  • localcode = the next 3 numbers of the phone number
  • digits = the last 4 numbers of the phone number

The syntax for the scan function is:

newvar = scan(variable, n or position, optional “delimiter”)

For our example:

areacode = scan(phone, 1)  OR areacode = scan(phone, 1, “-“)  <–we are creating a new                                                                                                          variable called areacode                                                                                                          which consists of the                                                                                                                first section of the                                                                                                                      variable phone with the                                                                                                              delimiter “-”

Why will both statements work?  Why don’t I need to specify the delimiter?

When SAS sees the function scan it knows to seek out any of the delimiters listed above.  So it knows when you say scan(phone,1) to seek out the contents of the variable phone that are listed before the first occurrence of a delimiter.

Data all;
set ssample;
areacode = scan(phone,1);
local = scan(phone,2);
digits = scan(phone,-1);                 <– notice the -1 in the position – this denotes the                                                                         contents of the variable after the last delimiter
Run;

Proc print data=all;
Run;

TRANWRD

tranwrd function allows you to replace a part of a string variable with something new.  In our example, we are going to replace all instances of the dash “-” with a period “.”  Think of this as the search and replace function in Excel.  The syntax for tranwrd is:

newvar = tranwrd(variable, “old string”, “new string”);

For our example:

newphone = tranwrd(phone, “-“, “.”);    <– we are creating a new variable called newphone                                                                     where we are replacing all instances of the dash                                                                     “-” in the variable phone with a period “.”

Data tsample;
set all;
newphone = tranwrd(phone,”-“, “.”);
Run;

Proc print data=tsample;
Run;

CAT, CATT, CATS

Concatenation is a popular function that many of us for one reason or another.  The goal of concatenation is to create a new variable which is a combination of two or more other variables.  In SAS there are a number of such functions and I will only discuss cat, catt, and cats, as opposed to concat. Call me old school, but concat was the function I used in SAS for many, many years, but I just discovered cat, catt, and cats, and trust me when I say they will save you time over concat🙂

In our dataset all, we have 3 variables called areacode, local and digits.  What we want to do now is to put them all together and create a new variable called squeezephone.  To do this we will concatenate or add together the contents of the three variables into one, using the cat function.  The syntax for the cat function is:

newvar = cat(var1, var2, var3);

For our example:

squeezephone = (areacode, local, digits);

Data csample;
set tsample;
squeezephone = cat(areacode, local,digits);
Run;

Proc print data=csample;
Run;

When you look at the data, does it look odd?  If so, how?

There are spaces between areacode and local, as well as between local and digits.  These blanks are referred to as trailing blanks because they follow the variable.  To remove these trailing blanks, we simply use the catt function.

Data csample;
set tsample;
squeezephone = catt(areacode, local,digits);
Run;

Proc print data=csample;
Run;

By using the catt function, we now should see in the output the contents of squeezephone variable without any blanks.

If we had leading blanks, so blank spaces before our variable, we would use the catsfunction.

In the past, when I used the concat function, in order to remove the trailing blanks we would have to use the trim function in combination with the concat, so using the catt andcats allows us to make these changes with one function!

Screen Shot 2013-11-18 at 7.33.07 PM