CS1102C Miniature 6

Three ways of string tokenization

In this miniature, I will discuss three ways of string tokenization, namely by using the C strtok() function, using the C sscanf() function and using the C++ istringstream class. Most of the miniature will be spent on the C strtok() function, because this is the function which seems to be most widely used, but also the most misused one.

The definition for the C strtok() function is as follows:

char* strtok(char* s1, const char *s2);

When you use the strtok() function for the first time, pass the C string to tokenize as s1 and pass the set of delimiters to tokenize as s2. In subsequent calls to strtok(), set s1 to NULL and pass the set of delimiters to tokenize as s2. Each time you call the strtok() function, a pointer to the token will be returned. When there are no more tokens to be returned, the function returns NULL. So an example of using strtok() is as shown:

char str[] = " 12, 34,5678    90";
char* token;
token = strtok(str, " ,");
while (token) {
    cout << token << endl;
    token = strtok(NULL, " ,");
}

This example code will token the C string str using space and comma as delimiters. The output is:

12
34
5678
90

We shall look at this example more closely using diagrams.

After the first call

The above diagram shows the character array after the first call to strtok(), which is before the while loop. Note that str[3] is now set to the terminating null while token points to str[1].

After the second call

The above diagram shows the same character array after the second call to strtok(), inside the while loop. str[7] is now set to the terminating null.

After the third call

The above diagram shows the character array after the third call to strtok().

After the fourth call

The above diagram shows the character array after the fourth call to strtok(). The code segment will proceed to make a fifth call to strtok(), which will return NULL, and the execution breaks out of the while loop.

Here, we can make some important observations on how strtok() should be used:

This wraps up the description of how the C strtok() function works and how to use it.

Next, I will briefly discuss the C sscanf() function. I will provide an example here:

char* str = "Freddy  85  A";
char name[20];
int score;
char grade[4];
sscanf(str, "%s %d %s", name, &score, grade);
cout << "Name: " << name << endl;
cout << "Score: " << score << endl;
cout << "Grade: " << grade << endl;

The output is:

Name: Freddy
Score: 85
Grade: A

We see that the sscanf() function is very similar to the scanf() function in terms of usage and functionality, except that sscanf() reads from the C string in its first argument instead of from the standard input.

However, sscanf() cannot tokenize strings which have a variable number of tokens, which is a major drawback. The strtok() function can.

The C++ standard library provides a very useful class called istringstream, which can be used to perform string tokenization. To use this class, you have to include <sstream>. I will also provide an example:

string str = "Freddy  85  A";
string name;
int score;
string grade;
istringstream stream;
stream.str(str);
stream >> name >> score >> grade;
cout << "Name: " << name << endl;
cout << "Score: " << score << endl;
cout << "Grade: " << grade << endl;

The output is the same as the previous example:

Name: Freddy
Score: 85
Grade: A

Since istringstream is a subclass of istream, the class cin belongs to, you will find that using the istringstream class is pretty much like using cin. The only difference is that before you tokenize a string, you call the str() method to specify the string you want to tokenize. Then you tokenize the string as though you are reading from standard input.

The istringstream class can be used to tokenize strings having a variable number of tokens. To check whether there are any more tokens from the string, use the eof() method of the class. If you want to use an instance of istringstream inside a loop to tokenize multiple strings, then you can call the clear() method inside the loop to reset the internal control structures before calling the str() method for each string that needs to be tokenized.

More information on the istringstream class can be found here.

This ends my discussion for three ways of string tokenization. If you are interested, you may also want to find out information on how to format strings using the C sprintf() function and the C++ ostringstream class, which are the counterparts to the C sscanf() function and the C++ istringstream class.

Back to Index