CS1101C Practical Exam

Makeup Session

Document Word Frequency Counting

The name of your C program file must be called countdoc.c, files with any other name will not be marked.

Certain words in the English language occur more often than others. Your objective is to write a program to found out which are the most commonly occuring words in the English language.

Your program must read a document text file called doc.txt and record the frequency of every word found in the text file. You can assume that each line in doc.txt will never exceed 100 characters, and that none of the words will exceed 30 characters. Assume doc.txt always exist. The file doc.txt may contain multiple lines, and some of these lines may be blank. Suppose doc.txt contains:

Why Play Bridge?

Why not? It is easy to learn! It is fun! Through duplicate play,
opportunities abound to meet new people, travel to exciting cities, and
experience the psychic rewards of being recognized as a winner!

The ABA, through its local chapters and clubs, offers many levels of
instruction and you may learn at your own pace. Your instructor will
introduce you to one or more of the many easy-to-read books on bridge,
some of which may be available at your local library. And, since bridge
players love to play, you can always find a more experienced player to
help you. Therefore, even if you do not play cards now, you can learn to
play and enjoy bridge. If you are an experienced card player, you
already know how much fun card playing can be.

So, consider joining the world of competitive bridge through a
membership in the ABA. Chances are that at least one of our over 5000
members lives in your city or town. He or she may be young or more
experienced, a college student, a retiree, a government employee or
official, a business person, or the teenage son or daughter of a bridge
player. Our diverse membership adds to the interest of duplicate bridge.
Membership and game fees are nominal.

Your program must read in doc.txt, convert all the words to all lowercase letters, count the frequency of every word that appears in the document, and write the output into the text file freq.txt. Assume that the list of delimiters is given by ",.?! :;()-" (do not forget the space). Overwrite the output file if it already exists.

The format of the output file freq.txt is:

  1. Each line contains a word followed by a single space, then the word's frequency.
  2. The file must be sorted firstly in descending order of frequency (the word with the highest frequency comes first). For those words with the same frequency, the words must be sorted in alphabetical order (dictionary order), with digits coming before letters (the "word" 5000 comes before the word abound).

The first ten lines of the output file freq.txt (based on the example document.txt given) is shown as follows:

to 9
a 8
of 8
you 8
bridge 7
or 7
the 7
and 6
play 5
your 4

This is because the word to occurs nine times, the words a, of, and you occur eight times, and so on. Notice that since a, of, and you occur the same number of times, they are sorted into alphabetical order.

The full contents of the above output file can be found in the file freq_correct.txt in your directory.

A sample doc.txt file has been placed in your directory by the pesetup program.

We will test your program with other doc.txt files.

All the best!

Some useful UNIX commands (in case you forgot what you did in Lab 0):

  1. "ls -l": lists all the files in the directory.
  2. "cp a.txt b.txt": copies a.txt to b.txt.
  3. "mv a.txt b.txt": moves / renames a.txt to b.txt.
  4. "cat a.txt": displays the text file a.txt.