Assignment 1: Web Server and CGI

CS2105, School of Computing, NUS, Feb 2007

Update

Objective

In this assignment, you will learn how a Web application works. In particular, you will see how a Web client communicates with a Web-based application using Common Gateway Interface (CGI) through a Web server.

Pre-requisite

You are expected to know the format of HTTP request and response messages (Section 2.2), socket programming with TCP (Section 2.7), and how to write a simple Web server that supports GET (Section 2.9). Before attempting this assignment, you should read and understand the above mentioned sections of the textbook.

The assignment will be done under a controlled, UNIX environment. Familarity with UNIX environment (how to copy/move/delete/edit files, how to compile and run programs, etc.) is assumed.

You should use Firefox as your Web client for this assignment.

Administrative Matters

The assignment is due on 5th March, 2007, Monday, 9am. You have three weeks to complete the assignment (excluding the break week).

This is an individual assignment.

An account has been setup for you on host cs2105-z.comp.nus.edu.sg. To access your account, ssh to the host and login using your SoC UNIX username and passwordfrom a SoC host or through SoC VPN.

If you have any questions or encounter any problems with the steps discussed in the assignment, please contact the teaching staff through CS2105's IVLE forum. Any important announcement regarding the assignment will be made by the teaching staff through IVLE forum as well.

Background

The earliest generation of the World Wide Web consists only of static Web pages. Web clients send HTTP requests to Web servers, which then read the requested Web objects (HTML files, images) and returned them to the Web clients in a HTTP response message.

Soon, people realize that the Web can be much more powerful -- by allowing the Web server to interface with back-end applications, users can use the Web to access and store information on remote applications the same way they have been doing using applications on their local machines. This new way of interacting with remote applications enables web-based applications (eBay, Amazon, IVLE, CORS etc.). The Web server can be viewed as a gateway to these back-end applications. A standard called Common Gateway Interface, or CGI, was established to define how the client, the server, and the back-end applications (also termed CGI scripts) communicate.

The Web server's responsibilities include (i) receives the HTTP request from the client, (ii) decides which CGI script to handle the client's request based on the URI supplied by the client, (iii) transform the client request into a CGI request, (iv) executes the CGI script, and (v) converts the CGI response into a response for the client.

Scripting languages such as Perl was a popular choice for writing CGI scripts. Several frameworks, such as ASP, JSP, PHP, ColdFusion, Ruby on Rails, were later developed to provide rapid web application development with better support for templating HTML code and database access. Modern Web servers commonly integrate intepreter of these languages into the Web servers to improve performance.

Your Tasks

In this assignment, you will modify the simple Web server given in the textbook to interface between Web clients and a given CGI script. To simplify the assignment, you are not expected to fully implement the HTTP and CGI standard as specified in the RFCs, but only enough for the given CGI scripts to function. The details of what your Web server should support and some Java tips on how you can do it is given below.

Processing HTTP Requests

The given Web server can already read HTTP requests from the client. Your Web server, however, should distinguish between different types of HTTP requests. For this assignment, we are interested in GET and POST methods. The given code supports only HTTP GET on static files. Web servers are typically configured to recorgnize static files through file extensions. Common static file extensions are gif, jpg, htm, html, etc., whereas dynamic web pages can have extensions such as php, cgi, asp. For this assignment, the only file extension for dynamic web pages you need to recorgnize is ".pl" (for Perl scripts).. When the requested file has extension ".pl", your Web server shall execute the Perl script (more on this later) instead of sending the content of the Perl script back to the client.

When the client uses GET to invoke a CGI script, the URL may encode additional arguments for the CGI script. Such "script URL" typically looks like this:

http://www.google.com.sg/search?hl=en&q=CS2105+Networks&meta=

In the example above, search refers to the script that the server should run. The arguments to the script (e.g., what to search?) are given in the format of key=value. The ampersand & seperates the key-value pairs, and the question mark ? seperates the name of the script with the arguments.

Note that a value can be empty, and the arguments may need to be encoded due to restrictions on URL format. For instance, a plus "+" in the URL above represents a white space. This encoding is called URL encoding. You are not required to decode URL encoded strings in this assignment.

The section of a script URL after the question mark is called query string. The query string is one method a Web client can send information to the server.

The other HTTP method we are interested in is HTTP POST method. There are two forms of POST requests -- differentiated by their "content type". To determine the content type of a POST request, you should look at the "Content-Type:" line of the HTTP request header.

The first content type is application/x-www-form-urlencoded. This type of POST request is not much different from GET request. The only difference is that the query string, instead of becoming part of the URL, is stored in the body of the HTTP request.

Using query strings to pass information from the client to the server has its limitation. An example is when the client needs to upload a file to the Web server (e.g., uploading a file to IVLE Workbins). It is not feasible to encode the whole file inside the query string. For this reason, there exists a second way of encoding information in a POST request.

The second content type is called multipart/form-data. Instead of using key-value pairs to encode information, the body of the HTTP POST request looks something like this:

Content-Type: multipart/form-data; boundary=LKJhl876x

   --LKJhl876x
   Content-Disposition: form-data; name="q"

   CS2105 Networks
   --LKJhl876x
   Content-Disposition: form-data; name="files"; filename="file1.txt"
   Content-Type: text/plain

   ... contents of file1.txt ...
   --LKJhl876x--

Since you are not required to parse the HTTP POST body, I will not go into details about the format above.

CGI scripts and CGI requests

Once your Web server receives the HTTP requests (regardless of POST or GET) for a CGI script, your Web server should execute the script and pass the necessary information (e.g. query strings, uploaded files) to the script for processing. This section explains how your Web server can do that.

There are two ways the Web server communicates with a CGI script. The first method is through environment variables. (See wikipedia entry on environment variables if you are not familiar with this term.) The CGI standards called these environment variables meta-variables. According to the standards, there are many meta-variables that should, or must, be set for CGI scripts to run properly. For the purpose of this assignment, we are only interested in four environment variables: REQUEST_METHOD, QUERY_STRING, CONTENT_TYPE, and CONTENT_LENGTH. The Web server sets the environment variables to proper values before calling the CGI script. When invoked, the CGI script reads the value of the environment variables.

The second method to communicate with a CGI script is by writing directly into its standard input. When HTTP POST method is used, the CGI script expects to read data from its standard input. By writing the data to the script standard input, the Web serve can pass the data to the script.

Which methods to use and which variables to set depend on the HTTP method. The enrionment variable that you always have to set is REQUEST_METHOD. Your Web server should always set REQUEST_METHOD to either POST or GET, depending on the HTTP request received. If the REQUEST_METHOD is GET, your web server should set QUERY_STRING to the query string of the HTTP request. Remember to set the QUERY_STRING to empty when there is not query string in the URL.

When HTTP POST request is received, your web server must set CONTENT_TYPE and CONTENT_LENGTH to their appropriate values according to the HTTP request header, and then write the HTTP body of the POST request to the standard input of the CGI script. It is important to set the CONTENT_LENGTH correctly and send the corresponding number of bytes to the CGI script.

Output from CGI scripts

The CGI script prints its data to its standard output. The Web server can use a pipe to reads its data. Your server should send the data (by prefix it with correct HTTP response code) back to the client.

For debugging purposes, it would be useful for your Web server to read from the standard error of the CGI scripts, and print the error messages to its own standard output/error.

Note that we will ignore standard error and standard output from your Web server during grading.

Your Web Application: To-Do List

In this assignment, you will be given two CGI scripts, one is for debugging purposes, and the other implements a simple web-based to-do list. This application allows user to add and delete to-do items. Each item has a associated description. You can add additional notes to each item. The notes can be edited. The rest of this section describe the functionality of this web application.

The figure above shows a screen shot of the to-do list application. The important elements of the user interface are labeled from A to G:

The next figure shows the screen when the edit button (B) is pressed.

The following are the user interface elements on screen.

How HTTP Requests Are Generated

The todo application is implemented in the script todo.pl. Calling todo.pl directly (using GET without any query string) would generate a list of to-do items. Clicking on button A and B would generate a GET request with appropriate query strings. Clicking on G generates a POST request with content type application/x-www-form-urlencoded. Finally, clicking on J generates a POST request with content type multipart/form-data.

(It might be easier to implement the features of your Web server incrementally in the order explained in the previous paragraph.)

How to Do It in Java

Calling an external program

You can use the exec() method of the Runtime class in Java to execute an external program. To execute a CGI script written in Perl, e.g. /home/o/ooiwt/a1/todo.pl, you can use the following code:

Runtime.getRuntime().exec("/usr/bin/perl /home/o/ooiwt/a1/todo.pl");

The call returns a Process object. You will need to use the Process object later.

Setting Environment Variables

You can set a series of environment variables in UNIX using the command env. From the UNIX command line, you can use env to modify the environment variables before calling another program as follows.

/usr/bin/env KEY1=value1 KEY2=value2 program

To set the environment variables before calling a CGI script within Java, you can use the exec() method of Runtime class just like before.

Reading and Writing to a Process

The exec() method of Runtime class returns a Process object. To read and write to the process, you need a handle to the input stream and output stream of the process. This is similar to the input stream and output stream of a socket and can be used in exactly the same way (using BufferedReader to read and DataOutputStream to write).

For example, to send the string "hello" to the standard input of some process p,

DataOutputStream o = new DataOutputStream(p.getOutputStream());
o.writeBytes("hello");

Your cs2105-z Account

An account on the server, cs2105-z, has been setup for you. From within SoC (or through SoC-VPN), ssh to cs2105-z using your SoC UNIX id and password.

Copy the files prepared for you to your home directory, by executing:

cp -r ~sadm/a1 .

You will see the following files:

Note that you must put your files inside the directory a1 directly under your home directory. This is the root of your Web directory.

File Permission

You will be responsible for the security of your own source code. Please be careful and set the correct permission for your files. They should not be readable by anyone else except the owner (chmod 600 *.java will ensure that).

Port Numbers

Your Web server must listen to a non-standard port number. Once your Web server is up and running, you can connect to it through Firefox browser on any SoC machines (or SoC-VPN enabled machine) by specifying the port number as part of the URL. For example, if your Web server is running on port 9090, use http://cs2105-z.comp.nus.edu.sg:9090/todo.pl to access the file todo.pl under your $HOME/a1 directory. You should, of course, write your Web server in such a way that it reads from $HOME/a1.

To make it easy to use a different port number, your Web server must take in the port number as a command line argument. For instance, to run your Web server on port 9090,

java WebServer 9090

Note that all of you will be running your Web server on the same host, and therefore must use a different port number. To prevent collision, you should avoid "nice" port numbers such as 8000 or 8080.

Submission and Grading

There is no need to submit the program by email or IVLE workbin. We will collect your assignment from your home directory on cs2105-z.comp.nus.edu.sg when the deadline is over.

We will test your assignment automatically using a grading program. For this to work, you must not modify todo.pl in any way. If you suspect that there is a bug in todo.pl, please contact us by posting on the IVLE forum.

You MUST name your java program WebServer.java. We will only compile this file when we grade. You MUST not implement additional classes in other *.java files.

Using Another Platform

If you like to work on your assignment on other platforms (Windows, Mac) that you are more familiar with, you are free to do so. But when you submit your assignment, you should ensure that your program runs properly under cs2105-z.comp.nus.edu.sg and your WebServer.java is located under $HOME/a1 on cs2105-z.comp.nus.edu.sg.

Plagiarism Warning

You are free to discuss the assignment with your peers. But, ultimately, you should write your own code. We employ zero-tolerance policy against plagiarism. If you are caught copying from other student, or let other student copies your code, you will receive zero for this assignment. Further disciplinary action may be taken by the school.


Assignment designed by Ooi Wei Tsang, Jan 2007. Free icons provided by N.Design Studio