Project 5: Web Server

Important Dates

Questions about the project? Send them to 354-help@cs.wisc.edu .

Due: Friday 12/14 by whenever.

Goals

To build a simple web server.
To understand how to use TCP/IP and sockets.
To become awesome at programming.

Security Warning

Please DO NOT leave your web servers running for very long. A web server such as this could be a security vulnerability, allowing remote entities to read your files and perhaps worse. Thus, only run your server when testing, and make sure to kill them when you are finished.

Overview

In this project, you'll be building a basic web server that can be accessed via a normal web browser. In doing so, you will use TCP-based sockets and in general understand a bit better how the web works. Your server will have a few twists, though, beyond a standard one; more information on that below.

The basics of what you will do are pretty simple. Your web server is executed with a single command line argument, the port number on which it accepts connections. Thus, to run your web server and listen on port 10000, the user would type:

prompt> webserver 10000

Web servers are conceptually very simple. Clients communicate with web servers via a protocol known as HTTP, or hypertext transfer protocol; a simple tutorial (which you should read) is available here . Read this first!

A typical web client makes a request by connecting to the web server via a TCP-based socket, sending plain text over a TCP that indicates the nature of the request, and then waiting for the requested file in response. There are numerous types of requests, but (for simplicity) we're only going to focus on one type: the GET request. Read about HTTP in more detail to learn about other request types.

The GET Request is in ASCII and looks something like this:

GET /file.html HTTP/1.0
(optional other header stuff)

This request is easy to understand: it means the client (usually a browser) is requesting the file file.txt located in the root directory of the web server (i.e., the directory that the web server is running in). Note some key aspects of the request: the name of the request type (GET in this case), the file name itself, starting with a slash, and then the protocol version (which will either be HTTP/1.0 or HTTP/1.1).

The server, upon receiving such a request, replies by writing back a header (describing the file's length and type) as well as the file itself.

A typical reply (assuming a short 65-byte HTML file) thus looks like this:

HTTP/1.0 200 OK
Content-Length: 65
Content-Type: text/html

<html>
<body>
<p> This is a short HTML file.</p>
</body>
</html>

There are a couple of key aspects of the request and reply that you must pay close attention to in order to build a working web server. The first is this: when the server receives a request on a new connection, it must not only read in the first line (which has the GET request details) but also all of the rest of the message, all the way until it sees an empty line.

The second is in the response: note the empty line between the header information (which includes the response, content length, and content type) and the actual contents of the file. This empty line is critical for the client to understand what you are sending back to it.

The third point is that at the end of each line of a message (e.g., after the HTTP/1.0 200 OK line of the response header), a full carriage-return and line feed (CRLF) is expected. In C, this is accomplished with the following combination of special characters \r\n .

Amazingly, this is about it for the basic description of what you have to build! But, there are a number of important details, which we now discuss.

TCP and Sockets

The server should start by reading in the port number from the command line, using the socket(), bind(), and listen() system calls to set up a connection on that port, and then enter a main loop. In that loop, the server should use the accept() call to accept incoming connections from clients.

Upon establishing a connection, the server should read from the connection descriptor (using the read() system call) until it sees the end of the HTTP request (i.e., until it sees an empty line with a CRLF ( \r\n ) on it). NOTE: This is important! If you only read in the first line and not the entire header, the communication won't work as expected. The first line of the request is all we're really going to pay attention to here, though, which should be of the form GET /path/to/file.html HTTP/1.0 following by a CRLF.

The server's job is then to parse the file name. In the example above ( /path/to/file.html ), the server should try to open the file path/to/file.html (thus stripping off the leading slash). It should then read that file into an in-memory buffer, in preparation for response to the request.

Finally, the server should send a reply by using the write() system call. The first part of the reply is the header which includes the protocol and response code, e.g., HTTP/1.0 200 OK . The 200 and OK parts are used by the client to know the request succeeded. The following two lines indicate the content length and type; the length is the length of the file, and the type (in this case) is text/html as this is an HTML file.

When the server has written the header (each line of which ends with a CRLF), followed by an empty line (with CRLF), and then the contents of the file, it should close the connection by calling close() on the socket connection.

At this point, the server should loop back up, waiting on the next connection to accept() , and thus continue operation indefinitely.

File Types

In addtion to HTML, your server should know that when it returns a file that ends in .gif or .jpg that it should indicate the file's type as image/gif or image/jpg , respectively. Any other file type should be returned as text/plain .

Errors

Your server should be able to handle a few error conditions as well. For example, if a protocol request is not a GET, a 501 error should be returned. If a file that isn't accessible is requested, the 403 file access forbidden error should be returned. If the file does not exist at all, 404 not found should be returned.

Replying to the client during such an erroneous request is simple: again, the server should reply with a header such as this:

HTTP/1.0 501 Not implemented
Content-Type: text/html
Content-Length: ...

The header should be followed by an empty line (with CRLF!) and then a short piece of generated HTML that describes the error in more detail, e.g.,

<html>
<title>Error 501</title>
<body>
<p>Error 501: Not implemented</p>
</body>
</html>

The length field in the header should match the length of the generated HTML.

The exact errors are:

403 File access forbidden
404 File not found
501 Not implemented

Please match the error format as above to make grading possible.

File Parsing

There is one special case all web servers handle. Specifically, when the file / (just the slash) is requested, the server should instead try to return the file index.html .

Special Feature: Proxy

Once you get your basic web server working, congratulations! You are mostly done. However, there is one other neat feature your web server will employ. Specifically, you might receive a GET request in the following format:

GET /web/www.cs.wisc.edu/index.html HTTP/1.0

In this case, instead of trying to return the file web/www.cs.wisc.edu/index.html , your server should understand that this is actually a request which means the server should itself perform a web GET request for www.cs.wisc.edu/index.html and return its contents to the client. In this way, your web server can act as a sort of proxy to other web sites.

Specifically, when such a file is requested, your server should create a connection to the target server and fetch the desired file. It should then store it in a local file in a directory called cache . The first such downloaded file should be called 1.html, the second 2.html, etc.

Then, your server should reply to the client by making the proper header and writing the contents of this file back to the client.

Here is the important part: if the client requests this same file again, your server should notice this and serve the cached local file instead of going to the web again. In this way, your server can speed up web performance via caching!

One other note: when your server starts up, it should clear the cache directory of any existing files; this should simplify cache management.

Notes

This project can be done with a single partner. Copying code (from other groups) is considered cheating. Read this for more info on what is OK and what is not!

Handing It In

The handin directory is ~cs354-3/handin/login/p5 where login is your login.

Please include a makefile to build you web server.

Please also include a README file. In there, describe what you did a little bit.