Read an Excerpt
Chapter 4: Understanding the Clickstream as a Data Source
One of the sources of data that will feed our data Webhouse is the HTTP clickstream itself-- the log records produced by the Web server each time a request is satisfied. In this chapter we'll discuss the content of the clickstream and ways of handling the enormous volume of data that will be generated by a busy Website. We will introduce a clickstream post-processor that receives raw log data from a Web server and normalizes it into a format in which it can be combined with application-derived data and piped into the data Webhouse. The database volumes required for log processing at an active Website can be equated to the billing system of a large telephone company, both in volume and in complexity. Part Two of this book presents detailed architectures for databases that are capable of event tracking and content delivery for high-activity Websites.In this chapter we describe how customers and Websites communicate with each other. We also show you how some important third parties like banner ad providers and customer profilers, attach to your session and become part of the available data. We study in some detail how much information can be derived from a cookie and what the limitations of even a "good" cookie may be. We describe what is known as "referral" information, which is a potentially amazing source of insight into why the user arrived at your Website. From the referral information we should be able to sort out the customers who arrived for the right reasons, customers who arrived for the wrong reasons, and perhaps what all of these customers were thinking about when they entered your site. We conclude the chapter by proposing an architecture for processing all of this data in the back room before it can become available in our databases for analysis.
Before we describe the specific data elements in the clickstream, it might be useful to review how a Web browser and Website interact.
WEB CLIENT/ SERVER INTERACTIONS---A BRIEF TUTORIAL
Understanding the interactions between a Web client (browser) and a Web server (Website) is essential for understanding the source and meaning of the data in the clickstream. Please refer to Figure 4.1 in this discussion. In the illustration we have shown a browser, designated My Browser. We'll look at what happens in a typical interaction from the perspective of myself as a browser user. The browser and Website interact with each other across the Internet using the Web's communication protocol-- HyperText Transfer Protocol (HTTP).
Basic Client/ Server Interaction
First, I click a button or hypertext link (URL) to a particular Website, shown as action (1) in Figure 4.1. When this HTTP request reaches the Website the server returns the requested item (2). In our illustration, this is a document in hypertext markup language format (HTML)-- your-page. html. Once the document is entirely retrieved, my browser scans your-page. html and notices several references to other Web documents that it must fulfill before its work is completed; the browser must retrieve other components of this document in separate requests. Note that the only human action taken here is to click on the original link. All of the rest of the actions that follow in this example are computer-to-computer interactions triggered by the click and managed, for the most part, by instructions carried in the initially downloaded HTML document, your-page. html. In order to speed up Web page responsiveness most browsers will execute these consequential actions in parallel, typically with up to ten or more HTTP requests being serviced concurrently.
The browser finds a reference to an image-- a logo perhaps-- which, from its URL, is located at your-site. com, the same place it retrieved the initial html document. The browser issues another request to the server (3) and the server responds by returning the specified image.
Advertisements
The browser continues to the next reference in your-page. html and finds an instruction to retrieve another image from Website banner-ad. com. The browser makes this request (4), and the server at banner-ad. com interprets a request for the image in a special way. Rather than immediately sending back an image, the banner-ad server first issues a cookie request to my browser, requesting the contents of any cookie that might previously have been placed in my PC by banner-ad. com. The ad Website retrieves this cookie, examines its contents, and uses the contents as a key to determine which banner ad I should receive. This decision is based on my interests or on previous ads that I had been sent by this particular ad server. Once the banner-ad server makes a determination of the optimum ad, it returns the selected image to me. The advertisement server then logs which ad it has placed along with the date and the clickstream data from my request. Had the banner-ad server not found its own cookie, it would have sent a new persistent cookie to my browser for future reference, sent a random banner ad, and started a history in its database of interactions with my browser.
The Referrer
The HTTP request from my browser to the banner-ad server carried with it a key piece of information known as the referrer. The referrer is the URL of the agent responsible for placing the link on the page. In our example, the referrer is "your-site. com/ yourpage. html". The referrer is not a browser. Because banner-ad. com now knows who the referrer was, it can credit your-site. com for having placed an advertisement on a browser window. This is a single impression. The advertiser can be billed for this impression, with the revenue being shared by the referrer (your-site. com) and the advertising server (banner-ad. com). If you are sharing Web log information with the referring site, it will be valuable to share page attributes as well. In other words, not only do you want the URL of the referring page, but you would like to know what the purpose of the page was. Was it a navigation page, was it a partner's page, or was it a general search page?
The Profiler
While the ad server deals primarily in placing appropriate content, the profiler deals in supplying demographic information about Website visitors. In our example, the original HTML document, your-page. html had a hidden field that contained a request to retrieve a specific document from Website profiler. com (5). When this request reached the profiler server, the profile. com server immediately tried to find its cookie in my browser. This cookie contained a userID that had been placed previously by the profiler, which is used to identify me, and serves as a key to personal information contained in the profiler's database. The profiler might either return its profile data to my browser to be sent back to the initial Website, or send a real-time notification to the referrer, your-site. com via an alternative path advising the referrer that I am currently logged onto his site and viewing a specific page (6). This information could also be returned to the HTML document to be returned to the referrer as part of a query string the next time an HTTP request was sent to your-site.com.
Composite Sites
Although Figure 4.1 shows three different sites involved in serving the contents of one document, it is possible, indeed likely, that these functions will be combined into fewer servers. It is likely that advertising and profiling be done within the same enterprise, so a single request (and cookie) would suffice to retrieve personal information that would more precisely target the ads that are returned. It is equally possible that a Web page contains references to different ad/ profile services, providing revenue to the referrer from multiple sources.
PROXY SERVERS AND BROWSER CACHES
When a browser makes an HTTP request, that request is not always served from the server specified in a URL. Many Internet Service Providers (ISPs) make use of proxy servers to reduce Internet traffic. Proxy servers are used to cache frequently requested content at a location between its intended source and an end user. Such proxies are commonly employed by large ISPs like America Online and Earthlink. In some cases, an HTTP request may not even leave the user's PC. It may be satisfied from the browser's local cache of recently accessed objects.
Figure 4.2 illustrates several aspects of the proxy problem. Proxy servers can introduce three problems: First, a proxy may deliver outdated content. Although Web pages can include tags that tell proxy servers whether or not the content may be cached and when content expires, these tags are often omitted by Webmasters or ignored by proxy servers. Second, proxies may satisfy a content request without properly notifying the originating server that the request has been served by the proxy. When a proxy handles a request, convention dictates that it should forward to the intended server, a message that indicates that a proxy response has been made. This is not reliable. As a consequence your Webhouse may miss key events that are otherwise required to make sense of the events that comprise a browser/ Website session. Third, if the user has come though a proxy, the Website will not know who made the page request unless a cookie is present.
It is important to make liberal use of expiration dates and noproxy tags in the HTML content of your Website. This will help ensure that you are getting as much data as possible for your warehouse.
The type of proxy we are referring to in this discussion is called a forward proxy. It is outside of your control because it belongs to a networking company or to an ISP. Another type of proxy server called a reverse proxy can be placed in front of your enterprise's Web servers to help them off-load requests for frequently accessed content. This kind of proxy is entirely within your control and usually presents no impediment to Webhouse data collection. It should be able to supply the same kind of log information as that produced by a Web server and discussed in the following section.
Browser Caches
Browser caches also introduce uncertainties in our attempts to track all of the events that occur during a user session. Most browsers store a copy of recently retrieved objects such as HTML pages and images in a local object cache in the PC's file system. If the user returns to a page already in his local browser cache (for example, by clicking the "back" button), no record of this event will be sent to the server, and the event not recorded. This means that we can never be certain that we have a full map of the user's actions. At best we can strive to obtain a tree representation of a session, with each leaf an object fetched from a Website and stamped with the time that the object was first requested by the browser.
As with proxies, we can attempt to force the browser to always obtain objects from a server rather than from cache by including appropriate "no cache" HTML tags, but we may not choose to do this because of performance-or other content-related reasons.
A similar uncertainty can be introduced when a user opens multiple browser windows to the same Website. The user may have multiple views of different pages of the site available on his PC screen, but there isn't any way for the Web server to know this.
WEB SERVER LOGS
All Web servers have the ability to log client interactions into one or more log files or databases or to pipe the log information to other applications in real time. These data elements are also available to be passed to real time applications using the Web server's Common Gateway Interface (CGI). Table 4.1 lists some of the typical data elements available from most Web servers.
The original standard for Web server logs was the Common Log Format (CLF), sometimes called the CLOG. This standard included the seven data elements checked in the CLF column in Table 4.1. Two additional elements were added in the Extended Common Log Format Standard (ECLF), and these are checked in the ECLF column of Table 4.1. Various Web servers add additional loggable parameters, but these are inevitably limited by the information contained in the basic HTTP protocol. The log data elements are discussed in more detail in the following paragraphs...