UW-Madison
Computer Sciences Dept.

Paper Write-ups for Measurement

Question: Describe the methodology employed in this paper. What traffic could be missed with this approach? Are there any statistics that the authors didn't report that you wish they had gathered? Do you think that access to the complete data would lead to different conclusions than those drawn by the authors?

Student

The methodologies used by the paper were definitely not aimed at being accurate because of the restriction on the available information about the working and infrastructure of the P2P systems. They use port numbers for P2P systems for identification of the kind of data, protocol and the IP numbers for the distributed content delivery systems like akamai. Given the lack of information about these then commercially operated systems, the methods are justified. Only major flaw about the methodology was the use of the University resources which are dominated by poor undergraduate and graduate students who tend to be stingy!

I believe they did leave out a significant portion of the data transfered on the internet by avoiding ftp transfers (which uses UDP). Also of concern is the P2p systems were concentrated on Kazaa and Gnutella, but there did exist other P2P systems like even the Usenet (a different system, but nevertheless a P2P system). But the port number are not even static. A later version uses a P2P system where it dynamically changes the ports periodically. Otherwise their assumptions on P2Ps are correct (Past paper is relatively new and the then widely used P2ps did not use those concepts). Other flaws were tunnelling and VPNs which could not possibly be monitored because of encryption. But the year of data also makes a difference and I am unsure of the network usage 4 years back.

I believe the ftp usage could have caused some shift in the data analyzed and the P2P systems with the legal apprehension could have used advanced camouflaging techniques like IP masquerading which could have confused the analysis. But I still believe that the error margin could not have been large because they have taken into account all the major factors.

Student

They monitored all incoming and outgoing TCP packets from their 2 links to ISP's. That means they don't monitor UDP traffic, but later they claim that that's only 3% of the traffic anyway. It also means that they don't monitor traffic that stays in their local network. So it might actually be that Kazaa generates even a lot more traffic then they think, thereby affecting their conclusions.

They also didn't determine all Akamai hosts: they found only 3966 out of the 13000+ Akamai servers. That could in theory mean that a higher percentage of the www traffic is actually Akamai traffic (although it might be unlikely that clients get served by more than a small number of Akamai servers, since that defies the whole purpose of a network like Akamai). I can't think of any other statistics that I would have seen investigated in 2002, but the current situation is completely different. I think that Kazaa isn't used as much anymore and P2P traffic has shifted towards other networks, newsservers and Torrents. Also now that broadband is becoming more common, there might be more UDP traffic due to streaming?

Student

The methodology used in this paper to study network data is to capture all outgoing and incoming traffic to the University of Washington through the university's connection to the outside world. The data captured going through this connection is then categorized according to usage (www, Akaimi, P2P, etc.) and content type (pictures, video, audio, etc.). One drawback this approach suffers from is that it will not record network data between servers on campus. That is, users who are accessing university web pages will not be represented in the data. Also, P2P users who are sharing files among other university users will not be represented. It is also difficult with this approach to categorize content type for data sent through encrypted connections. For example, data caught that is bound for port 443 is most likely encrypted data and therefore the type of data cannot be accurately reported.

One statistic that I would like to see is the time taken to download objects from each content-delivery category. The authors claim that because P2P objects are roughly 1000 times larger than www objects, the transfer times are roughly 1000 times greater for P2P objects on average. It is my guess that transfer times for P2P objects are actually larger than what is claimed by the authors. While web objects are usually served by high-bandwidth web servers, P2P servers generally have low-bandwidth connections, slowing transfer times even further.

I also think it would be interesting to see statistics when using a Gnutella cache versus a Kazaa cache. The authors explain that Kazaa tries to do a better job with exploiting physical locality than Gnutella. This might mean that a lot of Kazaa traffic is missed because the files are being transferred within the university. Thus, a Gnutella cache might perform even better than the Kazaa cache.

If we could have access to the complete data, I don't think that the major conclusions drawn would be much different. However, the conclusions drawn by the authors about the nature of P2P systems might be different. They claim that P2P networks do not do a worthy job of spreading workload for scalability. However, they only use Kazaa data to back up their claim. Since Gnutella makes no effort to exploit physical locality, Gnutella might do a better job of spreading the workload to different servers.

Student

1. Describe the methodology employed in this paper

The work described in the paper analyzes inbound and output traffic at the University of Washington over a nine day period. All packets passing through 4 boundary switches are captured and examined, and TCP connections are inferred (using an approach described in another paper). By examining connection ports and destination servers, traffic is divided into - web (non-Akamai) - web (Akamai) - Kazaa - Gnutella The traffic is then analyzed (and the analyses presented in graphs) using such metrics as total bandwidth used by each type of traffic, % bandwidth used by the top consumers, etc.

2. Are there any statistics that the authors didn't report that you wish they had gathered?

Section 5.3 and Figure 10 graph the CDF of % of bytes used by the top N (up to 1000) bandwidth consumers. The small number of Gnutella users gives a very skewed curve. It would probably have been more accurate to instead plot a CDF against a certain percentage of users (eg: how much bandwidth is used by the top 10/20/30/40% of Kazaa/ Gnutella/WWW users?)

The authors recommend that the ISP implement an outbound P2P cache, thus saving UW bandwidth. While not in the scope of the paper, I would like to know how much cheaper the overall cost would be (how much would the cache cost vs. the cost of bandwidth saved). In any case, it would certainly not be in the economic interest of the ISP to pay extra money on a cache that would reduce the bandwidth used by their customer. 3. Do you think that access to the complete data would lead to different conclusions than those drawn by the author?

I do not think that the conclusions drawn above based on the data requested above.

Student

In this paper, the authors compare four different content delivery systems namely HTTP web traffic, Akamai CDN traffic and Kazaa and Gnutella peer-to-peer traffic with respect to bandwidth consumption, number of transactions, net bytes transferred etc. The study was undertaken at the University of Washington by monitoring the traffic flowing across switches between the university's internal network and the external Internet. The monitoring hosts extract the metadata from the incoming and outgoing TCP packets and log this information. The traffic is further categorized as belonging to one of the four different CDN systems based on the port numbers and server classification (Akamai server or not).

The analysis undertaken is not comprehensive and could have missed some important traffic with this approach. Firstly this study does not present any data on non-TCP traffic, hence ftp transfers and other UDP-based traffic has not been accounted for. Also traffic from P2P systems other than Kazaa and Gnutella has not been explicitly quantified. The authors do not present any statistics about the non-HTTP TCP traffic (consisting of news, mail, streaming media, other P2P systems, and search traffic on Gnutella and Kazaa) which accounts for 43% of the total TCP traffic.

Another important statistic that has not been brought out is the traffic that is generated and served within the university network, for instance file sharing among Kazaa users in the university. Since the monitoring is done at the border routers, their traces do not account for internal traffic of CDN. Also the data has been collected for a university network and this may not be representative of a general Internet usage. Another variable factor is the period over which the traces were obtained, as the network behavior of university students (that constitute a large part of the user population) may be largely varying during vacation than in the exam week! However, I believe that the major conclusion of this paper, which is the domination of P2P systems in the modern day Internet traffic, will not change with this additional data.

Student

The paper methodology extracts and classifies network traffic using passive monitoring (of traffic flowing in and out of UW). The border routers of the university provide for capability to send copies of incoming and outgoing packets to a host. TCP packets, obtained by such monitoring, are reassembled and flows involving HTTP exchanges are identified (these include Kazaa and Gnutella too). Port numbers are used as the basis for classifying traffic into the corresponding buckets. Akamai traffic is filtered using a list of Akamai servers. This list was built using DNS queries for names in an Akamai managed domain. Gnutella traffic was captured effectively by exploiting the fact that it (Gnutella) does not locate hosts based on physical topology, thus connecting to peers outside the university. Kazaa traffic was traced accounting for objects being broken into fragments.

The paper considers only TCP traffic and completely ignores UDP traffic in and out of the university. Some common applications that use UDP include DNS, streaming media, voice over IP, and online games. It can be clearly seen that some of these applications, especially games and streaming media, would contribute to the bandwidth consumption, though the study mentions that the traffic is predominantly TCP. However, the scope of the paper (content distribution) might limit the relevance of some of these applications. Most of the data presented in the paper relate to the 4 content delivery networks and classify the rest of the traffic as "other" non-HTTP traffic. The paper, for most part, does not talk about this kind of traffic. This could include other file-sharing applications such as BitTorrent or Napster which are also used a lot today.

Different file-sharing applications might breakdown files into chunks of varying sizes and the distribution of these files (over the internet) may be different. However, the principles remaining the same, I would expect caching to provide similar bandwidth savings by making the required files/fragments available. It would be interesting, however, to see, quantitatively,z the effects of such a scheme (An important observation here is that the paper refers only to Kazaa while indicating certain metrics, like the effectiveness of caching, and ignores Gnutella).

Student

This paper examined the traffic flow of content delivery systems, focusing largely on Web -vs- P2P traffic flows, and specifically on WWW, Akamai, Kazaa, and Gnutella delivery systems. The methodology employed passive network monitoring of all traffic coming in and out of the border routers between the Univ of Washington (UW) and the rest of the Internet. Custom monitoring software reconstructed all TCP flows in and out of UW, and categorized each flow into HTTP or non-HTTP traffic by decoding HTTP headers and metadata. HTTP traffic was further categorized into WWW, Kazaa, or Gnutella requests by looking at the destination ports; Akamai traffic was recognized by the source IP address being registered to Akamai Corp in DNS.

Although the authors did a good job of analyzing the data collected, several potential deficiencies in their data collection methodology may have impacted their conclusions. My primary concerns revolve around time and volume. For time, the authors mentioned they monitored for nine days. Especially considering that a University environment is relatively homogenous population with respect to the calendar, I would have ideally preferred to see a much longer data sample. For instance, did these nine days occur during spring break when most interactive WWW users on campus are gone? Or during final exams? The timing of when the sample was taken certainly could dramatically change the relative amounts of differing traffic. As for volume considerations, the data collected was solely at the border routers at one institution. It can be questioned how applicable the author's conclusions would be to other institutions, especially since so much traffic was directed to a relatively small number of Kazaa servers. For instance, the authors conclude that a P2P client creates significant traffic in both directions. Perhaps this could be explained because a small population of students had particularly juicy content that was in high demand during the times monitored (e.g. complete season of the Sopranos right before the new season starts), or because the dorm rooms at UW are connected to very high-bandwidth networks that makes Kazaa prefer peers at UW over others on the Internet. Also, the fact that only the border routers were monitored could have skewed conclusions. For example, the authors mention that the P2P systems may benefit from a reverse cache at the border. But without data from routers inside the UW, it is unclear how much retrieval the P2P systems were doing inside the UW network -- perhaps they were only going out to the Internet because internal peers were already saturated.

Finally, the authors only decoded HTTP protocols. This ignores P2P systems that may use a protocol other than HTTP, as well as ignoring UDP traffic. How much of the non-HTTP traffic is really a P2P transfer is unknown. Also, relying on port numbers to identify a specific content delivery system is not completely reliable.

Student

The paper seems to be a fairly exhaustive analysis of content delivery systems. The results of the paper are significant and give useful insight into the nature, scaling performance and impact on internet bandwidth of P2P systems.

The methodology for the analysis involves collecting traces of traffic flowing between University of Washington and the rest of the internet. This is done by sending a copy of each incoming and outgoing packet to a monitoring host, that has packet filter installed to deliver TCP packets to a user-level process. This process derives useful information from the packets esentially by reverse-engineering and the data is then logged for analysis.

The authors classify the traffic into two classes: HTTP-based and non-HTTP based. The first kind of traffic tries to capture WWW, Akamai, Kazaa and Gnutella transfers. The second kind, which includes other TCP traffic, is ignored. However, it represents a significant 43% of the total traffic, and hence a large amount of traffic is essentially missed. An analysis of this traffic could be interesting as this would include TCP requests on port 25 (ftp requests) and will capture file transfers on a non P2P systems. It would be interesting to know the object size distribution of these files, the bytes transferred in both directions (incoming and outgoing) and the amount of bytes transferred in this way as compared to P2P systems. Also, the combined WWW, FTP and streaming media traffic (which represents most of the internet traffic before P2P systems were deployed) could come significantly closer to the P2P bandwidth load, leading to different conclusions than those drawn by the authors. If this is not the case, then the results of the paper would, on the other hand, become even more interesting.

Also, it would be useful to capture the WWW and P2P traffic within the university itself. In the approach used by the authors, only the traffic that crosses the border routers is examined. This is essential in case of Kazaa to analyse it more accurately (the paper says, "Kazaa appears to direct peers to nearby objects"). Also, it would be interesting to know the number of internal requests served by UW servers as compared to external requests, the bandwidth consumed in internal P2P transfers itself, and a comparison of the overall trends in internal and external traffic.

 
Computer Sciences | UW Home