Question: Describe the methodology employed in this paper. What
traffic could be missed with this approach? Are there any statistics
that the authors didn't report that you wish they had gathered? Do you
think that access to the complete data would lead to different
conclusions than those drawn by the authors?
Student
The methodologies used by the paper were definitely not aimed at being
accurate because of the restriction on the available information about
the working and infrastructure of the P2P systems. They use port
numbers for P2P systems for identification of the kind of data,
protocol and the IP numbers for the distributed content delivery
systems like akamai. Given the lack of information about these then
commercially operated systems, the methods are justified. Only major
flaw about the methodology was the use of the University resources
which are dominated by poor undergraduate and graduate students who
tend to be stingy!
I believe they did leave out a significant portion of the data
transfered on the internet by avoiding ftp transfers (which uses UDP).
Also of concern is the P2p systems were concentrated on Kazaa and
Gnutella, but there did exist other P2P systems like even the Usenet
(a different system, but nevertheless a P2P system). But the port
number are not even static. A later version uses a P2P system where it
dynamically changes the ports periodically. Otherwise their
assumptions on P2Ps are correct (Past paper is relatively new and the
then widely used P2ps did not use those concepts). Other flaws were
tunnelling and VPNs which could not possibly be monitored because of
encryption. But the year of data also makes a difference and I am
unsure of the network usage 4 years back.
I believe the ftp usage could have caused some shift in the data
analyzed and the P2P systems with the legal apprehension could have
used advanced camouflaging techniques like IP masquerading which could
have confused the analysis. But I still believe that the error margin
could not have been large because they have taken into account all the
major factors.
Student
They monitored all incoming and outgoing TCP packets from their 2 links to
ISP's. That means they don't monitor UDP traffic, but later they claim that
that's only 3% of the traffic anyway. It also means that they don't monitor
traffic that stays in their local network. So it might actually be that Kazaa
generates even a lot more traffic then they think, thereby affecting their
conclusions.
They also didn't determine all Akamai hosts: they found only 3966 out of the
13000+ Akamai servers. That could in theory mean that a higher percentage of
the www traffic is actually Akamai traffic (although it might be unlikely
that clients get served by more than a small number of Akamai servers, since
that defies the whole purpose of a network like Akamai).
I can't think of any other statistics that I would have seen investigated in
2002, but the current situation is completely different. I think that Kazaa
isn't used as much anymore and P2P traffic has shifted towards other
networks, newsservers and Torrents. Also now that broadband is becoming more
common, there might be more UDP traffic due to streaming?
Student
The methodology used in this paper to study network data is to capture
all outgoing and incoming traffic to the University of Washington
through the university's connection to the outside world. The data
captured going through this connection is then categorized according
to usage (www, Akaimi, P2P, etc.) and content type (pictures, video,
audio, etc.). One drawback this approach suffers from is that it will
not record network data between servers on campus. That is, users who
are accessing university web pages will not be represented in the
data. Also, P2P users who are sharing files among other university
users will not be represented. It is also difficult with this approach
to categorize content type for data sent through encrypted
connections. For example, data caught that is bound for port 443 is
most likely encrypted data and therefore the type of data cannot be
accurately reported.
One statistic that I would like to see is the time taken to download
objects from each content-delivery category. The authors claim that
because P2P objects are roughly 1000 times larger than www objects,
the transfer times are roughly 1000 times greater for P2P objects on
average. It is my guess that transfer times for P2P objects are
actually larger than what is claimed by the authors. While web objects
are usually served by high-bandwidth web servers, P2P servers
generally have low-bandwidth connections, slowing transfer times even
further.
I also think it would be interesting to see statistics when using a
Gnutella cache versus a Kazaa cache. The authors explain that Kazaa
tries to do a better job with exploiting physical locality than
Gnutella. This might mean that a lot of Kazaa traffic is missed
because the files are being transferred within the university. Thus, a
Gnutella cache might perform even better than the Kazaa cache.
If we could have access to the complete data, I don't think that the
major conclusions drawn would be much different. However, the
conclusions drawn by the authors about the nature of P2P systems might
be different. They claim that P2P networks do not do a worthy job of
spreading workload for scalability. However, they only use Kazaa data
to back up their claim. Since Gnutella makes no effort to exploit
physical locality, Gnutella might do a better job of spreading the
workload to different servers.
Student
1. Describe the methodology employed in this paper
The work described in the paper analyzes inbound and output traffic
at the University of Washington over a nine day period. All packets
passing through 4 boundary switches are captured and examined, and
TCP connections are inferred (using an approach described in another
paper). By examining connection ports and destination servers,
traffic is divided into
- web (non-Akamai)
- web (Akamai)
- Kazaa
- Gnutella
The traffic is then analyzed (and the analyses presented in graphs)
using such metrics as total bandwidth used by each type of traffic, %
bandwidth used by the top consumers, etc.
2. Are there any statistics that the authors didn't report that you
wish they had gathered?
Section 5.3 and Figure 10 graph the CDF of % of bytes used by the top
N (up to 1000) bandwidth consumers. The small number of Gnutella
users gives a very skewed curve. It would probably have been more
accurate to instead plot a CDF against a certain percentage of users
(eg: how much bandwidth is used by the top 10/20/30/40% of Kazaa/
Gnutella/WWW users?)
The authors recommend that the ISP implement an outbound P2P cache,
thus saving UW bandwidth. While not in the scope of the paper, I
would like to know how much cheaper the overall cost would be (how
much would the cache cost vs. the cost of bandwidth saved). In any
case, it would certainly not be in the economic interest of the ISP
to pay extra money on a cache that would reduce the bandwidth used by
their customer.
3. Do you think that access to the complete data would lead to
different conclusions than those drawn by the author?
I do not think that the conclusions drawn above based on the data
requested above.
Student
In this paper, the authors compare four different
content delivery systems
namely HTTP web traffic, Akamai CDN traffic and
Kazaa and Gnutella
peer-to-peer traffic with respect to bandwidth
consumption, number of
transactions, net bytes transferred etc. The study
was undertaken at the
University of Washington by monitoring the traffic
flowing across switches
between the university's internal network and the
external Internet. The
monitoring hosts extract the metadata from the
incoming and outgoing TCP packets and log this
information. The traffic is further categorized as
belonging to
one of the four different CDN systems based on the
port numbers and server
classification (Akamai server or not).
The analysis undertaken is not comprehensive and
could have missed some
important traffic with this approach. Firstly this
study does not present
any data on non-TCP traffic, hence ftp transfers
and other UDP-based
traffic has not been accounted for. Also traffic
from P2P systems other
than Kazaa and Gnutella has not been explicitly
quantified. The authors
do not present any statistics about the non-HTTP
TCP traffic
(consisting of news, mail, streaming media, other
P2P systems, and search
traffic on Gnutella and Kazaa) which accounts for
43% of the total TCP
traffic.
Another important statistic that has not been
brought out is the traffic that is generated and
served within the university network, for instance
file sharing among Kazaa users in the university.
Since the monitoring is done at the border
routers, their traces do not account for internal
traffic of CDN. Also the data has been collected
for a university network and this may not be
representative of a general Internet usage.
Another variable factor is the period over which
the traces were obtained, as the network behavior
of university students (that constitute a large
part of the user population) may be largely
varying during vacation than in the exam week!
However, I believe that the major conclusion of
this paper, which is the domination of P2P systems
in the modern day Internet traffic, will not
change with this additional data.
Student
The paper methodology extracts and classifies network traffic using
passive monitoring (of traffic flowing in and out of UW). The border
routers of the university provide for capability to send copies of
incoming and outgoing packets to a host. TCP packets, obtained by such
monitoring, are reassembled and flows involving HTTP exchanges are
identified (these include Kazaa and Gnutella too). Port numbers are used
as the basis for classifying traffic into the corresponding buckets.
Akamai traffic is filtered using a list of Akamai servers. This list
was built using DNS queries for names in an Akamai managed domain.
Gnutella traffic was captured effectively by exploiting the fact that it
(Gnutella) does not locate hosts based on physical topology, thus
connecting to peers outside the university. Kazaa traffic was traced
accounting for objects being broken into fragments.
The paper considers only TCP traffic and completely ignores UDP traffic
in and out of the university. Some common applications that use UDP
include DNS, streaming media, voice over IP, and online games. It can
be clearly seen that some of these applications, especially games and
streaming media, would contribute to the bandwidth consumption, though
the study mentions that the traffic is predominantly TCP. However, the
scope of the paper (content distribution) might limit the relevance of
some of these applications. Most of the data presented in the paper
relate to the 4 content delivery networks and classify the rest of the
traffic as "other" non-HTTP traffic. The paper, for most part, does not
talk about this kind of traffic. This could include other file-sharing
applications such as BitTorrent or Napster which are also used a lot today.
Different file-sharing applications might breakdown files into chunks of
varying sizes and the distribution of these files (over the internet)
may be different. However, the principles remaining the same, I would
expect caching to provide similar bandwidth savings by making the
required files/fragments available. It would be interesting, however,
to see, quantitatively,z the effects of such a scheme (An important
observation here is that the paper refers only to Kazaa while indicating
certain metrics, like the effectiveness of caching, and ignores Gnutella).
Student
This paper examined the traffic flow of content delivery systems,
focusing largely on Web -vs- P2P traffic flows, and specifically on
WWW, Akamai, Kazaa, and Gnutella delivery systems. The methodology
employed passive network monitoring of all traffic coming in and out
of the border routers between the Univ of Washington (UW) and the
rest of the Internet. Custom monitoring software reconstructed all
TCP flows in and out of UW, and categorized each flow into HTTP or
non-HTTP traffic by decoding HTTP headers and metadata. HTTP traffic
was further categorized into WWW, Kazaa, or Gnutella requests by
looking at the destination ports; Akamai traffic was recognized by
the source IP address being registered to Akamai Corp in DNS.
Although the authors did a good job of analyzing the data collected,
several potential deficiencies in their data collection methodology
may have impacted their conclusions. My primary concerns revolve
around time and volume. For time, the authors mentioned they
monitored for nine days. Especially considering that a University
environment is relatively homogenous population with respect to the
calendar, I would have ideally preferred to see a much longer data
sample. For instance, did these nine days occur during spring break
when most interactive WWW users on campus are gone? Or during final
exams? The timing of when the sample was taken certainly could
dramatically change the relative amounts of differing traffic. As
for volume considerations, the data collected was solely at the
border routers at one institution. It can be questioned how
applicable the author's conclusions would be to other institutions,
especially since so much traffic was directed to a relatively small
number of Kazaa servers. For instance, the authors conclude that a
P2P client creates significant traffic in both directions. Perhaps
this could be explained because a small population of students had
particularly juicy content that was in high demand during the times
monitored (e.g. complete season of the Sopranos right before the new
season starts), or because the dorm rooms at UW are connected to very
high-bandwidth networks that makes Kazaa prefer peers at UW over
others on the Internet. Also, the fact that only the border routers
were monitored could have skewed conclusions. For example, the
authors mention that the P2P systems may benefit from a reverse cache
at the border. But without data from routers inside the UW, it is
unclear how much retrieval the P2P systems were doing inside the UW
network -- perhaps they were only going out to the Internet because
internal peers were already saturated.
Finally, the authors only decoded HTTP protocols. This ignores P2P
systems that may use a protocol other than HTTP, as well as ignoring
UDP traffic. How much of the non-HTTP traffic is really a P2P
transfer is unknown. Also, relying on port numbers to identify a
specific content delivery system is not completely reliable.
Student
The paper seems to be a fairly exhaustive analysis of content delivery
systems. The results of the paper are significant and give useful insight
into the nature, scaling performance and impact on internet bandwidth of
P2P systems.
The methodology for the analysis involves collecting traces of traffic
flowing between University of Washington and the rest of the internet.
This is done by sending a copy of each incoming and outgoing packet to a
monitoring host, that has packet filter installed to deliver TCP packets
to a user-level process. This process derives useful information from the
packets esentially by reverse-engineering and the data is then logged for
analysis.
The authors classify the traffic into two classes: HTTP-based and non-HTTP
based. The first kind of traffic tries to capture WWW, Akamai, Kazaa and
Gnutella transfers. The second kind, which includes other TCP traffic, is
ignored. However, it represents a significant 43% of the total traffic,
and hence a large amount of traffic is essentially missed. An analysis of
this traffic could be interesting as this would include TCP requests on
port 25 (ftp requests) and will capture file transfers on a non P2P
systems. It would be interesting to know the object size distribution of
these files, the bytes transferred in both directions (incoming and
outgoing) and the amount of bytes transferred in this way as compared to
P2P systems. Also, the combined WWW, FTP and streaming media traffic
(which represents most of the internet traffic before P2P systems were
deployed) could come significantly closer to the P2P bandwidth load,
leading to different conclusions than those drawn by the authors. If this
is not the case, then the results of the paper would, on the other hand,
become even more interesting.
Also, it would be useful to capture the WWW and P2P traffic within the
university itself. In the approach used by the authors, only the traffic
that crosses the border routers is examined. This is essential in case of
Kazaa to analyse it more accurately (the paper says, "Kazaa appears to
direct peers to nearby objects"). Also, it would be interesting to know
the number of internal requests served by UW servers as compared to
external requests, the bandwidth consumed in internal P2P transfers
itself, and a comparison of the overall trends in internal and external
traffic.