Current networking implementations do not give the end user the ability to prioritize network usage between applications. Neither do they give a system administrator the ability to prioritize network bandwidth usage between users. CPU preemption provides a mechanism for fair distribution of processor time. No such mechanism exists for networking. It is possible for a single user to tie up the majority of the network bandwidth of a machine.
The PANTS project presents a mechanism to impose prioritized ordering of TCP packet transmissions to control network bandwidth prioritization, without significantly deteriorating the overall throughput of the network. This involved modifying the flow of network packets between the TCP and IP layers, and utilizing information from the data-link layer as an indicator of network behavior. Experimental implementation of PANTS on the Linux 2.1.60 kernel has yielded promising results.
Currently, there is no means by which a user or system administrator can manipulate the network bandwidth priority for an individual user, process, or socket.
Suppose a programmer writes a Java class that will upload data in parallel to multiple sites over multiple connections. If the transfer rates of some connections are more important than others, then the programmer would benefit from the ability to specify the network bandwidth priority of each connection, as in the case of CPU usage priority.
The PANTS project addresses these motivations by imposing a priority
scheme on the transmission of TCP packets, and providing a set of
user-level system calls which provide a way to specify the network
priority for each process and socket. Future implementations may address
network priority based on userID or other criteria.
PANTS effects prioritization of TCP connections by intercepting outgoing TCP packets (including TCP acknowledgements) before they reach the IP layer, placing the packets in a priority table, and having the packets transmitted in prioritized order by a kernel thread. The priority of each packet (identified by a sk_buff struct) is determined from a new variable (pants_priority) added to the sock struct (which characterizes an INET socket). This priority is an unsigned byte between 0 and PANTS_NUM_PRIORITY (a value that can be set at compile-time, currently set to 10), with 0 being the highest priority.
The priority table is an array of queues, with one queue for each distinct priority level. Each queue is implemented as a linked list, with pointers to its head and tail. Variables containing the current length of each queue, as well as the total number of packets in the table, are also maintained. Adding a packet to the table involves simply appending it to the tail of the queue corresponding to its priority, and bumping up the appropriate counts.
The PANTS kernel thread, kpantsd, is spawned at startup by the init process, with a scheduling priority of DEF_PRIORITY (20). A corresponding function (pants_tick()) is added to the scheduler's timer_table, and arranged to be called at every clock tick (every jiffy). This function checks the number of packets currently in the table, and wakes up the kernel thread if the number is greater than zero. When awakened, the kernel thread proceeds to remove the packets from the table in order of priority, and forwards them to the IP layer.
To prevent starvation, the packets in the table are promoted in
priority at regular intervals. This is achieved by shifting
each queue in the table to the queue of next-higher priority, and
appending the priority 1 queue to the tail of the priority 0 queue.
A shift is performed every time the kernel thread wakes up,
and whenever new packets are added to the queue while the kernel
thread is awake. Specifically, when the kernel thread is awakened,
it performs a shift operation, and looks at the number of
packets in the table. If after servicing exactly that many packets
the table is not empty, another shift is performed.
PANTS was implemented by modifying the 2.1.60 release of the Linux kernel source code. It was installed on two Intel machines with the Debian 1.3.1 Linux distribution. The first machine, crash1, runs on a 90 MHz Pentium processor and has a 10 Mbps ethernet connection to the UW Computer Science network, while the other, trinidad, runs on a 100 MHz Pentium processor and is connected to the internet via a PPP dial-in connection on a 28.8 bps modem. Both machines have 32 MB of RAM.
The kernel level thread that transmits outgoing packets from the priority table is called kpantsd. It is created at system boot up time, in the function init() located in the file init/main.c of the kernel source. The thread is modelled after the pageout daemon thread (kswapd), created in a similar manner. The main difference between the setup of kpantsd and kswapd is in the scheduling priority that is assigned to the threads' current task_structs. The pageout daemon is given a higher priority (32), whereas kpantsd is given the default priority of 20.
The kpantsd daemon puts itself to sleep shortly after it is started by init. A watchdog function, pants_tick(), is executed at regular intervals by the scheduler, and is responsible for waking up kpantsd when needed. By adding the function pants_tick() to the timer_table array (declared in kernel/sched.c) and resetting its bit-field among the timer_active flags at every invocation, we cause the function to get called at every system clock tick (or jiffy)--which is 10 ms on Linux--by the function run_old_timers()(also declared in kernel/sched.c). The pants_tick() function performs a simple check to see if the priority table contains any packets. If there are n > 0 packets, the function wakes up the kpantsd daemon, which processes all the packets in the table, shifting the table after every n packets.
Prioritization is realized through a priority table, called pants_table. Since performance of the algorithms is crucial, the table is structured so that all operations can be performed efficiently. The data structure is implemented as an array of linked lists. There is an element in the array for each distinct priority. Each element of the array is a regular queue with a head and a tail pointer. Each element of the queue is a pointer to a Linux socket buffer (a sk_buff struct), associated with a network packet.
Four operations are defined on the table: add, remove, first, and shift (which correspond to the following functions in net/ipv4/pants_queue.c: pants_atomic_table_add(), pants_atomic_table_pop(), pants_table_top(), and pants_atomic_table_shift()). The add operation adds a new packet to be transmitted to the prioritized queue at the priority level indicated by the sock struct associated with the sk_buff. Operations remove and first return the first packet from the highest priority non-empty queue. The remove operation also unlinks the packet from the queue. The shift operation was introduced to prevent potential starvation. When invoked, it raises the priorities of packets of non-maximum priority by one (which, in the code, corresponds to decrementing the value of the priority, since the maximum priority is represented by 0). This is implemented by appending the queue in priority 1 to the tail of the queue in priority 0, and moving all other queues by one position to the left in the array.
These queue operations have the following complexities. The add operation is O(1) since the access to a given priority queue array element is constant in time, and the insertion of packets is constant in time due to having a tail pointer. The remove and first operations are O(PANTS_NUM_PRIORITY) in complexity because they may need to traverse the array of queues to find the first non-empty queue. The extraction, by remove, of a packet from the queue implemented as a linked list is constant in time. The shift operation needs to traverse the array. For all but the first and last queue it needs to reset the head and tail pointers and update the counts of packets in the queues. This operation is also O(PANTS_NUM_PRIORITY) in complexity.
Since the number of priorities is fixed at compile time, the complexity of operations traversing the array could be considered O(1) but, of course, the constant factor is significant, and so it s best to distinguish those operations from the truly constant time of add.
Most of these operations have critical sections. To ensure consistency, access to the priority table obviously needs to be restricted while the data structure is being modified. The add operation happens in the context of user-level processes that make a send request. The remove operation only happens in our kernel thread (kpantsd), and so is always in the context of that thread. This means that we have a multiple-producer, single-consumer queue. Moreover, operations first and shift also happen only in the context of the kernel thread. Thus the amount of synchronization that is necessary can be minimized.
In particular, since the first operation can only be executed
concurrently with (i.e. interrupted in favor of) add, if a
given priority queue is non-empty, its head pointer will be valid,
and no locking is necessary. Similarly, during the remove
operation no locking is necessary to locate and save the pointer
to the packet to be removed. We have just one critical section in
which we reset the pointers and change the count of packets in the
priority table. The shift operation changes the pointers of
queues of every priority. Since it can be executed concurrently
with add, all of its queue operations are contained in one
large critical section. The add operation can be executed
concurrently with remove, first, and shift,
as well as with add operations under the context of other
user-level processes. We, therefore, have a critical section surrounding
updates to the queue's pointers and packet counts. Locking the
concurrently accessed data structures is done by disabling interrupts
before entering a critical section, and re-enabling them after leaving
it, using the cli() and sti() primitives.
Two network benchmarking tools were developed to test the effectiveness of PANTS. The first, called benchclient/benchserver, performs a large number of continuous sends and receives of packets between the two ends. The number of packets, as well as the packet sizes, are configurable at run-time. The benchclient measures the round-trip time for each send/receive cycle. The second utility, called onewayclient/onewayserver, performs a continuous transmission in one direction (either send or receive), measuring the total time of the transmission.
The first benchmark emulates more interactive protocols, like telnet, while the second emulates large message transfer protocols, like FTP. The measured times for both benchmarks are determined from the gettimeofday() system call. The data transmitted with both benchmarks are randomized, to avoid possible compression by the data-link or hardware.
With the priority table mechanism fully implemented and debugged on the crash1 system, testing did not produce the desired results. Under normal CPU and network load, there were seldom more than a few packets in the table at any time, and these were serviced too quickly by the kernel thread for any prioritized behavior to occur. This was particularly true for the onewayclient /onewayserver benchmarks. Running the benchmarks on the slower connection of trinidad also showed no prioritization.
Prioritized behavior was observed, however, when we ran 10 concurrent benchclient/benchserver connections, each with a different PANTS priority. 10,000 1K packets were transmitted by each benchclient.
The results, as seen in Figure 1, show a clearly prioritized behavior. The reason this behavior was observed is that after each send, the benchclient immediately makes a call to read, which is blocking. This allows for a context switch to the other benchclient processes to occur before the next jiffy. As a result, several (possibly all) benchclient processes are given the opportunity to add their own packet to the priority table. Therefore, prioritization does occur. In the case of the onewayclient/onewayserver, however, because the client performs continuous sends, it does not yield the processor until the end of its scheduled time slice. Because the granularity of time slices is one jiffy, the kernel thread will be awakened before any other process is allowed to run, and add their packets to the table. As such, the kernel thread will end up servicing only packets from one socket (hence, of the same priority) at any invocation.
Prioritization fails to occur under normal circumstances because the pants_table is never filled by more than one process' packets. However, our algorithm would still yield desirable results if the servicing of the packets by the kernel thread took long enough for context switches to occur. In such a situation, other processes would be able to add their packets into the priority table before it is fully cleared. A trace of the ip_queue_xmit() function (in net/ipv4/ip_output.c), which is called by the kernel thread to forward the sk_buff to the network layer (from TCP to IP), revealed that it contained no blocking calls. It merely places the sk_buff onto the device queue. As a result, its latency is not reflective of the time it took for the device (data-link and physical layers) to actually transmit the packet. A flaw in this initial design, then, is that the priority table attempts to enforce prioritization at a level whose bandwidth is much higher than at the bottleneck, which is at the device level.
With this observation, then, we proceeded to add a component to PANTS
that probes the device for some indication of its available bandwidth.
Before being successfully transmitted, a TCP packet must pass through a number of buffers. The PANTS priority table as described above is, in fact, a new buffer created between the TCP and IP layers, with a potential bandwidth delay of up to one jiffy (10 ms). The bandwidth of a network connection is constrained by the slowest channel, or the bottleneck, which is usually the physical layer. Since outgoing packets are buffered before being sent by the physical device, the bandwidth above it is very high. As such, placing the PANTS priority table between the TCP and IP layers will only prioritize packets within the high-bandwidth realm above the physical layer. When the packets reach the physical layer, the behavior reverts to FIFO.
Ideally, then, we want the priority
table to be in a position where it can capture the network behavior
of the bottleneck. If we can probe into the data-link or physical layer
and derive some information that is indicative of its bandwidth, we can
delay the servicing of packets from our priority table without adversely
affecting the overall network latency.
In an attempt to link the size of the priority table to the available network bandwidth of a modem connection, we looked for a way to examine the length of the device queue. Unfortunately, we found that the software interface for communication with the device was very limited. We did not find a way to query the hardware for the number of packets or the total amount of data in its queue. The software data-link layer code puts packets in a queue which is periodically checked by the device. Most devices transfer packets from that queue into their own on-board memory using DMA.
Shortly before the hardware is informed of a new packet to be transmitted, the software sets a few variables in the deviceís statistics structure, net_device_stats. Looking specifically at the PPP interface, we attempted to use a number of the fields in the structure to give us an indication of network saturation. We found the numbers tx_packets, ppp_opackets, and others not to be reflective of the network performance.
We then came across last_xmit, a field in the ppp struct that was updated immediately before a packet was moved to the hardware device queue. Comparing the time that has elapsed since the last time a packet was sent to the hardware with the time that has elapsed since the last time the kernel thread has forwarded a packet from TCP to IP, we can see how saturated the network is. If the time since the last PPP transmit is larger than the time since the last ip_queue_xmit(), the thread can go to sleep instead of forwarding a packet, thus allowing the accumulation of packets in the priority queue instead of at the device.
In addition to the potential delay inherent in using the kernel thread
approach, this mechanism introduces another potential delay. Since we
wait for the device to catch up to the kernel thread in sending packets,
a packet may remain in the queue for up to a jiffy after the
device has freed up.
The same benchmarks as above were run on trinidad with the PPP component installed. The results showed prioritization for both benchmarks. figure 2 shows, as before the results of the benchclient/benchserver test, this time transmitting only 200 1K packets (since the connection was only 28.8 bps, this was sufficient to get good data). The decreasing latencies for the later packets of the lower priority benchmarks (priority 6-9) is a result of the freeing up of bandwidth from the completion of the higher priority processes. The prioritized behavior is again apparent.
Figure 3 shows the transfer times of 10 concurrently-running
onewayclients, each transmitting 50K of data. This time, the
prioritized behavior is obvious.
We have created a set of tools to monitor and control the behavior of PANTS. A user-level program, called pants, can be invoked to turn PANTS on and off, to print out statistics regarding the current state of the priority table, and to check or assign a priority to a given process.
Every process and socket has a priority associated with it. By default all processes and sockets have a priority of PANTS_DEF_PRIORITY (which is PANTS_MAX_PRIORITY/2). When a priority for a given process is changed, the processís task_struct is searched for all INET sockets associated with it, and their priority is reset, as well. Priorities of newly created processes are inherited from their parents by copying them in the fork() system call. Priorities of processes that make the execve() system call are set to PANTS_DEF_PRIORITY. As a result, most processes get the default priority, but children of a forking process will inherit the priority of their parent, as desired (e.g., forking of a new Netscape window).
The operations described above are made possible by two new system calls
added to the kernel (using the socketcall 'multiplexor').
To avoid adding a large set of system calls we created a
pants_ctrl() call that accepts a number of commands and arguments
to perform the various desired functions. The user-level program interprets
the command-line invocations into calls to a PANTS library function, which
is an interface to the system calls implemented in the kernel.
The mechanism implemented for PPP can be made more precise by examining the actual hardware queue. The current Linux software interface to some network devices does not give a way to query for that information. We believe that most network devices do have a way to be queried for the length of their queues. We could gain experience with the various network devices, and enhance the software interface with routines to make the necessary hardware queries.
At this point the only control over incoming packets is effected through
the prioritization of the TCP acknowledgement (ACK) packets. These packets
are treated the same as the regular outgoing packets. More intricate
treatment of acknowledgement packets based on the round-trip time of
transmission of their connections can have a more significant impact
on their prioritization. This treatment would have to take into account
all timeout calculations, and make use of the sliding window to achieve
the result without forcing unnecessary retransmissions.