|Linda is a system for communication between cooperating processes. DIRT is a Linda implementation where storage is distributed among a tree of servers connected by an internet. Requests for data are widely distributed through the tree, while new data is kept near where it is created. We find that a central server coordinating multiple clients requires a large granularity to overcome network latency. In a simple DIRT system, we successfully make use of forty hosts in a cooperative computation. In a more complex system, we develop techniques for moving data closer to where it is needed.|
A Linda system, as described in , consists of an abstract global storage space called a tuple space, in which ordered tuples of information are stored. Linda clients interact with the tuple space through four commands: in, out, read, and eval. An out command inserts a tuple into tuple space, an in command removes a tuple that matches a given template, a read command reads a tuple matching a given template, but does not remove it from the tuple space, and an eval command includes code for evaluating values in a tuple. The in and out commands block when no matching tuples are found.
The original implementation of Linda on the S/Net  connected eight nodes on a local bus. By taking advantage of the fixed time it took to broadcast a message to all nodes on the bus, this implementation replicated each tuple to every node in the network, thus sacrificing memory usage for speed.
Modern applications of Linda are more suited towards a collection of workstations connected by an internet. Examples are Piranha , a scheme for locating idle machines, and Lifestreams , a system for organizing and collaborating on documents.
We wish to design a Linda system that will operate effectively over a wide area internet, where participants may have fast communication links to close neighbors, but slower links to distant machines. It should be possibile for collaboration software to allow two distant users to work together, but performance should be optimized for those that are geographically close. Unrelated, distant applications should not affect each other's performance.
In this paper, we present an implementation of Linda designed to be scalable over an internet. A network of machines are designated as servers, and are arranged in a tree. Clients may connect to any server, and see a consistent view of tuple space. We examine properties of a single DIRT server, and then develop techniques for moving tuples closer to the clients that use them.
A DIRT Linda network consists of a number of servers connected hierarchically via TCP connections. A client wishing to interact with the tuple space may connect to any participating server, and inject a new tuple or a request for a tuple into that server. It then becomes the responsibility of that server to fulfill the client's requirements.
In general, tuples stay at the node where they are injected, while requests (in and out templates) are copied to many servers. We choose to replicate requests instead of tuples because the number of requests is no greater than the number of blocked client processes in the system, which we assume to be much smaller than the number of tuples. Furthermore, since tuples store data, they are potentially much larger in size than requests.
The original Linda syntax uses a function-like notation to indicate tuple space operations:
in( "bart", int a, float f ); out( "bark", 16, 'a' , 7.5 ); read( "lark", x, int c );The syntax for template items in in and read commands require an extension to the C language (which is why Linda is often referred to as a programming language); supporting it would require the development of a new compiler for all Linda applications (see ). We instead choose a cleaner interface, modelled off of the C++ stream abstraction, that takes advantage of operator overloading. The equivalent of the operations above becomes:
dirt_client client("servermachine"); client << "bart" >> a >> f << dirt::in; client << "bark" << 16 << 'a' << 7.5 << dirt::out; client << "lark" << x >> c << dirt::read;Although the direction of the arrows may initially seem confusing (to Bart), the rule is simple: templates use right arrows, data items use left arrows.
Our implementation supports tuples containing items that has one of three basic types (character, integer, real), arrays of these basic types, or a string type. The client object uses TCP to connect to a DIRT server (usually on the same host) on a well-known port. The tuple or template is assembled in local storage and is flushed down the socket upon receiving the in, out, or read command. For the input requests, the client blocks until a reply is received from the server.
Servers in a DIRT network are connected hierarchichally (see Figure 1), serving as a complete Linda system with distributed storage and many points of connection. We base our design on the following assumptions:
A server in a DIRT network maintains the following:
Our current implementation stores the tuples and requests in hash tables, hashing on the first item of each tuple (the `key'), which we require to be a string. The hash tables facilitate a faster lookup for matching tuples and requests, and is simply an optimization.
A standalone DIRT server with no parent or children can serve as a complete Linda system. We use this configuration as a reference case in our measurements in Section 3.2.
When a request for a tuple is received by a leaf node, it is assigned a globally unique id consisting of a <host id, serial number> pair. The leaf node then attempts to satisfy the request from its local storage. Should this fail, the node forwards the request to its parent, who is then responsible for replicating the request to its children. If this also fails, the request is forwarded to the grandparent, and so on. In this way, every attempt is made to satisfy a request locally, but, eventually, an unsatisfied request will be replicated throughout the entire network.
When a node in the network receives a tuple that satisfies a request from one of its clients, the tuple is dispatched to the requesting client. The server must then injects a cancel message, which propagates through the tree to destroy outstanding copies of the request. A request to be destroyed is identified by its unique id.
When a node encounters a tuple matching a non-local request, the tuple is transmitted back towards the requesting node. We explore approaches for doing this in Section 2.3. Figure 2 shows a flowchart describing the actions taken by a server upon receiving a message.
When a node matches a tuple to a remote in request, it dispatches the tuple towards the requesting node. This can be done either by opening a new connection to the requesting node, or by sending the tuple back through the existing connections in the tree.
The first approach, connecting directly to the requesting node, involves only the two nodes in question, hence it does not contribute additional load to other nodes in the tree. A disadvantage, however, is that the time required to establish the new connection is quite high. In our testing environment, establishing a new connection takes ~50 ms while the communication latency across an existing connection is ~10 ms.
The second approach, transmitting the tuple through the tree, involves a possibly large number of nodes on the path between the source and destination nodes. Delivery by this means will be faster if the total latency of the connections in the path is less than the latency for establishing a new connection. This approach requires that each node keep track of the neighbor (parent or child) from which each request was received.
A side effect of this second approach is that tuple matching becomes more dynamic in nature. For example, when node A sends to node B a tuple matching a request from node C, B need not necessarily transmit the tuple towards C. It may choose to match the tuple to another request in its store. This behavior can be advantageous if we add a restriction that each node always matches the request closest to it. To enable this, a "hop count" is maintained for each request on each node that corresponds to the distance to the requesting node.
Our implementation uses both the approaches described above, choosing the approach based on the hop count of the request being satisfied. If the hop count to the requesting node is above a certain limit, we open a new TCP connection to the requesting node. Otherwise, the tuple is transmitted to the node from which the request was received. The value of this limit is dependent on the latencies for a given network. In our testing environment, we chose a limit of 5 hops.
For in requests, a tuple will be deleted from the local store of a node as soon as it is transmitted to another node. For read requests, however, we must ensure that only a single copy of the tuple remains in tuple space.
This was initially implemented by having the tuple stored at the node from which the matching read request originated. That is, when a node transmits a tuple to a client that made a read request, the tuple is not deleted as in the case of in requests.
With this simplistic strategy, however, when a single tuple is repeatedly read by clients on a number of leaf nodes, the tuple end up moving around from one leaf node to the next. This approach is a rather inefficient.
We therefore implemented "tuple promotion", where for each read request, the tuple is moved to the highest node (in the tree) in the path to the destination. In this way, further read requests within the subtree rooted at this highest node can be satisfied quickly by delivering a copy of the tuple downwards. This encourages locality, but the worst case scenario would see all the tuples promoted to the root, potentially congesting its tuple storage.
After building the server, we first benchmarked the performance of multiple clients connecting to a central server, and compared that to the original Linda implementation, which had very different network characteristics. We next measured the performance of multiple servers connected in a hierarchy. Finally, we measured the effectiveness of tuple promotion in increasing the performance of a read-intensive application.
The S/Net consisted of 68000s connected by an 80 Mbit bus with a latency of about 1 ms. By contrast, our corner of the Internet consists of SPARCstation-20s connected by a 10 Mbit ethernet with a latency of about 10 ms. Our applications of Linda are similar to the original paper, but need some adjustments to take into account the reduced network speed and increased computation speed.
Our single server experiments used SPARCstation-20 workstations on a single 10Mbit ethernet with one machine designated as a server. The cluster was subjected to student use in addition to our measurements.
The ping-pong application is described in . One client transmits a "pong" tuple for every "ping" tuple received, while another client does the opposite. We use this simple application to measure the latency of a tuple space operation, and the limit on the number of clients a single server can handle.
The results of running a varying number of ping-pong clients connected to a central server are given in Figure 3. With one client pair, the time necessary for one cycle of ping-pongs is about 50 ms, or approximately 12 ms per tuple space operation. This performance over TCP is significantly longer than the 1-2 ms achieved by the S/Net. However, this result is comparable to those from the UDP broadcast implementation presented in . We will have to reconsider the granularity of our applications in this light.
As client pairs are added, latency increases slowly up to 64 ms per cycle for 11 clients. Above 11 clients, latency increases drastically. We concluded that the a single DIRT server can handle ~700 tuple operations per second without significant performance loss.
The ping-pong benchmark measures servers under continuous load from clients. Other applications are likely to be less demanding, allowing a larger numbers of clients per server.
The first matrix multiplication algorithm used in  assigns to each worker process an individual matrix element to compute. Although this was acceptable for the S/Net implementation, it is a performance disaster for our system because the minimum latency for assigning work (24 ms), is slower than the time for a single machine to compute a 24x24 matrix. We require a rather large granularity to achieve a useful speedup.
We applied several optimizations to the matrix multiplication algorithm as presented in .
With these improvements, we give results for matrices of dimension 16, 128, and 512 in Figure 4. In the figure, the horizontal lines indicate the time required to perform the computation sequentially on a single machine.
Although the 16x16 matrix computation never improves upon the speed of a single machine, it demonstrates an important property of the system: the best parallelization factor is dependent on the granularity of the system. This matrix performs the best for 7 workers, and performance decreases when we add more.
The 128x128 matrix is the smallest computation to break even. It performs comparably to a single machine for between 5 and 12 workers.
The 512x512 matrix sees a marked speedup for 2 or more workers. Even when using a client on every machine on the local network, we were unable to detect a maximum performance! We attempted to harness more machines on adjacent networks, however, performance was wildly erratic due to long and unpredictable network communication delays.
We repeated the ping-pong benchmark with 20 clients connected to a central server, and compared to the same benchmark with the clients connected to a simple tree of three servers: one root and two children. The results were somewhat erratic, with the single server results averaging 169 seconds and the three-servers results averaging 266 seconds for 2000 pings and pongs. The slowdown can be attributed to traffic between the two leaf nodes. As such, it appears one of our goals, which was to keep computations in subtrees separate as much as possible, has not succeeded with this benchmark.
Figure 5 presents the results of running matrix multiplication on 16x16 matrices in a DIRT tree with tuple promotion disabled and enabled. This matrix multiplication benchmark is the same as that presented in ; since tuple promotion only affects read requests, we needed a benchmark that was read intensive.
The results show a consistent improvement with tuple promotion in all cases except with a single worker process. The improvement for the multiple worker cases is due to the inefficient policy in the non-promoting implementation, where a tuple is migrated to the node making the read requesting. When there are multiple reads of the same tuple from different leaf nodes, the tuple ends up being volleyed back and forth between these nodes. Tuple promotion alleviates this phenomenon, but limits the best case to the distance from the leaf to the common root.
The poor result of tuple promotion in the single worker case is actually a result of the way the experiment was conducted, in which the tuples were initially injected at the root instead of at the leaf where the worker is connected. This added an additional edge in the graph through which every read request and satisfying tuple must travel.
The bottleneck of this system is clearly the latency of requests hopping from node to node. We consider several possible optimizations.
We show effective use of a single DIRT server on a local area network. A single server can support about ten clients under heavy load, and significantly more in a more relaxed application. We have successfully used up to forty clients smoothly.
To harness more clients on disparate networks requires coordination in groups of ten or twenty. The algorithm we present gives predictable performance when tuple promotion is in effect, but does not successfully restrict traffic to minimal subtrees.