Main | Locality-Aware Request Distribution »

Experience with Grapevine: The growth of a distributed system

Experience with Grapevine: The growth of a distributed system Michael D. Schroeder, Andrew Birrell, Roger M. Neednam. ACM Transactions on Computer Systems, Feb 1984.

Reviews due Thursday, 9/4 at 8:00 am.

Comments

This paper describes the implementation of Grapevine, a distributed mail service. Objectives are covered, along with problems that arose while the system was in use, and solutions are proposed to continue to meet the needs of the system and the users.

Grapevine was built to serve as an asynchronous mail system for the organization. The important idea was to make the system easily scalable, such that they didn't need one large machine, but could continuously add new machines as the system grew. Each machine should have a constant amount of load that does not scale with the size of the system or number of users. In this way, they could compute how many servers they needed based on how many users there were. Indeed, the specification was for 10,000 users.

Another goal was to work transparently and seamlessly, such that it seemed as if all communication was with one machine. This is, in addition to a system design issue, a user experience issue. Finally, the system needed to be reliable, in that single server failure does not result in unavailability of a service for a client.

This papers major contributions are the lessons learned from deploying this system. The scale grew to 17 servers over 2 years, with heavy use by the corporation. The issues they experienced, by and large, fell into two camps. The first were problems that came about because of scale. The scale could be increased number of users, large distribution lists (which resulted in many of the problems), or just an increase in the amount of hardware (with enough hardware, failing becomes the norm, not the exception). Some of the problems that resulted were lost messages from failure, paralyzed servers resulting from mass sending of messages to distribution lists, and latency in message delivery due to large number of hops. It is interesting to note that for a relatively simple idea, they still ran into problems with scale, even at levels that today seem tiny. The lesson to take from this is the popular Knuth quote - "premature optimization is the root of all evil". While this may be an exaggeration, it is difficult to pinpoint in advance where a problem is going to occur when a system scales. And while some will be prepared for in advance, the fact of the matter is that even Google will run into problems when their software scales. It reinforces the importance, and difficulty, of maintaining such software.

The second camp of issues were user experience issues. These included users being confused about why the system acted in some way (such as sending duplicates of a message to their inbox) or wanting to know about the state of the system. The authors do indeed state that it is difficult to balance the want of the system visibility by the users with the tools needed to actually record the state of the system, which is an issue that can be carried into other distributed systems projects as well. Indeed, "how much does a single node know about the state of the system at all times?" is a deeper question than simply user experience, but the two are related. This also relates to the tools needed for hardware and system engineers to work on the servers, which would occasionally run into issues. The overarching idea - state of the system as a whole - is important for any distributed system.

Overall, this system had great ideas and an effective execution. As stated, it is difficult for system designers to take into account all of the possible factors during design, so it is difficult to fault them with any of the problems that arose.

Grapevine is a distributed messaging system that also provides
authorization and authentication mechanisms. It was intended to be
distributed to better handle the load from the users, as well as to
provide some fault tolerance through redundancy (inboxes are
replicated).

Positive:
- Redundancy: divides services into messaging and registration.
- Tested with on an interesting network topology.
- The RNames naming scheme is sensible, and reminds me of DNS -- they
are from about the same time.
- Platform-independent lib for communicating with system. It hides the
complexity from the user, making it appear as a single server.
- Scalability: 1500 to 8500 messages/day. However, the 10,000 design
maximum seems shortsighted, especially considering they started with
1,500!
- They are honest: they are afraid to make any changes to their code.

Negative:
- The system is not dynamically (atomically) reconfigurable to work
with a different load -- it cannot add/remove servers as needed. It
requires human intervention.
- All registry servers know each other, which reduces scalability (but
may be better for performance).
- Inherent limit to the size of the system: receiving message server
is responsible for finding *all* recipients! They suggested solving it
with indirection
- Many things seem hard-codded, like where to send replicas. The
heuristics that they may have seem rather primitive.
- There is some data on the usage, but no measurement of the load of the
system. They may have 8,500 messages a day, but what effect does that
have on performance? It would be nice to have a graph of load vs. #
messages.
- Mention of running out of disk space. But how about some byzantine
failures or other disaster scenarios?
- Maintenance is manual. This seems like the hard part of a Distributed
System, so it's a cop-out to not handle it. They instead have a back
door (the viticulturist's entrance!) for distributed debugging,
reminiscent of SSH.
- Deadlocking bugs! Another distributed systems problem
(synchronization) that doesn't seem to have been tackled.

This system looks like a precursor of SMTP. A big difference is that
SMTP also has distributed ownership of mail servers. The reason that
they couldn't develop something like SMTP is probably because there was
no such thing as DNS, which is what provides routing information (and
redundancy) for current mail servers.

To me, it looks like Grapevine implemented a Distributed System without
the hard parts. They did handle replication, migration and ability to
synchronize/reconfigure the configuration, but not other trickier
features. When there is a problem, their solution is to have a
technician intervene, instead of having the system detect the problem
and automatically adjust to the new situation.

One thing that they did a lot of is predicting human patterns, and
making design decisions accordingly. For example, they thought that
groups would be topographically close, so they stored them on close
nodes. I think that would be applicable today. One thing that seems
dated, is the problem of space. The reason they had to spread the system
over many nodes, is because they didn't have enough space: 5MB of disk
space! Our codebases these days are orders of magnitude bigger than
that!

Grapevine is a distributed messaging system. the system has registries of users, and groups that could contain both groups and users. it has a program to handle the registration and another program to handle the messages. users can have multiple inboxes and the system maintains replicates to handle failure of servers and unreliable connections. the system is also used for authentication of file servers (file servers originally had separate users, but after implementation of the system they use grapevine for authentication).

the main point of the system is that instead of one powerful machine, they used multiple small machines, and they handle the workload using replicates (to handle system failures) so that the distributed system is invisible to the users and they can interact with the system as if it is a single machine. (unless some error occurs, for example inconsistency in different replicates because of the delayed update).

some of the main problems that they mentioned are:
- delay in applying updates to replicas can cause inconsistency (for example an administrator add a user and then send a message that can goes through a server that wasn't updated yet).
- if a user is member of multiple groups and a message was sent to all of those groups, and registry for some of the groups are unaccessible at some point, the user might receive a message multiple time.
- if a user read his messages from a terminal the server should keep the messages in the inbox, and because of the limited disk capacities this can cause problem. (they decided to limit the number of users that can do this)

an unrelated point: I remember a couple of years ago I was trying to download the attachment of an old email from my yahoo account, and I encountered an error that the file is not available now (probably because the server storing that file wasn't accessible). it is interesting to see that some of the problems that people encountered in early distributed systems can still happen in current working systems.

This paper summarizes the valuable lessons they learnt from building and operating a large scalable, reliable, distributed messaging system, "Grapevine", that presents the abstraction of a single large unitary server to its users. The authors discuss the objectives and assumptions that motivated their design, aspects of their operational workload that violated these assumptions and stretched their design to its limits, and suggestions for future work that might improve upon the shortcomings of Grapevine's design.

A key design principle was to make Grapevine linearly scalable, allowing them to scale out their system by adding more machines as their user-base grew. This motivated them to use namespace partitioning as a way to balance load per server, essentially leading to a shared-nothing design. However, it is difficult to maintain both geographical locality (locating user inboxes on their nearest server) as well as organizational locality (collocating user inboxes or distribution list objects according to organizational hierarchy of the business) while achieving load balancing. I believe modern systems use distributed hash tables for this purpose which can significantly improve load balancing and elastic scale-out. Further, the authors blame the load due to operations like distribution list expansion and update for delays in message delivery. This suggests that the workload needed to have been better partitioned or distributed in the first place. The authors suggest this themselves as part of the "All" registry, but I think the principle of distributing heavier computation such as DL expansion should apply more broadly.

Another principle in Grapevine's design was that the underlying distribution and replication details should be transparent to the users, giving them the illusion of interacting with a single server. This goal is largely met in Grapevine's design through the use of services that users can connect to from anywhere and that can perform all actions for the user. However, there are violations both during fault recovery (particularly network faults) as well as message/transaction delays. While some of these may be unavoidable without very large-scale distribution (such as the web services of today provide with data centers all over the globe), others might have been avoidable with a better consistency model such as weak, session-based consistency (where the user sees the effects of their changes immediately, even as they propagate slowly to other servers).

An important lesson learned from Grapevine is that innovative system design is sometimes rendered difficult by the difficulty of predicting user behavior and load characteristics of the future. While we see the pervasive use of emails today as "obvious", many aspects of this were quite difficult to anticipate in 1981 when Grapevine was built. The only way to overcome this challenge is to use a modular, flexible software design where it is possible to adapt the system to user behavior over time. The authors admit in the conclusion that the design of Grapevine has grown so complex that it is now difficult to make any changes. This, I believe, remains a key challenge in software design today.

In this paper, the authors have described their experience with Grapevine - a distributed, replicated system that provides message service and registration service. Grapevine provides authentication and access control mechanism for accessing file servers along with message service.
The paper is written more like a commentary on the various design choices made by the authors while designing the original Grapevine system, the consequences of those choices as the system grew in size over time and the modifications to those choices that were necessary to address the new challenges. The new challenges are a result of the scale of the system – at the time of writing this paper Grapevine supported nearly 4400 users and 1500 groups and handled about 8500 new messages per day. Since the original Grapevine system was designed with a specification to support a load generated by 10,000 users, the current scale provided an opportunity for the authors to reflect upon the original design and how it handled scale.
The system has many design choices that helped attain scale and the most notable among them is the division of the registration database into registries so that the number of registries and not the size of a registry increases with the number of users. The concepts of geographic registries and local servers also helped achieve scale in a big way. The presence of 3 replicas and their choice of geographical distribution of the 3 replicas helps provide a good availability of the system. The designers have also given considerable thought into making the system look like a unitary one for all users and registrars though not always successful. Achieving consistent state is another important aspect that the designers considered though not completely successful in achieving it for all kinds of operations.
The main issues the designers faced were with respect to the handling of distribution lists which stressed a lot of the design choices. The presence of nested groups made authentication and access control to file servers very slow and the designers could have dealt with it better than the temporary fix that they provided by keeping a copy with flattened namespace. They also do not consider the underlying network for performing any kinds of optimizations. The system would definitely scale much better if networking limitations/optimizations were considered. Also, the methods used for recovery from failures involving operators, remote system experts and logs all seem insufficient for a large scale distributed system’s operation.
Overall, the Grapevine system provides valuable insights into how to design a distributed system and how the design choices affect its performance. Many of the principles used in its design are used today as well and so it has stood the test of time.

In this paper, the authors have described their experience with Grapevine - a distributed, replicated system that provides message service and registration service. Grapevine provides authentication and access control mechanism for accessing file servers along with message service.
The paper is written more like a commentary on the various design choices made by the authors while designing the original Grapevine system, the consequences of those choices as the system grew in size over time and the modifications to those choices that were necessary to address the new challenges. The new challenges are a result of the scale of the system – at the time of writing this paper Grapevine supported nearly 4400 users and 1500 groups and handled about 8500 new messages per day. Since the original Grapevine system was designed with a specification to support a load generated by 10,000 users, the current scale provided an opportunity for the authors to reflect upon the original design and how it handled scale.
The system has many design choices that helped attain scale and the most notable among them is the division of the registration database into registries so that the number of registries and not the size of a registry increases with the number of users. The concepts of geographic registries and local servers also helped achieve scale in a big way. The presence of 3 replicas and their choice of geographical distribution of the 3 replicas helps provide a good availability of the system. The designers have also given considerable thought into making the system look like a unitary one for all users and registrars though not always successful. Achieving consistent state is another important aspect that the designers considered though not completely successful in achieving it for all kinds of operations.
The main issues the designers faced were with respect to the handling of distribution lists which stressed a lot of the design choices. The presence of nested groups made authentication and access control to file servers very slow and the designers could have dealt with it better than the temporary fix that they provided by keeping a copy with flattened namespace. They also do not consider the underlying network for performing any kinds of optimizations. The system would definitely scale much better if networking limitations/optimizations were considered. Also, the methods used for recovery from failures involving operators, remote system experts and logs all seem insufficient for a large scale distributed system’s operation.
Overall, the Grapevine system provides valuable insights into how to design a distributed system and how the design choices affect its performance. Many of the principles used in its design are used today as well and so it has stood the test of time.

Summary: The paper by Schroeder et al. aims to discuss challenges and implementation details of Grapevine, a distributed system used for exchanging (email) messages between users. The paper starts off with the structural overview of Grapevine and how there is multiple delivery paths (primary and secondary mailboxes) for reliability, how Ethernets/(routers) and servers are connected, how registries are stored using RName, how Grapevine syncs by propagating messages, and how users can created/modify/delete groups and individuals. Next, the paper talks about the scalability of Grapevine, which can meet the load of 10,000 users by adding more layers of distribution so the message delivery is done in parallel (e.g. use Tax.pa and Tax.es instead of just Tax.All), by using editors to filter messages, or by avoiding unreliable links. Then, the paper covers the issues of Grapewine, such as when to add more servers, whether to use a division of function for messages and registration, how to divide inboxes, how to utilize distribution lists to account for people relocating, how to deal with the concurrency issue of registration database changes, how to obtain improved delivery time, how to authenticate faster, how to remotely maintain Grapevine servers, how to update Grapevine, how to make Grapevine reliable, how to add/split servers, and how to maintain disk space by archiving old messages. Finally, the paper concludes by saying they have already implemented some of the proposed changes if the change has high rewards and low risk.

Positive Comments: The paper is full technical contribution that may have solved many early problems of distributed systems. The list of technical contribution is rather large, such as using backup inboxes (primary and secondary) and splitting the secondary inbox to boost reliability, using locality information to increase message delivery speed, adding layers of distribution to decrease congestion, adding more servers when there is a larger load, separating message and registration servers for more efficiency, changing the remailing policy to avoid congestion, propagating only changes instead of the entire registration database, using cache and hash for faster authentication, using dead letter facility to diagnose errors, rejecting connections when disks are at near capacity to help avoid deadlocks, and archiving message to save disk space. Furthermore, some suggestions were excellent (large payoffs and low risk) and were implemented to Grapevine.

Negative Comments: The paper did not analyze the payoffs of any suggestions. Therefore, although the paper does contain a lot of suggestion to improve Grapevine and distributed systems, there is no data to support any of its suggestion. There are some suggestions that do sound counterintuitive for distributed systems. For example, although using an administrative interface program that picks one registration server may handle the concurrency issues with registration database changes, doing so may overload/congest the registration server if multiple users wanted to modify the registration database at the same time. Therefore, it is difficult to evaluate whether the payoff of the technical contributions of this paper is actually worth implementing in other distributed systems.

Summary:
This paper firstly describes some key principles behind Grapevine system that provides message delivery, naming, authentication, resource location, and access control services in an internetwork of computers. The system consists of a set of servers each of which provides two kinds of services namely "message service" which is responsible for message delivery and "registration service" which provides services like naming, authentication, resource location and access control.
The key focus of the paper is on some of the observations from operational experience of the Grapevine system in terms of scalability, reliability, transparency and load balancing as the load on the system increased. It further suggested improvements that can overcome some of the limitations exposed by the operational experience.

Some key ideas from the paper that I like:
1. The ability to scale by adding fixed power servers by partitioning the registration database into registries that span across multiple servers.
2. The idea of indirection or hierarchical distribution lists to get around the problem of scaling distribution lists.
3. The idea of replicating registries across multiple servers (also on both ends of unreliable links) and having primary/secondary mail boxes to provide reliability.
4. Providing diagnostics and monitoring through viticulturist's entrance
5. The idea of having a backup server to archive messages when the primary servers run out of space

Some flaws in the paper:
1. Inconsistent registry replicas during updates seemed like a major drawback in the system.
2. The system will not scale with addition of new servers beyond a certain point. This is because of the amount of messages exchanged between the servers will also increase limiting the scalability.
3. The paper does not considers the possible threats/attack vectors on mail servers like spams and denial of service.
4. Resources location is just based on proximity without taking into account other network parameters like congestion, bandwidth etc.

Contributions:
1. This paper provides a structural overview of a distributed and replicated system that can scale using fixed power servers. This work was first of its kind and laid a foundation for future of distributed systems.
2. The problems discussed in the paper are generic to any distributed systems. Some of the issues are still applicable to modern distributed systems and are being grappled with.

The paper describes the high-level implementation of Grapevine, a system that
provides a conglomeration of services such as messaging, name services, and
authentication and ACLs for file sharing. It then goes on to describe parts
of the design that worked well, but points out issues with the system design
that arose only after gaining practical operational experience with actual use
and scaling of the system over time.

The problem they are solving is the need to improve the design of future
implementations by advising system developers of these issues so they can be
avoided in newer designs, leading to more scalable, robust, secure, and
responsive systems. I think the advice is quite useful, as most of the issues
are subtle, some are the result of rare conditions, and some are the result of
cascading failures, all of which can be very difficult to identify and design
around on first implementation. Therefore, I believe they are essentially
solving the problem of, "if I could do it all again, I would have done it
better," which is common when developing any system of non-trivial complexity.
One problem they explicitly say they are NOT solving is to improve their
existing system. The paper in some cases suggests a specific solution to a
problem which they have not actually implemented. In the conclusion they give
fairly frank reasons for why.

They point out that scalability may have hidden bottlenecks. For example, the
machinery for sending individual messages may be scalable, but the notion that
users will want to join lists, make the list size proportional to the number
of users, but then also have additional messages from each user means that part
of the system load (messages to send to users) scales with (number of users)^2
instead of linearly. This is more of a social issue than a technical one,
which is why it's easy to miss in the original design requirements. They
propose a top level of indirection to delegate (forward) message delivery to
lower-level registries.

Another area they discuss is that configuring the system is difficult, as it
relies on a combination of technical properties (disk space, bandwidth and
reliability of the various interconnects), geographical, and organizational or
managerial properties. It is vital in a robust system that there be additional
capacity to handle the cases where servers are offline, since otherwise the
additional load on the still-operating servers can cause new classes of failure
and this idea can cascade until the whole system is broken. They discuss how
to determine WHEN to grow the system. However, they do not discuss HOW to grow
the system other than by some manual "good advice". Here, I believe there is
room for improvement by adding some ideas for dynamic load balancing. Another
useful addition to a system like this is push notifications to admins and
registrars when ideal capacity has been exceeded, as detected by the system
itself, rather than relying on passive monitoring by the admins.

At the time this was written, the general user community of this system likely
trusted each other and "hacking" wasn't an issue. And although Grapevine does
provide authentication services, the passwords are sent in the clear, and the
client is not able to authenticate the authentication service itself. The
paper "Grapevine: An Exercise in Distributed Computing" mentions this, but the
newer paper makes no mentions of any problems arising from lack of real
security. It's too bad it wasn't implemented, because cryptography is often
CPU-intensive leading to another host of interesting scalability problems and
Grapevine's experience with it would have been informative.

It's obvious that the contributions from this paper have long since been
incorporated in some form or another, and likely been further obsoleted by
newer problems given the explosive growth of the Internet. The combined
services of Grapevine were precursors for the two core services of the modern
Internet, mailer daemons (messaging) and DNS (name registration), both of which are
highly distributed systems and still work together in a similar way. (NIS/LDAP have taken over for other types of registration)

They also advise that personal computers are more common than anticipated, and
people will want to get their messages from anywhere, not just their
workstation. That's a pretty good insight for a paper written 30 years ago.

This paper provides operational experience of Grapevine, a distributed and replicated system for message delivery. Authors provide structural overview of Grapevine first, and then discuss several design considerations with proposed strategies. The goal of this report is to provide some suggestions for future system design.

Contributions:
1. In order to deal with scalability problem, Grapevine partitions registration database into more registries, which enables the data on every registration server will not exceed its maximum capacity.
2. To simplify database changes, Grapevine will only change the update message. This will great reduce the average update message size.
3. Grapevine provides high reliability through replication, which means other servers can perform same function when one server is unavailable. They found that enough spare resources for normal operation and system configuration is extremely important for reliable distributed system.

Pitfalls:
1. Grapevine requires direct connection for message delivery, but for modern world-wide Internet, direct connection is less possible, due to the rapidly growth of Internet and Firewall, etc.
2. Though it is necessary to increase the reliability by replication, it will cause inconsistencies on the other hand, as it is impossible to update data at the same time. And this leads to a management problem when deleting names.
3. Grapevine does not really consider the topology of the internet. For example, it does not consider topology of Grapevine Servers could change over the time. Also, network routing mechanism can be considered at the same time to improve the message delivery.
4. As an early-stage distributed system, it only had 17 servers by 1983. However, modern network may have more servers and run different network protocols, so the network is no longer homogeneous.

Overall, this paper presents detailed operational experience of Grapevine, which is important for future design. Although it might have some pitfalls for today’s perspective, this paper is still quite meaningful at that time, and many modern designs learn from Grapevine.

Summary

This paper presents the experience and the lessons learned in using the Grapevine system. Grapevine is a distributed system that provides message delivery, resource identification & location, access control and authentication services. The paper explains how the system performs under substantial load, shows scalability problems encountered, talks about the flaws in the original design and suggests/implements solutions to these problems.

Highlights

One of the key contributions is the realization of the idea that one can actually improve scalability by having large number of "fixed-power" machines than having a single super-powered computer. Though this has a problem of increasing the number of network messages, the idea itself sounds neat and applicable even today. The one-level indirection solution to the 'general interest distribution list' seems elegant. The idea of message sharing (which is kind of deduplication at the granularity at a message level) seems like a good attempt to save disk space. Increasing the opportunities for sharing using strategic placement of mailboxes is very good. Many small features like archiving, secondary mail boxes, logs for debugging are well thought of and adds more value to the system. To me, it feels like many directory services like Windows Active Directory and Open LDAP have taken inspiration from the ideas/concepts introduced in this paper.

Drawbacks

1. Though they seem to convey that the system is scalable, it is not truly scalable in the following sense: As we increase the number of machines in the system, the network messages can increase rapidly and still curtail scalability.

2. Consistency semantics for registry entries seem to be really weak although the updates operations seem to be highly available. This can cause really annoying problems like the following: An user is added newly to the registry. This update reaches one replica of the registry database and some other user immediately tries to send a message to that new user. This sending can refer to another replica that hasn't received the update yet and hence doesn't have any clue about the new user.

3. The resource location mechanisms don't consider sophisticated network properties like bandwidth, losses, congestion rather it just considers only the number of links/hops which seems sub optimal.

4. It is quite surprising to note that in the original design, they were sending the entire list during updates of the registry database. Though they refined their design to send only deltas to propagate updates, it seems like a thing that should have been nailed in the original design itself.

5. The paper has limited or no descriptions on the security threat model of the system. It does not talk about adversary users, attack scenarios, and abnormal operating conditions in the system. These things have also become a norm rather exception in today's world.

Relevance

Though this is a paper written 30 years back, the ideas discussed in this paper are still relevant and are being researched upon in the distributed systems research community. Many successful ideas and directory service implementations like Windows AD and Open LDAP seem to have taken inspiration from the ideas in this paper. This paper can even today serve as a great read for people building distributed systems.

Grapevine is a distributed messaging system and this paper is a review of its performance over the years and how it performed and also its pitfalls at that time and some suggestions for future versions. Implemented in the 1980s, Grapevine provided message delivery, naming, authentication, resource location and access control services and it was the only system to provide that combination of services.

Grapevine was originally designed for a maximum system size of 30 servers and a load of 10000 users and in most cases, it delivered the requirements that were expected of it. The authors suggest that scaling beyond might require keeping partial configuration information in each server and adding another layer to the naming hierarchy. Also dividing the registration database into registries will prevent the scaling issue.

It was noticed that with growing user community (and the distribution lists) the time to process messages was more than 10 minutes, thus delaying delivery. The authors are of the opinion that this can be solved by adding a layer of indirection wherein the entire list would be divided into smaller sublists and the requests would be ultimately steered to these sublists. Grapevine was not ready to handle user communities beyond 10000 as at some point the number of messages arriving for a user would be too overwhelming for the system and hence would require a filtering mechanism.

The size of the internet infrastructure impacts the delays in delivering the message since the increasing number of gateways in the path adds up to the delay. However the authors point out that at that time, grapevine didn’t see a major issue with up to 11 gateways.

It was a good idea to share the messages among all the servers so that the messages reside on one of the servers and it best benefits when the size of the server matched the size of the unit since all the corresponding messages can be stored on that server. It is also smart to add more servers of fixed power rather than increasing the power of a single server.

As a distributed system, grapevine faced issues such as delays in propagating registration database changes since the updates were propagated by message and the inconsistencies lasted for minutes. It also failed to eliminate duplicate messages from the inbox. Grapevine's replicated message delivery service had poor resource location decisions since it did not consider the bandwidth, reliability, or congestion of the links. The original design of grapevine based on some assumptions of load balancing proved to be considerably inaccurate and affected its performance. Grapevine replicates function among several servers to achieve high reliability and it has been seen to prove right.

The paper highlights that over the time the designers were reluctant to change or upgrade the software as the associated bugs disrupts the essential services to the dependent users. This is still the case as we keep seeing minor blackouts in many popular services, often times caused by upgrades. In my opinion, this is a very good paper since it evaluates one of the early distributed systems and at the same time provides guidelines and suggestions for developing future systems.

The authors of the paper provide in detail their experience in using grapevine, a distributed message delivery system.They talk about their design choices and their consequences. In addition to giving a first hand report of their experience they give out new features and alternate implementations for the grapevine which they have not yet implemented.

One of the main goals of the grapevine is to scale. The designers want a message delivery system that requires addition of the servers to scale with the system. To achieve this they tried to distributed the registration data onto the registration servers. They realized that registration data distribution lists (kind of representing interest in a group) are proportional to fraction of users. This would imply that load on a server storing that distribution list is proportional to number of users which is not the intention of the designers. To solve this they added another level of naming hierarchy and allowing initial server to expand then delegate distribute the task of message delivery to other servers. Another interesting observation by the authors is the need for the filters in a mailing system.

One decision they have taken here to add naming hierarchy to avoid having all registration configuration in one server. This would prove a trade off between speed and disk space, a decision which is archaic with present disk storage capacities.

Grapevine is designed and implemented with the intent of being used internal to Xerox. Because of this some decisions which they have taken make sense in their organizational structure. One such decision is to let having a delay of 12 hrs for updating access control checks. Another such decision is to individuals from same organization division to shares same message server as that would increase the probability of messages being shared and disk space being saved.

Another role of the grapevine is to decouple the authentication process from the file servers. This coupled with caching of the access control checks is expected to reduce the latency of the authentication. It is found that due to the nesting of the groups, there is still considerable latency for authentication. This is solved by making use of upward character to provide syntactic method for recognizing groups for further exploration.

The messages in the inbox lists are archived to the file servers after a period of time. The file servers are not replicated and thus provide a bottleneck.

Another thing grapevine has not considered are malicious attacks on the message delivery system. The messages are not checked for their contents which makes it possible to have malicious programs propagate.

The authors of the paper, began to recognize the limitations of a distributed database. They noticed the inconsistency period for the updates. They also noticed that incremental updates need a mechanism to preserve ordering.

The designers of the system tried to increase the reliability of the system by replication. The authors considered geographical and organizational arguments for their replication configurations.

The authors also recognize their folly for having a resource allocation algorithm which does not posses any any knowledge to the underlying network. They also recognize the problems in not providing a way for the end user to know the status of the system.

In addition to above designers recognized the importance of monitoring and provided Grapevine with the facility for remote monitoring through viticulturists entrance in addition to logging backed with a circular buffer and dead letter facility.

The paper discusses the operational experience with Grapevine, a distributed and replicated system that provides message delivery, naming, authentication, resource location and access control services. The authors have given us an overview of the structure and the major design considerations made for handling scalability, availability and reliability in the system. A Grapevine server is the combination of a message server and a registration server.

Contributions :

1. Partitioning the registration database based on registries (corresponding to location/organization) proved to be a key idea for handling scaling issues.
2. Message delivery to huge distribution lists scaled well by breaking down into sublists based on the registry.
3. Message sharing between inboxes on the same server is an useful idea as their present configuration and load states that atmost 300 inboxes share a message. Also, division of primary inboxes to servers along organizational lines benefits from message sharing.
4. Availability has been given importance by employing a policy of having a primary inbox on the server, secondary inbox nearby and tertiary inbox on the other end of the internet also keeping in mind the reliability of links.
5. Some key decisions to adjust the load balance which have caused significant impact : The distribution lists were sent as incremental updates rather than the entire contents of the list. Caching is used to store the results of authentication and access control checks for file access.
6. Flattening of complex groups and syntactic distinction between group names and individual entries also reduced the number of checks for access control.
6. Log maintained by the server is an useful feature for debugging.

Flaws :

1. The initial specifications in the paper indicate that the maximum system size was 30 Grapevine servers with a total load of 10,000 users. However, it is not clear about the extent to which these specifications could be expanded based on the design considerations proposed.
2. Inconsistency due to the delay in the propagation of registration updates to various replicas looks to be a major issue in the transparency of the replication.
3. Some other limitations in the system include handling of deletion of names and duplicate messages.
4. Assumption that the user would not need to know about the state of availability of the server. However, informing the user about the reason for the failure of a service should be given more emphasis.
5. Assumption that the mail systems have workstations with local disks. A terminal service was developed for this purpose but with a limit on the number of users. This assumption is also not very relevant in today's systems where webmail is used.
6. The recovery process from server failures like disk corruption requires a lot of manual intervention in grapevine.
7. In the concluding remarks, it is said that a larger disk would eliminate archiving but current mail servers still use the concept of archiving and therefore, the argument is not quite convincing.

Most of the key concepts here could be seen in Microsoft Outlook. I think that these basic concepts are relevant even today in an implementation of a distributed mailing service.

The paper evaluates Grapevine, a distributed, replicated system on various operation characteristics such as scalability, configuration decisions, Transparency and distribution challenges, load related issues, operations and reliability. Grapevine provides two services: messaging service(message delivery) and registration service(naming, authentication, access control).

Main contributions of the paper include scaling up the system by introducing a layer of indirection and dividing distribution list by registry. I find introducing redundancy to achieve high reliability as a good idea. The authors found that spare capacity could help in handling peak loads but gets used quietly as system grows leading to shortage of spare resources during a failure. The idea to split secondary inboxes to spread load to multiple servers is good and helps fix the issue. The idea to have a dedicated message server and registration server for heavy local load is novel. However, it might still have some catches as it was not implemented. Adding organisational registries along with registries based on geographical locations seems to be more practical. To large extent, the system was successful in maintaining an impression of unitary model. But delays in propagating changes to database and registration would 'surprise' end users and reveal distributed and replicated nature of the system. the system did succeed in becoming a single source of authentication and access control to large extent.

Flaws:
1. No automatic load-balancing as the large number of factors influence configuration decisions.
2. Grapevine was designed to scale to a maximum system size of 30 which is probably less in today's world.
3. Assumptions were made regarding the nature of the load which lead to performance problems.
4. Many of the solutions were proposed such as separate mailing and registration server, moderating the mailing list (like editor for newspaper content) but never implement. It would interesting to know the system behaves with these implementation.
5. I find the system to be NOT robust enough as the authors admit their reluctance to change the software leading to potential disruptions due to bugs.
6. Message body not retained on failure. The authors were concerned about the cost, which may not be true today as disk storage is cheaper now. I would treat this more as enhancement than flaw.
7. Sending message independent of the availability of local server (an assumption in paper). Message delivery might take longer with immediately available nonlocal server than waiting for local server to freed up. Resource location algorithm should be modified to tackle such circumstances.

Relevance:
The paper brings out the various issues faced while designing a distributed system, some known prior to system design and others identified while implementing/testing the system which hold true even today. The solutions and approaches discussed can be a good starting point to tackle those issues while designing a distributed system.

The paper touches upon common issues faced by a distributed System, Grapevine being the subject of analysis and propose potential improvements to help improve performance and reliability.
The experience with operating Grapevine have been analyzed revealing problems which the paper tries to solve. One of which is the effect of scaling on distribution lists. An increase in size of the user community has resulted in message delivery delays. The message delivery algorithm takes time to process these lists causing this delay. To mitigate the effect an indirection was added. Basically distribution lists consisted of sublists spreading the computation among a number of Grapevine servers. In order to serve a larger user base a third level of naming hierarchy was added reducing scaling issues.
The importance of configuration decisions has been discussed. The also mentions the need to get secondary inboxes to the other side of a link or to be specific an unreliable link. So a policy was adopted for registration and message servers to have primary inboxes on the server itself, secondary nearby and tertiary at the end of the internet. Another issue is regarding the distribution of registry replicas. The decision to use only geographical registries is considered a mistake by the paper and thus to improve this organization registries were included for certain distribution lists.
Another topic the paper addresses is the assumption that users would not want to know the state of the system. Which turns out to be wrong and is still work in progress as it is very hard to determine the state of the distributed, replicated system. The original design decision about the load proved to be wrong due to performance issues. Updates of Databases were done using merges but an increase in cost of merges was seen on system expansion, increase in distribution lists and update frequency. This led to considerable lag in delivering messages. In order to mitigate this effect updates were performed incrementally instead of merging hence reducing computation cost.
Another pitfall in the system is during control checks with the use of caching which took too long due to nesting of group names. To solve this problem the use of a new membership check and a parallel flattened version of complex groups was maintained.
The paper also talks about the archiving arrangements used by grapevine. The current setting can cause disk to get full and thus to improve its reliability a way to adjust the age parameter controlling archiving automatically is required.
The author has provided its shortcomings in great detail alongside the systems success. Overall the paper was a candid assessment of Grapevine explaining the various points important for the design of a distributed system.

Grapevine is a distributed, replicated system built by Xeror for providing electronic messaging facility. In this paper, the authors discuss about the challenges they faced while trying to subject the system to substantial load (much more than at the time of designing) and have also discussed some solutions they have found for some specific problems. Their experiences were with the scale, configuration, distribution and replication, expected load, monitoring and repair, and reliability.

The following are considered to be the key contributions of Grapevine:
The introduction of heirarchy among distribution lists and the corresponding changes in the delivery algorithm to expand lists only upto one level solved the problem of scaling with large distribution lists. I believe this idea is still used in Microsoft's Outlook.
The idea of brief updates (only including the change in the update message) when the load on the system was very high was a great way to avoid sending the entire changed entry in a message.
Their recommendation for increasing the inbox size and random access of messages proved to be very useful as we can see them adopted in modern email providers.

Flaws:
The system had huge delays in propagating the registration database changes since updates were propagated by messages.
A solution wasn't found to eliminate the problem of duplicate messages that are sent to an user if he belongs to 2 lists to which the message was sent.
Designing the inboxes to be read sequentially and making the assumption that the clients would have some space to store the messages.

I feel that Grapevine laid the foundation for people to start thinking about building a distributed system which is scalable. Most of the design issues introduced in this paper are very relevent even in today's modern distributed systems.

This paper shared discoveries the authors made when testing the Grapevine system. Grapevine is a distributed system that provided naming, message delivery, authentication, access control services. It is designed to operate on multiple machines. Its design goal is to make the system scale its performance with additional machines instead of with more powerful individual machines. Also reliability is a major concerns when designing Grapevine.

The authors shared their experiences when testing the Grapevine system. They did a lot of experiments with the system and measured the performance. Their results showed that the system met most of the design goals, although surprise existed in some aspects.

The authors discovered overall the system scaled properly as the number of machines increased. Their system consisted of 30 Grapevine servers and the total local was 10,000 users.

Also the authors found the system was extremely reliable. The Grapevine system achieved this level of reliability by adding redundancies. The system hide the details of failure from the user, and when the authors were using this system, they didn't even know a machine has failed.

The distribution list was one aspect that did not meet the expectations. They discovered that as the scale grew, the size of the distribution list remained only a fraction of the community size. They also discovered that under under certain conditions, it took more than 10 minutes to process an interest list. The authors believed this was the bottleneck of the system. A possible fix for this issue was to create more layers, as was proposed by the authors.

Another drawback of the system they discovered is the configuration decisions. The system required the configuration decisions to be made before everything is up such as how many servers to have, where to place them, and how to distribute registry replicas. It was sometimes hard to determine such configurations. For example, secondary inboxes. Splitting secondary assignments could impact the effectiveness of message sharing, but there did not exist a simple rationale for this.

In my opinion, this paper is important because it evaluated an early distributed system thoroughly, and provided valuable comments and suggestions. It also pointed out future developments in this relatively new area.

Grapevine was an early distributed messaging, authentication, and service locating system with decentralized administration that in many ways appears like a precursor to modern email systems. The Grapevine system was designed to be reliable, fault-tolerant, and secure even though the system was built on a foundation of individually unreliable servers and workstations. Overall, the authors provide a good case study for modern distributed systems since although they faced problems on a micro-scale in contrast to modern distributed systems, the problems fundamentally remain the same today.

For example, the authors state (in their conclusion) that they became afraid to make changes to the system because they couldn't foresee the repercussions (their original system was 30k lines of code). This foreshadowed the modern version of this problem since our codebases with many millions of lines of code have become difficult to safely modify as well.

Another issue the authors highlight (that also remains a severe problem in modern times) is the problem of load balancing. For example, the system began to sometimes experience 10 minute delays in message delivery (because individual servers were getting overloaded – the workload was not being shared among enough machines). The solution was to add a level of indirection to messages which was called a steering list which allowed the computation to be distributed to a number of servers proportional to the number of users. They admitted this would eventually fail because the system was built on the assumption that any given messaging server could be reached directly (and they suggested that multi-step forwarding could solve this possible future issue).

Many other salient points were made such as: the use of log files to analyze and correct issues / bottlenecks, replication to create reliability, or the importance of maintaining excess capacity across all servers for a system to perform well overall. For example, they modified server behavior so when less than 5% capacity was left it would start to reject messages (to prevent deadlock). We still use log-files today to analyze & improve performance and maintaining excess capacity on our servers is just as important now as it was back then.

I was disappointed that the authors didn't address the issue of cleartext passwords which was present in the original Grapevine system; I can only guess they thought the issue either wasn't worth addressing or they didn't fix it. Overall, the authors were unusually candid regarding their own mistakes; the entire body of the paper explains the problems with their original distributed system design and the various solutions they used to overcome the respective problems as both the system and the user-base grew. Most of the problems they faced are still relevant today, and hence, this paper has aged quite well.

This paper is mainly a report about their operational experience with Grapevine, a distributed, replicated system. With several years' experience, they found most aspects of Grapevine run as expected while they also found some surprising bottlenecks. It is consist of two parts- overview of Grapevine and observations of the usage of Grapevine.

Operated in Xerox research internet, the services of Grapevine are divided into the message service and the registration service. Each Grapevine server is the server of both service program. Each service is a client of the other.

As for scalability, they found the amount of registration data on one registration server will not grow as the system size grows but the size of certain distribution lists increased as a fraction of total user community and also the size of underlying internet grows as the size of user community grows.
These two issues can be solved by
* adding an indirection in their interpretation
* multistep forwarding

The section of configuration decisions answers part of the configuration questions. They realized that division along organizational lines reflects communication patterns and is easy to administer. So that only use geographical registries was mistaken, a distribution list can be placed in a geographic or organizational registry according to its purpose.

It turns out that Grapevine can be regarded as a single reliable computer besides some corner cases cased by:
* delays in propagating registration database changes
* deleted names
* duplicated/replicated messages
* etc.

Adjusting the load is similar to system comfiguration. They found even with caching, users still complained access control checks takes a long time, especially when acess is denied.

Grapevine provides two types of feedbacks to the two types of people involved in system operation: operators and system expert.
Although with redundancy, the reliability of Grape turn out to be pretty successful, they still proposed two ways to improve the reliability:
* automatically adjust the age parameter that controls achieving
* larger inboxes

This paper summarizes operational experience with Grapevine, a distributed and replicated system providing message delivery, naming, authentication, resource location, and access control services.
Grapevine has several exceptions on scalability. The distribution list design does not scale properly. Messages to large interest list consume long time to process and thus delay other deliveries. This may be handled adding an indirection in their interpretation. Large Internet size also affect scale for messages have to take longer unreliable links.
Grapevine also has configuration requirements. The servers should be large enough to match the unit size of organization and new servers should be added when load on existing server gets too large. A pair of machines can handle "hotspot". Devision of primary boxes assignment should be even among local area servers. The definition of registries and the distribution of registry replicas should also be taken care of.
As an early-stage distributed system, Grapevine unevitably has problems. The registry replicas are not consistent in propagating registration database changes. Utility tools are difficult to provide. Problems also exist with deleted names and failure to remove duplicates. Resource location is based on the number of internet links instead of bandwith, reliability and link congestion.
Grapevine pioneers in providing a combination of services and exhibiting a homogeneous distributed implementation. It distributes registries to get scalability and use replicas to improve reliability and availability. The problems of Grapevine also reals the difficulties of distributed systems, including scalability, consistency and relability.

This paper analyses operational aspect of running the Grapevine system and concentrates on the post system design problems faced after gaining experience in running the system under substantial load.

Grapevine is distributed system for message delivery and authentication. In this paper, authors have thoroughly described the challenges faced in scaling the system to handle operational load from more users and making the system more reliable. They have provided several suggestions for improvements which can be incorporated while developing a new distributed system. I think the challenges described in running the Grapevine system are still relevant for current generation distributed system when dealing with scale and reliability of the system. Authors have designed the original Grapevine system to scale and handle operational load of about 10,000 users, but as the load grew, it faced many bottlenecks in its processes. In this context, authors have described the changes made to overcome some of the issues which broadly fall in categories of scale, replication, locality, and reliability.

The important contributions of this paper are idea for increasing system capacity over large range by adding more servers with fixed power, rather than more powerful servers. And idea of locality, that is, deciding which server to place the inbox of the users based on the reliability of the inter-connecting links. And also idea of improving reliability through optimized replication of data is good.

Though the paper describes several improvements for future systems, it lacks to take into account the network capacity while taking decision on scaling and adding new servers, does not use load balancing between servers to redirect traffic to another server to reduce latency in delivering messages. Also they describe the various updates needed for fixing the issues encountered but seem not to consider running different version of software running concurrently on different servers which has an impact on the availability of the distributed system.

Grapevine is a 17-servers distributed, replicated email systems which is developped more than 30 years ago. It provides mainly two service: the message delivery to the recipient list and the registration. While Grapevine is not as sophisticated as modern distributed systems, it contributes a lot by considering the principles and challenges when designing distributed systems.

Takeaways:

1. Making the load for each server a constant (and independent of the whole system load), but adding more fixed-power servers when more whole system load is need.
For example, to deal with the increasing users, Grapevine divide the user lists into registries (with a limit size) saved in each server.

2. Grapevine increase the reliability based on single server system by considering redundancy: replicating the informations (e.g. user inboxes, registries) among multiple servers. First, it ensures that the whole message dilivery systems still works even some of the servers are down; second, it speeds up the accessing latency of the systems. This idea is widely used in modern distributed systems, e.g. hadoop.

3. hierarchical naming schema on registries borrows the idea of cache locality. It utilizes the divide of registries to balance the load, to take the network transpotation (such as geographical distances) into consideration.

Pitfalls:

1. While the multiple fixed-power servers is a cool idea in that period, it is not sophiscated enough to take network bandwidth and the potential server heterogeneity (different setting on each server).

2. Inconsistency of the registries. When Grapevine replicates registries, it deal with the update of registry by propogating to other replicates of the registry. It may suffer an inconsistent registries issue and message loss issue (due to deletion of recipient inbox).

3. The sucurity issues, defenses for potential attacks on the mailing systems are never considered.

In this paper the authors describe their experience and learnings from the distributed system called Grapevine. This system was designed primarily to develop a distributed system for research purposes which also does useful work. It quickly expanded into an essential system inside Xerox. This paper is written for the purpose of educating the software community about the design and challenges of a distributed system. Of particular note is the fact is that the paper describes a mature system and hence provides experienced insights.

The system provides primarily mail and authentication services inside the enterprise. A variety of concepts described hold a lot of significance even today, 30 years after the paper was published. The concepts of redundancy and high availability, Distribution lists as ACLs, partitioned naming schemes and indirection or domains, registration or authentication service are used in today’s LDAP or Active Directory environments. The authors describe various design trade-offs in choice of primary and secondary servers and geographical vs organisational distribution. Since this is a mature system they have insightful experience in debugging, remote support and administration. They describe how limitation of resources like bigger disks and additional failover servers affected their system. The authors even predict that spam and junk mail will be a problem in the future!

Some later design decisions which try to solve a problem could potentially backfire. The separation of message and registration server could potentially overload the network as the communication between the two which were till now processed as local RPCs would now need to go over network. The decision to not allow permanent storage of messages on the server is evidently a bad decision in today’s world. The authors recognised this though. At two different places the authors have described some component of the algorithm used to resolve distribution lists into individuals. But it is not clear if this algorithm can detect and avoid circular memberships.

The most impressive aspect of this work is that almost every concept and idea discussed has stood the test of time and is prevalent in modern systems.

In this the paper the authors have reported on their technical experiences learned using Grapevine. They have also reported on some interesting insights of human behavior and user-experience to the Grapevine system they have observed. Grapevine is a multicomputer distributed system on the Xerox research internet created to provide facilities for the delivery of digital messages such as computer mail, naming users, machines and services, authentication of users and machines and locating services on the internet. Before detailing out the then current state of Grapevine and the lessons learned, the authors gave us a detailed description of the set of services provided and the architecture of the Grapevine system illustrating the novel facilities and implementation techniques.

The paper gives us a great insight into the structure of an early distributed system. Given the limitations of hardware and technology at the time Grapevine was implemented; we are introduced to some clever workarounds and a novel network topology. Even though Grapevine was an early distributed system; we are introduced to the ideas and impacts of scalability, data consistency, availability, load-balancing, physical locality constraints and trade-offs in cost. Many of the problems described in this paper such as storing data in a distributed architecture, data replication, scaling, and user authentication are very relevant even today with state of the art hardware. The technique described in making the distribution lists and registries scalable and hence more efficient in routing messages is novel and interesting. Also the fact that they have observed how humans behaved as part of a distributed systems is intriguing. The dilemma faced when deciding how much of the internal structure and mechanics the end users should be exposed to is still not well defined today. Given the fact that “internet” or “intranet” was still in its infancy; the Grapevine was a remarkable contribution. Even though Grapevine was primarily used to deliver computer mail, experiments were made to use the distributed architecture for other proposes such as controlling an integrated circuit manufacturing facility.

Of course in hindsight we can see many limitations in the paper. Even though the philosophy of “reluctance to change the software” for the fear of potential disruption introduced by bugs seems practical, in my opinion such decisions impede the progress of a pioneering scientific project like Grapevine. Given the then young nature of the underlying internet; even though multistep forwarding was a pretty nifty way to deliver messages it was still inefficient. Also a lot of human intervention was required to “monitor” and maintain the Grapevine servers and connections. The author realized that they did not have an effective way to scale with the number of messages being sent evident by their prophetic statement “An analogous filtering mechanism will be required in the world of electronic message systems before they can become universal.” Also hey were evident data consistency issues with Grapevine.

Grapevine was a pioneering endeavor in the realms of distributed systems and I feel the authors have done a wonderful job in painstakingly outlining the problem faced and the solutions they came up to circumvent the issues so that future endeavors can benefit from the Grapevine experience.

In this the paper the authors have reported on their technical experiences learned using Grapevine. They have also reported on some interesting insights of human behavior and user-experience to the Grapevine system they have observed. Grapevine is a multicomputer distributed system on the Xerox research internet created to provide facilities for the delivery of digital messages such as computer mail, naming users, machines and services, authentication of users and machines and locating services on the internet. Before detailing out the then current state of Grapevine and the lessons learned, the authors gave us a detailed description of the set of services provided and the architecture of the Grapevine system illustrating the novel facilities and implementation techniques.

The paper gives us a great insight into the structure of an early distributed system. Given the limitations of hardware and technology at the time Grapevine was implemented; we are introduced to some clever workarounds and a novel network topology. Even though Grapevine was an early distributed system; we are introduced to the ideas and impacts of scalability, data consistency, availability, load-balancing, physical locality constraints and trade-offs in cost. Many of the problems described in this paper such as storing data in a distributed architecture, data replication, scaling, and user authentication are very relevant even today with state of the art hardware. The technique described in making the distribution lists and registries scalable and hence more efficient in routing messages is novel and interesting. Also the fact that they have observed how humans behaved as part of a distributed systems is intriguing. The dilemma faced when deciding how much of the internal structure and mechanics the end users should be exposed to is still not well defined today. Given the fact that “internet” or “intranet” was still in its infancy; the Grapevine was a remarkable contribution. Even though Grapevine was primarily used to deliver computer mail, experiments were made to use the distributed architecture for other proposes such as controlling an integrated circuit manufacturing facility.

Of course in hindsight we can see many limitations in the paper. Even though the philosophy of “reluctance to change the software” for the fear of potential disruption introduced by bugs seems practical, in my opinion such decisions impede the progress of a pioneering scientific project like Grapevine. Given the then young nature of the underlying internet; even though multistep forwarding was a pretty nifty way to deliver messages it was still inefficient. Also a lot of human intervention was required to “monitor” and maintain the Grapevine servers and connections. The author realized that they did not have an effective way to scale with the number of messages being sent evident by their prophetic statement “An analogous filtering mechanism will be required in the world of electronic message systems before they can become universal.” Also hey were evident data consistency issues with Grapevine.

Grapevine was a pioneering endeavor in the realms of distributed systems and I feel the authors have done a wonderful job in painstakingly outlining the problem faced and the solutions they came up to circumvent the issues so that future endeavors can benefit from the Grapevine experience.

With Grapevine, the authors wanted to create a both a system rich in useful features (messaging, resource location, registration/authentication), and with a homogeneous distributed design. They claim at the time available systems were either interconnected heterogeneous systems, or not as feature-rich.

Grapevine servers both maintain/distribute registration information for users and groups (lists of users and/or groups), as well as perform message delivery, and serve as temporary inboxes. Clients then communicate with servers to send or retrieve a user's messages, or access desired files on a file server, etc. The forward-looking designers attempt to address the problems of partial failure with secondary and tertiary inboxes, and appropriate physical placement of servers. Load is balanced through having separate replicated registries, distributing inboxes, and in some cases caching at clients.

The system grew to over 4400 users and an average of 35,000 received messages per day, in which time the authors were able to observe the quality of the system at scale. Overall the system seemed to perform near to their goal. One notable design choice that helped in this regard were geographic registries, allowing replication of smaller data sets and reducing need to scan through possibly unrelated entries. I also think that the idea proposed in this paper for splitting distribution lists (example given was Tax.*) into per-registry lists was very good and would allow the small registries to continue serving their purpose.

One weakness that was mentioned, but somewhat brushed off in the paper was the poor assumption that registration information is expected to be static. Even relatively static registration or access control information should be expected to change very frequently as you scale a system if you allow users (a growing group) to change it. In fact, this is one of the few problems that users actually reported negative experience from.

In my opinion, this paper stands today as a very valuable work for thinking about designing for scale. The problems will not be the same (we can and do keep email inboxes on servers now), but we still must consider server failure, unreliable networks, and providing users with a system whose complexity is not overwhelming, if visible at all. The ability of Grapevine to continue to appear as a healthy, productive service after a server failure was vital to its success.

Grapevine is a distributed system developed by the Xerox center to aid the communication between the computers in the internet. It is the first step towards the development of modern day mail systems. It basically has registration and message services and each is a client of other. The registration service aids in resolving the location of serves, mapping of names, access control whereas the message service helps in the transfer of messages. The paper mainly focuses on how to develop a scalable distributed system which will have a good performance even when the load increases.

Techniques used:
1. To increase capacity add more servers instead of using a powerful machine.
- This is useful when the increase in user capacity is transformed to increase in registry and not extending registry.
- when the bug in the partitioning of the naming is identified for the groups with large number of users, it is rectified using distribution lists which will have steering list, which will scale well for the large load.
2. Location of messages and registers are decided along the lines of communication patterns,
- They also admit that selection of positioning registries along the geographical lines was a bad decision as it introduced more complexities.
- It was useful in the case of selection of location of inbox of messaging servers.
3. To make it reliable maintain a replicas of registries.
4. Notifying the owner of distribution list about the deleted users instead of finding the list which the user subscribed to.


Pitfalls:

1. sending duplicate messages when the user registered for more than one list, and the mail is sent to both the lists.
2. Remailing policy doesn't work good as expected.
3. Grapevine's delivery algorithm selects a available server which may be non-local instead of waiting for a local one to become available.
3. It doesn't know the nitty gritties of internet and just decides based on the number of intranet hops but not on the other factors such as bandwidth and congestion.
4. It also lacks the smartness of forwarding the messages to the nearest registration server which helps in faster expansion of address.

Though the Grapevine has started off with many features but couldn't implement everything due to "engineering problems" as they pointed out in paper. But still this makes for a good effort on a distributed, reliable messaging system for its time.

This paper gives an overview of a distributed implementation of message system - Grapevine, and discusses the strengths of the original system design along with some pitfalls revealed during several years experience on execution and evolvement.

The objective of Grapevine design was the ability to increase system capacity over a large range by adding more servers of fixed power, rather than by using more powerful servers. The job need to be spreaded to all the involved servers. In order to achieve the scalability, the workload and resource consumption on every single server should be constant to the scale of the problem (number of users and messages). The message system is composed by two services: message service and registration service. To spread the message service functionality to all the servers, complex algorithms are defined in every server to make them collaborate for the same goal. Every server need to have backup in order to achieve reliability. Meanwhile, the registration service is actually a distributed database system, which spread information for the whole system to the individual servers, and replicate the information on every single server to achieve reliability. As this paper points out, hierarchy and indirection are ways to improve scalability, e.g. the solution to the issue with large distribution lists.

About how to decide the system configuration, the paper discusses the importance to make the server be relatively independent. That's essential to the scalability of the distributed system. Authors also discuss some factors to decide the configuration of the distributed system, e.g. where to put registry replicas, which is specific to Grapevine, however such kind of examples show the challenges to design a good distributed system. The authors also explain the difficulty to design a distributed system with transparent interfaces to the user. In order to achieve efficiency and scalability, the Grapevine introduces extra latency and temporal inconsistency to the system; Duplicated message could be aware of by the user in some scenarios. The authors assumed that end users don't need the interface to access the internal state of the system, however, the experience proved that is not true in practice. Also, the authors admit that the state of a distributed, replicated system is much harder to determine and to describe to a user than the state of the unitary one. The evolvement of Grapevine also show the importance to reduce the amount of the information exchange between servers, cache and optimization of information exchange are solutions. An interesting section of this paper is about the operation of a dispersed system, which introduces the unique challenges of distributed system: hard to deploy and debug. Logging and application specific approach (dead letter facility) are explained to solve the problem. To achieve high reliability, distributed system needs to sacrifice efficiency to some extent because of the fact that using redundancy to achieve reliability requires the system to have spare resources in normal operation.

The contribution of this paper is the exhaustive discussion about the challenges in designing a distributed system along with first hand experience. Since most of the solutions and ideas in this paper are from direct experience, they are very useful for real system.

Grapevine is a distributed system that works as a backend for a mail client much like a modern-day Gmail. It’s heterogenous in that it is made up of three services: a message service for delivery, a registration service for user and group controls, and a briefly mentioned file service for archiving. This paper was about the operational experience of designing such a system and ideas for possible future optimizations.

This system is important for its place in history and its use of clever design ideas which show up in current-day systems. User registries were replicated in a hierarchical fashion such as one would see with sharding. The reason for replication was two-fold: for maintaining a backup and for balancing load. The entire system was designed to scale. For example, the registration service only incurred an overhead storage cost that increases at a constant (not linear) rate to the number of servers added (except for the insignificant configuration storage cost). Grapevine was designed to be an abstraction. To the average user, the system could be interacted with like a simple single service. There were also interesting features that show up in later protocols like LDAP, POP, and IMAP.

Several optimizations and features were proposed, but not implemented due to fears of introducing a bug which would result in lost availability. They should have implemented a more robust system with better tests, canary deploys, and more. Their logging feature worked terrifically, however they had no way of doing simple analytics and statistics on the live system so as to diagnose its current health. Analytics should have been considered in the original design for this purpose. Replication was used to prevent data loss, however the data replicated was actually only the metadata about the messages. This seemed to be a big oversight as message content is important and as the service scaled more failures would occur.

The paper is a summary of the authors’ operational experiences with Grapevine, an early distributed, replicated system. The Grapevine system was designed keeping in mind goals that are still relevant challenges in the design of similar systems today - scalability through the use of a large numbers of commodity hardware, tolerance to unreliable networks and node failure, a distributed namespace and finally replication and propagation of updates to achieve consistency. The implementation of the distributed messaging system is analogous to what we know see as email on a cloud based system such as Exchange on Office 365, and the registry naming mechanism is like a wireframe of today’s Active Directory services, giving it very interesting context in a look back to where those modern day technologies once evolved from.

The paper does a really good job of outlining some of the challenges in a deployed engineering system. Their approach to defining a distributed, replicated registry is a pretty neat way of creating a global namespace that the messaging system could make use of. Their approach to scalability by not growing the registries beyond a certain size, and instead growing the number of registries seems to be one that worked pretty well for them given the size of the system that they were dealing with. The authors also seem to have put a lot of thought into placement of registries and their replicas, and their impact on system performance.

Having said this, the one thing that stands out throughout the paper is the lack of numbers. For the size of the system that they were targeting, very little infrastructure seems to have been built into it to perform any real time tracking of the system apart from some rudimentary logging. This is something that we see baked into modern systems in a big way to detect and resolve failures and performance bottlenecks as they happen. At the risk of sounding harsh, most of the flaws being described are very reactionary and seem to be ones detected after something broke.

Also, the authors have acknowledged that Grapevine has limited understanding of its underlying network, which limits its potential in performing better resource location decisions and limits performance. But the biggest flaw that I see in this system is the amount of emphasis that is laid on the administrators of the system in getting the configuration right. A poorly configured system could see performance degrade rapidly, and for an administrator to have to visualize what is going on in a large complex system which doesn’t have any metrics to guide them is a pretty steep ask of even the best engineers. The fragility of the system comes to the fore where even the creators of the system are unwilling to make improvements that they have identified for fear of breaking what’s running.

But most of the negatives that have been stated are in hindsight and in comparison to systems that exist today that have several years of learning that came from designing systems such as Grapevine. As an initial effort, given the limited hardware resources and the small scale at which the internet used to operate, Grapevine is definitely a system that must have played a key role in understanding challenges and defining approaches in building large scale distributed systems.

Grapevine is a distributed, replicated system. This paper focuses on the operational experience after using Grapevine, specifically as a mail service, for several years. The author discussed some performance improvements and hoped these kinds of experience would be helpful for the new system design.

As a distributed system developed in 1980s it has several drawbacks from the view from now inevitably. It did make great contributions to the design of later modern systems. Many concepts are still used today. For example, replicating is a good way to speed up accessing latency and make the system more fault-tolerant. Replicating technique was adopted by Grapevine in many places, like one user has inboxes at least at two message servers, each registry is replicated in several different registration servers and so on. However, the multiple replicas cause a critical consistency problem. When discussing the registration database, the author mentioned the mechanism of updating different replica, but said nothing about if the different updating latency would cause the registration data inconsistency and if there was any safety guard to deal with or prevent this problem. Another good idea is location division. At most times, the data cannot be fit in the whole memory, even a single disk. So division can help solve this problem. Remote debugging tools is also great ideas with provides a way that system experts can diagnose the whole system in any remote terminals when encountering big issues.

However, some obvious shortcomings did exist in this prototype system. The first one is the consistency problem mentioned in the last paragraph. Another one is its security issues. As the paper describes the Grapevine had a file access control scheme integrated with the registry services. This provides some security to some extend, but not enough. If the system can resist the DoS attack and man-in-the-middle attack? If the cached credential would be a potential issue in misconducts? Although it really had a lot of security issues, I think it makes sense since at that time the performance was still the first consideration.

In the end, there seems be an interesting design pattern that as Grapevine many systems, even the modern ones, are designed with a bunch of assumptions or have some specific design purpose, including user behaviors, system capacity and so on. But later when it is used in practice, many behaviors are out of expectation. But while adjusting the system to the load, many great ideas come out.

The paper gives a report of the distributed system aspects of xerox’s prototype model called Grapevine. The goal of the paper is to highlight the various factors a distributed systems engineer should consider, when building a reliable and scalable systems.

Positives :
1. Grapevine offers variety
of services(naming,access control etc) other than the typical messaging service. Typically distributed systems in those days offered email services.
2. The system is a very apt precursor to today’s email system. Most of the design decisions made in Grapevine still continue to exist in today’s mail delivery systems.
3.The system provides reliability by replicating information across the servers. This ensured that the meta data information that is necessary to communicate the information is always available even if one of the replicas holding the information goes down.
4.This was one of the earliest systems to introduce hierarchical naming scheme by partitioning the database into logical registries. It is also a sound design decision which aided in scalability. On addition of new users, registry was not expanded rather new registries were created. This ensured that servers were not overloaded with registration data as the number of users increases.
5.Steering lists improved performance when messages need to be delivered to a large distribution list. Since a large distribution list would now be made of multiple steering lists, multiple servers will be involved in delivering messages to the end recipients, than a single server handling the load of delivering it to all the recipients.

Draw backs :
1.Compromising reliability to save up a minor cost . Message bodies were not replicated to save disk space, this means if the message gets corrupted the receiver does not receive the correct message, unless the sender resends it.
2.Update propagation through messages was a draw back. This causes inconsistent view of the system for some amount of time depending on network latency.
3.No knowledge about the network topology leading to inefficient delivery routes. Choosing the first available nearest server, does not necessarily mean it results in shorter delivery time. A local server that is available little later could deliver the message much faster.
4.Manual intervention in making crucial configurations. It could be error prone.[Registrar, Operator, System experts].
System is not very smart to avoid duplicate messages to a user. [A possible scenario would be a user being a part of two distribution lists receives the same mail that is sent to both the lists]
5.Scalability can still be an issue when more users are added as it would cause increase in size of distribution lists and number of registries].

This paper gives a brief summary of Grapevine distributed system and then gives an insight into the various challenges/trade-off faced during the implementation of the system. It also proposes solutions (some of which were implemented) to those problems. It touches on many of the design issues which one need to take care in designing a distributed system.

Grapevine distributed system consists of grapevine servers and and clients connected to the network. There are two types of grapevine servers: message servers and registration servers; and together they provides various services such as message delivery, naming, authentication, access control, etc. Clients uses GrapevineUserPackage to transparently connects to the nearest grapevine server to request for services.

The main feature of grapevine is scalability. More servers can be added easily to handle increasing number of users. The distribution and replication of grapevine servers are configured according to the usage pattern and the topology of the network. For example, for mail service, the primary inbox of recipient is the server which is close to the recipient's workstation. The registration database is divided into registries for scalability. Each register entry corresponds to an individual or a distribution list. One nice feature of grapevine is that it provides monitoring information, logs and other tools (viticultrist) to operators and expert in order to help in case of failures.

Some major problems with the grapevine system were:
1. As the number of users grow, the number of users in the distribution lists increases. This makes the message delivery to those distribution list takes long time and also delay other messages in queue. Also the addition and deletion of users from distribution list becomes very slow.
2. Grapevine cannot handle frequent changes in the registration database efficient. Designers had to change the message format for a specific update problem of adding/removing the user from a group because this kind of update messages occurred more frequently.
3. There were many consistency problems due to delays in registration database updates. For eg. When a group owner deletes an individual from a group and then sends a message to that group, the message may also be send to the deleted user due to the delay in registry update propagation.
4. One major problem was the remailing of mailboxes to new inbox when old inbox is deleted. This generates a lot of load on the servers. Author proposes solution to remail only when the size is below certain threshold.
5. The nested distribution list caused many delays in the system. The authors tried to make it efficient by caching flatten version of lists.
6. There was no facility in grapevine to restrict access to server mailboxes. Also, security issues were not handled.

Although this is a very old paper, the concepts in this paper are still relevant in today's world and these challenges are still faced in today's distributed system even when the technology has advanced so much.

Grapevine was an early distributed system that was set up to provide a variety of services (such as message delivery and access control) to computers hooked up to the internet. The main point of the paper (and why it is still relevant) is to show what kind of issues arose when they started scaling up to more users and how they dealt with it. Their goal was to be able to scale the service up gracefully to be able to handle 10k users; while a paltry number compared to today, what they learned was very informative. What they found was that as more and more users got onto the system, problems popped up that they didn’t expect. For instance, distribution lists were becoming a bottleneck because people would tend to congregate on lists that were general and held common interest, such as “taxes”. Another problem they had to deal with was how to ensure reliability of user data. While in some cases, they felt that some instances of failure were too rare to warrant the extra engineering (e.g. if they lost the content of a message they would put an equivalent to “We’re sorry” as the message instead), they did put a lot of thought on how they should handle other facets of reliability and, furthermore, where in the system they ought to place replication. For instance, obviously you should place the primary inbox as geographically close to the users as possible, but what considerations should be made for the secondary inbox? If you put it at the next nearest geographic location, you might overload that server in the event that your primary server fails.

The great aspect of this paper is that it is a very practical discussion on what issues can arise when it comes to making and maintaining a distributed system. Problems with regards to the networks latency and bandwidth, load balancing, and reliability are all problems that are still hard even as we move to systems that service billions of requests. They really hit the nail on the head with the idea that interacting with the system from the user’s perspective should be like interacting with a standalone system. They also had a nice system set up for maintaining the overall system with automatic log keeping and interfaces for system experts to fix a particular server without having to physically go to a location.

One of the downsides of this paper is that they seemed not to address some problems (i.e. they made some faulty assumptions). For instance, throughout the paper they were talking about the “network” as though there wasn’t really much they could do about it. Yes, they did talk about how they shortened the updates to database entries so that they send just the changes and not the entire entry and they talked about how to distribute servers so that they are close to users, but they seemed to let a lot of network problems stand and left it to the user to deal with (i.e. retry on their own). This kind of assumption would never fly today. They also seemed to let the system stagnate as a result of commercializing it for Xerox. They didn’t implement many improvements that they described in the paper because the engineers started to become unfamiliar with the system so they only made changes that were low risk and had great impact. However, that does bring up a good point with the fact that, even a system that was fairly low key, it is hard to program without having very intricate knowledge of the inner workings of the system.

Summary:

The paper presents Grapevine, a distributed system mainly aimed at message delivery which is performed by the message service component. In order to create a notion of naming system, another component called registration service is used; this component also provides authentication, access control(for the file server) and resources location(to locate the nearest message server) functions. Registration data is maintained in a database containing entries for both individual users and groups of distribution lists. Both messages and registration database are replicated, with messages being stored in two inboxes(primary and secondary) and registries are replicated in entirety. Registrars/users are permitted to create, update, delete registry entries independent of each other. Message delivery for a large distribution list is handled at two levels by creation of sublists which form the first level of expansion and later expanding sublists into individual users. Increase in the number of messages due to large number of users is handled by filtering mechanisms. Increase in network congestion is handled by using multistep forwarding which avoids the unreliable links. Significance of placing the inbox copies for both the servers is discussed: primary inbox is placed close to user in case of message server and within the registration server; secondary inboxes are placed strategically at the other end of unreliable links. Crashes faced during the operation of the system are also discussed along with recovery mechanisms.

Takeaways:

  • Scaling issues considered are highly relevant to present day systems:
    1. Load on the server being independent of the total system load.
    2. Number of registry replicas being independent of number of servers or users.
  • Sharing of messages among all recipient inboxes on a single message server.
  • Addition of new servers based on network reliability and load of existing servers.
  • Hierarchy used in the RNames enables easier management and supports many users.

Drawbacks:

  • The initial system design does not consider many trivial scalability issues(like registration update containing just the changed member instead of the entire updated list) and futuristic hardware advancements.
  • Consistency is not handled thoroughly - message delivery loss due to inconsistency in registration data updates is not handled; this might be unacceptable in current day systems.
  • Resource location was done only based on the number of links, which might be an inaccurate estimate as the network capacity is more important than the number of hops.
  • Security issues are not considered/handled making the system vulnerable to attacks.

Grapevine is an example of an early distributed system that implemented many ideas that are present today: load balancing, e-mail, naming, authentication, resource location and access control services.

However, it had many shortcomings compared to modern systems. The design of the system was similar to the POP protocol requiring deletion of messages on the server after being downloaded to the client, and it had been developed keeping sequential access to e-mail in mind. Not much thought was given to accessing mail via terminals, forcing the admins to limit the number of concurrent terminal users. An operator was required at each Grapevine server, so the number of people had to scale with the size of the system. Even system experts were required to manually repair corrupt disk structures. Message bodies weren't replicated, so broken messages were replaced with an apology. The servers could deadlock upon overload instead of gracefully degrading like modern systems. The decision to make geographical registries proved to be unsatisfactory when compared with organizational registries. Having Grapevine store the access control information for files introduced a performance bottleneck (that was partially mitigated) in a previously simpler system. The system also didn't take advantage of the network topology in many cases, where a more efficient delivery mechanism could have been realized. Sometimes duplicate messages were delivered due to the way the distribution lists were expanded. The time taken for the system to converge to a consistent state could also be highly variable depending on the state of the system. Towards the end of the project, the resulting system was complex enough that changes were made minimially and with a lot of scrutiny.

There were many good ideas that were realized early on that are present today. Replication was done. Placing the primary and secondary mailboxes on different systems for load distribution, and to guard against single point failures was done. The hierarchial registry convention is similar to LDAP. The idea of caching authentication credentials to mitigate performance issues is found in the system. Factoring the network hops to choose the closest server was taken into consideration. Distribution lists are similar to e-mail groups used today.

Grapevine is an example of an early distributed system that implemented many ideas that are present today: load balancing, e-mail, naming, authentication, resource location and access control services.

However, it had many shortcomings compared to modern systems. The design of the system was similar to the POP protocol requiring deletion of messages on the server after being downloaded to the client, and it had been developed keeping sequential access to e-mail in mind. Not much thought was given to accessing mail via terminals, forcing the admins to limit the number of concurrent terminal users. An operator was required at each Grapevine server, so the number of people had to scale with the size of the system. Even system experts were required to manually repair corrupt disk structures. Message bodies weren't replicated, so broken messages were replaced with an apology. The servers could deadlock upon overload instead of gracefully degrading like modern systems. The decision to make geographical registries proved to be unsatisfactory when compared with organizational registries. Having Grapevine store the access control information for files introduced a performance bottleneck (that was partially mitigated) in a previously simpler system. The system also didn't take advantage of the network topology in many cases, where a more efficient delivery mechanism could have been realized. Sometimes duplicate messages were delivered due to the way the distribution lists were expanded. The time taken for the system to converge to a consistent state could also be highly variable depending on the state of the system. Towards the end of the project, the resulting system was complex enough that changes were made minimially and with a lot of scrutiny.

There were many good ideas that were realized early on that are present today. Replication was done. Placing the primary and secondary mailboxes on different systems for load distribution, and to guard against single point failures was done. The hierarchial registry convention is similar to LDAP. The idea of caching authentication credentials to mitigate performance issues is found in the system. Factoring the network hops to choose the closest server was taken into consideration. Distribution lists are similar to e-mail groups used today.

This paper discusses the challenges faced when implementing a scalable and reliable distributed system, which was used as a messaging service, called Grapevine. I think that the paper is a very relevant case study. This is because, when distributed systems are being developed now, designers still face most of these problems even though we are equipped with much better infrastructure. So, there are a lot of lessons that this paper teaches us.
The designers of Grapevine did not come up with a very scalable model in the beginning. So, when more users/messages were in the system, their bad design started showing its effects. So, they had to change everything one by one.

The good aspects :
I like the fact that Xerox came up with such a nice, reasonably scalable idea back in those days when resources were very limited.
1. The idea of introducing hierarchical distribution lists in registers was a nice trick. Also, making the registers locality aware was a very good move in improving performance.
2. I like the idea of actually introducing registration and user services as two separate services.
3. I also feel that making the load on a server constant was a nice constraint to impose. This way, when the load became huge, meaning the number of users/messages increased, they just ended up adding more servers which is very simple and an easy solution.

Drawbacks:
1. The idea of constraining the amount of load on a server to a constant and adding more servers when the number of users increased seems simple and easy, but it doesn’t look like the network overload is being handled. This could definitely affect the performance of the system badly especially if the number of users is huge, as is the situation these days.
2. I think their resource location algorithm could have been better in also considering the bandwidth and the network congestion and not just the number of links between servers.
3. The paper does not provide a solution to maintaining consistency of data. I feel that especially in a message service system such as Grapevine, consistency should be a high priority.
4. I also feel that having the message and the registration services on the same servers was a bad idea, the paper does not mention virtualization of any sort, so it could affect performance when compared to having them on separate dedicated servers.
5. It also appears from the paper that security and authentication issues need to be addressed better.

This paper is a summary and discussion of the performance findings of Grapevine; a distributed system used for message delivery, naming, authentication, resource location, and access control. The paper discusses in detail the positive aspects, as well as the shortcomings, and some solutions, of Grapevine after a few years of operational experience.
The original problem the authors were attempting to solve was that of building a distributed system that offers a message service and a registration service. They set a goal of building the system in such a way that it would scale to handle a load generated by about 10,000 users. However, as they system grew while operational it faced many bottlenecks which were both expected and unexpected behavior of the system. Therefore, within the context of this paper the authors are attempting to address and solve the issues that created bottlenecks within the system in order to reach their original goal of handling a load generated by 10,000 users. Although the authors address specific issues within the paper the problems they are attempting to solve fall within the general categories of scale, system configuration, replication policies, load balancing, locality, and reliability.
The important contributions of the paper can be synthesized generally as well. Although the paper spends the majority of the time discussing specific issues with regards to Grapevine it offers a unique perspective in that the issues are a reflection on a mature operational system. Further the authors openly discuss the shortcomings right alongside Grapevine’s successes. This is an invaluable insight that is lacking from most academic literature as papers generally down play the negative aspects of their contribution. In this way, the main contribution of the paper is that it provides a road map of issues that anyone designing and implementing a distributed system will inevitably face as the system scales. So while the specific issues discussed, being inherently unique to Grapevine, are not applicable to systems today, the general pitfalls experienced by the authors are helpful as guiding lights.

This paper analyses the experience of running the Grapevine System - a distributed message delivery system under substantial load and proposes changes where they notice bottlenecks
The database of users and servers is not centralized but divided into registries and various registration servers hold this information thereby balancing the load . These servers are also replicated . The author admits that the consistency between these servers ( while propagating changes ) is relaxed but is acceptable in practice . The GrapevineUser package running on the client makes resource location ( of the registry server) transparent to the user. However Grapevine isn't completely transparent in some cases where a recipient's name is present in two distribution lists resulting in the recipient receiving duplicate messages .
The author states that scalability can be achieved by adding servers of fixed power rather than using more powerful servers . The message delivery algorithm that determines the inbox site for every recipient didn't scale very well with increase in the distribution lists . So the author proposed adding a level of indirection where distribution lists are divided based on registries and these steering lists are forwarded with a copy of the message to a server whose registration server knows the registry . Another proposed improvement was propagating the changes in the registry database rather than sending and updating the entire distribution lists. The author admits to making a mistake in dividing registries based on geographical areas and not organizations which would ease out administration( for some cases like distribution lists )
The system is definitely applicable to currently deployed real systems since it implements DNS, authentication , access control and mailing services into one distributed system , However I believe that making the registration server centralized ( like the metadata server in GFS) might make the system much more simple and consistent . The division of the registration server and mail server might be a great idea but its adds to the delay in the system .


Grapevine is a distributed system developed by Xerox Corporation for their internal message delivery, authentication, naming, resource location and access control services. The authors in this paper reviewed the operational experience of the system with substantial load in last 3 years. Grapevine was presumed to be a giant mail server by many users. Said that, there were some "surprises" in its operation, like their implementation of mailing list (called distribution list) was not very efficient and had a very high occassional message delivery time. Breaking up the RNames into registries was a very smart move in view of scalability and made it independent with the number of users of the system. But the registries are replicated in some servers depending on the usage and updating registry (adding or deleting users) created some glitches due to inconsistencies amongs the replicas.

Grapevine introduced to the computing world of new paradigm -- distributed systems and this paper analysed that system and commented on the upside (good decisions) and potential improvements in making this system usable by up to 10,000 users. The authors discussed about the scalability, load balancing, configuration and operation overhead of this dispersed system. These are some of the key goals and challenges for DS till today.

The authors did not tried to solve the issues with Grapevine in most of the cases. Also the paper does not look at the security related problems in the system. For example if some server starts to malfunction how Grapevine will react, is the communication between the server is secure etc.

Post a comment