UW-Madison
Computer Sciences Dept.

Paper Write-ups for Sprite Migration

Student

At the OS level, the Sprite migration mechanism sought to improve transparency at the cost of all three other considerations: residual dependencies, performance, and complexity.

Specifically, the transfer of process state from one machine's kernel to another, the forwarding of syscalls between machines, and the maintenance of some "dual" process state on multiple machines each served to provide excellent transparency, but each incurred a performance cost, created residual dependencies, and required complex (and apparently very difficult to maintain) code in the kernel.

Whether these tradeoffs made sense or not should be primarily a question of the intended use, and users. However, like the systems in Tannenbaum's survey, Sprite was clearly designed more as a research project than as a solution to any concrete user's problem. The OS was "distributed" only within a single Computer Science department, and its distributed computations consisted primarily of a single Computer Science application. The biggest beneficiary of the process migration mechanism was an impressive improvement in the performance of "make". If the goal had been to provide a good parallel "make", a much simpler solution may have done just as well.

Student

Sprite designers chose for transparency and high performance at the cost of residual dependencies and significant complexity in the kernel. Whether the trade-offs were reasonable or not depends on the targeted environment, users and workloads. In general, the trade-offs seemed less desirable for a wider population of users wanting to run more than one or two applications but seemed okay for the more restricted university environment of Sprite OS and the specific application it targeted, namely pmake.

Generally speaking, in my opinion, they have stressed too much on achieving 100% transparency at the cost of other factors. For example, when a remote process forks in Sprite, the child process has the same home machine as the parent in accordance with transparency. This in turn makes the process state distributed on both machines, making the system less fault tolerant and also decreasing the performance of the system. This seems an unreasonable trade-off in a general sense, as a process migration mechanism that is not fault tolerant and potentially much slower than single host-only execution of programs is likely to be less successful in practice.

Also, it seems that while authors sacrificed to make the kernel more complex and less maintainable for achieving transparency (eg. changing the file system) , they simultaneously sacrificed performance for simpler implementation of virtual memory transfer module (where the file server became a bottleneck). Thus overall, the code broke easily, file server saturated at 12 hosts doing pmake and processes had dependencies on host machine after migration.

On the other hand, the targeted environment of the Sprite system was a collection of workstations on a LAN (instead of a wider network) and in particular, memebers of the Sprite project. It strived to give them lower compilation times when using pmake and later, faster simulations. This was unlike the distributed systems of 1985 which had no specific workloads and users in mind while making design choices.

Student

The Sprite designers chose to make process migration very transparent at a cost of higher complexity, higher residual dependencies, slightly worse performance, and less reliability. I believe slightly worse performance is a reasonable tradeoffs for providing transparency. However, I think that the tradeoff between high transparency and less reliability because of higher residual dependencies is a bit suspect. It seems that most users would be willing to deal with less transparency if it meant higher reliability of their processes. Transparency does nothing for a user if their process cannot complete because the system is not reliable enough.

The Sprite designers decided it was necessary to include some residual dependencies to achieve higher transparency. They do this at a cost of reliability, performance, and complexity. One example of a residual dependency is that of forwarding I/O data from the host machine to the remote machine running the process. It may have been unwise for the designers to decide to migrate processes which interact with I/O devices. Not moving them saves them one residual dependency and increases performance of processes which use I/O data. However, because of the remote file system in use, residual dependencies on the file server is unavoidable.

The main performance loss occurs during the transfer of processes from machine to machine. In order to acheive high transparency they must copy the virtual memory, open file information, etc. I believe that the performance loss is acceptable because of the clever way they try to mitigate the penalty for migrating processes. They choose to send the process information into a backing store file, which the newly started process on the remote machine will fetch from in a lazy copy sort of way. There is a residual dependence on only the file server in this case.

In order to make the system highly transparent, a lot of complexity was added to the system that would have been unnecessary in a non-transparent system. Higher complexity seems like a bad tradeoff because of extra problems that may arise because of unforseen consequences of design decisions, more bugs in the software, and an overall less reliable system as a result. This seems like an unreasonable tradeoff for higher transparency.

Student

Sprite Designers seem to have focussed more on transparency and performance than on any of the other issues. To transparently migrate processes, Sprite forward kernel calls home. Forwarding also occurs from home machine to the current machine (signals). Transparent migration is consistent with the distributed system paradigm that existed in the late 80s - provide a single system image. Transparent migration increases complexity as well as residual dependency. Even though its impossible to eliminate residual dependency completely totally, Sprite designers have retained it in other cases for transparency. One example is notifying the parent of process creation and termination. Residual dependency doesnt always affect performance. The authors even give a case where it improves performace - lazy copying over the network. Sprite designers have made sure that processes dont leave residues in all the machines they ran. Residual dependency of any form residues reliability, which I think is more important than transparent migration. I think the focus must instead be on building that are less complex and hence more reliable. Transparency shouldnt be as much a issue as reliability and complexity. Further I think performance (network cost) with the advent of gigabit networks (LAN) isnt as much an issue as it was then.

Student

The ultimate goal of the Sprite designers was to achieve complete transparency and high performance and to minimize residual dependencies and complexity of the system. However, in some cases trade-offs were made in the design decisions due to conflicting goals. Achieving transparency was the primary objective in Sprite. To achieve transparency from the point of view of process execution, most of the process state is transparently transferred from the source to target machine so as to recreate the same environment on the target machine for the migrated process to execute. This is true of both the virtual memory and open files of a running process. Additionally, Sprite employs eviction and maintains some replicated information on both the home and host machine (e.g. process control block) so that the user is unaware of migration and can still control migrated processes. Transparency is compromised in a few cases (e.g. certain special purpose kernel calls like gettimeofday) which is an acceptable trade-off. Sprite makes every attempt to keep the migration transparent to the user and the process, which is an important contribution of this system.

By transferring most of the process state to the target machine, instead of forwarding it from the home machine, Sprite gains significantly in terms of performance. Also much of this state is transferred using lazy copying (e.g. dirty pages and open files are transferred to a file server and then copied to the target machine only when used), which makes migration fast. However, performance is compromised to reduce complexity of the system in some cases. File and access position caching are disabled due to migration if the file is shared and all the file accesses must be forwarded to a file server. The paper states that this happens infrequently, however if there are large number of migrated processes that concurrently access a single file, this can become a significant bottleneck. The Sprite system also limits complexity of process migration by simply disallowing certain processes to migrate (e.g. processes sharing writable virtual memory or with memory mapped I/O).

Sprite minimizes residual dependencies as most process state is completely transferred (e.g. none of the pages accessed by a process have to be retained on a machine after the process has migrated), instead of relying on forwarding. However, Sprite maintains copy of PCB at the home machine, thereby creating residual dependencies on the home machine and relying on it for process creation and termination. Although this affects performance, it is a good trade-off to allow a single machine (i.e. home machine) to control this as it ensures correct semantics and avoids potential race conditions. On the other hand, this makes the home machine a single point of failure, which may be acceptable in this case as technically these processes would not have executed anyways (when the home machine crashes) in the absence of migration (unless migration serves the additional goal of fault-tolerance). Sprite also forwards signals (e.g. kill) to a migrated process from the home machine to the current host. This can definitely be improved (thereby getting faster response from a migrated process) by maintaining information on each machine about the location of different processes.

Student

This paper describes the Sprite process migration mechanism, including both the implementation of the mechanism itself and also when the Sprite system invokes the mechanism to trigger migration. The mechanism is a trade-off between four factors: transparency, residual dependencies, performance, and complexity. In the paper, the Sprite designers themselves claim to have emphasized transparency and performance. In order to maximize transparency and performance, they accepted some residual dependencies and added complexity, although the designed state they attempted to minimize complexity by choosing the most simple implementation paths whenever possible.

Did the Sprite designers make the appropriate trade-offs? In my opinion, they did not adequately demonstrate in this paper that the appropriate trade-offs were made. The most fundamental flaw in their work was the lack of an application mix. The paper only presented explicit results for a parallel make, LaTeX, and some simple synthetic tests. Perhaps their trade-offs are appropriate considering the particulars of their environment and job mix (i.e. just parallel compilations), but it is unclear if the trade-offs chosen would be appropriate given a wider variety of applications or environments. For example, process migration performance may not be very important if the job mix contained longer-running processes, since the time of migration would be relatively small when amortized over a long runtime.

I also believe they did not emphasize minimization of complexity enough in their quest for transparency. Evidence for over-complexity include the fact the authors mentioned the implementation was extremely fragile, and for nearly two years was regularly failing whenever other changes to the kernel were made. Also, the authors described a highly complex system to preserve transparency in the event that the access position for a file is shared between two or more processes. In my opinion, this is an example where transparency is not very important, especially considering the complexity (and maybe performance?) costs. The authors failed to mention one real-world application that relies on this behavior, and in my own experience, practically every program that forks a new process quickly closes any inherited file descriptors -- or at least stops using those descriptors while the child lives.

In my opinion, the authors were too negative about residual dependencies. Several of the undesirables of residual dependencies mentioned by the authors still existed in their implementation, even after adding complexity to avoid such dependencies. One advantage of residual dependencies is a natural form of scalability. In Sprite, the central file server becomes quickly becomes a bottleneck as memory pages are being written during migration. On the other hand, the Condor system leaves a process behind on the home workstation to act as a file server for that process during migration. Although Condor adds this residual dependency, one result is a more scalable system because in essence every home system added is equal to adding another file server. And in some instances, the Sprite authors overstated the performance penalty of residual dependencies - for instance, the discussion about the gettime() system call. With only a tiny hit on complexity, this system call could have been easily optimized to cache the time skew between the host and target system on the target system itself, thereby requiring only one call over the network back to the host system per migration instead of an RPC every time gettime() is invoked.

Finally, I feel some of the author's other viewpoints were off target. For instance, the authors stated that one important motivation for emphasizing performance of their migration mechanism was so an owner experiences minimal disruption when a process eviction is triggered upon returning back to their workstation. But there are many other methodologies to minimize interactive disruption of the owner besides just evicting the process quickly; for example, perhaps the process could migrate off the machine at a slow low-priority trickle.

Student

The Sprite process migration mechanism makes appropriate decisions in most cases with respect to the trade-offs between the conflicting goals of transparency, residual dependencies, performance and complexity. The mechanism for virtual memory transfer, for instance, seemed justified since the backing storage for virtual memory was the network file system anyway, and since such a pre-existing mechanism could be used directly, it did make the implementation simpler. Since only dirty pages incurred overhead at migration time, the trade-off between performance and simplicity seems correct. The fact that such a decision also avoids residual dependencies at the remote machine (on an eviction event : since the source need not retain pages or later respond to paging requests) makes the choice a good one. Another decision, viz to implement special server code for migrating files, seems to be a good choice, since enabling caching at the remote machine was made possible, hence having an impact on performance.

Care has been taken to ensure that migration/eviction of a process from a remote machine does not cause residual dependencies on that remote machine. This has implications related to residual dependencies, performance and complexity. On one hand, it ensures that returning users at the remote machine do not sense a performance loss due to residual dependencies. At the same time, it limits the number of machines involved with the process, a definite plus with respect to performance. These benefits of this policy seem to justify the choice. Transparency was also favoured over residual dependencies in deciding to permit some residual dependencies on the home machine. As seen from the results, the overhead owing to these was minimal and the choice was hence justified.

The migration policies used in Sprite primarily represent the trade-off between implementation complexity and the performance implications of these more complex implementations. Migration was still the exception rather than the rule even in the Sprite system. This kind of a system usage also involves the trade-off between implementation complexity and complete transparency too. The paper mentions that 'Users do not think of their workstations as shared'. However, this could also be one of the motivations for using a model similar to the processor pool model. The system would need to take decisions regarding when to migrate a process dynamically to achieve complete transparency. This has been sacrificed for simplicity. The technique of using centralized approaches for storing the idle-host database is also an example of performance-complexity tradeoff. Though this has been justified by the scale of the system, increased frequency of load-averaging might cause a collapse of such a system.

Student

"...we emphasized transparency and performance, but accepted residual dependencies in some situations."

"In the case of eviction, there are no residual dependencies on the source after migration." In case of a returning user to a computer, the transfer of a process also seems to go pretty fast so the computer is quickly available for that user. This was one of the goals of the project so I think this design was appropriate.

Some implementations of the system added extra complexity, (like using an intermediate fileserver to move the virtual memory) but they justify that by claiming that this adds to the overall performance. And since the system has been running stable for a while, the added complexity was apparently not that much of an issue.

They say that they considered a host idle if there was no mouse or keyboard activity for 5 minutes, but later changed it to 30 seconds. They claim that it doesn't give any noticeable impact for the users of those machines, but in my experience a normal user is idle for 30 seconds very often (if you're reading something for example). I think the 30 seconds may lead to too many process migrations (and therefore unnecessary slowdowns for the remote process), since a host thinks he's idle, gets a process and then soon after that the user is done reading and the program has to migrate again.

 
Computer Sciences | UW Home