Student
At the OS level, the Sprite migration mechanism sought to improve
transparency at the cost of all three other considerations: residual
dependencies, performance, and complexity.
Specifically, the transfer of process state from one machine's kernel
to another, the forwarding of syscalls between machines, and the
maintenance of some "dual" process state on multiple machines each
served to provide excellent transparency, but each incurred a
performance cost, created residual dependencies, and required complex
(and apparently very difficult to maintain) code in the kernel.
Whether these tradeoffs made sense or not should be primarily a
question of the intended use, and users. However, like the systems
in Tannenbaum's survey, Sprite was clearly designed more as a
research project than as a solution to any concrete user's problem.
The OS was "distributed" only within a single Computer Science
department, and its distributed computations consisted primarily of a
single Computer Science application. The biggest beneficiary of the
process migration mechanism was an impressive improvement in the
performance of "make". If the goal had been to provide a good
parallel "make", a much simpler solution may have done just as well.
Student
Sprite designers chose for transparency and high performance at the cost
of residual dependencies and significant complexity in the kernel.
Whether the trade-offs were reasonable or not depends on the targeted
environment, users and workloads. In general, the trade-offs seemed less
desirable for a wider population of users wanting to run more than one or
two applications but seemed okay for the more restricted university
environment of Sprite OS and the specific application it targeted, namely
pmake.
Generally speaking, in my opinion, they have stressed too much on
achieving 100% transparency at the cost of other factors. For example,
when a remote process forks in Sprite, the child process has the same home
machine as the parent in accordance with transparency. This in turn makes
the process state distributed on both machines, making the system less
fault tolerant and also decreasing the performance of the system. This
seems an unreasonable trade-off in a general sense, as a process migration
mechanism that is not fault tolerant and potentially much slower than
single host-only execution of programs is likely to be less successful in
practice.
Also, it seems that while authors sacrificed to make the kernel more
complex and less maintainable for achieving transparency (eg. changing the
file system) , they simultaneously sacrificed performance for simpler
implementation of virtual memory transfer module (where the file server
became a bottleneck).
Thus overall, the code broke easily, file server saturated at 12 hosts
doing pmake and processes had dependencies on host machine after
migration.
On the other hand, the targeted environment of the Sprite system was a
collection of workstations on a LAN (instead of a wider network) and in
particular, memebers of the Sprite project. It strived to give them lower
compilation times when using pmake and later, faster simulations. This was
unlike the distributed systems of 1985 which had no specific workloads
and users in mind while making design choices.
Student
The Sprite designers chose to make process migration very transparent
at a cost of higher complexity, higher residual dependencies, slightly
worse performance, and less reliability. I believe slightly worse
performance is a reasonable tradeoffs for providing
transparency. However, I think that the tradeoff between high
transparency and less reliability because of higher residual
dependencies is a bit suspect. It seems that most users would be
willing to deal with less transparency if it meant higher reliability
of their processes. Transparency does nothing for a user if their
process cannot complete because the system is not reliable enough.
The Sprite designers decided it was necessary to include some residual
dependencies to achieve higher transparency. They do this at a cost of
reliability, performance, and complexity. One example of a residual
dependency is that of forwarding I/O data from the host machine to the
remote machine running the process. It may have been unwise for the
designers to decide to migrate processes which interact with I/O
devices. Not moving them saves them one residual dependency and
increases performance of processes which use I/O data. However,
because of the remote file system in use, residual dependencies on the
file server is unavoidable.
The main performance loss occurs during the transfer of processes from
machine to machine. In order to acheive high transparency they must
copy the virtual memory, open file information, etc. I believe that
the performance loss is acceptable because of the clever way they try
to mitigate the penalty for migrating processes. They choose to send
the process information into a backing store file, which the newly
started process on the remote machine will fetch from in a lazy copy
sort of way. There is a residual dependence on only the file server in
this case.
In order to make the system highly transparent, a lot of complexity
was added to the system that would have been unnecessary in a
non-transparent system. Higher complexity seems like a bad tradeoff
because of extra problems that may arise because of unforseen
consequences of design decisions, more bugs in the software, and an
overall less reliable system as a result. This seems like an
unreasonable tradeoff for higher transparency.
Student
Sprite Designers seem to have focussed more on transparency and
performance than on any of the other issues. To transparently migrate
processes, Sprite forward kernel calls home. Forwarding also occurs from
home machine to the current machine (signals).
Transparent migration is consistent with the distributed system paradigm
that existed in the late 80s - provide a single system image. Transparent
migration increases complexity as well as residual dependency. Even though
its impossible to eliminate residual dependency completely totally, Sprite
designers have retained it in other cases for transparency. One example is
notifying the parent of process creation and termination. Residual
dependency doesnt always affect performance. The authors even give a case
where it improves performace - lazy copying over the network. Sprite
designers have made sure that processes dont leave residues in all the
machines they ran. Residual dependency of any form residues reliability,
which I think is more important than transparent migration.
I think the focus must instead be on building that are less complex and
hence more reliable. Transparency shouldnt be as much a issue as
reliability and complexity. Further I think performance (network cost)
with the advent of gigabit networks (LAN) isnt as much an issue as it was
then.
Student
The ultimate goal of the Sprite designers was to
achieve complete transparency and high performance
and to minimize residual dependencies and
complexity of the system. However, in some cases
trade-offs were made in the design decisions due
to conflicting goals. Achieving transparency was
the primary objective in Sprite. To achieve
transparency from the point of view of process
execution, most of the process state is
transparently transferred from the source to
target machine so as to recreate the same
environment on the target machine for the migrated
process to execute. This is true of both the
virtual memory and open files of a running
process. Additionally, Sprite employs eviction and
maintains some replicated information on both the
home and host machine (e.g. process control block)
so that the user is unaware of migration and can
still control migrated processes. Transparency is
compromised in a few cases (e.g. certain special
purpose kernel calls like gettimeofday) which is
an acceptable trade-off. Sprite makes every
attempt to keep the migration transparent to the
user and the process, which is an important
contribution of this system.
By transferring most of the process state to the
target machine, instead of forwarding it from the
home machine, Sprite gains significantly in terms
of performance. Also much of this state is
transferred using lazy copying (e.g. dirty pages
and open files are transferred to a file server
and then copied to the target machine only when
used), which makes migration fast. However,
performance is compromised to reduce complexity of
the system in some cases. File and access position
caching are disabled due to migration if the file
is shared and all the file accesses must be
forwarded to a file server. The paper states that
this happens infrequently, however if there are
large number of migrated processes that
concurrently access a single file, this can become
a significant bottleneck. The Sprite system also
limits complexity of process migration by simply
disallowing certain processes to migrate (e.g.
processes sharing writable virtual memory or with
memory mapped I/O).
Sprite minimizes residual dependencies as most
process state is completely transferred (e.g. none
of the pages accessed by a process have to be
retained on a machine after the process has
migrated), instead of relying on forwarding.
However, Sprite maintains copy of PCB at the home
machine, thereby creating residual dependencies on
the home machine and relying on it for process
creation and termination. Although this affects
performance, it is a good trade-off to allow a
single machine (i.e. home machine) to control this
as it ensures correct semantics and avoids
potential race conditions. On the other hand, this
makes the home machine a single point of failure,
which may be acceptable in this case as
technically these processes would not have
executed anyways (when the home machine crashes)
in the absence of migration (unless migration
serves the additional goal of fault-tolerance).
Sprite also forwards signals (e.g. kill) to a
migrated process from the home machine to the
current host. This can definitely be improved
(thereby getting faster response from a migrated
process) by maintaining information on each
machine about the location of different processes.
Student
This paper describes the Sprite process migration mechanism,
including both the implementation of the mechanism itself and also
when the Sprite system invokes the mechanism to trigger
migration. The mechanism is a trade-off between four factors:
transparency, residual dependencies, performance, and complexity. In
the paper, the Sprite designers themselves claim to have emphasized
transparency and performance. In order to maximize transparency and
performance, they accepted some residual dependencies and added
complexity, although the designed state they attempted to minimize
complexity by choosing the most simple implementation paths whenever
possible.
Did the Sprite designers make the appropriate trade-offs? In my
opinion, they did not adequately demonstrate in this paper that the
appropriate trade-offs were made. The most fundamental flaw in their
work was the lack of an application mix. The paper only presented
explicit results for a parallel make, LaTeX, and some simple
synthetic tests. Perhaps their trade-offs are appropriate
considering the particulars of their environment and job mix (i.e.
just parallel compilations), but it is unclear if the trade-offs
chosen would be appropriate given a wider variety of applications or
environments. For example, process migration performance may not be
very important if the job mix contained longer-running processes,
since the time of migration would be relatively small when amortized
over a long runtime.
I also believe they did not emphasize minimization of complexity
enough in their quest for transparency. Evidence for over-complexity
include the fact the authors mentioned the implementation was
extremely fragile, and for nearly two years was regularly failing
whenever other changes to the kernel were made. Also, the authors
described a highly complex system to preserve transparency in the
event that the access position for a file is shared between two or
more processes. In my opinion, this is an example where transparency
is not very important, especially considering the complexity (and
maybe performance?) costs. The authors failed to mention one
real-world application that relies on this behavior, and in my own
experience, practically every program that forks a new process
quickly closes any inherited file descriptors -- or at least stops
using those descriptors while the child lives.
In my opinion, the authors were too negative about residual
dependencies. Several of the undesirables of residual dependencies
mentioned by the authors still existed in their implementation, even
after adding complexity to avoid such dependencies. One advantage of
residual dependencies is a natural form of scalability. In Sprite,
the central file server becomes quickly becomes a bottleneck as
memory pages are being written during migration. On the other hand,
the Condor system leaves a process behind on the home workstation to
act as a file server for that process during migration. Although
Condor adds this residual dependency, one result is a more scalable
system because in essence every home system added is equal to adding
another file server. And in some instances, the Sprite authors
overstated the performance penalty of residual dependencies - for
instance, the discussion about the gettime() system call. With only
a tiny hit on complexity, this system call could have been easily
optimized to cache the time skew between the host and target system
on the target system itself, thereby requiring only one call over the
network back to the host system per migration instead of an RPC every
time gettime() is invoked.
Finally, I feel some of the author's other viewpoints were off
target. For instance, the authors stated that one important
motivation for emphasizing performance of their migration mechanism
was so an owner experiences minimal disruption when a process
eviction is triggered upon returning back to their workstation. But
there are many other methodologies to minimize interactive disruption
of the owner besides just evicting the process quickly; for example,
perhaps the process could migrate off the machine at a slow
low-priority trickle.
Student
The Sprite process migration mechanism makes appropriate decisions in
most cases with respect to the trade-offs between the conflicting goals
of transparency, residual dependencies, performance and complexity. The
mechanism for virtual memory transfer, for instance, seemed justified
since the backing storage for virtual memory was the network file system
anyway, and since such a pre-existing mechanism could be used directly,
it did make the implementation simpler. Since only dirty pages incurred
overhead at migration time, the trade-off between performance and
simplicity seems correct. The fact that such a decision also avoids
residual dependencies at the remote machine (on an eviction event :
since the source need not retain pages or later respond to paging
requests) makes the choice a good one. Another decision, viz to
implement special server code for migrating files, seems to be a good
choice, since enabling caching at the remote machine was made possible,
hence having an impact on performance.
Care has been taken to ensure that migration/eviction of a process from
a remote machine does not cause residual dependencies on that remote
machine. This has implications related to residual dependencies,
performance and complexity. On one hand, it ensures that returning
users at the remote machine do not sense a performance loss due to
residual dependencies. At the same time, it limits the number of
machines involved with the process, a definite plus with respect to
performance. These benefits of this policy seem to justify the choice.
Transparency was also favoured over residual dependencies in deciding
to permit some residual dependencies on the home machine. As seen from
the results, the overhead owing to these was minimal and the choice was
hence justified.
The migration policies used in Sprite primarily represent the trade-off
between implementation complexity and the performance implications of
these more complex implementations. Migration was still the exception
rather than the rule even in the Sprite system. This kind of a system
usage also involves the trade-off between implementation complexity and
complete transparency too. The paper mentions that 'Users do not think
of their workstations as shared'. However, this could also be one of
the motivations for using a model similar to the processor pool model.
The system would need to take decisions regarding when to migrate a
process dynamically to achieve complete transparency. This has been
sacrificed for simplicity. The technique of using centralized
approaches for storing the idle-host database is also an example of
performance-complexity tradeoff. Though this has been justified by the
scale of the system, increased frequency of load-averaging might cause a
collapse of such a system.
Student
"...we emphasized transparency and performance, but accepted residual
dependencies in some situations."
"In the case of eviction, there are no residual dependencies on the source
after migration." In case of a returning user to a computer, the transfer of
a process also seems to go pretty fast so the computer is quickly available
for that user. This was one of the goals of the project so I think this
design was appropriate.
Some implementations of the system added extra complexity, (like using an
intermediate fileserver to move the virtual memory) but they justify that by
claiming that this adds to the overall performance. And since the system has
been running stable for a while, the added complexity was apparently not that
much of an issue.
They say that they considered a host idle if there was no mouse or keyboard
activity for 5 minutes, but later changed it to 30 seconds. They claim that
it doesn't give any noticeable impact for the users of those machines, but in
my experience a normal user is idle for 30 seconds very often (if you're
reading something for example). I think the 30 seconds may lead to too many
process migrations (and therefore unnecessary slowdowns for the remote
process), since a host thinks he's idle, gets a process and then soon after
that the user is done reading and the program has to migrate again.