CS 739 Reviews - Fall 2014: Distributed Computing in Practice: The Condor Experience

Summary:

Condor project is a distributed, scalable, flexible, fault tolerant batch processing system developed during the 80's in academia which accepts a job, allocates a resource to it, executes it and returns back the result to an user transparently. The core of the Codor system is a Condor high throughput system, which offers both high throughput computing and opportunistic computing through variety of tools such as classAds, checkpointing and migration, Remote system calls. A condor system has an agent, to which a user submits the job, which advertise itself to a matchmaker, and a resource also advertises itself to a matchmaker, while the matchmaker decides which agent and resource are compatible. In the claiming stage, agent and resource negotiates and agent creates a shadow process which stores the state of the job, resource creates a sandbox process, which actually executes the job. The inter-domain communication and communicating to other batch processing systems were also supported by Condor systems using GRAM protocol, which is supported by almost all the batch processing systems.

Contributions:
1. High throughput computing, is to provide large amount of resources available in the system to execute jobs and opportunistic computing, which uses the idle computers to carry out the execution of the job.
2. preemptive-resume scheduling, is one of the scheduling techniques used in the system, where an idle resource is scheduled for a job, and if the resource gets allocated a job, the running job is migrated to some other resource.
3. gateway flocking, which is initially used to communicate to the other condor pools in different geographic locations, but proved useless because of the legality of the implementation.
3. direct flocking, in which the agent itself communicates to the matchmaker of the other pool. It was later developed to I/O communities, where a centralized match maker is used for a specific geographic location.
4. matchmaking algorithm, which plays a greater role in matching the agent and resource, helped making the system more flexile and light weight.
5. master worker, where all the jobs are accepted in a queue, from which different worker threads can be created to carry out the tasks. DAGMAn, where a DAG of processes is created, which are executed in order, where a rescue DAG is executed to resume a worker thread from the point it failed.

Confusing:

1. They say that in Java universe, JVM talks to shadow process through the sandbox because job cannot talk to shadow through the same secure communication channel created by sandbox and shadow, which is one of the big difference between standard universe and java universe. I din't understand how it is possible in standard universe ?

Learning:

The matchmaking algorithm which is so powerful in the system, one of the important features of the system was interesting in this project. They also talk about how to decouple the scheduling and planning, where a policy can be used separately for each of them, which makes the system more flexible.

Posted by: Dinesh Rathinasamy Thangavel | November 18, 2014 08:00 AM

Summary:
- The paper presents Condor, a system to facilitate distributed computing by distributing resources (processor, RAM) to agents (workload requesters) in a pool. Sharing resources grant supercomputer computing power, but at low costs of many cheap workstations.

Problems:
- Messages may be lost, corrupted, or delayed.
- Computers are heterogeneous with different hardware, different operating systems, unreliable networks, changing configurations, and many users with different private policies.
- Matchmakers have to map workloads to resources optimally. Oversubscribing to resources will delay other workloads while undersubscribing to resources will delay current workload.
- Condor also had to balance between CAP.

Contributions:
- The Condor kernel (Figure 2) forms the basis in which many resources could be shared. Although many distributed computing protocols have changed over time, the Condor kernel has withstood the test of time.
- Gateway flocking allows sharing resources across organizational boundary. Gateway flocking is quickly replaced with direct flocking to be more flexible with resource sharing. Flocking allows a larger scale of resource sharing, which is crucial for distributed computing.
- Shadows are used to specify job at runtime for best workload-resource match. Sandboxes allow processes to run without being intrusive. Shadows and sandboxes help facilitate resource sharing.

Confusing:
- I did not really understand the concept of the grid as shown in Figure 1. Is the grid used to match users to fabric? If so, then I think the paper could have used a better model/figure, such as the icon on the top of every page.

Learned:
- Condor can be used to apply distributed computing. Furthermore, condor had many changes in to past to keep up with demand, and many more changes to expected to fully utilize distributed computing.

Posted by: Kai Zhao | November 18, 2014 07:56 AM

Summary,
In this paper the authors discuss the design of Condor the distributed high-throughput computing system. The authors aim to disign a high throughput computing system built from a network of low cost computers which cocsists of idle computers as well as dedicated machines belonging to an organization. The nodes can join or leave the computing cluster any time and the owners have complete control over the duration the nodes participate in the cluster as well as the the resources the node contributes to the cluster.

Problem,
The authors aim to solve the problem of building a large compute cluster wi th the computer power equal to that of a similar sized super computer, from low cost heterogenous compute nodes, that can be used by the scientific community throughout the world. One of the important tenant of the Condor design is flexibility, which aims to put the owners of the compute nodes in full comtrol of their conrtibution to the cluster.

Contributions,
- The authors clearly enumerate the design of Condor and the philosophy behind it and also explain how the designed has evolved over time(decades) to satisfy the needs of the users.
- The flexible design philosophy where the members of the community are encouraged to contribute to the cluster by putting them in full control of their contribution.
- The ability to advertise for jobs and available resources and match them through an independent match maker using ClassAds.
- The ability of the condor system to transparently checkpoint the submitted jobs so that they can be migrated and rescheduled later.
- The mechanisms of gateway flocking and direct flocking to share the resources across organizations without incurring much administrative overhead.

Learned,
How to design a community driven system flexible enough to encourage the memebers of the community to conrtirbute to the system and to enable the community to grow naturally.

Confused,
The idea of execution domains was a little unclear to me, particularly how they are formed and how an agent selects one for submitting a job.

Posted by: Sathiya Kumaran | November 18, 2014 07:49 AM

SUMMARY: The authors describe practical experience both developing and running a distributed system over a period of time at UW-Madison, including both social and technical issues and their effect on the software.

PROBLEM: A relatively inexpensive way to buy compute power is to buy lots of commodity CPUs as opposed to a single giant super-computer. And often, organizations these days typically already have many commodity CPUs in the form of desktop computers, connected with a network. However, whether opportunistic or dedicated, compute resources at times are not used to their full potential. Condor aims to resolve that by offering benefits for joining resources together.

CONTRIBUTIONS: The technical mechanisms of collecting and advertising both jobs and resources makes for a very flexible system, in which at every point the Owner of either a Job or a Resource is allowed to express policy that includes both requirements and preferences. This naturally leads to a system where the resource owners are in control. By building on that trust, there is a benefit to grouping resources together so that the dynamic workloads put into the system have additional opportunity to execute on temporarily underused machines, and this allows communities and cluster to grow together over time.

Methods for easily sharing compute resources (called "flocking") across administrative boundaries were developed. These again leave the owner in control and individual admins all perceive that there is a benefit to being part of such an arrangement, even in the case of running a "Personal Condor" which is essentially a pool of one node running all on one machine.

Condor also draws a distinction between scheduling and planning, which is crucial in a commodity hardware world where many variables can change at any time. Entire clusters may come and go (due to the Owners desire to use them or for technical reasons such as a inter-continental network partition) and Condor handles this as the normal case. Late binding of resource allocation leads to a very flexible system.

Condor focuses on the big picture of "Throughput" rather than more short-term metrics such as latency. Success is typically measured in overall system throughput rather than the turnaround time for individual jobs or workloads. And to this end, Condor provides mechanisms to users such as Remote Execution (to run on a machine that may not have met your data requirements) and checkpointing (to seamlessly migrate from one machine to another) that improve both the user experience and overall system throughput.

LEARNING: I was a little confused about scheduling jobs, in particular "Gang-matching" in which multiple resources can be allocated simultaneously, as doing will be either imperfect or can lead to deadlock, and the paper doesn't go into much detail on how this is done or what trade-offs were made. One thing I learned is that early flocking was done with a server-to-server protocol before being replaced with direct resource advertisement from one pool to another.

Posted by: Zach Miller | November 18, 2014 07:47 AM

Summary: This paper summarize the experience and philosophy of the distributed computing system Condor (since 1984). Condor is the first distributed computing system that allows all normal users to use the computing resource, which was previously avaialble only for few super users.

Problem: In 1970s people recognized the demands of uting multiple cheap commodity computers as a distributed system. This system can provides the same computing power with super computer with much fewer costs. The problem in centralized distributed system is: how to allow hetorogeneous demands and perform pattern of different users. Condor solves this.

Contribution:

1. This paper reviews the history of Condor, summarize the experience of it and most importantly shows the philosophy behind it: flexibility. Flexibility allows as many users and computing resource owners as possible to participate the computing system. The users and owners should be able to submit & withdraw the jobs & resource at any time.

2. Condor uses of ClassAds and MatchMaker to pair up the resources and jobs.

3. For each resource, Condor use Shadow to represent it to users. For each job, Condor uses sandbox to build a protected execution environment in the resource.

4. Dividing the Condor systems with heterogeneous computation machines as different pools.

5. Condor uses the Directed Acyclic Graph (DAG) to formalize job dependencies. This is used to executing dependant multiple jobs. Failure is also handled by creating a rescue DAG from the old one.

Things I learnt: The idea of flexibility, contrast to centralized distributed computing system. Condor can utilize the idle computing resources and allow every user to utlize them with freedom. It is more about bring distributed computing to the heterogeneous society.

Confusion: The philosophy of flexibility is really powerful to allow everyone to access the jobs. However, how to avoid the possibility of users to maliciously sniffing the jobs? The information is easier to be leaked even than cloud computing, where the victims jobs can be running on the attacker's machine...

Posted by: Shike Mei | November 18, 2014 07:45 AM

Summary:
This paper introduces the Condor project, a distributed computing platform.

Problem:
The computing power could be achieved inexpensively with collections of small devices rather than expensive single supercomputers. Some distributed computing system uses the dominant centralized control model while Condor insists that every participant should be in control.

Contributions:
(1) The philosophy of the Condor is flexibility.
(2) Introduce the Condor high-throughput computing system, and the Condor-G agent for grid computing. Condor is a high-throughput distributed batch computing system. Condor provides high-throughput computing and opportunistic computing. The condor adopts the flexible ClassAds language, allows job checkpoint and migration, preserves the local execution environment via remote system calls. With these tools, Condor can do more than effectively manage dedicated compute clusters. Condor-G combines the technology from the Condor (job submission, allocation, error recovery, creation of a friendly execution environment) and Globus (protocols for secure inter-domain communications and standardized access to remote batch systems) projects to form a tool that binds resources spread across many systems into a personal high-throughput computing system.
(3) The kernel of Condor has components including matchmaker (introduce potentially compatible agents and resources), agent (remember jobs and find resources willing to run them), resource, shadow (provide details necessary to execute a job) and sandbox (create a safe execution environment). As pools sprouted up around the world and users needed to share across organizational boundaries, the gateway flocking is used. Gateway nodes pass information about participants between pools. A gateway will pass idle agents and resources in home pool to peers. The direct flocking sloves the problem of one user in multiple communities by reporting an agent to multiple matchmakers. To track large number of jobs, users need queuing, prioritization, logging and accounting, which makes the Condor speaks GRAM with gliding in. Resources can group themselves together to I/O communities to express the "nearby" relationships.
(4)Condor uses matchmaking to bridge the gap between planning and scheduling. Matchmaking includes 4 steps: advertisement to agent and resource, notification and claiming. The combination of planning and scheduling strategy includes planning around a schedule and scheduling within a plan.
(5) High-level problem solver is built on top of the Condor agent, including master-worker (run jobs on a large and unreliable workforce) and the directed acyclic graph manager (execute multiple jobs with dependencies in a declarative form).
(6) Split execution is accomplished by two distinct components: the shadow and the sandbox.

Learned:
This paper dissects the Condor system and explains every component in detail. It gives me an overview of how a distributed computing system should work and clarifies the working mechanism of this system and every component.

Confused:
The ClassAd language is just a simple version of JSON without nested json values? The author mentions that a single machine may run either/both an agent and resource server. How is the matchmaker, resource server, agent actually distributed among computers? Does current distributed computing systems, such as Hadoop, has similar idea of resource manager, matchingmaker, sandbox and shadow?

Posted by: Jing Fan | November 18, 2014 07:41 AM

Summary
The paper discusses Condor , a resource sharing distributed computing system composing of commodity machines. The authors discuss the motivation and evolution of the system.

Problem
Commodity hardware are cheap and are in abundant. Users do not use commodity machines all the time. If there is anyway to use idle time on commodity hardware, it could become a huge computation resource. The owner of the machine and the job that is to be run on idle time : both needs to have considerable freedom to have flexible policies.

Contributions
The philosophy of "leave the user in control, regardless of cost"
Sandbox is used to provide isolated working environment.
Checkpointing at periodic times helps in migrating the job when failure occurs without restarting it again.
Resource sharing across different condor pools using flocking - gateway flocking and direct flocking.
Gliding in technique to build condor pools from distributed system operating on GRAM.
Remote system calls helps in running jobs without sharing any file system.
ClassAds is used to express policies on both the resource and job side.

Learned
The use of schema free language to communicate the specifications in distributed systems.
The use of checkpointing to safely migrate jobs.

Confused
It is unclear if the system provides guarantee any bound on the time of execution of job.

Posted by: Sreeja Thummala | November 18, 2014 07:15 AM

Summary
The paper describes Condor , a distributed high throughput batch computing system where users submit jobs , and Condor based on certain policies chooses to run them on an appropriate machine that has idle resources .

Problem
It is much more economical to build a distributed system using a set of commodity machine to perform batch processing as opposed to using a single supercomputer , but introduces another set of problems due to their heterogeneity and the ability of nodes to fail .

Contributions
1.ClassAds Matchmaking - Condor provides a framework allowing jobs to describe their requirements and resources to describe their availability and policies and the Matchmaker find an appropriate match for the job and runs it .

2.Job Checkpointing and Migration - Condor ensures forward progress for jobs as well as some form of fault tolerance by allowing checkpointing of jobs , which can also be migrated to a different node if required

3.Remote System Calla - Condor allows users to run jobs on remote machines and redirects all I/O related system calls back to the user's machine that submitted the job , eliminating the need to have a shared file system

4.Shadow and Sandboxing ( Split Execution )- Condor ensures jobs runs correctly in the desired environment by using two components - Shadow which represents the user and its job is to specify runtime parameters for the job to the Sandbox which is running in an isolated environment on a remote machine

5.DAGMan - A service to run multiple jobs with dependencies .

6.Gateway Flocking and Direct Flocking - Ability to utilize a remote pool of idle resources

Thoughts
Condor's ClassAds provides a uniform way of representing and matching jobs to machines
Naming - It wasn’t mentioned exactly how machine information is made visible to remote matchmakers and vice versa

Posted by: Arkodeb Dasgupta | November 18, 2014 06:14 AM

Summary
Condor is a distributed system that combines all computing resources and provide an unified resource pool. Via opportunistic computing it utilises the idle cycles of the computers of a community.

Problem Statement
Condor has mainly two goals, Unifying the commodity hardware into single huge computing pool and provide a giant computational resource to the users. Allow (but not required) users to collaborate and make it a globally managed system. This poses many philosophical questions like, administrative decentralization, resource sharing limits etc.

Contribution

A good balance between user control and transparency. It allows user to control how they want their job to be distributed as well as for a naive user given a job Condor can be a just giant super computer.
Match makers: match makers help the resources to meet the requirements making it them easy to find each other at least overhead cost.
Shadows and sand-boxing provides better internal and external control over the job that is being executed.
Cycle stealing: Condor utilises the idle computing resources of the community and unify them into one pool.
Because check pointing it can easily move a process between machines.

Confusing
How it is different from a Hadoop or Map/reduce? I understand they are all more or less designed with certain set of parameter in mind and hence their architecture are quite different but it will good to know a set of them (if not all) in a same page how the differ and compare.

Learning
Higher user control with adequate security can achieve much higher throughput and meet the users need.

Posted by: Rahul Chatterjee | November 18, 2014 06:06 AM

Summary
The paper talks about the history and evolution of the Condor project. It describes the design philosophy that it adopted namely Flexibility which accommodates for heterogeneous nature of distributed systems in terms of users, hardware, unreliability of network and ever changing configurations.

Problems to solve
Part of providing flexibility is the need to provide users the option to form communities and grow naturally. It is the user’s choice to cooperate with another.
This requires the user to be in control of its resources which is another goal Condor.
Resource management without overdependence and planning to overcome failure.
Provide high throughput computing. To bridge the gap between planning and scheduling of jobs even under heterogeneous data caused due to the control given to users over their own resource.

Contributions
Describes the various guidelines that help maintain the philosophy of flexibility the Condor provides.
Provides with software tools that aid for the provision of high-throughput computing. ClassAds provide flexibility for allocation policies and planning approach. Ability to resume a job with the help of check pointing and provision of Remote Procedure Calls aid improving the compute power.
Gateway flocking which provided inter organization resource sharing.
Use of Matchmaking which bridges gap between planning and scheduling.
Schema free comparison of resource and agent with the use of ClassAds.

Confusing about the paper
It seems the matchmaker has to do a lot of work, wont it be a bottleneck?

Learnings
The concept of gateway flocking. Transfer of jobs to other resources effectively restarting the job without loss of state. And the use of DAG’s to provide a general order for jobs.

Posted by: Shiva Prashant Chada | November 18, 2014 05:42 AM

Summary:

Condor (legally renamed to HTCondor) is a specialized batch processing system developed by Miron Livny's group at University of Wisconsin - Madison that manages the jobs for a grid computing system. Jobs are prepared and submitted to Condor and it takes care of finding the correct machine type and running the job. In this paper the history and philosophy of the Condor project is provided along with its interactions with other projects and its evolution along with the field of distributed computing. The core components of the system are outlined along with the philosophy that technology of computing must correspond to social structures.

Problem:

To have ready access to large amount of computing power inexpensively, a collection of small devices was a natural choice over expensive single supercomputers. But to build a coherent/ controllable system was difficult since there was a fundamental tension between consistency, availability and performance in distributed systems. Most of the systems at that time employed a dominant centralized control model, but it was evident that in a grid computing environment such model was not sustainable since inconvenienced machine owners would withdraw from the grid computing community hence there was a logical need for a internationally distributed heterogeneous grid-computing system where every participant in the system remained free to contribute as much or as little as it cared to. Condor tries to solve this with its motto: "leave the owner in control, regardless of the cost."

Contributions:

Unique philosophy of flexibility that aligns with the assumption that technology of computing must correspond to social structures. The philosophy of flexibility allows communities to grow naturally, leaves the owners in control always regardless of the cost, plans without being pick or over-dependent on the correct operation of any remote node and created a community for knowledge and expertise sharing.
The Condor High Throughput Computing systems provides high throughput and opportunistic distributed batch computing in addition to job mechanism, scheduling policy, priority scheme, resource monitoring and resource management.
Condor-G is an agent for grid-computing that binds resources spread across many systems into a personal high-throughput computing system using technologies from Condor and Globus projects.
Provides powerful tools such as:
- ClassAds: language that provides framework for matching resource requests with resource offers.
- Job Checkpoints and Migration: a form of fault tolerance and safeguards the accumulated computation time of a job.
- Remote Systems Calls: preserving local execution environment.

Learned:

It is interesting to learn about the social aspect of grid-computing driven by the core philosophy of flexibility. The utility of the Condor's grid-computing is very evident from its usage in LHC at CERN, IceCube project and many big-data analytic projects but its is very interesting to see how individual owners still maintain control.

Confusion:

Since there is parallel processing involved we could effectively see synchronization problems, I am not sure synchronization in general was discussed in details in this paper.

Posted by: Saikat R. Gomes | November 18, 2014 02:16 AM

Summary:
Thain, et al describe the history and philosophy of the Condor project which they created here at UW Madison. Condor unites dedicated cluster management and scavenged idle resources into a unified grid which can run high throughput parallel processing and other batch tasks.

Problem:
It was recognized that commodity hardware could outperform super-computers for some types of workloads. However, creating a flexible system that could combine heterogenous idle resources as well as dedicated systems (both of are both geographically & organizationally dispersed) into a unified single environment that could reliably run distributed computing tasks is a non-trivial problem. The problem is further complicated by the fact that Condor is a community based system & it was critical to respect users by permitting but not requiring cooperation.

Contributions:
1. Gliding In: the implementation of Grid Resource And Access Management (GRAM) by “gliding in”. Unlike direct flocking, this prevents the system from under-subscribing to long queues or over-subscribing to too many queues by creating a personal pool of remote resources (i.e. resulting in more efficient resource consumption).

2. Matchmaker & ClassAd: in a heterogenous grid computing environment a single centralizing scheduling system is not practical. Users turn their machines on & off, and cycle through periods of inactivity and usage creating a dynamic resource pool that is constantly shifting. The matchmaker unites ClassAds that bring agents and resources together with matching needs, and allocates them, taking both requirements (that a machine is truly idle) and ranks (how desirable is the match) into consideration. Matchmaking is distinctly separate from claiming which allows stale information (resulting in bad claims) to be rejected.

3. Split Execution: the knowledge required to successfully complete a job is split between the execution machine and the submission machine. Hence, it makes sense to split execution responsibilities respectively into the sandbox and shadow. The sandbox gives the job a safe place to work (with a functional non-hostile environment with correct access and permissions), and the shadow provides information such as input files, arguments, environment, etc which are required for the job to run.

4. Standard / Java Universe: this is the ‘sand’ in the sandbox. It replicates the user’s home environment and allows checkpointing (snapshots). The shadow represents the user and nothing happens without the shadow’s consent & hence both must work together (split execution).

What I Found Confusing:
Security wasn’t addressed at all; how is this system not vulnerable to a malicious attacker?

What I Learned:
Creating a unified grid computing across geographically and organizationally dispersed system with heterogenous hardware is a monumentally complex problem. Condor has managed to stay flexible and evolve into a system that remains relatively popular by applying a number of novel approaches to difficult problems (which I listed as contributions)

Posted by: Jason Feriante | November 18, 2014 02:00 AM

Summary:
Condor provides a co-operative computing system with commodity machines. The paper describes the design of condor as well as its integration with other projects like Globus.
Problem:
High throughput computing can be achieved through expensive supercomputers or through co-operative processing on inexpensive commodity machines. Though co-operative computing seems to be the more preferable choice, there are various challenges requiring attention like communities with different policies, scheduling, failure tolerance etc.
Contributions:
- ClassAds which is a schema free resource allocation language allows for variability/flexibilty in matching resource requests and offers.
- Ability to checkpoint and migrate jobs which is essential considering their motto of “leave the owner in control, regardless of the cost”. When the computation is long running batch jobs, it is very important to checkpoint so as to avoid the cost of having to restart the job from the beginning.
- Sandboxing which provides an isolated and complete working environment on the execution machine for the remote job.
- Remote system calls which redirects all of a job’s I/O related system calls back to the machine that submitted the job. This enables users to run their jobs on remote machines even in the absence of a shared filesystem without having to copy the data over.
- A two phase approach to matchmaking. First there is the matchmaker itself which introduces compatible agents to resources. After that, the agent is still responsible to check if the resource is a valid match by itself. What this does is it provides more flexibility in updating the matchmaker, i.e., as much care need not be taken to make sure the matchmaker is strictly updated.
- They introduce two different flocking mechanisms, each with their own advantages to help share resources across communities.
- They provide two seemingly very useful problem solvers -- master-worker and DAGMan – each with its own unique programming model.
- They provide wrappers around Globus which brings condor functionality to it and also run condor servers as jobs on the batch system to provide a personal pool for users.
What was not clear:
The matchmaker was not explained in sufficient detail. Can it become a bottleneck and also is it a single point of failure within a community/pool?
My Key takeaway:
The importance of checkpointing, especially in a co-operative computing environment and especially when you have long running jobs. Remote system calls and running the condor processes as condor jobs themselves are pretty neat techniques I felt.

Posted by: Chaithan Prakash | November 18, 2014 01:43 AM

Summary :
This paper describes Condor, a distributed batch processing/computing system which has high performance computing and opportunistic computing as its base paradigms.

Problem :
Users may have jobs which might need a lot of compute power and resources which they might not be able to afford. Also, users might know how to partition their jobs to efficiently run on a distributed framework. In such a scenario, Condor offers a grid computing framework to provide resources for users to run their jobs. The details of which node does what part of the computation is kept transparent from the user making it very simple to use.

Contributions :
1. ClassAds used to represent each of the compute nodes based on their capabilities help the matchmaker choose the appropriate compute node(s) for a task.
2. Checkpointing jobs at periodic times enables scheduling other jobs (which might have a higher priority) and changing the pool of resources dynamically.
3. Security is taken care of by the sandbox-shadow process pair where the sandbox consults the shadow to check against malicious attackers.
4. Flocking technique used to migrate job from one pool to another where there are idle resources available.

What I learned :
I learned about how a match maker might be designed, and how jobs can be migrated across a distributed framework. The paper also had a brilliant problem solver component using DAGs which I found interesting.

What I found confusing :
How does Condor protect against malicious agents trying to surpass the match maker?

Posted by: Anusha Dasarakothapalli | November 18, 2014 01:19 AM

Summary: Paper presents the Condor; how it works, its goals, development history, and problems that were overcome along the way. Condor arose from the idea of building an environment for distributed computing by focusing on the needs of an organically growing community or users and workstations.

Problem: Distributed computing that utilizes the unused compute time of commodity workstations is becoming more cost effective and even faster than specially built supercomputers. The problem that arises when trying to take advantage of this framework is how to both make it easy and convenient for the owners of the compute resources to freely donate their idle power and to make it easy for users to take advantage and get access to the owners compute resources.

Contribution: The main contribution of this paper was the outline of the history and basic philosophy behind the Condor system as well as its basic operation.

One of the aspects of the condor system that makes it unique is their design philosophy. The goal was to get as many users and owners as possible to be able to participate in the system. In other words, the system was designed around being flexible. A particular point to help illustrate this was their focus on the social aspects of computing. The designers were cognizant of the fact that this system will be used by real people and thus, social communities will be a natural result. Therefore, in order to enshrine the goal of flexibility, what seems to be an orthogonal concern has to be one of the centerpieces in the design. Another design point is to make sure that the owner has as little pain as possible when donating his/her compute resources. That means that they are able to stop any job in progress at any time and that configuration to participate in Condor has to be as painless as possible.

In order to accomplish this, their basic design is that the user launches an agent, which goes to a matchmaker, which pairs them up with a resource. The agent and the resource then negotiate to solidify the relationship. The resource creates a sandbox that the condor job can run in (this helps solve problems of owner configuration as the sandbox environment makes sure that proper permissions and files are set up). A condor pool is made up of resources and a matchmaker along with at least one agent (users can both be contributing to the resources of the system as well as agents). One initial problem was that an agent might want to utilize resources from another pool. They developed gateway flocking. This solved the problem, but was eventually dropped in favor of direct flocking as gateway flocking required negotiation at the organizational level.

Matchmaking is done via classAds, which resources advertise to the matchmaker and the matchmaker uses this to pair up resources and agents. Condor allows for two types of programs, master-worker and DAGman. Finally, condor allows split execution that ensures that jobs are checkpointed regularly and ensures isolation between the owner of a resource and the Condor job and that is where the shadow and sandbox come into play.

Confusion: With regards to the standard universe (figure 15), is all that happening on a single machine (i.e. the resource)? Does the agent run the shadow on the resource? It seems that it does. My confusion arises from what figure 2 seems to say and figure 15.

Learned: That, if getting a lot of users for a system is the goal, then the social aspects of the system being put into use is actually very important instead of just a secondary concern. This seems obvious when you think about it, but few papers seem to really point this out.

Posted by: David Tran-Lam | November 18, 2014 12:37 AM

Condor

Summary:
The goal of the paper is to highlight the overall architecture and the experiences gained while building a super computer using numerous ordinary computers across the world. The HTCondor provides normal users to make use of computing power which was once available only for previleged users who can afford the high costs of super computers.

Problem:
The computer scientists of the 1980s wanted to build a system which can provide a ready access to large amounts of computing power. We could achieve this by using a super computer but it costed a lots of money. This problem was solved by harnessing the power of multiple commodity computers and to make them perform at the level of the super computer. The problems faced by the centralized model of computing were also solved using this distributed system.

Contributions:
1. The most important contribution of HTCondor is their philosophy of flexibility which helped develop Condor to become a famous distributed system.
2. The idea of ClassAds and MatchMaker to establish match between the resources and the jobs formed the basis of the functionality of Condor.
3. The idea of shadow at the agent which has all the necessary information about the job to be executed and the idea of sandbox to provide a protected execution environment at the resource for the execution of the job were pretty good.
4. Separating the planning from scheduling clearly separated the responsibilities of the user and the resources.
5. The idea of gliding in helped form Condor pools from heterogeneous distributed systems which understand the GRAM protocol.
6. Building a problem solver component on top of the agent where the user can communicate with was important from the perspective of the user.
7. Split execution provides a unique model of distributed cooperative computing where the responsibilities are divided among the resourses available.

Thing I learnt:
I learnt that people have started building massive distributed systems like Condor very early and that their design still works. I was also happy to learn that almost all machines in the CS Department at Wisconsin is used as a part of a huge super computer.

Thing I found confusing:
I'm not very sure about the difference between distributed systems and grid computing. Also, I'm not sure what will happen if I send a job (eg. grep on a 10 GB file), will it be automatically parallelized by the problem solver to use multiple machines or will it use just a single machine.

Posted by: Adalbert Gerald | November 18, 2014 12:36 AM

Summary:
The paper talks about condor, a distributed resource management system; its evolution over the years and the various lessons learnt while developing the system. It is an opportunistic execution framework which combines the idle CPU power, projects it as a resource and uses it for various computations.

Problems:
Some batch execution jobs require a lot of computing resources. Idle CPUs across the organizations can be combined together to perform jobs which require high computational resources. This setting is cheaper than having a dedicated super computer perform the same job. Though cheaper, this type of distributed setting does not have enough flexibility and gives less control to owner leading to inconvenienced owners backing out can become difficult to control as the system grows.
Condor is an opportunistic execution framework with the key philosophy of flexibility and providing full control to owners.

Contributions:
1. Different from existing batch execution systems:
- Flexibility in a heterogeneous environment and control to owner.
- High throughput computing and opportunistic computing.
2. Match-making and ClassAds: schema-free and semi-structural data model for request-response matching.
3. Checkpoints for subsequent resuming and migrating from one machine to another.
4. Flocking (gateway and direct) for cross pool matching is transparent to participants.
5. Gliding-in technique to make personal condor pools out of remote resources using GRAM protocol.
6. Split execution through shadow(represents user and provide job execution details) and sandbox(safe execution environment).

Learnings:
Co-operative computing, match making process and the concept of classified advertisements (classAds).

Unclear concept:
1. Security concern: Can a user sniff on condor job being run on a machine as an idle resource is decided based on no keyboard strokes etc.?
2. Can matchmaker be a bottleneck in the overall system as it has to talk to many agents and resources?
3. Very remote possibility: Is it possible that a job never gets executed fully because whenever it is assigned an idle machine, that machine is taken back by its user? Or Do we care about environment setup cost/time taken etc if it has to be done multiple times for running a job because previously assigned machine was taken back by its owner?

Posted by: Harneet Singh | November 18, 2014 12:33 AM

Summary:
Condor is a system that can be deployed over a large number of workstations and can be used for batch processing of tasks. This paper provides a gist of the Condor system – primarily the design principles of the system, its evolution over the period and how the user needs impacted it. The key idea of Condor being to use idle computational power and match the user tasks to the appropriate resources available.

Problem:
Workstations with huge computational power are inherently very expensive and the same can be replaced by a set of low-cost workstations that offer same computing power. This can very well be achieved with existing infrastructure where the systems are idling. The challenges here lie in identifying the appropriate resource, allocating the jobs, checking for failures, and other coordination. Previous

Contributions:
• Condor is based on flexibility to the users – importantly it allows the owners to control when and how much they wish to contribute. This also leads to natural growth of the user community.
• I liked the implementation of matchmaker – it makes the whole process of identifying and allocating the resource transparent.
• Gateway flocking – allows the coordination between multiple condor pools, benefit to all users in the pools. Direct flock is the other version – interaction between a single user and the other pool.
• ClassAds – can be either for a job or as a resource, they allow advertising the jobs and the resources so that the matchmaker can identify the right pair according to requirement and rank.
• Remote system calls help in transitioning the I/O requests back to the user that submitted the job and returns the feedback to the resource machine.
• DAGMan helps in resolving the dependencies between the jobs.
• Shadow – provides the user representation to the resource. Sandbox - a daemon at the resource machine that creates a safe environment for the target jobs to be run on the machine.

Confusion:
The paper says that standard universe provides checkpointing to handle failure and also to transfer the process. I didn’t completely understand who coordinates this transfer and what happens if the resource is malicious and falsely resubmits the job to many other nodes?

Learnings:
I learnt how resource is found and allocated via the agent and the matchmaker and how it works across domains. I also learnt how a suitable and safe environment is created on the resource machine using sandbox.

Posted by: Chetan Patil | November 18, 2014 12:26 AM

Summary:
In this paper authors have described the evolution of Condor, a distributed computing system for batch processing of jobs. With core-design philosophy of flexibility, it provide complete control to the owners of the contributing compute nodes to set policies for the jobs executed on their resource. Condor considers the requirements of the jobs as well as the constraints on the machines for matching the jobs to processing nodes.

Problem:
With the wide-spread availability of inexpensive computing machines and the inter-connecting infrastructure it is more economical to perform high throughput computing on combination of these than on super computers. But with different contributing nodes present under different administrative control and policies, it was required that each contributing node be free to contribute as much they wanted. With this requirement it needed mechanism to detect idle cycles on contributing nodes; and match and schedule jobs as per the requirements specified by the job and restrictions imposed by the nodes.

Contributions:
-The idea for providing complete control to the resource owners on deciding on policies on what jobs can be executed on their machines.
-The capability for transparently recording job checkpoint and migrating to another machines. This also enables it to preempt low priority job and schedule high priority job.
-The approach for using remote system calls in shadow and sandbox to ensure run-time information for the jobs. Also with sandboxing it protects the host resource from malicious attacks.
-The idea of ClassifiedAds representing job requirements and machine restrictions in a flexible semi-structured format, which is used by match-maker for scheduling
jobs. This gives flexibility of using heterogeneous types of machines in the grid.
-The use of Directed Acyclic Graph to represent job dependencies and use it for executing multiple jobs with dependencies. Also providing simple approach for
handling failures by creating a rescue DAG.

Confusing:
I am little confused about the direct flocking, in this case, how the agents learns about the address of matchmakers in external communities.

Learned:
-I learned about how semi-structured data can be used to define requirements and restrictions and used for the purpose of matching.
-Also the use of checkpoints for migrating a job executing on one machine to another.
-Learned the use of DAG to represent job dependencies.

Posted by: Bhaskar Pratap | November 18, 2014 12:22 AM

Summary:
This paper provides a history and philosophy of the Condor project and gives an brief introduction to Condor’s design philosophy and internal implementation as well as how it involves with other latest techniques like grid computing.

Problem:
As distributed systems scale to ever larger sizes, they become more and more difficult to control or even to describe. Specially:
1. Different distributed system has different underlying hardwares, operating systems and applications.
2. Network connections are unstable, especially the network is the nationwide network.
3. Configurations of the distributed systems keep changing constantly.
4. Owners of different distributed systems cannot have an unique requirement and policies, which is a hard challenge to make these heterogeneous systems cooperate.

Contributions:
1. The design philosophy of Condor. The authors present their design philosophy of flexibility in the development of Condor system. This principle is simple but meaningful.
2. The ClassAds and Matchmaking system. The author introduces the concept of advertisement into condor system to achieve the goal of crossing the policy and requirement boundary among different underlying computing system. Both the machine and task publish their requirement and matchmaker would match task to some kind of resource. In this case, condor users have no need to think about how to require more resource even local resource is not enough.
3. Resource sharing among different distributed system. Three ways are provided to achieve this goal: gateway flocking, direct flocking and personal condor pool, providing different kind of granularity of sharing to condor users.
4. Different data models and executing universes. Various data model and executive universes are provided to fit different kinds of needs from different users, like traditional MW model is well-suited for those problems which can be splitted into independent subproblems while DAGMan is suited for processing stream data.

Confusings:
The Condor uses the ClassAd to publish the tasks’ requirements and machines’ capabilities and do matching between tasks and resources. So, is it possible that some malicious users that can tample the ClassAd-related code locally to make the ClassAd publish a bigger task requirement than it actually needs?

Things I learned
The most interesting thing I learned is the DAG manager that pipelinalizes the data. DAG model, as an alternative to MW model, provides a highly abstraction to stream data processing. Organizing data model as a DAG also expose some potential optimization like merging or reorganizing pipe lines and etc.

Posted by: Lichao Yin | November 17, 2014 11:53 PM

Summary: This paper shares its experience with Condor, a high throughput computing system comprised of many individual machines with different policies.

Problem: Building a single supercomputer is more expensive than putting together many less capable computers. However the problem of managing different computers is hard because of node failures, unreliable network connections, and different policies deployed at different nodes.

Contribution:
1. Building a distributed system that is comprised of individuals who are willing to contribute their computing resources to the community, aka Condor pool. This is like the BitTorrent protocol where users volunteer to share their files with peers.

2. Each owner has full control over his machine. That is, he can decide whether or not to contribute his resource to the community, how much resource he is willing to contribute, whom to share and not to share his resources with, and what kind of computing jobs are acceptable, etc. This is achieved by having the agents, matchmakers, and resources have their own policies.

3. Checkpointing to deal with failures. During the executing of a job, checkpoints will be stored. In case of a failure, the remaining job can be executed on a different node from the last checkpoint.

4. Sandboxing to protect host from malicious jobs. When a job is sent to a resource, the job is executed in a sandbox inside the host. The sandbox prevents the job from accessing resources that it shouldn't be accessible.

5. Gateway flocking and direct flocking to facilitate inter-community resource utilization.

Things confused me: Is Condor capable of executing complex parallel programs that make extensive use of synchronization primitives and/or inter-process communications? Both the Master-Worker and the DAGMan problem solver seem too simple to handle such tasks.

Things I learned: Despite various things mentioned in the paper, what surprised me most is Condor is still alive today!

Posted by: Menghui Wang | November 17, 2014 11:48 PM

Summary:
The paper presents the design and implementation highlights of the various incarnations of the Condor distributed computing environment developed at Madison. The paper gives an overview of the reasoning behind each design choice and how each subsequently evolved as users utilized the system and the system grew in size both physically and spatially.

Problem:
The condor environment was built to provide users access to compute resources which existed as idle personal desktop machines. The idea was to provide a grid style distributed environment where jobs were able to be submitted as batches and the the owners of machines dictated how their idle resources could be exploited so that people would be willing to participate in the condor infrastructure.

Contributions:
- The large scale grid style distributed compute infrastructure consisting of machines, resources, agents, and matchmakers.
- The template interfaces (DAGMan, master-worker, standard & Java environments) exposed to users of the condor system to make job creation easier to reason about based on differing needs.
- The idea of gateway flocking, direct gateways, and gliding in to allow for scalable and transparent acquisition of resources by agents as the condor grid grew to a world wide compute system.
- The overall road-map and discussion of the design choices/philosophy of the various components of the system (I imagine a resource like this is invaluable to people design their own distributed systems with similar design goals) and the experience gained as users stressed the system and requested the user of condor in different ways.
- The creation of the ClassAds schema free language to describe the resources required by a job and the way a machine could be utilized within condor.

Learned:
I learned about the condor system at a high level (it was just something I knew took up resources on my machine before). I found the the idea of gliding in to be particularly clever to build condor resource pools on demand.

Confused:
I would be interested to here in some detail how the condor system took the ClassAds language and performed matchmaking. Consider the language seems very expressive it seems like a daunting task (algorithm) to provide accurate matches on all the criteria that can be specified. Or perhaps the majority of users require similar resource use cases so the common case is manageable.

Posted by: Aaron Cahn | November 17, 2014 11:46 PM

summary:
- they described their distributed system for high performance computing.

problem:
- using many smaller machines instead of a big super computer would be less expensive.
- they want to design a system that can use the available resources as much as possible without putting too much constraint on the owners.

contributions:
- they work under assumption that there is always some available resources. so they designed their system to be flexible and if one resource becomes unavailable they can find a new resource and move to that.
- this way machine owners know that if they share their machines with condor they will not lose computing power and condor will only use them when they are idle.
- the system has three components, agenst that submit jobs, resources (the computing machines), and matchmakers (that match the agents and resources.
- the agents and resources advertise their requirements to a matchmaker.
- they use ClassAd to describe the requirements and preferences.
- matchmaker find a match between agents and resources and inform the agent.
- the agent contacts the resource and sees if it is still available.
- they use check pointing to save the progress of jobs.
- if a resource become unavailable (for example owner of the machine starts using that) they migrate the job to another available resource and resume from the checkpoint.
- they had flocking to submit jobs from one pool to another pool (if one pool is busy and the other has idle resources).
- they use execution domains so that they can consider locality: they prefer to run a job in the local network that has high speed connection rather than submitting it to a different domain via a slower connection (even though they are in the same pool)

learned:
- I had used DAGMan before when submitting jobs to condor, but I didn't know about the master worker setting. it could be very useful in specific scenarios (for example as they mentioned for searching the parameter space).

confusing:
- if I understand correctly, the matchmaker suggest a match and then the agent should contact the resource itself. can that agent contact a different resource? can an agent contact a resource without asking the matchmaker? for example if there is a malicious agents, can that agent keep sending requests to resources without talking to matchmaker and go around the balancing policy of the matchmaker?

Posted by: Alireza Fotuhi | November 17, 2014 11:34 PM

Summary:
This paper gives a the philosophy behind Condor, and how social reasons decided how it grew. It gives in depth descriptions of changes over time.

Problem:
Condor was designed to facilitate batch job distribution in a cluster. It was made flexible enough that it worked well with a distribution of workstations. This is a focus for Condor, and a difficult problem, since workstation owners have complete control over the machine. Thus, it is challenging to have a system that balances the power and requirements of the owner with the needs of the users.

Contributions:
Condor as a whole has contributed much towards utilizing distributed workstations for batch jobs. This paper is long and describes many different aspects of the system, and more importantly, how the system evolved from what the designers thought would be best to what was the best. There are a few takeaways from this.

Large systems can grow, and evolve, efficiently in coordination with users. Of course Condor is not the only example of this, but the nature of Condor makes it an interesting case study for coordination among people and administrations. Take, for example, the evolution of the architecture. As it became obvious that users administrations needed an easy way to submit jobs on each others machines, Condor implemented the flocking mechanism; but even that changed with time.

Another takeaway is the understanding that the owner of the machine is always in charge. They are in charge of how they are advertised in ClassAds. They can cancel jobs at any time and use their machine. They are in charge of deciding how many resources these jobs can take up. The Condor system has shown how important, and powerful, this is in a batch job system like Condor.

Limited Understanding:
I was interested in the trust aspect of the machines. They discuss the sandbox, and how that is meant to protect the host machine from any malicious jobs. But what about the other way around? Are all jobs and files they need in possession by the host machine? How much protecting does the sandbox provide to hosts?

Learned:
I learned a lot from this paper. The biggest one is probably some of the general aspects of batch, and grid, computing. These were things like Master-Worker and Globus.

Posted by: Frank Bertsch | November 17, 2014 11:25 PM

Summary:
This paper describes the Condor distributed computing system : the basic architecture of the system and how the system has evolved over time. User place their jobs to the system and then required resources are searched and used in to execute those jobs.

Problem:
The problem condor is trying to solve is how to use idle computing power of workstations/machines located over different locations to provide high computing resources for batch processing of jobs.

Contribution:
1. The main contribution of the system is the flexibility provided to the agents and machine owners to cooperate in the execution of jobs. It uses Matchmaker component which matches a request with the required resources. Both agent and resource advertise their requirements to the matchmaker beforehand using ClassAds language.
2. I liked the idea of "gliding in". GRAM provides an abstraction over condor machinery which hides the details to the agent. To solve this, agent use condor servers which are placed as ordinary jobs by the GRAM. Then these servers contact a personal matchmaker started by the agent. This way the agent has carved out a condor pool out of remote resources over which it can now submit its normal jobs.
3. Execution domains are used for efficient data transfer by grouping collection of resources using a checkpoint server. Agents use this information to place jobs intelligently.
4. Split execution: for each job there is a shadow component which represents the user to the remote sandbox component which provides a safe environment for the job. Sandbox contact shadow for details about the job.

Confused: I am confused about the Condor-G. The paper mentioned that condor and globus project provides some functionality. How it is used with condor and what functionality does it provides.

Learned: I learned how to provide flexibility to co-opearating machines in distributed system using matchmaker. Also the shadow and sandbox component which provide a safe environment for remote execution of job in hostile environment is also interesting.

Posted by: Avinaash Gupta | November 17, 2014 11:16 PM

Summary:
This paper introduce the evolution of Condor - A distributed batch computing system which leverage the computation resource of participant’s servers and desktops. The authors explain the design philosophy of the system design, and discuss the unique and powerful tools of Condor.

Problem:
With collections of desktops, it’s feasible to build high performance computing system, but it’s difficult to build such kind of distributed system, because of many challenges, like message missing, delay, etc. And for the system that every participant remain free to contribute as much or as little as it cared to, it’s even more complicated to manage the resource and dispatch the jobs to the available computing resources.

Contribution:
- Unlike centralized control model, every participant of Condor can contribute to the system as desired, that characteristic make Condor easy to be accepted by more users.
- This paper introduces the architecture of Condor, and explains the responsibilities of every module. Two approaches are discussed to solve cross community resource sharing - gateway flocking and direct flocking.
- The key primitives of Condor system are resource acquisition and resource management, this paper summarizes them as planning and scheduling, and introduces the detail steps and interactions between modules in the process.
- After the introduction on resource sharing and resource management, this paper introduces the job management and execution, which provide the ultimate service for users.

Learned:
Condor is a distributed computing system that every participant can share and utilize the computation resource. The philosophy to build such a system is different from building a high throughput computing system with dedicated servers.

Discussion:
Actually I had some experience with Condor. I tried to use it to run GEM5 to simulate CPU in a course project. Condor seems to be a good candidate to run such kind of computation intensive application, but unfortunately I was not able to use it successfully. The major issue was that Condor is not good at the “sandbox” implementation, not well tested code can evict other users’ job easily, and slow down the whole system significantly. For example, if I submit a job which claims to use at most 2 cores, but actually it uses more than that and has the risk to crash the system with multi-core processors (8 cores, 16 cores, etc), Condor will kill it along with many other users’ job in the same machine. The problem that made the thing worse is that I was not able to find any method to figure out where is the problem of my code, because it ran pretty well in single machine. I am confused about the sandbox implementation of Condor, this paper doesn’t introduce that topic in detail. My personal feeling is that the isolation of different users’ jobs is not good enough in Condor.
Another concern about Condor like system is that the job interruption or eviction is very often, and the checkpoint and migration have high cost. So only the jobs that can be splitted to many small independent pieces are suitable for Condor, the long run jobs are not, I think.

Posted by: Peng Liu | November 17, 2014 11:15 PM

Summary
This paper provides an overview of Condor, a system that combines the high throughput computing model with opportunistic computing, allowing users to submit jobs that get run on platform composed of a distributed heterogenous compute pool. The unused cycles made available by resource owners is provided through a matchmaking service to potential consumers who wish to run batch jobs on the available compute. Condor provides the platform that allows potential consumers of resources to be matched with providers in a manner that’s conscious of the policies and requirements of both sides.

Problems
Providing large scale compute for jobs is expensive if you choose to go with a standalone machine. As jobs get larger and larger, providing the necessary compute will become arbitrarily more expensive. To enable scale, the Condor system allows programmers to partition their work into tasks that can run as jobs on the Condor system. The Condor system abstracts away a heterogenous compute pool, that is not necessary always available, and makes it available in a manner that works for both the provider and the consumer of the resources.

Contributions

Matchmaking: Given the heterogenous nature of both the job requirements and the resource, Class Ads provide a means to express the expectations from both sides of the transaction, and allows Condor to provision a request on to a resource for potential execution.

Construction of Condor pools by allowing requests to be ‘glided in’ using the GRAM protocol understood by multiple batch systems is a neat solution for interoperability.

Split execution - providing a mechanism to unify the user’s view of the system - to specify the job, provide the arguments, the environment etc., and the sandbox - that provides a safe execution environment for the job, and protect both the job and the machine from damage that they could do to each other.

Checkpointing - as a means to record program state to stable storage, and enable re-execution/relocation on failure.

What’s unclear
How do agents perform resource discovery in direct flocking? How is the correct remote matchmaker identified? Or is the request broadcasted to all the match makers that the agents know? Do remote matchmakers advertise resources that register with them?

Concept Learned
As with most things that we’ve seen, the level of indirection provided by the matchmaking process, enables both the producer and consumers of resources to identify each other at run time, and exercise their own policies without any tight coupling.

Posted by: Vijay Kumar | November 17, 2014 11:10 PM

Summary:

In this paper, the author presents an overview of
the Condor system, which goal is to building a
distributed systems that scale to world-wide
computational grids. The key technical goal
is to achieve high throughput for computational tasks.

Problem:

There are three technical challenges:
- Feasibility: How to express and support a diverse set of
users and their workload;
- Social Effect: How to make sure people want to join the
grid to share their resources;
- Performance: How to harvest large amount of computation
time.
We describe the contributions corresponding to each of
these points one by one.

Contributions:

One contribution is a loosely-coupled representation
to specify the requirement of jobs and their matches.
The author mentioned a "schema-free" job ads (ClassAds)
submitted by the submission machines, and matched
by a matcher. This allows the flexibility of representing
job requirements, and decrease the amount of maintenance
workload when machine changes.

Another contribution is the experience of how to ensure
users are not frustrated and do not want to share their
resources. The key solution is to respect their ownership
of their resources. To do this, the system need to be
ready for highly-dynamic workload. For example, when the
user has their own work to do on their workstation, the
currently-running Condor jobs need to stop. There are some
techniques, e.g., checkpoint, to make the impact of
eviction of jobs smaller.

To get high performance, the system design is loosely
coupled. The job ads is based on a three-valued logic
to avoid over-specific job matching. The matching process
can across organization boundaries in multiple ways,
and the current implementation does not have a centralized
gateway to avoid high maintenance workload.

In terms of implementation, the author describes different
sandboxes for run the users' job, e.g., the standard universe
that communicate with OS, and the Java universe that
communicate through JVM. Also, the author describes a
DAG-based implementation for fault-tolerant execution.

What I Found Confusing:

One thing that I am confused is the assumption that Condor
makes for communications across workers. Is this allowed
through interfaces like MPI? If it is allowed, when
workers of different organizations communication cross
organization boundaries, how does it guarantee that the
users code follows the policy of communication of both
organizations? Or it is assumed to be the user's
responsibility to make sure the policy is followed?

Posted by: Ce Zhang | November 17, 2014 11:09 PM

Summary:
This paper describes about Condor, a distributed batch system which leverages the available resources effectively and performs batch processing of the jobs submitted to it.

Problem:
Distributed computing gained a lot of attention after the realization that a connected network of computers can be more powerful and economical than a single super computer. Main problem the authors are trying to solve is high-throughput computing and opportunistic computing which are not solved in traditional batch processing systems. Fault tolerance for a longer sustained periods of time and ability to be able to utilize the resources whenever they become available are the key characteristics they are trying to implement in Condor.

Contributions:
1. Removal of centralized control, by making each resource owner to be able to control their resource. The planning and scheduling vests both the users and the owners of resources have control over their requirements.

2. Checkpointing the jobs' progress enables preemptive-resume scheduling thereby making use of a machine if it becomes idle, and prempting a job from a machine after checkpointing in order to schedule a higher priority job. This highly aids in dynamically changing set of resources.

3. Shadow and Sandboxing ensures that the necessary run time information for a job to execute is available and at the same time preventing any malicious attacks by sandboxing it, and redirecting all requests to shadow for its consent.

4. The ClassifiedAds concept that helps in representing the restrictions and ranking in a flexible manner thereby helping the matchmaker to execute any allocation policy while allocating the grid resources. This is required since there is a wide heterogeneity in terms of the processor speed, memory etc across the resources.

5. The design of match maker decouples the matchmaking functionality from the actual connection between a resource and an agent. The advantage of this is that, match maker maintains no state and hence crash recovery of a match maker becomes simpler.

6. It provides reliable job execution by retrying in case of a failure. (The agent retries the job).

One thing I learnt:
When we have multiple heterogeneous architecture a schema-free representation like ClassAds could be used to exchange information.

One thing I found confusing:
I was not sure how Condor would scale especially when the resources and agents grow in terms of having minimum message exchanges.

Posted by: Manasa Subramanian Ganapathy Subramanian | November 17, 2014 10:55 PM

Summary:

The paper describes Condor, a distributed system that accepts tasks for batch processing and executes them after matching constraints and requirements.

Problem:

Using dedicated systems for HTC computing is expensive. Idle CPU cycles on inexpensive stations are not utilized.

Contributions:

- The idea of opportunistic computing can be seen in many projects like SETI @ Home, Folding @ Home where unused CPU cycles are used to perform a useful task.
- The matchmaker system for matching requests to resources seems unique since both the provider and the client provide conditions that must be mutually satisfied.
- The idea of using inexpensive workstations in a distributed manner to perform complicated tasks can be seen in Map-Reduce.
- Migration of tasks using checkpointing mechanism.
- I/O redirection to avoid migration of the entire file to the sandbox site.
- Gliding to create ad-hoc pools via GRAM was ingenious.
- Two separate flocking techniques for pooling the resources across pools.
- Two programming abstractions for problem solvers.

One thing I found confusing:

Didn't the standard universe suffer from the problems that the proxy solved in the Java universe since it used the direct route.

One thing I learned from paper:

I learned about the matchmaking mechanism which allows both sides to state their requirements.

Posted by: Satyanarayana Shanmugam | November 17, 2014 10:25 PM

Summary:

Condor provides a decentralized distributed computing environment aimed at providing high throughput for production level computation. The system combines all the available compute resources as a common resource pool and performs opportunistic computing by leveraging available resources.

Problem:

Given the low cost of commodity hardware and the availability of existing infrastructures, performing distributed computing is much efficient than using super computers.
In a distributed computing environment that is trying to leverage existing resources, we need mechanisms to detect idle machines, transfer jobs if an idle machine needs to be used by machine owner.
Apart from the above problems, we would need to know location of an available resource and a means to match the job to resources.

Contributions:

Allow flexible control to users and owners and very clear association of users and owners to specific components of the architecture(say planning and scheduling).
Check-pointing can not only help in recovering from failures, but also help in moving a job to a better available resource.
Supporting resource matching across communities by providing flocking methods: direct and gateway.
Making ClassAds schema free allows representational convenience for agents and resources; also providing a simple logic method to overcome the problem of difference in representations.
Separation of matchmaking from claiming providing flexibility both to the matchmaker and the agent(which can decide to accept or reject the resource). In addition to this, making matchmaker to be stateless aids in easy recovery when matchmaker crashes.
Support for sandboxing aids in preventing any malicious jobs running on resources from attacking the Condor jobs.
I/O operations are redirected back to the local machine, thereby preventing the multiple-copies requirement.

Unclear concept:

I am curious to know about how exactly the agents and resources advertise themselves to the matchmaker, do they use some DNS resolution for the same? Also, how many matchmakers would the agents have to know about?

Learning:

I learned that jobs can be moved mid-execution from one machine to another by using preemptive resume scheduling and how the checkpoints effectively aid the same.

Posted by: Meenakshi Syamkumar | November 17, 2014 10:23 PM

Summary

This paper describes the history and high level aspects of the Condor system developed here.

Problem

Ready access to large computing resources can be achieved inexpensively with collections of small machines than using a single supercomputer. The condor system provides such an environment to cooperative processing. In the condor system, every participating resource and job has lot of freedom and can enforce appropriate policies.

Contributions

1. ClassAds - ClassAds is uniform language that the resources and agents use to express their constraints and policies. The authors realized that having any schematic structure is not going to work for all scenarios. It uses a semi-structured data model that the match maker uses to find matches between resources and agents.
2. Design of Match Maker - The match maker has good design elements in it. For example, it maintains only soft-state. The Ads will be sent to Matcher and it can reconstruct the state again. Also it is okay for the matcher to have stale data as the resources and agents have to again negotiate even after they get an intimation from the matcher about the match. Similar design principles can be seen in GFS.
3. Checkpointing and Migration of jobs - Condor can record checkpoints of a job and resume the job from the checkpointed point. This also enables migration of the job to some other resource.
4. Remote operations - The system calls in the job are redirected back to the submitter. This obviates the need to copy all data files to the resources where the job actually executes.
5. Flocking - Techniques like gateway flocking and direct flocking enable agents in one condor pool to find resources on other condor pools. These techniques aggregate an enterprise's compute resources that are spread across locations.
6. The way how private condor pools can be created with existing grid computing resources using Condor-G agent is neat.

Confusing

I felt I did not get few details of the system. Is matcher a single point of failure? Can a pool have more than one matcher?

What I learned

I learned how one can use semi-structured data to model the advertisements and how one can build matching or similar systems on top of such a semi structured data model.

Posted by: Ramnatthan Alagappan | November 17, 2014 10:01 PM

Summary: Condor provides users with many nodes around the world abstracted as a single computing resources.

Problem: big dedicated servers dedicated to processing batch jobs are expensive. Many institutions around the world have idle workstations that may be in need by someone with a big batch job. It is difficult to schedule jobs with heterogeneous nodes, shipping jobs to be run, and ensuring that the computational requirements align with the machine's owner's. Furthermore, if there is a failure in the system, it can be difficult to distinguish this failure from a failure in the job itself, and then recover from it and reschedule it.
Contributions:

the shadow and sandbox provide a way to run a job remotely and redirect I/O only where necessary

instead of using a gateway to bridge compute pools, abstract the different pools to the user as one huge pool (personal pool)

ClassAds provides a way to match jobs to machines, facilitating the scheduling problem

Universes are a convenient way of providing predictable environments on which to compute

Confusing: in what ways does a user have control over what jobs run on his machine? What parameters can he control? How is it enforced? Also, is there any notion of a SLA?
Learned: Condor has a similar problem to that which we saw recently: authenticating clients from different administrative/security domains. They initially approached it with gateways, but then came up with a way to abstract different administrative domains, or pools, and make it possible to run jobs on multiple pools.

Posted by: Theo | November 17, 2014 09:24 PM

Summary:
The Condor project seeks to enable cooperative computing between commodity resources, instead of expensive supercomputers. It provides owners with full control of the use of their resources and the treatment of their jobs at the executing machine.

Problems:
In the realm of high throughput computing, there were both expensive multicomputers and cheaper centralized-control models of distributed workstations. The latter did not allow for much flexibility or control by the individual owners of computing resources.

Contributions:
- ClassAds and Matchmaker: Allows owners to decide how much of their resources they are willing to contribute, what types of jobs they are willing to take on, and what the requirements are for running their own jobs.
+ In addition, ClassAds and Matchmaker can be used outside of the Condor system for general matching needs.
- Checkpoint: Job state can be recorded and thus jobs can be paused and resumed, even after migrating to another machine. Very useful if preemption may occur and we want to avoid throwing away already accumulated computations.
- Sandbox: Provides an insulated (box) environment on the execution machine, and a complete working environment (sand) for the remote job.
- Shadow: Responding to remote system calls, the shadow resolves many data issues inherent in remote execution of a job.
- Clever emulation of Condor system within Grid computing environment by running vanilla Condor server software as jobs within batch systems in the grid environment.

Learned:
I really liked the bootstrapping nature of running pieces of Condor as jobs within Condor (problem solvers), delegating responsibility for robustness mostly to the Agent.

Confusing:
I found the discussion of planning vs scheduling a little bit confusing.

Posted by: Brandon Davis | November 17, 2014 07:58 PM

CS 739 Reviews - Fall 2014

Distributed Computing in Practice: The Condor Experience

Comments

Post a comment