Back to index
Distributed Computing in Practice: The Condor Experience
D. Thain, T. Tannenbaum, and M. Livny.
University of Wisconsin
One-line Summary
Condor is a batch processing system for high throughput computing running on commodity machines, and provides full controls for hosts where matched remote executions run within sandboxes.
Overview/Main Points
- Architecture outline
- User writes ProblemSolver to describe jobs and their requirements, which is submitted to Agent.
- Resource provides computation environments for executing jobs within Sandbox.
- Matchmaker negotiates between Agent and Resource for matched job requirements.
- Agent forks Shadow to communicate with Sandbox in terms of job execution.
- Architecture evolution
- A Condor pool
- Gateway flocking
- Direct flocking
- Condor-G(rid)
- initial design
- personal Condor pool
- Condor in details
- Planning & Scheduling
- planning
- scheduling
- ClassAds: Job and Machine ClassAds
- planning within a schedulre: make choices about where to submit and when to submit.
- scheduling within a plan
- suppose where and when are fixed
- create a schedule for all the tasks
- ProblemSolver: a high-level parallel design pattern
- Master-worker
- Workder process runs jobs
- Master process steers among workList and tracks worker processes
- DAG
- Split execution: conduct secure remote execution
- shadow
- sand-box
- the standard universe
- two-phase open for handling I/O requests
- directing system calls to the original Agent node
- Security issues
- Secure communication: secure RPC
- Secure execution: user-id based mgmt
- Reliablity issues: error handling
- what if a remote server suddenly shuts down? Contact MatchMaker for another Resource.
- what if the Condor job itself produces errors? Automatically re-schedule
Relevance
Flaws