Cluster Resource Management:
A Scalable Approach
|
|
|
Ning Li and Jordan Parker |
|
CS 736 Class Project |
Outline
|
|
|
Introduction |
|
A Scalable Approach: Hierarchy |
|
Results |
|
Conclusions |
|
Questions |
Why Study Resource
Management?
|
|
|
|
Clusters have become increasingly
popular for large parallel computing. |
|
Web Servers |
|
Clusters are becoming increasingly
large to the order of thousands of nodes. |
|
Clusters are providing multiple
services. |
|
Hard to evaluate |
|
Bad is easy to determine |
|
Good is much harder |
Resource Management Example
|
|
|
4th Node Services only B |
|
Poor Management |
|
|
|
|
|
Ideal |
Clustering Goals
|
|
|
Scalability |
|
Reliability |
|
High Performance |
|
Affordability |
Related Work
|
|
|
Proportional-Share |
|
Cluster Reserves |
Related Work: Approach
Differences
|
|
|
|
Our Goal: to provide a scalable
solution for resource management. |
|
Other work focused primarily on just
having good management |
|
This often meant 1 manager for all the
nodes |
|
Clearly this could present a scalable
bottleneck |
|
Effectiveness: Other solutions probably
better for smaller clusters, we hope to be better for large (>1000 nodes)
clusters. |
Outline
|
|
|
Introduction |
|
A Scalable Approach: Hierarchy |
|
Results |
|
Conclusions |
|
Questions |
Hierarchy: A Scalable
Approach
|
|
|
|
Hierarchical Management |
|
Nodes service jobs |
|
Managers facilitate resource management |
Banking Algorithm
|
|
|
|
|
Goal |
|
Determine best allocation given
previous usage |
|
Primitives |
|
Tickets |
|
Bank accounts |
|
Deposit / withdraw tickets |
|
6 Steps |
Banking Algorithm
|
|
|
|
|
Step 1: For each service class on each
node |
|
Deposit unused tickets |
|
Step 2: For each service class on each
node |
|
Reallocate service class |
|
Full utilization: Allocation = usage +
k |
|
Under utilization: Allocation = usage -
k |
Banking Algorithm Cont.
|
|
|
|
|
Step 3: For each service class |
|
Compare total allocation to desired |
|
Subtract from over-allocated |
|
Add to needy & under-allocated |
|
Step 4: For each service class |
|
Deposit / Withdraw |
|
If still over-allocated withdraw |
|
If still under-allocated deposit |
Banking Algorithm Cont.
|
|
|
|
|
Step 5: |
|
Withdraw and allocate |
|
Reward the needy nodes |
|
Step 6: |
|
Done, clear the bank accounts |
|
|
Reliability
|
|
|
Bottom-up Manager Replacement |
Outline
|
|
|
Introduction |
|
A Scalable Approach: Hierarchy |
|
Results |
|
Conclusions |
|
Questions |
Results
Implementation Details
|
|
|
|
Simulations via The NS – Network
Simulator |
|
Low bandwidth 10Mbs communication
network |
|
UDP for lower server overhead |
|
Assumptions |
|
Node level resource management works
ideally |
Test 1: Overview
|
|
|
4 nodes – 3 services – 60/30/10
Allocation |
|
4th node receives all of 3rd
class’s requests |
|
Steady Workload |
Test 1: Data
Test 2: Overview
|
|
|
100 nodes – 3 services – 60/30/10
Allocation |
|
nodes 1-30 receive all of 3rd
class’s requests |
|
Steady Workload |
Test 2: Data
Test 3: Overview
|
|
|
100 nodes – 3 services – 60/30/10
Allocation |
|
nodes 1-30 receive all of 3rd
class’s requests |
|
Dynamic Workload |
Test 3: Data
Test 4: Overview
|
|
|
|
100 nodes – 3 services – 60/30/10
Allocation |
|
nodes 1-30 receive all of 3rd class’s
requests |
|
Steady Workload |
|
Reporting 1/5 |
|
Nodes every 0.3 second |
|
Managers every 1.5 seconds |
|
|
Test 4: Data
Test 5: Overview
|
|
|
900 nodes – 3 services – 60/30/10
Allocation |
|
nodes 1-300 receive all of 3rd
class’s requests |
|
Steady Workload |
Test 5: Data
Outline
|
|
|
Introduction |
|
A Scalable Approach: Hierarchy |
|
Results |
|
Conclusions |
|
Questions |
Conclusions
|
|
|
|
Benefits of an hierarchy |
|
Scalable |
|
Reliable |
|
Geographic Applications |
|
Implemented a new management scheme:
Banking |
|
Comparable Results |
|
Improved Scalability |
Conclusions
|
|
|
|
|
Clusters are sensitive to small policy
changes |
|
Clusters are built for specific
workloads |
|
Their performance is important and
small changes have significant impact |
|
No scheme is universally applicable |
|
Future Work |
|
Real system implementation |
|
Real Workloads |
|
Real node level resource management |
|
More steady performance |
Outline
|
|
|
Introduction |
|
A Scalable Approach: Hierarchy |
|
Results |
|
Conclusions |
|
Questions |
Questions
Related Work:
Proportional-Share
|
|
|
|
Stride Scheduling |
|
Ticket based and similar to lottery |
|
Scale |
|
Randomly query k nodes to find best
allocation |
|
Different Application |
|
Condor-like resource
allocation/applications |
Related Work: Cluster Reserves
|
|
|
|
Resource Container Schedulers |
|
Constrained Optimization Algorithm |
|
Scale |
|
Centralized single manager |
Hierarchical Cluster
Reserves – Version 1
|
|
|
|
Modify Cluster Reserves optimization
algorithm |
|
Use it when manager manages nodes |
|
AND when level_n+1 manager manages
level_n managers. |
Hierarchical Cluster
Reserves – Version 2
|
|
|
|
Cluster Reserves optimization algorithm |
|
Use it when manager manages nodes |
|
Don’t use it for upper level managers |
|
Modify the manager to manager reporting |
|
Lie to the algorithm |