How to use the COW
Table of Contents
There are 40 sun 20/51 workstations which form our cheap
testbed for parallel computing.
All of the workstations are connected via a 1 Mbyte/sec ethernet.
4 of the 40 workstations are used as "compilation" hosts, where
solaris software can be compiled and tested. These machines are
named after cuts of meat (flank, sirloin, strip, filet).
The remaining 36 workstations are in used to do "work".
Individual workstations are allocated for groups to use. Individuals
within a group use machines allocated for their group.
One part of the COW are the individual nodes.
The other part of a parallel machine is the high-speed
communications infrastructure which allows data to move from one node
The original plan was to interconnect the COW nodes
via ATM. That never happened.
Needing some interconnect for their work, the Wind Tunnel
project bought Myrinet interface
cards and switches.
Historically, the following problems
have limited the overall connectivity of the myrinet network at our site.
- software incompatibility between the Illinois "Fast Message"
system and the Myrinet softeare
- version 1.0 myrinet doesn't route ip packets across >1 switches
(we have upgraded to 2.X, which does allow this)
Fortunately, that may be problems may be eliminated soon, as the
Wind Tunnel is switching to the Berkeley Messaging system,
which can operate alongside normal myrinet traffic.
The individual "compute" nodes are named 'cowe00' throug 'cowe35' The
'cow' part indicates that the machine is part of the cow. The 'e'
indicates that the name belongs to the Ethernet interface of that
node. The two digits with leading 0s identifies individual cows. To
communicate via a particular interconnect, use the form of the
hostname with the 'interface' character changed to the desired
interface; 'e' for ethernet, 'm' for IP over myrinet, and 'a' for ATM
(if we ever get it).
To allow the various groups using the cow to cow-exist without slaughtering
each other, we have partitioned the myrinet hardare into several independent
- Nodes 0-15 are a available for anyone to use. The nodes are
managed by DJM (Distributed Job Manager). Users request nodes
from DJM and are granted a reservation when the request doesn't
conflict with other requests. The myrinet on these nodes are
split across 3 8-port switches;
- nodes 0, 1, 2, 3, 4
- nodes 5, 6, 7, 8, 9
- nodes 10, 11, 12, 13, 14, 15
- Noes 16-23 are bolo's.
- Nodes 24-27 are used for software development, such as
device drivers and new myrinet software. These
nodes are typically used by Yannis Babek, or Steve.
- Nodes 28-35 are used by the WindTunnel project.
After a long discussion, it was discovered that there is almost
no common software interests among the many groups using
However, all these groups still need to be able to use the
cow alongside each other.
To facilitate this a set of fundamental utilities were created.
These utilities provide for selecting groups of nodes,
configuring those nodes, and for reserving nodes.
The Partition Manager
The partition manager allocates
nodes into named units which may be used to refer to those
nodes as a whole.
It is similar to a filesystem in concept, which allocates and
names units of blocks into a file, and files into directories.
For example: create a partition from nodes 1..4,
and call it shylock.
The Reservation Manager is a facility which allows
users of the COW to schedule use of the nodes among themselves.
Arbitration between requests may be performed either
by algorithmic means, or punted to human intervention.
The reservation manager may reserve more than just nodes;
it may also optimize requests onto underlying communications
infrastracture, and special hardware installed on a per-node
- I need 10 nodes tomorrow from 0600 to 1800.
- I need 5 myrinet nodes for the rest of the day.
The Batch Manager lets users executes jobs on the
Batch jobs will be run in a first-come first-served manner,
give or take available nodes on the COW.
For example: run ray_tracer on 5 nodes.
[ This section left not quite blank ]
These software systems implement the facilities described in
the previous section.
The cube mangler is an implementation of
the partition manager functionality.
It consists of two pieces...
- A daemon, named cube_mgr, which
is started from the COW init.d script.
This daemon reads and maintains the current configuration
of partitions from a text database in /usr/adm/cow/config.
- A set of utilities which communicate with the daemon
to examine and change the state of the partition database.
These utilities live in /p/cow/bin, and all
have the tag cube in their names.
In the following commands, if no [partition] is specified, the root
partition is used. The '.' character is used as the path seperator in
- lscube [partition]
- lists any sub-partitions of the partition.
- showcube [partition]
- display information about the nodes which comprise a
a particular partition.
- getcube [-l list,of,nodes | -n number_of_nodes] partition-name
- Create a new partition with the given name.
The options which specify the construction of the new partition are
- -l list,of,nodes
- a comma-seperated list of node numbers from
the enclosing partition. For example, if
the enclosing partition has 8 nodes,
the numbers 0-7 are valid node numbers
- -n number_of_nodes
- is followed by the number of nodes the new partition
- -m mode
- A Unix-like protection system is currently used
to control access to partitions and nodes.
This specified which mode the partition should be created with.
There is currently no method of changing the mode of
a partition after it is created.
- rmcube partition.name
- removes the named partition, freeing the nodes to be
used by someone else.
The cow hostler provides for the configuration
of individual nodes and the execution of programs on multiple nodes.
The hostler also has multiple components ...
The various hostler components may be found in the cow pen,
- A server, cud.pl started on demand from inetd.
- A configurator, pconfig which allows the
setup of a pre-arranged configuration on a set of nodes.
- A program starter, prun which
executes programs on multiple nodes in parallel.
The pre-arranged configurations are shell scripts stored
in /p/cow/pen/rbin with support files in
- pconfig -c <configuration> -p <partition>
- Add and Configure <configuration> onto the nodes
A configuration will only be added to a node once.
Specifying it a second time will do nothing.
- pconfig -C <configuration> -p <partition>
- Delete all configurations added to nodes in <partition>
until <configuration> is deleted.
The following configurations are available.
- Enable IP packets to flow across the myrinet.
- These configurations are random one-shots which bolo uses
for various things. They really aren't designed to be used
by anyone else.
- Most of these configurations no longer have any use. They
should be avoided.
Distributed Job Manager
The Distributed Job Manager, DJM is a batch
scheduling system designed for users who don't care to run
jobs interactively, or who want to schedule a large number of
jobs which can be run unattended.
DJM originated as software used on the cm-5.
The person who knew anything about DJM has left the UW,
currently we are clueless as to what it actually does.
DJM is currently used for the following COW functions
- Batch Scheduler
- Program Executor (for batch jobs)
DJM is also structured in multiple pieces
- A daemon, cow_starter, started from the COW
I believe this part may execute batch jobs.
- Another daemon, cow_master, started from
the cow_starter daemon.
I think this part is the combination reservation and batch
- A set of utilities to communicate with the daemons.
These are located in /p/cow/bin.
The cryptic documentation
(as if this isn't cryptic already :-) for creserve and cfree
and cstat may be found
- Request a set of nodes to use.
- Display the status of the nodes and reservations for them.
- Release a reservation.
It is of no small coincidence that the name of the "hypothetical
user" in the following is markos.
He helped me write the original version of this document.
Finding some nodes to use
In general, the partition commands shouldn't be used to create or
release "top level" partitions (aka those in the "root" directory).
DJM controls this level of the namespace, and all requests for top
level partitions must be granted from DJM. The creserve and
cfree commands are used to perform these actions.
For example, say a hypothetical user, 'markos' wanted to use some
of the 16 nodes. He wants them RIGHT NOW!!!! For the rest of the evening,
and doesn't want DJM mucking around on them ... (ps the creserve and
cfree commands are in /p/cow/bin)
cow% creserve -nodes 4 -mode nosched -until 11pm
Reservation #8, has 4 nodes (4-7), from now until Jun 21 23:00.
This reserves a 4 node partition for markos from NOW until 11pm today.
DJM will not schedule batch jobs in this partition, so markos is
free to use it himself. The partition will be in the root of the partition
namespace, and will default to the user's name, 'markos' in this case.
A explicit partition name may be specified, see the creserve documentation.
The "reservation number" is used to identify the partition when using
DJM commands. If markos finishes his work early (haha! :-) he can
release the nodes he is using with cfree:
cow% cfree 8
Reservation #8 deleted.
If your reservation conflicts with another, you can use the
'cstat' command to view registered reservations.
cow% cstat res
Running scheduled partitions:
PART BNODE NODES LOAD
*** None ***
# PART NODES NP USER GROUP PRM MODE UNTIL
4 -nopart- 16-23 8 bolo bolo 754 nosched Indefinite
6 -nopart- 24-27 4 swartz swartz 755 nosched Indefinite
5 wwt-pen 28-35 8 swartz swartz 777 nosched Indefinite
Currently Free Nodes: 0 - 15
If markos wanted to use a subset of his nodes, he could now use getcube
to partition his nodes into whatever subsets he wanted to used.
The configuration software configures entire partitions, so if markos
wanted to have some of his nodes configured differently from the rest,
he would need to place them in their own partition.
Markos would like to test a shore client and server; he only needs to
use 2 of his 4 nodes to do this. He could configure all of them to
use ip over myrinet, but I'll use this as a place to demonstrate getcube.
cow% getcube -n2 .markos.tcp
getcube 'markos.tcp' ok
Creates a 2 node partition which we will configure for IP over myrinet.
You can use 'showcube' to find what nodes you actually are using.
cow% showcube markos.tcp
0 cowe04 cowe04 cowa04 cannex1/5005 1
1 cowe05 cowe05 cowa05 cannex1/5006 1
showcube 'markos.tcp' ok
By default, the nodes of the cow are setup as generic unix systems.
The default "mode" of the myrinet (mention myrinet "modes") is to
communicate using the the myrinet API. Other communication options,
such as IP over Myrinet, and Illinois "Fast Message" networking are
available. Other configuration options exist too; a list of them can
be found by 'ls /p/cow/pen/rbin'. If you don't know what a particular
configuration does, you prrobably don't want to use it. Some of these
configuration options are mutually exclusive, some arent; there is no
way to tell a-priori.
The configuration or package name for IP over myrinet is
called "myri-tcp'. To configure a partition, use the "pconfig" command
(which is found in /p/cow/pen/bin)
cow% pconfig -c myri-tcp -p markos.tcp
This configures IP over Myrinet on the 2 nodes which markos specified.
To remove a configuration, the '-C' option is used instead of '-c'.
There is a command called 'prun' which allows you to run a command
on all or some of the nodes in a partition. I don't find it
particularily useful for what I do, so someone else can write
documentation about it!
Shore Comm itself
Start an xterm and rlogin to the node(s) you will be using. Markos
is using shore comm, so he will need to setup some environment
variables to make shore comm communicate over the myrinet instead
of the ethernet.
cowe04% setenv OCOMM_TCP 'cowm04/any'
cowe05% setenv OCOMM_TCP 'cowm05/any'
cowe0X% cd shore/src/object_comm/ns
cowe04% ./ns .ns
cowe05% ./query -f .ns
ns> enter HI-MOM!
How to undo all of this?
cow% pconfig -C myri-tcp -p markos.tcp
cow% rmcube markos.tcp
cow% cfree 8
and we're done.
Bolo's Home Page
Mon Oct 16 14:10:45 CDT 1995
bolo (Josef Burger)