Read an Excerpt
Chapter 1: Introduction
High availability in computing is defined as making a business application set available to the users as high a percentage of the time as possible.This simple statement covers a lot of ground. It can be as simple as planning for the loss of electrical power for a piece of equipment, or as disastrous as a fire and earthquake, which can cause the loss of the entire computing site.
Murphy's Law says anything that can go wrong will. The space shuttle is built on this premise. All critical systems have one or more redundant backup systems. In case of a failure, the backup system automatically takes over the function of the failed primary system. Fault tolerance is the concept that a system can survive any failure in a hardware or software component and continue to function. This obviously will help to maintain the high availability we are looking for in our business applications.
This same idea is used in designing high availability into computer systems supporting business-critical applications. For example, a mail order business comes to a screeching halt if the order entry application becomes unavailable to the users for any reason.
1.1 Clustering - a means to an end
Clustering is the use of two or more computers or nodes for a common set of tasks. If one computer fails, the others will take up the slack. This design supports the idea of fault tolerance. A second computer can be used as the redundant backup for the first computer.
Clustering can be used to increase the computing power of the entire computer installation. This also allows a system to be scalable. Adding more computers increases the power and hence the designcan support more users. In our case, we use clustering for application availability with increasing processing power as a fortunate side effect.
1.2 What clustering provides
The current range of clustering solutions allows for a level of fault tolerance. The degree to which failures can be tolerated depends largely on two things:
1. The location of the failureIf the failure is within the cluster (for example, failed hardware in a node or a trapped operating system), then the high availability software will probably be able to recover and continue servicing its users. If the failure is outside the cluster, it is less likely that the high availability software will be able to maintain service. Failures such as power distribution failures, complete network outages, and data corruption due to user error are examples of faults that cannot be contained by products such as Microsoft Cluster Server.
2. The ability of applications to cope with the failure
Microsoft Cluster Server, for example, allows applications and resources that were running on a failed system to be restarted on the surviving server. For "generic" applications, this is simply a matter of restarting the program. For more complicated applications (for example, SAP R/3 servers), there must be a certain sequence to the restart. Certain resources, such as shared disks and TCP/IP addresses, must be transferred and started on the surviving server before the application can be restarted. Beyond that, other applications (for example database servers) must have clustering awareness built into them so that transactions can be rolled back and logs can be parsed to ensure that data integrity is maintained.
Microsoft Cluster Server provides high availability only. The Microsoft solution does not as yet address scalability, load balancing of processes nor near-100% up time. These can currently be achieved only through more mature clustering, such as that which is implemented in RS/6000 SPs.
Microsoft also offers its Windows Load Balancing Service, part of Windows NT 4.0 Enterprise Edition. It installs as a standard Windows NT networking driver and runs on an existing LAN. Under normal operations, Windows Load Balancing Service automatically balances the networking traffic between the clustered computers.
1.3 Business data - to replicate or not?
Adding computers, networks and even cloning entire computer centers for fault tolerance purposes does not solve all the problems. If the business is relatively unchanging, the databases can be replicated along with the rest of the system. Any updates can be accomplished for all the replicated copies on a scheduled basis. An example of this is a Web server offering product information to customers. If one Web server is down the other servers can serve the customer. In this example the cluster is used for both performance and availability.
On-tine Transaction Processing (OLTP) applications have different data requirements. As the name implies the data is always changing based on business transactions. A customer places an order for a product. Inventory is allocated to the order and is then shipped from the warehouse to the customer. If this data is replicated, there would be the possibility of promising the same item to two different customers. Somebody would be unhappy.
1.4 Disk sharing
Shared disks is one of the cluster architectures in the industry today. It may be used for scalability as well as for high availability purposes. In a typical two-node high availability cluster, both nodes can access the same storage devices, but only one server at a time controls the storage devices shared by both servers. If one server fails, the remaining server automatically assumes control of the resources that the failed server was using, while still controlling its own resources at the same time. The failed server can then be repaired offline without the loss of time or work efficiency, because access to that server's data and applications is still available.
The key point is that only one server has control of the storage devices at any point in time. There is only one copy of the data, so data accuracy is maintained.
1.5 MSCS-based solutions
As part of the early adopter's agreement with Microsoft, IBM has announced validated solutions of hardware and software to enable customers to run Microsoft Cluster Server (MSCS) in a shared disk environment.
For managing Microsoft Cluster Server configurations, IBM has developed IBM Cluster Systems Management (ICSM). It provides portable, generic cluster systems management services that integrate into existing systems management tools such as IBM Netfinity Manager, Intel LANDesk, and Microsoft SMS.
ICSM offers enhancements to the manageability of MSCS in three distinct categories:
1. Ease-of-use
2. Productivity
3. Event/Problem notification...