Clustering and the iSeries 400

IBM i (OS/400, i5/OS)
Typography
  • Smaller Small Medium Big Bigger
  • Default Helvetica Segoe Georgia Times

Clustering is about physically and logically coupling computer systems together to accomplish one or more of three basic things: workload distribution, high availability, and scalability. A cluster can be as simple as a topology designed to accomplish data resilience with little consideration given to the operating system or application, or it can be as sophisticated as an implementation that accomplishes data resilience, application resilience, and workload distribution all within a concept called a single-system image. For years, the iSeries 400 has supported the simple topology of high-availability clustering for data resilience, but things have changed.

With V4R4, OS/400 began to feature constructs that represent the most advanced and current thinking in the science of clustering. OS/400 may be a bit late to the game, but the benefit is that the iSeries 400, standing on the shoulders of the giants that came before it, bring to the market IBM’s most avant-garde R&D thinking. One might also note that, with the iSeries being the world’s most available standalone server, there has not been the hue and cry among the massive iSeries install base for such technology that there has been for UNIX or PC servers, which are not known for their reliability or stability.

The recent surge into the Internet Age, along with the massive globalization of industry and focus on consolidation at both the business level and the IT organizational level, has shifted additional focus of the iSeries laboratory to the science of clustering. The primary motivation is to provide clustering for continuous availability.

Some Definitions and Concepts

Before we go further, let us set up some parameters. There is a fair amount of confusion about what exactly constitutes a high-availability solution. These days, with many storage area network (SAN) vendors promoting their versions of availability, aimed primarily at disaster recovery, it is our obligation to be more precise with the words we use and to know what they mean.

The basic element of availability design is resilience. There are several types of resilience: hardware resilience, operating system resilience, data resilience, and application resilience. Resilience is a measure of robustness. For example, a solution that provides 100 percent data resilience essentially guarantees that the user will experience no data loss or corruption in the event of a failure. It says nothing about the state of the operating system or application.


When building a modern cluster, the ultimate solution is to attain 100 percent resilience. If, at the end of the day, all elements of resilience are addressed, the one thing the user experiences directly is application availability. A complete solution ties together the hardware, operating system, application, and data in such a way that the user experiences a seamless work environment where the complexities of the underlying resilience elements are unseen, save that the user experience continues with minimal disruption if an underlying unavailability event occurs.

For example, as you fill you shopping cart at the Web site of your favorite e-tailer at, say, 1:30 a.m. on Saturday, you are unaware of the operations staff doing a backup operation. The cluster design has masked the disruption from you. There is nothing like going to your online brokerage account on Sunday only to learn that the system is temporarily unavailable because of maintenance. That is unacceptable in the e-world of the 21st century.

The Old Ways, Part Ways, and No Ways

To better understand what modern clustering technology has to offer, let’s review some of the preclustering and nonclustering methods used to address availability. One common element in nonclustering solutions for availability is that these methods address only data resilience and only to varying degrees of robustness.

Replication Services

The iSeries 400 method for data resilience is based on replication services. In real time at the logical or operating system level, data is transferred from one server to another so a duplicate application and database environment are available in the event of a planned or unplanned primary system outage. This method is in wide use around the world and has been very effective, but there are several limitations. One is that the implementation is highly customized to each individual customer environment. Another is that there is no concept of a single-system image for operations; you manage two or more systems individually to maintain the clustering environment. Yet another limitation is that the individual applications are not involved.

The net effect is that switching between nodes in the cluster is a manual process in which it is up to the operations staff and users to get things back on track. Even with IP takeover, when users are attached to another node in the cluster, the users are not logically attached to the operating system or application environment. That is, being IP connected to a second node within a cluster does not mean much other than “Hi! We’re here!” There’s much to be done before those users can do any work. They need access to their application, which, in turn, needs to be activated within the OS. It would be particularly useful if the user knew precisely where he was relative to the application state prior to the switch between nodes. The plus side of this structure is that a resilient copy of the data is useful for real-time processing, such as saving to tape, ad hoc queries, or read-only batch processing.

Storage Area Networks

Another method advertised to address data resilience is a SAN or disk-level approach. This method is very basic; it occurs only at the physical data level (i.e., disk array level) and does not involve the operating system in any way. Thus, only a second physical copy of the data is provided. For data to be used for operations backup, it must be attached to a host environment. In the iSeries world, this means an IPL. The data is not concurrently usable for real-time processing, meaning that the redundancy is purely for standby operations, making it a fairly expensive proposition.

In this environment, it is essential that the two disk arrays used for resilience be connected synchronously via some type of mirroring method. The primary reason for this is that, without a synchronous connection, data entry arrival order into the backup database


cannot be ensured. Given that the systems must be connected synchronously, the distance is limited to whatever does not impact primary server performance. It might be useful to point out that disk level system-to-system mirroring methods for disaster recovery are very different from replication-based approaches. Disk-level mirroring techniques are completely independent of the operating system. It is a purely mechanical process operating at the bits and bytes level and can not offer anything beyond the most primitive type of data resiliency possible. Replication services by contrast are a part of the operating system environment and form the basis of the data resiliency architecture for the iSeries clustering technology.

At best, the SAN or disk-level method can address disaster recovery with the aforementioned caveats. For disaster recovery, one might call it a “part way” solution because it offer real-time services on the backup configuration for the data being mirrored. For planned or unplanned outages, we call it a “no way” solution because it does not involve the operating system or application environment, making the job of bringing the total system and application environment to a known and usable state unpredictable and, of course, totally manual.

Common Building Blocks

Before we get into the operating system constructs that provide state-of-the-art clustering services, you need to be at a more basic level. What prerequisites need to be in place before implementing an advanced clustering environment?

The most basic service required for all high-availability solutions is journaling. Although it goes by other names, such as event logging, the result is the same: copies of changed data systematically and programmatically moved from volatile storage (main memory) to nonvolatile storage (disk). Actual execution of a particular program is tied dependently to the assurance that data has been moved from main memory to disk before the program is allowed to progress. The reason for this is that, if a system outage or failure occurred, all data resident in the main memory would be totally lost.

By journaling the data to disk, you create a level of transactional integrity. In the early days of availability design, this was all that was available. If a failure occurred, the operations and applications teams would work to restore the system, users, and applications to a known state by applying the saved journals to the transaction data.

If a vendor tells you that its solution requires no journaling, it’s time to show that vendor the door. Without journaling enabled, you may as well plan to restore from tape in the event of an outage. When deploying a clustering solution based on replication services from an IBM high-availability business partner, you should select those solutions that use remote journaling. Remote journaling is an operating system service that does the
“plumbing and housekeeping” for the replication service provider. That is, rather than depending on third-party replication service provider applications to extract data from OS/400 journal receivers on one system then sending them via some transport to a second system, IBM has provided this service within the OS/400.

A second prerequisite for complete high availability is commitment control. Commitment control goes beyond simple journaling in that it defines and connects a series of program elements into a single transaction. In the event of an outage, the recovery ensures the integrity of data by placing all related elements at a completed transaction boundary. This is useful because it shortens the time to recovery and improves the efficiency of I/O operations that can be impacted by continuous journaling.

Turning on journaling sometimes causes application slowdowns, especially when commitment control is not used. Without commitment control, every database I/O operation is considered a transaction, forcing a physical main-store-to-disk I/O operation. Batch Journal Caching PRPQ 5799-BJC was designed for applications that have journal overhead problems and do not use commitment control.

An additional prerequisite, the application design, is fundamental to the success of any high-availability solution. Because the application itself must be resilient to outages,


IBM introduced the concept of ClusterProven applications. ClusterProven applications are designed to particular standards and criteria that enable the user to experience the highest levels of modern clustering technology and, therefore, application availability. Too often, customers have told us that they want a high-availability solution that doesn’t involve their applications. Ironically and paradoxically, it is typically the owners of the poorly written applications who ask for miracles that don’t impact the application.

The Modern iSeries 400 Clustering Framework

Now that we have explained clustering and some of its underlying requirements, we want to touch on the core elements within the iSeries design that were introduced in V4R4 and explain what they do and how they work. The OS/400 capabilities that provide the updated cluster framework are called cluster resource services. OS/400 cluster resource services provide configuration, activation, and management functions for the cluster and cluster nodes.

Perhaps a hypothetical example can help illustrate these services. In the example shown in Figure 1, there are three critical user applications: one for servicing Web-based catalog orders, one for managing inventory associated with the catalog, and one for electronically interfacing with downstream suppliers. In this environment, four systems are used. SYSTEM1 hosts the Web serving application, SYSTEM2 hosts the inventory and supply chain management (SCM) application, SYSTEM3 hosts the catalog data, and SYSTEM4 is a development system.

Cluster Definitions

An iSeries cluster is a collection of complete iSeries systems that cooperate and interoperate to provide a single, unified computing capability. Each system in the cluster is a cluster node. An iSeries cluster can have between one and 128 cluster nodes; the HYPOCAT cluster in Figure 1 has four (SYSTEM1, SYSTEM2, SYSTEM3, and SYSTEM4).

Cluster definition, configuration, and state information are maintained in a persistent internal object that exists on each node in the cluster. Upon request, cluster control starts clustering on a node and coordinates the joining of that node to the cluster so all nodes are equally aware of the joining action and have the same content in their cluster information object.

Application and Data Resilience

As mentioned earlier, continuous availability implies more than just robust system availability. In a complete solution, critical data and applications are always available.

Resources available or known across multiple nodes within a cluster are cluster resources. A cluster resource can conceivably be any physical or logical entity, such as a database, a file, an application, a system, or a device. When a cluster resource persists across an outage, it is a resilient resource. It is resilient to outages and accessible within the cluster, even if an outage occurs to the node currently serving as the point of access for that resource.

The set of cluster nodes grouped together to provide availability for one or more cluster resources is the recovery domain for that group of cluster resources. A recovery domain can be a subset of the nodes in a cluster, and each cluster node can actually participate in multiple recovery domains.

In our example, the resilient resources are the three applications (Web catalog, Inventory, and SCM) and the three sets of data associated with each application. Resilience is managed through a cluster resource group (CRG) object, a new kind of OS/400 object that defines and controls the behavior of a group of cluster resources across a recovery domain. OS/400 cluster resource services provide object management functions—such as creation, deletion, and modification—for CRGs.


Conceptually, the CRG is a distributed object; it exists on all nodes in the defined recovery domain. A change made on one cluster node is automatically reflected across the recovery domain, that is, across every node in the CRG. Each node in the recovery domain has a defined role of primary, backup, or replicate. The nodes in the recovery domain and their respective roles are defined in the CRG. When any cluster event (such as a node being added to the cluster, a node going offline, or a change being made to the recovery domain) occurs that affects that CRG, a user-specified exit program is called on every active node in the recovery domain.

The CRG exit program is identified in the CRG object. Since the exit program provides resource-specific processing for the cluster event, it could be considered the resource manager for the group of resources associated with that CRG. There can be multiple CRGs on a node, each with a potentially different recovery domain. The HYPOCAT cluster in Figure 1 contains six CRGs, one for each critical resource (the application CRGs are identified by the acronym ACRGx and the data CRGs are identified by the acronym DCRGx). HYPOCAT shows some of the flexibility possible in the cluster definition.

Recovery domains can be different for different CRGs, and a node can have a role for a given CRG that is different from its role for another CRG. Figure 2 shows the defined recovery domains for different CRGs used in our example.

Supporting System Services

iSeries 400 clustering uses a peer relationship architecture among cluster nodes. Each active node has all the information needed to understand the total configurational and operational characteristics of the cluster. Definition of the cluster and all cluster resources is consistently maintained on each active node through a set of distributed group services and reliable messaging services. (For more information on iSeries cluster resource services, refer to the AS/400 V4R4 Technology Journal.)

As a result, a request for a cluster action can be initiated from any node active in the cluster. Furthermore, any node (not necessarily the requesting node) can assume the role of coordinator for a particular protocol. This helps ensure that neither a single outage nor an outage of several cluster nodes ever constitutes a cluster failure.

The Future

You’ve heard about where clustering has been and the state of clustering today, so what about the future? Well, it’s definitely exciting. The current focus is data and application resilience and corresponding CRGs, and, given that OS/400 is a 100 percent object-based architecture, CRGs can include other object types. The possibilities are endless.

For example, it may become possible to use a tape device as a cluster resource, and there are more ways than one to provide data resilience. There may be data resilience via a switch disk methodology. (We hesitate to call it “switch disk,” because that invokes the notion of a disk-based solution such as what is done in a SAN structure. The challenge is not the physical aspects of switching at a hardware level; SAN vendors have been doing that for some time. The challenge is switching the application, operating system, and users in concert with the physical switching.) The application environment designed for clustering could choose the type or types of underlying data resilience services to deploy, whether replication or switch disk. However, switch disk methodology is intended for unplanned outages and system upgrade outages, not for disaster recovery or save window impacts.

For clusters deployed within a single Logical Partitioning (LPAR) system, workload balancing and the ability to move users, data, and applications around in a cluster will soon be available. For separate systems within clusters, these features will actually appear in a future release. Once the basics have been implemented, IBM’s effort to expand the OS/400 design to allow these clustering benefits becomes incremental.


The journey has just begun. The iSeries 400 looks to extend its already impressive image as the world’s most available and usable server into the clustering domain. For iSeries customers, this means that the platform will continue to lead the pack as the business machine truly designed for continuous operations and, more importantly, the robustness the Web world demands.

Web app UI Web catalog app

ACRG1

ACRG2

Inv/SCM app GUI

Inventory app First backup for catalog app

SYSTEM 2

Figure 1: The HYPOCAT cluster solution offers better availability.

CRG Association Primary First Backup

Node Node

ACRG1 Web-serving application SYSTEM1 SYSTEM4 ACRG2 Inventory-control application SYSTEM2 SYSTEM3 ACRG3 Supply chain management application SYSTEM2 SYSTEM3 DCRG1 Catalog and order data SYSTEM3 SYSTEM2 DCRG2 Inventory data SYSTEM2 SYSTEM3 DCRG3 SCM data SYSTEM2 SYSTEM3

Figure 2: Different CRGs can have different recovery domains.


SYSTEM 1 SYSTEM 3

ACRG2

SCM app ACRG3 ACRG1

DCRG1

Clustering_and_the_iSeries_40006-00.png 59x69

Clustering_and_the_iSeries_40006-02.png 59x69

Inventory data SCM data

Backup Backup Backup

DCRG3 DCRG2

DCRG2

SYSTEM 4

Catalog & order data

DCRG1

Clustering_and_the_iSeries_40006-01.png 59x69

Clustering_and_the_iSeries_40006-03.png 59x69

BLOG COMMENTS POWERED BY DISQUS

LATEST COMMENTS

Support MC Press Online

$0.00 Raised:
$