Performance Tuning

System Administration
Typography
  • Smaller Small Medium Big Bigger
  • Default Helvetica Segoe Georgia Times

I’ve done a lot of consulting for AS/400 shops. Much of it revolved around concern that an AS/400 was out of capacity and had to be replaced. As it turned out, many of those problems transpired because the AS/400 had never been tuned. Tuning will increase the years of service you’ll get from your AS/400 and generally make your computing life run more smoothly.

Too many AS/400 customers ignore tuning or are unaware of it. As a result, the art of performance tuning an AS/400 is almost lost—which is too bad, because tuning is fairly easy.

One reason I suspect AS/400 customers don’t tune their AS/400s is that they associate the task with “fixing” something. That’s only true in really extreme cases—usually when a performance consultant is brought in. You don’t want to get to that point. Tuning an AS/400 is just something you do. Not all the time, but from time to time. Think of tuning as getting to know your system; it’s really not much more than that. If you faithfully review your AS/400 using the techniques in this article, you’ll make the adjustments once, and you’ll have a better mental image of what your AS/400 is doing at any given time. And, because of your knowledge, your AS/400 will be humming along beautifully.

Also, this article will develop a few rules for tuning. Rather than keep you in suspense, I’ll just cut to the chase here. When you tune always do the following:

• Monitor when the system is at its busiest (for the job mix).
• Modify when the system is at its lightest.
• When reassigning memory among pools, you must first decrease the memory in the pools that require less memory. Then, you can increase memory in other pools.

There aren’t too many specific tasks involved in tuning; the sticking point is determining what’s right for any particular system. Each AS/400 gets tuned to support a job mix. As the mix changes, the tuning requirements change. And the mix is always changing. The challenge, then, is gathering the data (monitoring) to maintain the proper

support for your specific mix. After you monitor, you may need to modify some AS/400 parameters.

In this article, I’ll discuss manual monitoring—which can be quite tedious. After you’ve done it once, you may want to write a program to do your monitoring for you.

First, I’ll discuss the AS/400 items you will monitor:
• Wait-to-ineligible ratio
• Nondatabase faults
• Nonmachine pool
• Machine pool
• All pools

Wait-to-ineligible Ratio

The wait-to-ineligible ratio is the primary item to monitor. Jobs can be in one of three run states:

• Active. The job is currently running.
• Wait. The job is waiting for something like disk access, which could be a memory page retrieval or a user’s response to a screen.

• Ineligible. The job is ready to run but just waiting for the system to allocate resources to it.

From these three, IBM defines three critical transition states, or the act of a job going from one state to another: • Active-to-wait
• Active-to-ineligible
• Wait-to-ineligible The AS/400 measures these transition states in a ratio of each against the other as well as against other jobs.

Use transition ratios to define what you want to see in a healthy system. The valid ratios are given in the IBM Redbook AS/400 Performance Management for each AS/400 model and each version of the operating system. Keep in mind that the text and examples in this article use acceptable values for an AS/400 that may be nothing like yours. Follow the discussion, but don’t copy my values for your AS/400. Check the Redbook for your AS/400’s values before you start tuning.

Your primary ratio is the wait-to-ineligible to active-to-wait. You’d expect it to be about 10 percent. This ratio means that 10 percent of the jobs that went from active to wait states are unable to start right away when they are done waiting. In other words, if 50 jobs were active and went into a wait state, you would like to see five jobs (or 10 percent) waiting to go to ineligible.

To understand why it is good to have some number of jobs ineligible to run, consider having a system in which no jobs wait. Better yet, consider how “great” it would be for drivers if New York City did away with traffic lights and stop signs. You may have an image of zipping from one end of the city to another, but you probably wouldn’t go anywhere. Think of this number of ineligible jobs as jobs that are waiting for their turn to move. An AS/400 that shows no jobs going ineligible is like a city where no one is stopping for traffic control. A low ineligible-to-wait ratio (or no ineligibles) doesn’t have to be bad—it can be an indication that you don’t have much running on the system. If either case exists on a busy system, it may be overtuned.

Can a system be overtuned? Yes, although a better term may be overallocated. An AS/400 can have so many system resources allocated that no jobs ever have to wait. However, the effect is like the traffic free-for-all situation. If everyone cooperates for an hour, the system appears to work well. However, if traffic on Fifth Avenue at 5:00 p.m.

hogs that street, cars waiting to cross or to join traffic may wait an interminable amount of time. If you just look at the number of cars (jobs on the AS/400) and their individual successes, you would see that some run great and others (on a random distribution) get nowhere. That’s what happens on an overallocated system—random jobs just go in the toilet for no apparent reason. Think of the 10 percent wait-to-ineligible/active-to-wait ratio as an indication that the system resources are fairly distributed.

The monitoring command for that ratio is Work with System Status (WRKSYSSTS). Figure 1 contains a sample.

You can control the ratios by changing two things on this screen: the Max Act and the Pool Sizes. On most systems, you don’t really have a lot of spare memory to adjust into pools, so you’ll be changing the Max Act field much more frequently. I favor a Max Act adjustment over a pool adjustment, anyway, because increasing memory is like increasing the number of streets in the traffic example. It’s a nice effort, and it will delay some performance problems, but it doesn’t solve the problem; the problem only shifts for a while. To change either, position the cursor on the field and type in the new figure.

To monitor for good transition, focus on two columns, Wait-Inel (wait-to- ineligible) and Act-Wait (active-to-wait). You want the wait-to-ineligible value to be between 0 and 10 percent of the active-to-wait value.

Nondatabase Faults (Nonmachine Pool)

Another monitoring point shown on Figure 1 is the nondatabase faults. A fault occurs when the system goes to work on a memory page (which can contain a chunk of program code or some data), and the page isn’t where it’s supposed to be. Usually, another program needed some memory for one of its pages, and the particular memory pool page showed up as old and boring so the page got moved to disk. The system faults when it doesn’t see the page and has to go to disk to reload it into memory.

A database fault refers to a missing data page and is usually an application issue. For example, a program reads records from a file. The file’s records are loaded into memory in blocks. The program happily reads through the blocked records (which are in memory pages) and then faults when it reads the last of the block. That fault causes the program to ask the operating system to bring in another block of records, and that’s called a database fault.

Database faults are pretty application-dependent—if you have a lot of them, you may want to ask a developer to take a look. But nondatabase faults are caused by a program’s inability to execute some code or to read some records that should be there. These are indications of a busy system.

Remember, pools are just fences around chunks of memory. The system pool is where the system does its thing for all programs, and the user pools are where everything else runs.

Refer to the left column in Figure 1; it is a list of the memory pools on the system. Pool 1 is the system pool, and the rest are user pools. This section focuses on nondatabase faults for user pools. (A database fault occurs when an AS/400 customer unintentionally corrupts his database. It’s a database problem, and it’s all his fault!)

Database faults can be an application issue, but nondatabase faults are a way of saying, “There isn’t enough memory to adequately load the program.”

Monitor nondatabase faults with the WRKSYSSTS command, and, if you need to, fix the problem by modifying the pool size value on the same screen. Raising the pool size increases available memory and decreases the nondatabase faulting rate. Conversely, decreasing the pool size increases the nondatabase faulting rate.

These pools should have nondatabase faults of between 10 and 20 pages per minute. Don’t worry too much about the ones that are less than 10. In the pools in Figure 1, Pools 2 to 6 represent batch jobs that are expected to be lightly loaded during the day, so the fault rate would be low. However, Pool 7 in the figure supports interactive jobs and has a fairly high nondatabase fault rate (44.8). You should probably adjust this value.

Nondatabase Faults (Machine Pool)

Pool 1 on Figure 1 also called the machine pool. It is where the system does its system tasks. Its nondatabase fault rate should be between three and five pages per minute. In my example, this pool is 0.2, and that’s pretty low; I could take memory away to get it up to the correct range.

Nondatabase Faults (All Pools)

Finally, as another monitoring activity, total the nondatabase faults for all pools (including the machine pool). This value should be between 180 and 300 (each AS/400 model has a unique value; you can obtain that value from the performance Redbook).

If the value isn’t what’s specified by the Redbook, one or more pools may be seriously out of tune. If all pools are tuned or there are no further steps you’re able to take (like no inactive pools you can draw memory from), then the AS/400 may be out of capacity and in need of an upgrade.

When to Monitor

If the AS/400 is lightly loaded, you’ll get low numbers for these monitoring points. If you make adjustments based on that, your AS/400 will go belly up when any serious workload hits.

Always monitor when the AS/400 is getting hammered. Take several monitor samples, and don’t worry about reacting to all of them. On the best-tuned AS/400s, you’ll get transition periods in which the monitored values are not good. You’re looking for a trend that’s acceptable most of the time.

Approach performance monitoring like stock market investing. Establish a goal and check it periodically. If you check it every day (or hourly), you’ll go crazy.

You can tune any AS/400 for any job mix, but job mixes change. A typical AS/400 predominantly runs interactive jobs from 7:00 a.m. to 6:00 p.m., Monday through Friday. The rest of the time, the system runs batch jobs—maybe lots every night, a few on the weekend, and very many at month-end. These are all job-mix situations that would benefit from unique “tunes.”

Make Modifications

As we get into this section, I need to point out that the AS/400 will do automatic tuning for you through the system value QPFRADJ. If this value is on, it will make changes to what you change as you tune your system. For this reason, I never turn it on. I simply like my tuning efforts better.

Refer to Figure 1 again. Its second and fourth columns represent the things you can change:

• Pool Size
• Max Act These are bad names for the columns. Read them this way:
• Memory Pool Sizes
• Activity Level Memory Pool Sizes This is the amount of memory a set of jobs has to play in. The more memory a set of jobs has, the faster the jobs will run. But they may do so at the expense of other jobs.

You’ll notice from the display in Figure 1 that you cannot modify the memory for Pool 2. That pool is also called the base pool, *BASE, or star-base. It contains all the system memory not used in any other pool.

When you decrease the memory in a pool, the leftover memory goes into Pool 2. When you increase the memory in any pool, the amount you need comes from Pool 2.

To take memory from one storage pool that isn’t using it and put it in another that needs it, you can modify the system by either of two methods:

• Decrease the first pool’s memory.
• Increase the second pool’s memory. Activity Level Activity level is called MAX ACT on the WRKSYSSTS display, which is easy to confuse with MAXACT jobs on the subsystem description and job queue descriptions. The job queue MAXACTs refer to how many jobs can be running at any one time.

The MAX ACT on the WRKSYSSTS display is a lot different. The activity level doesn’t affect how many jobs can run at one time. It affects how many can be active at one time. Remember the three job states I mentioned?

When a job ends a wait state, it wants to go active. It can return to active, go to ineligible until another job finishes, or go to wait. Whether it goes active or ineligible is determined by the activity level (or MAX ACT) setting for the storage pool and by the number of jobs being run.

You could have a storage pool with MAXACT jobs of six (from the subsystem description) but an activity level of three (on the memory pool). Although six jobs can be running at the same time, three of them will always be in either a wait state or an ineligible state.

Monitoring and modifying MAX ACT is done differently for interactive and batch pools. Interactive pools go into waits when they display a screen to a user. In computer time, those waits are small eternities. So you could have a storage pool supporting 100 interactive users with an activity level of six, because you expect the others to be waiting on screens anyway. Batch pool waits are for the system, not the user. System processes are very fast, and the wait time is minimal, so your batch storage pool activity level will be around 50 percent of the maximum active jobs.

The Redbook has suggested initial values for activity level settings. The only way you can change the Wait-Inel column is to change the MAX ACT figure. Lower it by an increment of 2 and then wait approximately 15 minutes for the machine to stabilize before you make another modification. When the Wait-Inel number drops to almost 0, increase MAX ACT by 2 and leave it.

For detailed information about memory pools and activity levels, read “SYSOP: Memory Pools” and “SYSOP: Activity Levels” in the May and June issues of MC.

When to Modify

When you change memory or activity level, your system must go through a fairly intensive rearranging of the jobs in those pools. This rearranging can get really intensive when you take memory from one or more of the pools and put it into one or more of the other pools. When you make these changes, you can watch your CPU activity light go steady for anywhere from several seconds to several minutes.

For this reason, don’t make modifications when your system is heavily loaded. I may violate this rule when setting up a new application or a new AS/400. But the rule definitely applies for a stable AS/400. Monitor when it’s busy, and modify some other time. You may even develop a “canned” set of pools and activities to best support different job mixes.

When I’ve been in shops that have a stable, although changing, job mix, I’ve had great success with canned pools. All I do is write a couple of CL programs that change pools and activities. One runs just prior to the start of the nightly batch processing. It tunes the system to favor batch processing by “stealing” memory from interactive pools and giving it to batch pools. It also boosts the batch activity levels slightly. The other does just the opposite; it reallocates memory from batch to interactive. I embed these programs at the start and end of nightly processing, so they run automatically. When there’s no nightly batch process (like on Saturday and Sunday evenings), the machine stays tuned for interactive processing.

With that canned set of pools, you can change memory and activities to support batch processing immediately prior to transitioning from interactive to batch in the evening. In the morning, you can change back to support interactive processing before the first users sign on. If you add some basic monitoring during each mix’s period, you can change your script subtly from time to time.

Tuning Check List

Performance tuning must be an ongoing activity. It’s not a good operation to start doing when your system is crashing.

Remember that monitoring is also ongoing. And it’s boring. Consider writing a program to monitor your system at known busy times and just collect data that you can scan periodically. Then, make modifications when the time is right.

If you make performance tuning an ongoing activity, you’ll never be backed into a corner in which your system is limping along with a tuning problem and you have to make a change in the middle of a busy period.

Reference

Redbook AS/400 Performance Management (GG24-3723-02)


Figure 1: Sample of the WRKSYSSTS command





Performance_Tuning06-00.png 900x383
BLOG COMMENTS POWERED BY DISQUS

LATEST COMMENTS

Support MC Press Online

$0.00 Raised:
$