Understanding Tape DATA Compression Methods

IT Infrastructure - Other
Typography
  • Smaller Small Medium Big Bigger
  • Default Helvetica Segoe Georgia Times

The MIS executive today is faced with a bewildering array of hardware and software options. And nowhere is this more evident than in the recent explosion of DASD backup solutions. The problem starts at the most fundamental level; there are six different--and mutually incompatible-- hardware technologies from which to select: the old tried and true half-inch reel-to-reel tape drives; the quarter-inch cartridge (QIC) products that became the standard for smaller machines during the 1980s; the 1/2-inch cartridge (3480) tape drive that is the mainframe standard and is becoming popular on high-end AS/400s; the 8mm helical scan tape drives that were the domain of third party vendors until IBM introduced the 7208 on the new "D" Series processors in April; the 4mm helical scan tape drive, which is the latest third party vendor introduction to the midrange market; and finally Optical, which is actually a removable media disk.

Each of these technologies has its weaknesses and strengths and each has successfully penetrated the AS/400 market to some degree. The elements which appear to have the greatest influence on the MIS executive's ultimate choice in selecting one of these technologies, are performance, capacity, compatibility and cost--though not necessarily in that order. It is obviously in the decision maker's best interest to understand how these various technologies weigh in against a shop's specific needs when selecting a DASD backup solution.

The Importance of Data Compression

There is another element that crosses all these drive technologies and directly affects the four key criteria mentioned above (performance, capacity, compatibility and cost) and is equally important to the decision-making process. This additional variable is data compression. In fact, the data compression question is so critical that, if understood, it could easily change many of the decisions currently being made by otherwise knowledgeable MIS executives. Unfortunately, the effects the various types of data compression have on the end-user's backup and disaster recovery strategies is poorly understood by customer and vendor alike. It is the intent of this article to rectify this condition by providing the information needed to include the data compression issues in the decision-making process.

The purpose of data compression is exactly what the name implies: if a data record of some given length is reduced to some smaller length, without loss of any of the original data, then data compression has occurred. The techniques used to achieve data compression are commonly called data compression algorithms. According to Webster's New Universal Unabridged Dictionary, an algorithm is, "...any special method of solving a special kind of problem." So, when we mention a particular data compression algorithm, we are simply referring to a specific method for achieving data compression. Knowing something about the different algorithms is important for two reasons. First, all algorithms are not created equal; some are considerably better at compressing data than others. And second, data compression algorithms are only compatible with themselves. You cannot record compressed data on one drive and then read it on another drive which is using a different data compression algorithm. If system backups are to be saved in a compressed format, then any other drive which may be called upon to restore that data must not only be the same technology (reel-to-reel, DAT, 8mm, etc.) but must also support exactly the same data compression algorithm.

Another important variable that is often ignored when discussing data compression is the performance of the compression implementation itself. The actual time it takes to perform a backup on a given system is primarily a function of: the physical location of specific data on the disk; the seek time of the disk drives; the number of disk spindles; the processing speed of the AS/400; the speed of the tape channel; and the speed of the tape drive. Any of the above can become the bottleneck in a given system and therefore the determining factor in the actual time required to perform a backup. When data compression is added, it becomes another link in the chain and can--and sometimes does--cause the total backup time to become longer. The method of implementation and the algorithm itself determine the speed with which the data compression logic can process data.

There are two general methods of implementation commonly known as software data compression and hardware data compression. Software data compression gets its name because the algorithm is implemented by a program which is executed on the host system (the data is read from the disk into memory; compressed by an AS/400 program; then the AS/400 transfers the compressed data from memory to the tape drive). Hardware compression is performed by the hardware electronics in the tape controller or in the tape drive itself. Since the amount of compression achieved is a function of the algorithm, not the method of implementation, software compression is equally effective at compressing data as the equivalent hardware implementation. In addition, assuming the algorithm was properly implemented, software and hardware compression are compatible with each other. The down side of a software algorithm is its performance. The overhead--where the system is involved--is significant. It can easily take two to three times longer to back up with software data compression than with no compression at all. Because of this heavy performance cost, most AS/400 end-users do not use the software data compression feature that comes standard with the OS/400 backup utility.

The attraction of a well-implemented hardware data compression algorithm is that it has the capability of increasing both the capacity and speed by multiples of the drive rate. The newest 8mm product (Exabyte model 8500) has a basic capacity of 5 gigabytes and a transfer rate of approximately 500 kilobytes per second. This means that an AS/400 with 5 gigabytes of DASD could perform a complete backup on a single 8mm cartridge in less than three hours (the actual time would be somewhat longer based on the ability of the AS/400 to keep the tape streaming). With a hardware data compression algorithm capable of providing an average of 3:1 compression, the same Exabyte 8500 would be capable of storing 15 gigabytes on a single tape at an effective rate of 1.5 megabytes per second (three times the capacity at three times the speed). For this reason, almost all vendors of AS/400 tape products, including IBM, are now providing some form of optional hardware data compression.

The Compression Algorithms

Before discussing how data compression affects an MIS executive's backup strategy decisions, we need to take a closer look at the types of compression available today. There are currently four hardware data compression algorithms in common use for data backup in the AS/400 environment. All four are based on the premise that repetitive data can be identified and then represented with fewer bits on the tape. These four methods are: HDC, which comes standard from IBM on the AS/400 feature codes 2608 and 2621 SCSI tape controllers; LZ1 which has achieved popularity as the compression algorithm standard for QIC products; LZ2, which is endorsed by Hewlett-Packard and appears to be the most likely standard for the 4mm helical tape drives; and IDRC, which is endorsed by IBM and used on its high performance half-inch cartridge (3480/3490) tape products. There is also rumor that IBM plans to add IDRC to its new 7208 helical scan tape products.

The HDC (Hardware Data Compression) algorithm, which comes standard on all IBM feature code 2608 and 2621 tape controller boards, is the simplest, and least effective, of the current algorithms available on AS/400 tape products. HDC examines each of the data bytes, as they are transferred from the AS/400 system to the tape drive, and identifies each as one of the following three types: a blank (X '40'), a duplicate character, or a nonduplicate character. The heart of the algorithm is a control byte in which the first two bits are encoded to represent one of the three types (00 = nonduplicate, 10 = blank, and 11 = duplicate) and the remaining 6 bits represent a count from 1 to 63.

If the HDC algorithm should encounter a string of blanks, up to 63 bytes long, it would replace the entire string with a single control byte designating how many blanks need to be inserted into the string when it is uncompressed at a later time. If an entire database were to be composed of blanks, the HDC algorithm would provide a 63:1 compression ratio or, 63MB of DASD data could be stored on a single megabyte of tape. This, of course, represents the best- case scenario for the HDC algorithm. If the algorithm encounters a string of duplicate characters, other than blanks, it would replace the string with two bytes, the appropriate control byte followed by one byte of the character to be duplicated. If a database were composed entirely of a single, nonblank, character, the HDC algorithm would realize a compression ratio of approximately 32:1. And finally, if the algorithm encounters a string of nonrepetitive bytes, a control byte is written to identify the nonrepetitive nature of the string. If a database were to consist entirely of nonrepetitive characters, the data would actually be expanded, since an extra control byte would now exist for every 63 bytes of real data. A total of 100MB of DASD data would now occupy approximately 102MB of tape.

Most AS/400 users realize a modest gain (approximately 1.4:1 in both speed and capacity) when using HDC. There have been reports of compression as high as 2:1, while others have actually experienced expansion of data. Like all compression algorithms, HDC's effectiveness is strictly dependent on the type of data being compressed. Although HDC is the least effective of the algorithms, it has the advantage of being inexpensive (actually it's free on the 2608 and 2621 feature codes), fast (because of its simplicity it will never become the bottleneck in the backup process) and compatible (with any other AS/400 equipped with either a 2608 or a 2621 controller).

The LZ (Lempel/Ziv) data compression algorithms (LZ1 & LZ2) were named after Abraham Lempel and Jakob Ziv who did much of the pioneering work in data compression techniques. The LZ algorithms are similar in operation to the HDC algorithm previously discussed but significantly more efficient. In both the LZ1 and LZ2 algorithms, the incoming data is not only searched for repeating bytes but also for repeated byte patterns. As data is initially received, by the compression logic, it is stored in a pattern dictionary as well as on the tape. As patterns are detected which duplicate a dictionary entry, the dictionary location is recorded on tape instead of the individual data bytes of the recognized data pattern.

The difference between LZ1 and LZ2 is in the method used to access and maintain the data pattern dictionaries. In LZ1, the dictionary is a FIFO-type system in which the oldest patterns stored are continuously being replaced from the incoming data stream. In LZ2, the dictionary is filled at the beginning of the data stream and then remains static until the data compression ratio drops below a predetermined threshold, at which time the entire dictionary is flushed and refilled. Because the dictionary is always current, LZ1 invariably achieves higher data compression ratios than LZ2 for equivalent data types (an independent study using a mixture of file types showed LZ1 to provide an average compression ratio of 3.2:1 while LZ2's average was only 2.6:1). On the other hand, LZ1's dynamic dictionary results in a more costly design and is inherently slower in performing the compression and decompression functions.

IDRC (Improved Data Recording Capability) differs considerably from the more familiar recurring byte pattern replacement algorithms discussed so far. IDRC is based on an arithmetic coding technique in which the incoming data string is divided into 512-byte blocks which are each arithmetically encoded, bit by bit, into a single binary fraction. The compressed data consists of a series of binary fractions, each of which represents 512 bytes of uncompressed data. The original data is recovered by using magnitude comparison on the binary fraction to determine how the encoder must have subdivided the intervals.

The mathematical basis for a binary arithmetic code (BAC) algorithm is quite simple. There exists an infinite number of fractions between the numbers 0 and 1. Theoretically, everything in the universe could be assigned a different and unique fraction allowing everything to be referenced by its binary fraction. In IDRC, there exists a unique binary fraction for every possible combination of bits within a 512-byte block. Theoretically, there exists a bit pattern that can be represented by a single coded bit (.1). Likewise, there exists some bit patterns for which the binary fraction is longer than the total number of bits in the original data string. The challenge in establishing the IDRC algorithm was in developing an encoding technique which allowed the smaller binary fractions to be assigned to the more probable bit patterns.

The Benefits of Compression

The key element of any data compression algorithm is its compression ratio. This represents a direct measure of the algorithm's ability to reduce the size of the data records. The base line compression ratio of 1:1 represents the condition where the compressed data is exactly the same length as the uncompressed data (obviously not a desirable condition). If the compression ratio were to increase to 2:1, the compressed data would have half as many bytes as the uncompressed data. At 3:1, the compressed data is one-third the size of the uncompressed data, and so on.

An equally important, and often misunderstood, result of compression is the effect it has on the time it takes to perform a backup. A look at what actually happens during a compressed backup should clarify the time issue. For discussion purposes, we will assume that an IBM 7208 tape drive, capable of recording at 250KB/sec, is attached to an AS/400 in which all the other system elements are able to keep up with the tape drive. Without compression (or a 1:1 compression ratio), it would take four seconds to back up one megabyte of DASD data and the data would occupy one megabyte of tape. If data compression logic providing a 2:1 compression ratio was now added to the system, it would still take four seconds to record one megabyte on the tape, but this would represent two megabytes of DASD data. In other words, the system would be seeing an effective backup rate of 500KB/sec or twice the tape speed. Obviously this effect continues as the ratio increases; to support a 4:1 compression ratio, the system would have to provide data into the compression logic at the rate of 1MB/sec in order to keep up with a drive recording data at 250KB/sec. Data compression provides the end-user with a double bonus: not only does it allow a single tape to back up more data but, it also backs the data up faster.

The variables which determine the actual compression ratio on any given system are the algorithm used and the nature of the data to be compressed. In the AS/400 environment, IDRC has proven to be the most effective data compression algorithm, with typical compression ratios running between 3.5:1 and 4.0:1. Next comes LZ1, with typical compression ratios between 3.0:1 and 3.7:1; followed closely by LZ2, with typical compression ratios between 2.8:1 and 3.4:1. At the back of the pack is HDC with typical compression ratios between 1.2:1 and 1.5:1.

The only practical way to determine the compression ratio on any specific system is through actual measurement on that system but the following guidelines can be used to develop a fairly accurate estimate. Binary files (program files that contain nontext data such as control characters) are the most difficult to compress and tend to measure out at approximately 50 percent of the typical values presented earlier for IDRC and LZ algorithms. The HDC algorithm tends to expand binary files, giving a compressed file which is actually larger than the original. Word processor files, text files, spreadsheet files and program source code files tend toward the lower end of the typical range, while desk top publishing files, database files, and CAD files tend toward the upper end of the typical range.

Backup Strategies Involving Data Compression

In order to better understand how data compression affects the selection of a backup solution, we will evaluate the various options against a typical AS/400 environment. For the purpose of this example, we will define our typical site as having a D50 with 5 gigabytes of DASD. The system is actively being used from 6 a.m. to midnight, Monday through Friday, with the possibility of some light use during off hours and on the weekend. The site is currently doing a save of changed objects and certain selected libraries every evening after midnight and a full system save every Saturday. These saves are currently being done on an IBM 2440 1/2-inch reel tape drive. Pressure to reduce operating costs has resulted in the decision to eliminate the need for a third-shift operator by investing in an unattended backup solution. The critical factors, which will affect the final decision, are: must provide unattended backup during a single shift; must be compatible with appropriate disaster recovery sites; must provide for growth up to 10 gigabytes of DASD; and cost.

New tape technologies, combined with advanced data compression algorithms, now provide a large number of possible solutions where, just four years ago, there were none. The following three solutions, each using a different tape drive technology, are given as examples.

1) A pair of 4mm drives with LZ data compression is one of the lowest-cost solutions and should provide between 10 and 12 gigabytes of unattended backup. On the negative side, the 4mm technology is still new enough that there could be some difficulty in ensuring compatible disaster recovery sites (remember that both the technology and data compression algorithms must be compatible), and 4mm, even with data compression, is relatively slow. One gigabyte per hour is about the best that can be expected with today's 4mm technology (which isn't fast enough to back up 10 gigabytes in an 8-hour shift).

2) A double-density 8mm drive (8500) with LZ data compression is relatively cost-competitive with the dual 4mm system and should provide up to 15 gigabytes of unattended backup. The 8mm solution suffers from the same compatibility issues as the 4mm (IBM does not currently support the double- density 8mm or LZ data compression); however, this solution does not have the speed limitations. A double-density 8mm with LZ compression should be able to sustain a transfer rate in excess of 3 gigabytes per hour.

3) A pair of 1/2-inch cartridge tape drives (3480 style) with stackers and IDRC data compression, represents the optimum solution in every respect except cost. The 3480 is the fastest and most reliable of the new backup technologies, and IDRC is the fastest and most-effective data compression algorithm. In addition, 3480 and IDRC are both IBM standards and, as such, are fully supported by all reputable disaster recovery facilities. The prohibitive cost from IBM is mitigated somewhat by a number of third party tape vendors offering low-cost, 1/2-inch cartridge drives. Unfortunately, most of these low-cost units come with the LZ data compression algorithm which re- introduces the compatibility issue. Memorex Telex is the only third party vendor who currently offers a low cost 3490 with IBM-compatible IDRC data compression.

These three examples are not, and were not intended to be, an exhaustive list of possible solutions. Instead, these examples were provided to present an overview of how the various data compression algorithms can affect the decision making process.

Summary

Data compression, when combined with the new backup technologies, provides such a comprehensive array of capabilities that a solution should exist for any reasonable AS/400 system requirement. When selecting a specific solution, it is important to understand how the compression algorithm affects performance, capacity, compatibility and cost. Because data compression can provide tremendous added value to a tape product, it is easily subject to misuse and abuse. An understanding of the various algorithms--along with their strengths and weaknesses--is the only insurance an MIS executive has that he will not be inadvertently misled.

BLOG COMMENTS POWERED BY DISQUS

LATEST COMMENTS

Support MC Press Online

$0.00 Raised:
$