Joining Files

General
Typography
  • Smaller Small Medium Big Bigger
  • Default Helvetica Segoe Georgia Times

Access information from more than one file at a time with this modern database technique

by Ted Holt

If you moved into the System/38-AS/400 world after working with the System/34-36, as I did, you may feel overwhelmed at the difference in the two environments. Many of the differences are simply extensions of what you already know, such as using data areas. Other differences just involve using new techniques, such as treating RPG indicators as data fields.

But the AS/400 is a database machine, built upon new concepts, so the majority of the differences from your old environment proceed from a change in philosophy. If we consider the topic of this article on the join file to simply be a new technique, we fail to understand not only what joining is all about, but also what the AS/400 is all about. Joining files is not a new tool to add to our coder's toolbox - it is part of the philosophy of relational database management systems.

What is Joining?

Joining is combining data from two or more files into one record format, based on common fields. In other words, to the user (a program, or a human using a query tool) each record appears to have come from one file although it really contains data from two or more files.

Joining doesn't retrieve any information that the programmer couldn't get just by reading multiple input files. It is common to read a primary file sequentially, and then use the RPG CHAIN operation or the COBOL random READ to get additional information from other files. Reading multiple input files doesn't increase system overhead by adding another logical file for the system to maintain. So why join?

To help us understand the rationale behind joining, let's review two relational database concepts you've probably heard of before: normalization and view.

Normalization is the process of breaking up some types of records into smaller records, without losing the ability to retrieve the data that was contained in the original record. Normalization will eliminate data redundancies and maximize the flexibility and useability of the data. With normalization, you'll find the number of physical files increasing - one file may become two or three. It may seem like quite a mess, but nothing you can't clean up with views.

The term "view" refers to the fact that a user doesn't have to see the data in exactly the way it is stored in the database. A logical file which includes only certain fields is a view. A logical file which includes only certain records (through select and omit criteria) is also a view. Logicals which concatenate fields into other fields are views. Views allow each user to see the data in the way that he wants to see it.

Your users don't care if you store their data in one big file or a dozen small ones. They do care if you don't give them all the information they need. If you give them too much information, you may lose your job. Views let your database appear differently to different people.

You should join files for two reasons: one, to be able to normalize your database; and two, to present data to users (both programs and humans) in the way that they wish to see it.

Types of Joins

Relational database textbooks are full of pages explaining many types of joins. For our purposes, we need to consider only three: the inner join, the partial outer join, and the exception join. These are distinguished by what happens when a record from the primary join file has no matching records in the secondary files. Figures 1 and 2 contain data from two files, customer master and sales history, which we'll use in our illustrations.

If you have joined files with Query/36, you're already familiar with the inner join. Inner joins create resultant records for records with matching values in both files. Let's join the sales history file to the customer master file with an inner join. The records we get are shown in 3. Notice that order 50013 is not in the resultant file because the customer master file has no records with customer number 106.

If you have joined files with Query/36, you're already familiar with the inner join. Inner joins create resultant records for records with matching values in both files. Let's join the sales history file to the customer master file with an inner join. The records we get are shown in Figure 3. Notice that order 50013 is not in the resultant file because the customer master file has no records with customer number 106.

If we want to see all history records, we'll have to use a partial outer join. Like the inner join, the partial outer join retrieves records with matching values. Unlike the inner join, the records which have no matching value in the secondary file will be joined to a null record, a nonexistent record of default values.

4 illustrates the same join as before, but using a partial outer join instead of an inner join. Order 50013 has been joined to a nonexistent blank customer record, which has a blank customer name and zero sales rep number. Default values for character fields are blanks and for numeric fields, zero. To give a field a different default value, use the DFT keyword in the DDS.

Figure 4 illustrates the same join as before, but using a partial outer join instead of an inner join. Order 50013 has been joined to a nonexistent blank customer record, which has a blank customer name and zero sales rep number. Default values for character fields are blanks and for numeric fields, zero. To give a field a different default value, use the DFT keyword in the DDS.

The e 5 shows the results of an exception join. Since order 50013 is the only one with no matching customer number in the customer master file, it is the only order returned by the join.

Another way of looking at it is thus: inner + exception = partial outer. Those that join successfully, plus those that don't, yield all records.

Join Logical Files

Join logical files are the easiest way to define joins on the S/38 and AS/400. Once you define and compile a join logical, your programs can use it as they do any file, with a few restrictions.

Before we get into the limitations of join logicals, let's study the DDS required to join the sales history file and the customer master file (shown in 6).

Before we get into the limitations of join logicals, let's study the DDS required to join the sales history file and the customer master file (shown in Figure 6).

The first keyword in this file, even before the record format definition, is JDFTVAL (Join Default Value). If this keyword is present in the DDS, the join will be a partial outer join. Otherwise, the join is an inner join.

The JFILE keyword in the record format definition tells which files to join. The first file listed, in this case the SLSHIST file, is the primary file. All other files are secondary.

A join specification, coded with a J in column 17, tells how the files are to be joined. One join specification is required for every pair of files. (Joining three files would require two join specs; four files would require three, and so on.) The JOIN keyword, which is optional if only two files are being joined, tells which pair of files this specification is joining. The JFLD (Join Field) keyword tells which fields to match. Here, the customer number has the same name in both files.

The fields to be included in the join logical must be listed next. If an included field is found in more than one file, you must indicate which file's value is to be used. JREF(1) (Join Reference) means that the value of CUSNBR is to be taken from the first file, SLSHIST.

Defining a key field lets us read it in a sorted sequence. All key fields must be from the primary file.

You may also include select/omit specs, but if you do, you must add the DYNSLT (Dynamic Select) keyword at the file level (before the R spec).

Join logicals have a few limitations which you need to keep in mind. One, you can't do exception joins with join logicals. Two, the key fields must all come from the primary file. Three, join logicals can only join physical files. (Ideally, you shouldn't have to worry about whether a file is a physical or a logical.) Four, you can't update through a join logical file. Instead, you must update through a physical or a non-join logical. You can get around all of these limitations by using Open Query File (OPNQRYF), except for the last one.

Joining files with OPNQRYF

If you want to get around limitations one and three, you'll have to let OPNQRYF join the files. OPNQRYF handles inner, partial outer, and exception joins with the JDFTVAL parameter, which looks like its DDS counterpart, but differs in that it takes a parameter value. JDFTVAL(*NO), the default, is for inner joins. You can get a partial outer join with JDFTVAL(*YES). If you want an exception join, specify JDFTVAL(*ONLYDFT). If any of the files being joined is itself a join logical, it must agree with this parameter.

To overcome the second limitation, when you need to order a join logical by a field from the secondary file, specify the key field on an OPNQRYF command. 7 illustrates how we could sort the CUSHIST file on sales rep number, which comes from the secondary file.

To overcome the second limitation, when you need to order a join logical by a field from the secondary file, specify the key field on an OPNQRYF command. Figure 7 illustrates how we could sort the CUSHIST file on sales rep number, which comes from the secondary file.

You'll also have to work with the FILE, JFLD, FORMAT, and MAPFLD parameters when joining with OPNQRYF. List the files you are joining, primary first, in the FILE parameter. Use the FORMAT parameter to tell OPNQRYF how the high-level language expects the data to look. Put the name of any file here, and if the file is a multi-format file, add the name of the format to use. (If no file has exactly the fields you need, create a physical file with no members in it.) Be sure to code the same file in your HLL program as well. When you override the format file to the primary join file, the program will not read the actual data, but the query.

The JFLD parameter, like the JFLD keyword in DDS, tells which fields to use to join the files. Its syntax is different, in that one JFLD parameter is used in place of several JFLD keywords.

If the same field name is found in more than one file, use the MAPFLD parameter to indicate from which file to retrieve the field value.

8 illustrates these keywords. WORK is a file which has all of the fields needed for the HLL program. The actual data will come from the sales rep master file and the join logical called CUSHIST. Since CUSHIST is a partial outer join logical, OPNQRYF must also indicate a partial outer join with JDFTVAL(*YES). The SLSREP field is used to make the join. The MAPFLD tells OPNQRYF to get the value of SLSREP from the sales rep master file. (If we told it to get the value of SLSREP from the CUSHIST file, it would return a zero sales rep number for sales reps with no sales.)

Figure 8 illustrates these keywords. WORK is a file which has all of the fields needed for the HLL program. The actual data will come from the sales rep master file and the join logical called CUSHIST. Since CUSHIST is a partial outer join logical, OPNQRYF must also indicate a partial outer join with JDFTVAL(*YES). The SLSREP field is used to make the join. The MAPFLD tells OPNQRYF to get the value of SLSREP from the sales rep master file. (If we told it to get the value of SLSREP from the CUSHIST file, it would return a zero sales rep number for sales reps with no sales.)

For more examples of joining with OPNQRYF, see the CL Reference manual and my articles in the May and June issues of Midrange Computing.

Be aware that OPNQRYF does have its performance problems. It does tend to hog resources (files, memory, and CPU cycles). OPNQRYF creates temporary access paths as needed, which slows the job as well as the system.

Use join logicals whenever possible for jobs that run frequently. For those jobs that run infrequently, use OPNQRYF so the system doesn't have to maintain a rarely used logical file. Some frequently run jobs may require features not available with join logicals, so you will have to use OPNQRYF. If you do, build supporting logicals so that OPNQRYF won't have to create access paths every time these jobs run.

Try It, You'll Like It!

If you can code a join with DDS and OPNQRYF, you'll have no problem with joins in other products, such as SQL/400, AS/400 Query, or ASC's SEQUEL. Actually, you'll find that learning how to join files is easy. The hard part is making yourself do it. We RPG programmers love the CHAIN op code, and the idea of writing programs that read one (and only one) input file is, well, different. But that is the shape of things to come.

You'll never be convinced of the importance of views, including joins, until you see what they can do for you, so here is an experiment you can try. Pick a user who bugs you to death with requests for new reports. Build a join logical which includes every piece of information he needs. Give him a query package, or write a generic RPG report program with a prompt screen and OPNQRYF over it, to let him pick the sort sequence and the selection of records. Each time I've used this approach, the users ended up generating more of their reports, leaving me with more time for more important things.

So on top of learning about data queues, subsystems, and job descriptions, you've got a different way to write applications. But you didn't pay all that money for an AS/400 so you could keep doing everything the same way.

A QUICK REVIEW OF THE FIRST THREE NORMAL FORMS

Normalization is a set of guidelines for file design. These guidelines, known as normal forms, help prevent redundant data in your database.

Suppose you must store some employee data for a personnel department. For each employee you must record employee number, name, the number and name of the department in which he works, the state in which he works, the code for the plant where he works, the numbers and names of training courses he has completed, and the dates he completed them.

You might try to store all the data in one file, like this:

File 1: EMP#,EMP_NAME,DEPT#,DEPT_NAME,STATE, PLANT,COURSE#1,COURSE_NAME1,COURSE_ DATE1,COURSE#2,COURSE_NAME2,COURSE_ DATE2,COURSE#3,COURSE_NAME3,COURSE_ DATE3

The underlined field is the key. (This should not be confused with access path. In relational database terminology, the key is the group of one or more fields which distinguishes one record from another. On the S/38 and AS/400, relational keys are implemented as access paths with the UNIQUE keyword.)

You can already see some problems with this approach. For how many courses should you leave room? If management changes the name of Department 4 from Personnel to Human Resources, will all Department 4 records get updated? Where do you store the name of a new course no one has yet completed?

Let's look at how normalization takes care of such problems.

FIRST NORMAL FORM

A file is in first normal form (1NF) if it doesn't have repeating groups. The obvious violation of 1NF in this example is the course information. We can convert this file to first normal form by placing each completed course into a record of another file, whose key will be a composite of employee number and course number.

In 1NF, the data in our example is stored in two files:

File 1: EMP#,EMP_NAME,DEPT#,DEPT_NAME,STATE,PLANT

File 2: #EMP#,COURSE#,COURSE_NAME,COURSE_DATE

This is better than before for two reasons. One, we can store as many or as few courses per employee as we like. Two, to find out which employees have completed course 434, we search one field, not three.

SECOND NORMAL FORM

Our new database design still isn't perfect. If someone enters the wrong course name in a record, there will be more than one course name for the same course number, and the database will be contaminated. There really is no need to store the course name for every employee who has completed it. Because course name is functionally dependent on (can be determined from, is a fact about) only part of a composite key (a key made up of two or more fields), this file is not in second normal form. We need to put the course name in another file, with course number as the key.

In 2NF, the data in our example is stored in three files:

File 1: EMP#,EMP_NAME,DEPT#,DEPT_NAME,STATE,PLANT

File 2: #EMP#,COURSE#,COURSE_DATE

File 3: COURSE#,COURSE_NAME

THIRD NORMAL FORM

The design of our employee database is improving, but still has one problem. The department name in the first file is functionally dependent on a non-key field, department number. We need to move department name to another file to achieve third normal form.

In 3NF, the data in our example is stored in four files:

File 1: EMP#,EMP_NAME,DEPT#,STATE,PLANT

File 2: #DEPT#,DEPT_NAME

File 3: EMP#,COURSE#,COURSE_DATE

File 4: COURSE#,COURSE_NAME

This database design allows us to store each non-key data value one time. Key values are stored redundantly in order to join files. We can retrieve any information in the database from a single file or from a join file of two or more files.

OTHER NORMAL FORMS

There are other normal forms, but these are the most important. If your database is consistent with 3NF, you should be able to store and retrieve anything with no problem. And your database probably will be in fourth or even fifth normal form without any effort on your part.

IT'S COMMON SENSE

Some people wonder why the big fuss over normalization. "It's just common sense," they argue. Well, maybe. I've worked on systems that violated all three of these rules, and you probably have, too. With the technology we had at the time, we often had to violate these rules.

Most of us would have probably designed a course master file and a department master file before writing the first line of code for our fictitious personnel system. Let that encourage you. It means that normalization is not really as new or as difficult as it sounds.

DDS Reference (SC21-9620) Data Base Guide (SC21-9659) Control Language Reference (SC21-9778)


Joining Files

Figure 1 Customer master file CUSMAST

 Figure 1. Customer master file CUSMAST Cust Cust Sales No. Name Rep 100 BAKER 41 101 JONES 85 102 SMITH 7 103 DOE 41 104 BLACK 7 
Joining Files

Figure 2 Sales history file SLSHIST

 Figure 2. Sales history file SLSHIST Order Cust Date Item Unit No. No. Of Sale No. Qty Price 50011 104 90/06/04 T310 4 100.00 50012 100 90/06/04 M099 1 500.00 50013 106 90/06/04 T314 3 250.00 50014 104 90/06/05 T310 8 95.00 
Joining Files

Figure 3 Inner join with sales history and customer master

 Figure 3. Inner join with sales history and customer master Order Cust Date Item Unit Cust Sales No. No. Of Sale No. Qty Price Name Rep 50011 104 90/06/04 T310 4 100.00 BLACK 7 50012 100 90/06/04 M099 1 500.00 BAKER 41 50014 104 90/06/05 T310 8 95.00 BLACK 7 
Joining Files

Figure 4 Partial outer join, sales history and customer mst

 Figure 4. Partial outer join with sales history and customer master Order Cust Date Item Unit Cust Sales No. No. Of Sale No. Qty Price Name Rep 50011 104 90/06/04 T310 4 100.00 BLACK 7 50012 100 90/06/04 M099 1 500.00 BAKER 41 50013 106 90/06/04 T314 3 250.00 0 50014 104 90/06/05 T310 8 95.00 BLACK 7 
Joining Files

Figure 5 Exception join with sales history and customer mst

 Figure 5. Exception join with sales history and customer master Order Cust Date Item Unit Cust Sales No. No. Of Sale No. Qty Price Name Rep 50013 106 90/06/04 T314 3 250.00 0 
Joining Files

Figure 6 DDS for join logical file CUSHIST

 Figure 6. DDS for join logical file CUSHIST A* JOIN SALES HISTORY & CUSTOMER MASTER FILE A A JDFTVAL A R CUSHISTR JFILE(SLSHIST CUSMAST) A J JOIN(SLSHIST CUSMAST) A JFLD(CUSNBR CUSNBR) A ORDNBR A CUSNBR JREF(1) A SLSDAT A ITMNBR A QTY A UPRICE A CUSNAM A SLSREP A K ORDNBR 
Joining Files

Figure 7 Ordering a join LF on field from secondary file

 Figure 7. Ordering a join logical on a field from a secondary file OVRDBF FILE(CUSHIST) SHARE(*YES) OPNQRYF FILE((CUSHIST)) KEYFLD((SLSREP)) CALL HLLPGM CLOF OPNID(CUSHIST) DLTOVR FILE(CUSHISTR) 
Joining Files

Figure 8 Sample join with OPNQRYF

 Figure 8. Sample join with OPNQRYF OVRDBF FILE(WORK) TOFILE(REPMAST) SHARE(*YES) OPNQRYF FILE((REPMAST) (CUSHIST)) + FORMAT(WORK) + JFLD((SLSREP SLSREP)) + JDFTVAL(*YES) + MAPFLD(REPMAST/SLSREP) CALL HLLPGM CLOF OPNID(REPMAST) DLTOVR FILE(WORK) 
BLOG COMMENTS POWERED BY DISQUS

LATEST COMMENTS

Support MC Press Online

$0.00 Raised:
$