Practical SQL: Selecting One of a Set of Duplicates

SQL
Typography
  • Smaller Small Medium Big Bigger
  • Default Helvetica Segoe Georgia Times

Today's article shows you how to take advantage of DB2 for i to fix an all-too-common problem.

 

Moving from traditional DDS environments to a pure SQL configuration requires some rethinking and sometimes a little re-engineering. One case has to do with duplicate keys. We sometimes end up with databases where we have "almost" duplicate records. That is, the same record (typically, some type of master data) gets written multiple times with nothing different except the timestamp or some other inconsequential field. It doesn't affect our application because when we CHAIN we get the first record of the set and since the meaningful data is the same for all records, we never notice. But then we go to SQL and we need to create a primary key over the file. We create our new DDL-defined table and attempt to copy our data, but the copy fails because of duplicate keys. Today I'll show you an easy way to clean that data up.

 

Problems with Selecting the Data

When moving from DDS to DDL, the best scenario is when all we have to do is create a new table with all the fields from the original file and then just copy the data in. So the basic process is pretty simple: we select all the fields from all the records from the first file and insert them into the second file. The problem arises when you have more than one record with the same key fields. And remember, I'm asserting that the rest of the important fields in the duplicate records are the same. If they're notif you have records with the same key but different datathen you have a different problem. But if the case is as I described and the only differences in the data are things like timestamps or change counts, then all you want is one record from each set; you don't really care which one.

 

If you're familiar with SQL, you're probably already saying to yourself, "Joe, all we need to do is use the DISTINCT keyword!" And yes, you would be right if all the fields were duplicated. But because the timestamps are different, you would get every record, including the "almost" duplicates. So DISTINCT doesn't really work very well. The next way to go would be to use GROUP BY. For that solution, you would have a GROUP BY clause on the key fields. However, as anyone who has worked with grouping in SQL knows, you can't then just select the non-key fields. Instead, you have to use some sort of aggregate (like MAX) on the other fields. You'll have to list every field in the file in the main SELECT clause with all non-key fields as aggregates (and then list all the important fields again in the GROUP BY). If it's a typical master record with dozens of fields, the result can get quite unwieldy. Imagine a customer master record with only one primary key field but dozens of data fields:

 

SELECT CMCUST, MAX(CMSTAT), MAX(CMNAME), MAX(CMADR1), (…), MAX(CMTIME)
FROM CUSTMAST GROUP BY CMCUST

 

As you can see, the more fields in the file, the longer and more difficult to maintain the statement becomes. Now, we RPG dinosaurs would like to point out that this is a perfect job for the RPG cycle. We'd simply put a good old L1 indicator on the key field(s) and then only do a write at L1 time, but we know that the cycle makes people's heads explode these days, so we need to figure out a more elegant solution. (And when we say "elegant," we really just mean one that doesn't use I-specs.)

 

The Solution

The genesis of this particular solution came from a discussion on the MIDRANGE-L mailing list. I watched as several people discussed the pros and cons of some of the approaches, and the answers became pretty convoluted. I recognized most of the approaches because I have run across this problem many times in the past and used nearly every technique imaginable, although I do admit that there were a couple of new ones in the list that I hadn't seen before. But generally speaking, most of the approaches used grouping in some way or another. One of the more creative approaches used FETCH FIRST ONLY in a subselect, but even that would eventually have required listing out all the fields.

 

I want to point out that if you're doing any sort of data conversion on some of the fields in the file as they move to the new table, such as converting numeric dates to date fields or other modernization techniques, or if you're dropping some fields, or if you're incorporating foreign data of some kind, then the massive SELECT statement may be the correct solution. Or better yet, an RPG program makes those conversions a bit easier to maintain. I try to avoid huge SQL statements that have bits of business logic strewn about throughout them; it's very hard to debug those things for the original programmer, much less for someone down the road who has to pick it up. But I'm going a bit far afield here; let's get back to the topic at hand.

 

So if you're not doing any conversions and if you're simply selecting all the fields from the old file to insert into the new table, I have a solution that works for every file, every time. Here's the technique:

 

SELECT * FROM CUSTMAST WHERE RRN(CUSTMAST) IN
(SELECT MAX(RRN(CUSTMAST)) FROM CUSTMAST GROUP BY CMCUST)

 

That's it! The only thing that changes is the file name, obviously, and the fields specified in the GROUP BY clause. Everything else stays the same. What you get is the data in the record that was most recently written, while MIN(RRN(CUSTMAST)) would get you the first record written. Well, we have to be a little bit careful with that: technically, you get the record with the highest relative record number. If your file has REUSEDLT specified, the relative record number (RRN) is not reliable as an indicator of position. But since we're declaring that the contents of the non-matching fields don't matter, it shouldn't matter which record you get.

 

What I think is neat about this solution is that it's unique to the IBM i, since other databases don't have a concept of a relative record number. That unique RRN lets us do a lot of interesting things. Take a look at Jim Staton's article on deleting duplicates using RRN and SQL. You'll see that the same concepts apply. I hope this helps you in some of your data cleanup and conversion tasks!

 

BLOG COMMENTS POWERED BY DISQUS

LATEST COMMENTS

Support MC Press Online

$0.00 Raised:
$