JavaJournal: Fun with Parsing

Java
Typography
  • Smaller Small Medium Big Bigger
  • Default Helvetica Segoe Georgia Times

How could parsing possibly be fun? Read on and find out. I don't know about you, but sometimes I get caught up in the day-to-day grind of programming and forget that I didn't go into this field just for the babes and the money. I went into programming because I like to write programs. My father-in-law has always advised me, and everybody else within earshot, that if you take a job doing something you love to do, you will never have to work a day in your life. So I find it's a good idea a couple of times a year to lock myself in the den and write something not to satisfy a customer's request for a new feature or product, but rather just to program for the sheer joy of programming. I find these programming binges rejuvenating, and I usually emerge from my den with a sense of accomplishment and my batteries recharged.

Two prerequisites must be satisfied in order for me to go on one of my programming binges: time and inspiration. Time, which I used to spend freely before I had a family, is now a precious commodity. So, to justify the time spent programming, I write articles like this one. This brings me to the second prerequisite, which is inspiration. The inspiration for my most recent binge and the focus of this article came from a friend of mine (Damien Chavarria) who runs DAP-G, a community-based Web site for amateur photographers. It's a place where photo enthusiasts can upload their portfolios and share them with similarly obsessed individuals. They have theme-based contests, articles on photography, forums, and all manner of other features that people have grown to expect from similar sites. I don't think of myself as a photographer by a long shot, but I do take a lot of pictures with my digital camera, and by "a lot of pictures" I mean thousands per year. If I could devise a way of flipping through them, I would probably have a time-lapse movie of my daughter's first year. I've noticed that the pictures people upload to Damien's Web site always have a certain amount of metadata associated with them, including the camera make and model used, date the image was taken, exposure time, and F number (aperture). Damien informed me that when you upload a JPEG image to his site, this information is parsed out of the file. So what other information might be locked away in these files? I had found my inspiration and decided to go on a binge. Let's start by looking at the JPEG file format together, and then let's take a look at some code (less than 200 lines, I promise) that can get us some basic image information, such as the make and model of the camera the image was taken with, the date and time the image was taken, and finally the x and y resolution of the image.

JPEG File Format

We all see JPEG files every day by the hundreds or even thousands. They are one of the two ubiquitous image file formats on the 'net, the other being GIF. JPEG, which is named for the Joint Photographic Experts Group that initially put forth the standard, is a "lossy" compression technique for storing images. What is meant by "lossy" is that when the image is uncompressed, it is not a duplicate of the original.

However, never fear. Note that the E in JPEG stands for expert, and these experts are snobs about quality. They also know the limitation of human ocular perception and how to exploit it, so even though a computer can tell you that the original image and the JPEG image are not pixel-identical, you will have a very difficult time telling the difference. It should also be noted that JPEG itself actually refers to the compression technique, not the file format. Just because they are experts doesn't mean they agree on much. In fact, even though they did come up with an excellent compression technique, they were unable to agree on a file format until recently.

The official file format is called Still Picture Interchange File Format (SPIFF), and it follows a simple pattern. Each file is composed of a small header followed by a number of segments. Each segment consists of a marker, followed by a segment length, followed by the segment data. This makes for an easy top-down design for our parser. After reading the header, we construct a framework that reads a segment marker, then reads the segment length, and then skips to the next segment marker, based on the length it just read. Repeat this until you reach the end of the file, and you have the general framework for a parser. Now, one by one, we can add in code for parsing each segment type that we are interested in.

Well, it's almost that easy. Because it took awhile for the international standards experts to come to an agreement on file format, other experts motivated more by getting products to market than arguing over bits and bytes came up with their own extensions to early drafts of the SPIFF file format. The differences are minor and easily manageable. This group, know as Japan Electronics and Information Technology Industries Association (JEITA) came up with the Exchangeable Image File Format for digital still cameras, which is simply know as Exif. If you ask me, it should have been called EIFF, but then I don't speak Japanese. Exif specifies two segments for a file: APP1 and APP2. The data that we are interested in is in APP1 and is stored in what is known as the Tagged Image File Format (TIFF). TIFFs consist of a header followed by one or more Image File Directories ( IFDs). Each IFD is composed of a series of attributes. Each attribute has four elements: tag, type, count, and value offset. Tag tells us the name of the attribute being stored. In our case, we are looking for make, model, date, time, x-resolution, and y-resolution. Type tells us the domain of the data. Examples are byte, ASCII, short, long, and rational. Count tells us the number of values stored. Value Offset should probably be called "value" or "offset." The idea is that if the data can be contained within the 4 bytes allocated to this element, then this element contains the data itself. If the data is larger than 4 bytes, this field contains the offset to the actual data. My guess is that the writers of this spec were C programmers. Anyway, this file format is a lot like a set of Russian matryoshka dolls, with containers containing containers containing containers, but the good thing is that it all follows rather simplistic rules. The following hierarchy summary should help you visualize the file format:

JPEG file
  Header (also know as start of image)
  App1 Segment
    Length
    TIFF Header
    IFD (0 or more)
      Attributes (0 or more)
        Tag
        Type
        Count
        Data Offset
      Attribute data that didn't fit in Data Offset element
  Image Segment (aka start of scan)

Endians

One messy detail that we do have to deal with is endians. This is basically whether the bytes that store a number begin with the least significant digit or the most significant digit. In our day-to-day lives, we normally deal with big endian numbers. If I write the number 1234, then 1 is the most significant digit and represents the thousands digit, and 4 is the least significant digit and represents the ones digit.

So we have two things to address here. First, all the numbers that we are going to deal with are in hexadecimal. Second, although the file format specs allow for either big or little endian encodings, most often little endian is used. So a 2-byte long number in hexadecimal where each byte is represented by two digits stored as a big endian would be ABCD, but the same number stored as little endian would be CDAB. In other words, the high and low bytes are interchanged. This gets more complicated for a 4-byte number, where the first and fourth byte are interchanged, as are the second and third byte. So 89ABCDEF in big endian becomes EFCDAB89 in little endian. Don't worry. I will provide convenience methods for dealing with this bit twiddling.

Parsing Example

The code in Figure 1 parses a JPEG file and looks for an Exif segment. Within that segment, it looks for the items we talked about above.

// JPEGParser.java
import java.io.RandomAccessFile;
import java.io.IOException;

public class JPEGParser
{
  private static final String inptFile = "Clyde.jpg";
  private static final int startOfImageToken = 0xFFD8;

  // Segment Tokens
  private static final byte app1Token = (byte)0xE1;
  private static final byte startOfScanToken = (byte)0xDA;

  // TIFF Attribute Tokens
  private static final int manufacturerToken = 0x10F;
  private static final int modelToken = 0x110;
  private static final int dateTimeToken = 0x9003;
  private static final int xDimensionToken = 0xA002;
  private static final int yDimensionToken = 0xA003;
  private RandomAccessFile in;

  public static void main(String[] args)
  {
    System.out.println("Start JPEGParser");
    JPEGParser myParser = new JPEGParser();
    myParser.parseJPEGFile();
    System.out.println("End JPEGParser");
  }

  public void parseJPEGFile()
  {
    try
    {
      in = new RandomAccessFile(inptFile, "r");
      int startOfImage = in.readUnsignedShort();
      if (startOfImage == startOfImageToken)
      {
        while (true)
        {
          // Find each of the segments
          String segmentMarker = this.byteToHexString(in.readByte());
          System.out.println("Segment Marker: " + segmentMarker);

          // Find segment type
          byte segmentType = in.readByte();
          if (segmentType == app1Token)
          {
            System.out.println("App1 Segment");
            this.parseApp1();
          }
          else if (segmentType == startOfScanToken)
          {
            System.out.println("Start of Scan Segment");
            return;
          }
          else
          {
            System.out.println("Unknown Segment");
            this.parseSegment();
          }
          System.out.println("----------------");
        }
      }
    }
    catch (java.io.FileNotFoundException ex)
    {
      System.out.println("File not found");
    }
    catch (java.io.IOException ex)
    {
      System.out.println("Parse exception");
    }
  }

  void parseApp1() throws java.io.IOException
  {
    long startOfSegment = in.getFilePointer();

    // Segment length
    int segmentLength = in.readUnsignedShort();
    System.out.println(" Length: " + segmentLength);
    String indentifierString = this.stringRead(4);
    System.out.println(" Identifier: " + indentifierString);

    //eat the pad
    in.skipBytes(2);

    // start TIFF structure
    long startTIFF = in.getFilePointer();

    // Byte Order 2 bytes
    in.skipBytes(2);

    // 42 - just in case you didn't read the Byte Order
    in.skipBytes(2);
    int nextIFDOffset = in.readByte();
    in.skipBytes(3);
    int IFDNumber = 0;
    while (nextIFDOffset != 0)
    {
      System.out.println("  IFD: " + IFDNumber);
      parseIFD(startTIFF);
      nextIFDOffset = in.readByte();
      in.skipBytes(3);
      in.skipBytes(nextIFDOffset);
      IFDNumber++;
    }

    // Skip to next segment
    in.seek(startOfSegment + segmentLength);
  }

  private void parseIFD(long startTIFF) throws IOException
  {
    int TIFFAttributes = this.unsignedShortRead();
    for (int index = 0; index < TIFFAttributes; index ++)
    {
      int tag = EndianConverter.LEtoBE((short)in.readUnsignedShort());
      int type = 
        EndianConverter.LEtoBE((short)in.readUnsignedShort());
      int count = EndianConverter.LEtoBE(in.readInt());
      int valueOrOffset = EndianConverter.LEtoBE(in.readInt());
      if (tag == this.manufacturerToken)
      {
        long absoluteOffset = valueOrOffset + startTIFF;
        String manufacturer =
          this.indirectStringRead(absoluteOffset, count - 1);
        System.out.println("   Manufacturer: " + manufacturer);
      }
      else if (tag == this.modelToken)
      {
        long absoluteOffset = valueOrOffset + startTIFF;
        String model = 
          this.indirectStringRead(absoluteOffset, count - 1);
        System.out.println("   Model: " + model);
      }
      else if (tag == this.dateTimeToken)
      {
        long absoluteOffset = valueOrOffset + startTIFF;
        String dateTime = 
          this.indirectStringRead(absoluteOffset, count - 1);
        System.out.println("   Date Time: " + dateTime);
      }
      else if (tag == this.xDimensionToken)
      {
        System.out.println("   X: " + valueOrOffset);
      }
      else if (tag == this.yDimensionToken)
      {
        System.out.println("   Y: " + valueOrOffset);
      }
      else
      {
        //System.out.println("   Unknown TIFF Tag");
      }
    }
  }

  void parseSegment() throws java.io.IOException
  {
    long startOfSegment = in.getFilePointer();

    // Segment length
    int segmentLength = in.readUnsignedShort();
    System.out.println(" Segment Length: " + segmentLength);

    // Skip to next segment
    in.seek(startOfSegment + segmentLength);
  }

  public int unsignedShortRead() throws java.io.IOException
  {
    return EndianConverter.LEtoBE((short)in.readUnsignedShort());
  }

  public String stringRead(int length) throws java.io.IOException
  {
    byte[] inputBuffer = new byte[length];
    in.read(inputBuffer);
    return new String(inputBuffer);
  }

  public String indirectStringRead(long absoluteOffset, int length)
    throws java.io.IOException
  {
    long saveReadPoint = in.getFilePointer();
    in.seek(absoluteOffset);
    byte[] inputBuffer = new byte[length];
    in.read(inputBuffer);
    in.seek(saveReadPoint);
    return new String(inputBuffer);
  }

  public static String byteToHexString(byte myByte)
  {
    return (Integer.toHexString(myByte)).substring(6, 8);
  }
}

Figure 1: This code parses a JPEG file and looks for an Exif segment.

This is far less complicated than it looks. Lets break it down by line by line:

import java.io.RandomAccessFile;
import java.io.IOException;


We only need two classes from the JDK: RamdomAccessFile, which we will use for reading our file, and IOException, which we will throw only if something really bad happens.

public class JPEGParser
{
  private static final String inptFile = "Clyde.jpg";
  private static final int startOfImageToken = 0xFFD8;

  // Segment Tokens
  private static final byte app1Token = (byte)0xE1;
  private static final byte startOfScanToken = (byte)0xDA;

  // TIFF Attribute Tokens
  private static final int manufacturerToken = 0x10F;
  private static final int modelToken = 0x110;
  private static final int dateTimeToken = 0x9003;
  private static final int xDimensionToken = 0xA002;
  private static final int yDimensionToken = 0xA003;
  private RandomAccessFile in;


Next, we declare our top-level class JPEGParser, followed by some private static final constants, which are the actual hex values of tags and tokens within the files. Generally, I'm a big fan of using symbolic constants in this way, but I have heard it argued that when dealing with parsers, it is better to hard code your token so that it is easier to debug. Here, we also create a class level attribute for our RamdomAccessFile so that we don't have to pass it as a parameter to each method.

  public static void main(String[] args)
  {
    System.out.println("Start JPEGParser");
    JPEGParser myParser = new JPEGParser();
    myParser.parseJPEGFile();
    System.out.println("End JPEGParser");
  }


Here, we have our entry point, complete with a couple of system outs for marking the beginning and ending of our program output as well as the instantiation of our JPEGPareser Object and a call to the pareseJPEGFile() method.

  public void parseJPEGFile()
  {
    try
    {
      in = new RandomAccessFile(inptFile, "r");
      int startOfImage = in.readUnsignedShort();
      if (startOfImage == startOfImageToken)
      {
        while (true)
        {
          // Find each of the segments
          String segmentMarker = this.byteToHexString(in.readByte());
          System.out.println("Segment Marker: " + segmentMarker);

          // Find segment type
          byte segmentType = in.readByte();
          if (segmentType == app1Token)
          {
            System.out.println("App1 Segment");
            this.parseApp1();
          }
          else if (segmentType == startOfScanToken)
          {
            System.out.println("Start of Scan Segment");
            return;
          }
          else
          {
            System.out.println("Unknown Segment");
            this.parseSegment();
          }
          System.out.println("----------------");
        }
      }
    }
    catch (java.io.FileNotFoundException ex)
    {
      System.out.println("File not found");
    }
    catch (java.io.IOException ex)
    {
      System.out.println("Parse exception");
    }
  }


We first open the specified file and consume the Start Of Image header. After that, we begin the outer loop of our parser, where we scan through each of the segments looking for either an App1 or a Start Of Scan segment. When we find an App1 segment, we make a call to the parseApp1() method. If we find the Start Of Scan segment, we terminate parsing. Start Of Scan is always the last segment and contains the actual compressed image. All other segments we skip over by calling the parseSegment() method. Finally, we catch any exceptions that might be thrown.

void parseApp1() throws java.io.IOException
  {
    long startOfSegment = in.getFilePointer();

    // Segment length
    int segmentLength = in.readUnsignedShort();
    System.out.println(" Length: " + segmentLength);
    String indentifierString = this.stringRead(4);
    System.out.println(" Identifier: " + indentifierString);

    //eat the pad
    in.skipBytes(2);

    // start TIFF structure
    long startTIFF = in.getFilePointer();

    // Byte Order 2 bytes
    in.skipBytes(2);

    // 42 - just in case you didn't read the Byte Order
    in.skipBytes(2);
    int nextIFDOffset = in.readByte();
    in.skipBytes(3);
    int IFDNumber = 0;


This first part of the parseApp1() method is where we consume the App1 and TIFF headers. We pull out the segment length as well as the identifier string, which will be literally "Exif". We then consume a 2-byte pad and are ready to parse the TIFF header. We save the value of the offset to the TIFF header for later use. The TIFF header begins with a byte order that tells us whether our file uses big endian or little endian notation. Just in case you ignored the byte order or if you want to use simpler logic to determine it, the next two bytes contain the number 42 encoded using the correct endian for the file. In my example, I assume the file is little endian just to keep things as simple as possible. Next, we consume more pad and get ready to perform the IFD parsing loop.

while (nextIFDOffset != 0)
    {
      System.out.println("  IFD: " + IFDNumber);
      parseIFD(startTIFF);
      nextIFDOffset = in.readByte();
      in.skipBytes(3);
      in.skipBytes(nextIFDOffset);
      IFDNumber++;
    }

    // Skip to next segment
    in.seek(startOfSegment + segmentLength);
  }


Loop through each of the IFD's and when finished, skip to the end of the segment.

  private void parseIFD(long startTIFF) throws IOException
  {
    int TIFFAttributes = this.unsignedShortRead();
    for (int index = 0; index < TIFFAttributes; index ++)
    {
      int tag = EndianConverter.LEtoBE((short)in.readUnsignedShort());
      int type = 
        EndianConverter.LEtoBE((short)in.readUnsignedShort());
      int count = EndianConverter.LEtoBE(in.readInt());
      int valueOrOffset = EndianConverter.LEtoBE(in.readInt());
      if (tag == this.manufacturerToken)
      {
        long absoluteOffset = valueOrOffset + startTIFF;
        String manufacturer =
          this.indirectStringRead(absoluteOffset, count - 1);
        System.out.println("   Manufacturer: " + manufacturer);
      }
      else if (tag == this.modelToken)
      {
        long absoluteOffset = valueOrOffset + startTIFF;
        String model = 
          this.indirectStringRead(absoluteOffset, count - 1);
        System.out.println("   Model: " + model);
      }
      else if (tag == this.dateTimeToken)
      {
        long absoluteOffset = valueOrOffset + startTIFF;
        String dateTime = 
          this.indirectStringRead(absoluteOffset, count - 1);
        System.out.println("   Date Time: " + dateTime);
      }
      else if (tag == this.xDimensionToken)
      {
        System.out.println("   X: " + valueOrOffset);
      }
      else if (tag == this.yDimensionToken)
      {
        System.out.println("   Y: " + valueOrOffset);
      }
      else
      {
        //System.out.println("   Unknown TIFF Tag");
      }
    }
  }


Here, we have the entire parseIFD() method. For each attribute, we pull out the tag, type, count, and valueOrOffset. We then check each of the tags to see if it is one of the tags that we are interested in; if so, we parse out its value appropriately, using offsets if needed for larger chunks of data. Note again the C programming accent in that all ASCII encoded fields like manufacture and model are null terminated, so we read one less character than the count field calls for. Also note that the system out for unknown tags is currently commented out. This is because there are many tags that we are not currently interested in, but it is sometimes hard to print these out when debugging, so I left them in. Now, you can see how easy it would be to look for a new tag. You would just add in one more else if to this structure and a little bit of logic, and you'd have another piece of metadata. For a complete list of available tags, see the Exif documentation referenced previously.

void parseSegment() throws java.io.IOException
  {
    long startOfSegment = in.getFilePointer();

    // Segment length
    int segmentLength = in.readUnsignedShort();
    System.out.println(" Segment Length: " + segmentLength);

    // Skip to next segment
    in.seek(startOfSegment + segmentLength);
  }


This method is the default parser for segments of unknown types.

public int unsignedShortRead() throws java.io.IOException
  {
    return EndianConverter.LEtoBE((short)in.readUnsignedShort());
  }

  public String stringRead(int length) throws java.io.IOException
  {
    byte[] inputBuffer = new byte[length];
    in.read(inputBuffer);
    return new String(inputBuffer);
  }

  public String indirectStringRead(long absoluteOffset, int length)
    throws java.io.IOException
  {
    long saveReadPoint = in.getFilePointer();
    in.seek(absoluteOffset);
    byte[] inputBuffer = new byte[length];
    in.read(inputBuffer);
    in.seek(saveReadPoint);
    return new String(inputBuffer);
  }


These three methods are used as convenience methods for doing direct and indirect reads for particular types. Most notably, the indirectStringRead() method is used for reading ASCII encoded strings that are too big to fit in the data offset field and must therefore be read as an offset from start of TIFF.

  public static String byteToHexString(byte myByte)
  {
    return (Integer.toHexString(myByte)).substring(6, 8);
  }


Finally, we have a convenience method for formatting a byte as a two-digit hexadecimal string.

So there you have it in a measly 195 lines of code. If you download the sample file Clyde.jpg from the Web site and run the code above against it, your output should look like this:

Start JPEGParser

Segment Marker: ff

App1 Segment 

 Length: 7678 

 Identifier: Exif  

  IFD: 0   

   Manufacturer: Canon   

   Model: Canon PowerShot S400  

  IFD: 1   

   Date Time: 2003:09:10 11:48:46   

   X: 2272   

   Y: 1704

----------------

Segment Marker: ff

Unknown Segment 

 Segment Length: 132

----------------

Segment Marker: ff

Unknown Segment 

 Segment Length: 17

----------------

Segment Marker: ff

Unknown Segment 

 Segment Length: 418

----------------

Segment Marker: ff

Start of Scan Segment

End JPEGParser 

Endian Convenience Methods

OK, I did cheat a little bit. I left out one snippet of code from a class called EndianConverter that has a couple of static methods for converting from big endian to little endian.

// EndianConverter.java

public class EndianConverter
{
  public static int LEtoBE(short le)
  {
    return ((le >> 8) & 0x00FF) | ((le << 8) & 0xFF00);
  }

  public static int LEtoBE(int le)
  {
    return (le >>> 24) | (le << 24) |
           ((le << 8) & 0x00FF0000) | ((le >> 8) & 0x0000FF00);
  }
}

Happy Parsing

So I've put together the framework here, which should be pretty easy to extend. Just cut and paste out the code from here, compile it, and try running it against some JPEG files of your own. Then grab the Exif document from the link provided and add in some code to parse out a few more attributes.

Conclusion

The JPEG file format is very regimented, making it easy to parse. Although the focus of this article was purely on parsing, JPEG files it could be generalized to any similar image or video format. Even though it is a "lossy" compression technique, it is still ideal for images that will only be viewed by humans. The Exif extension to the standard SPIFF format contains a number of interesting pieces of metadata about the file. In fact, there are even attributes for GPS coordinates, so in the not-so-distant future when GPS's are ubiquitous, your camera will not only record the time you took a picture but also the location. There were really only two tricky parts to writing this parser. The first was dealing with the big and little endians, and the second was tracking down all the documentation on the file formats. Now that those are in place, it is time to refactor this general framework into several smaller, more manageable classes. I plan on continuing work on this project, so if you are interested in keeping tabs on it or would like to help contribute, just drop me an email. Parsing can be fun and so can going on a programming binge from time to time.

The example used in this article was created using Together Control Center 6.0.1 and was compiled using JDK Version 1.3.1. Also, if you are going to do anything with binary files, I strongly suggest you get a good hex editor. I like UltraEdit.

Michael J. Floyd is an extreme programmer and the Software Engineering Manager for DivXNetworks. He is also a consultant for San Diego State University and can be reached at This email address is being protected from spambots. You need JavaScript enabled to view it..

BLOG COMMENTS POWERED BY DISQUS

LATEST COMMENTS

Support MC Press Online

$0.00 Raised:
$