How could parsing possibly be fun? Read on and find out.
I don't know about you, but sometimes I get caught up in the day-to-day grind of
programming and forget that I didn't go into this field just for the babes and
the money. I went into programming because I like to write programs. My
father-in-law has always advised me, and everybody else within earshot, that if
you take a job doing something you love to do, you will never have to
work a day in your life. So I find it's a good idea a couple of times a
year to lock myself in the den and write something not to satisfy a customer's
request for a new feature or product, but rather just to program for the sheer
joy of programming. I find these programming binges rejuvenating, and I usually
emerge from my den with a sense of accomplishment and my batteries recharged.
Two prerequisites must be satisfied in order for me to go on one of my
programming binges: time and inspiration. Time, which I used to spend freely
before I had a family, is now a precious commodity. So, to justify the time
spent programming, I write articles like this one. This brings me to the second
prerequisite, which is inspiration. The inspiration for my most recent binge and
the focus of this article came from a friend of mine (Damien Chavarria) who runs
DAP-G, a community-based Web site for
amateur photographers. It's a place where photo enthusiasts can upload their
portfolios and share them with similarly obsessed individuals. They have
theme-based contests, articles on photography, forums, and all manner of other
features that people have grown to expect from similar sites. I don't think of
myself as a photographer by a long shot, but I do take a lot of pictures with my
digital camera, and by "a lot of pictures" I mean thousands per year. If I could
devise a way of flipping through them, I would probably have a time-lapse movie
of my daughter's first year. I've noticed that the pictures people upload to
Damien's Web site always have a certain amount of metadata associated with them,
including the camera make and model used, date the image was taken, exposure
time, and F number (aperture). Damien informed me that when you upload a JPEG
image to his site, this information is parsed out of the file. So what other
information might be locked away in these files? I had found my inspiration and
decided to go on a binge. Let's start by looking at the JPEG file format
together, and then let's take a look at some code (less than 200 lines, I
promise) that can get us some basic image information, such as the make and
model of the camera the image was taken with, the date and time the image was
taken, and finally the x and y resolution of the image.
JPEG File Format
We all see JPEG files every day by the hundreds or
even thousands. They are one of the two ubiquitous image file formats on the
'net, the other being GIF. JPEG, which is named for the Joint Photographic
Experts Group that initially put forth the standard, is a "lossy" compression
technique for storing images. What is meant by "lossy" is that when the image is
uncompressed, it is not a duplicate of the original.
However, never
fear. Note that the E in JPEG stands for expert, and these experts
are snobs about quality. They also know the limitation of human ocular
perception and how to exploit it, so even though a computer can tell you that
the original image and the JPEG image are not pixel-identical, you will have a
very difficult time telling the difference. It should also be noted that JPEG
itself actually refers to the compression technique, not the file format. Just
because they are experts doesn't mean they agree on much. In fact, even though
they did come up with an excellent compression technique, they were unable to
agree on a file format until recently.
The official file format is
called Still Picture Interchange
File Format (SPIFF), and it follows a simple pattern. Each file is composed
of a small header followed by a number of segments. Each segment consists of a
marker, followed by a segment length, followed by the segment data. This makes
for an easy top-down design for our parser. After reading the header, we
construct a framework that reads a segment marker, then reads the segment
length, and then skips to the next segment marker, based on the length it just
read. Repeat this until you reach the end of the file, and you have the general
framework for a parser. Now, one by one, we can add in code for parsing each
segment type that we are interested in.
Well, it's almost that easy.
Because it took awhile for the international standards experts to come to an
agreement on file format, other experts motivated more by getting products to
market than arguing over bits and bytes came up with their own extensions to
early drafts of the SPIFF file format. The differences are minor and easily
manageable. This group, know as Japan Electronics and Information Technology
Industries Association (JEITA) came up with the Exchangeable Image File Format
for digital still cameras, which is simply know as Exif. If you ask me, it
should have been called EIFF, but then I don't speak Japanese. Exif specifies
two segments for a file: APP1 and APP2. The data that we are interested in is in
APP1 and is stored in what is known as the Tagged Image File Format (TIFF).
TIFFs consist of a header followed by one or more Image File Directories (
IFDs). Each IFD is composed of a series of attributes. Each attribute has four
elements: tag, type, count, and value offset. Tag tells us the name of the
attribute being stored. In our case, we are looking for make, model, date, time,
x-resolution, and y-resolution. Type tells us the domain of the data. Examples
are byte, ASCII, short, long, and rational. Count tells us the number of values
stored. Value Offset should probably be called "value" or "offset." The idea is
that if the data can be contained within the 4 bytes allocated to this element,
then this element contains the data itself. If the data is larger than 4 bytes,
this field contains the offset to the actual data. My guess is that the writers
of this spec were C programmers. Anyway, this file format is a lot like a set of
Russian matryoshka dolls, with containers containing containers containing
containers, but the good thing is that it all follows rather simplistic rules.
The following hierarchy summary should help you visualize the file
format:
Header (also know as start of image)
App1 Segment
Length
TIFF Header
IFD (0 or more)
Attributes (0 or more)
Tag
Type
Count
Data Offset
Attribute data that didn't fit in Data Offset element
Image Segment (aka start of scan)
Endians
One messy detail that we do have to deal with is
endians. This is basically whether the bytes that store a number begin with the
least significant digit or the most significant digit. In our day-to-day lives,
we normally deal with big endian numbers. If I write the number 1234, then 1 is
the most significant digit and represents the thousands digit, and 4 is the
least significant digit and represents the ones digit.
So we have two
things to address here. First, all the numbers that we are going to deal with
are in hexadecimal. Second, although the file format specs allow for either big
or little endian encodings, most often little endian is used. So a 2-byte long
number in hexadecimal where each byte is represented by two digits stored as a
big endian would be ABCD, but the same number stored as little endian would be
CDAB. In other words, the high and low bytes are interchanged. This gets more
complicated for a 4-byte number, where the first and fourth byte are
interchanged, as are the second and third byte. So 89ABCDEF in big endian
becomes EFCDAB89 in little endian. Don't worry. I will provide convenience
methods for dealing with this bit twiddling.
Parsing Example
The code in Figure 1 parses a JPEG file and looks for
an Exif segment. Within that segment, it looks for the items we talked about
above.
|
Figure 1: This code parses a JPEG file and looks for an Exif
segment.
This is far less complicated than it looks. Lets break it
down by line by line:
import java.io.IOException;
We only need two classes from the JDK: RamdomAccessFile, which we will
use for reading our file, and IOException, which we will throw only if something
really bad happens.
{
private static final String inptFile = "Clyde.jpg";
private static final int startOfImageToken = 0xFFD8;
// Segment Tokens
private static final byte app1Token = (byte)0xE1;
private static final byte startOfScanToken = (byte)0xDA;
// TIFF Attribute Tokens
private static final int manufacturerToken = 0x10F;
private static final int modelToken = 0x110;
private static final int dateTimeToken = 0x9003;
private static final int xDimensionToken = 0xA002;
private static final int yDimensionToken = 0xA003;
private RandomAccessFile in;
Next, we declare our top-level class JPEGParser, followed by some private
static final constants, which are the actual hex values of tags and tokens
within the files. Generally, I'm a big fan of using symbolic constants in this
way, but I have heard it argued that when dealing with parsers, it is better to
hard code your token so that it is easier to debug. Here, we also create a class
level attribute for our RamdomAccessFile so that we don't have to pass it as a
parameter to each method.
{
System.out.println("Start JPEGParser");
JPEGParser myParser = new JPEGParser();
myParser.parseJPEGFile();
System.out.println("End JPEGParser");
}
Here, we have our entry point, complete with a couple of system outs for
marking the beginning and ending of our program output as well as the
instantiation of our JPEGPareser Object and a call to the pareseJPEGFile()
method.
{
try
{
in = new RandomAccessFile(inptFile, "r");
int startOfImage = in.readUnsignedShort();
if (startOfImage == startOfImageToken)
{
while (true)
{
// Find each of the segments
String segmentMarker = this.byteToHexString(in.readByte());
System.out.println("Segment Marker: " + segmentMarker);
// Find segment type
byte segmentType = in.readByte();
if (segmentType == app1Token)
{
System.out.println("App1 Segment");
this.parseApp1();
}
else if (segmentType == startOfScanToken)
{
System.out.println("Start of Scan Segment");
return;
}
else
{
System.out.println("Unknown Segment");
this.parseSegment();
}
System.out.println("----------------");
}
}
}
catch (java.io.FileNotFoundException ex)
{
System.out.println("File not found");
}
catch (java.io.IOException ex)
{
System.out.println("Parse exception");
}
}
We first open the specified file and consume the Start Of Image header.
After that, we begin the outer loop of our parser, where we scan through each of
the segments looking for either an App1 or a Start Of Scan segment. When we find
an App1 segment, we make a call to the parseApp1() method. If we find the Start
Of Scan segment, we terminate parsing. Start Of Scan is always the last segment
and contains the actual compressed image. All other segments we skip over by
calling the parseSegment() method. Finally, we catch any exceptions that might
be thrown.
{
long startOfSegment = in.getFilePointer();
// Segment length
int segmentLength = in.readUnsignedShort();
System.out.println(" Length: " + segmentLength);
String indentifierString = this.stringRead(4);
System.out.println(" Identifier: " + indentifierString);
//eat the pad
in.skipBytes(2);
// start TIFF structure
long startTIFF = in.getFilePointer();
// Byte Order 2 bytes
in.skipBytes(2);
// 42 - just in case you didn't read the Byte Order
in.skipBytes(2);
int nextIFDOffset = in.readByte();
in.skipBytes(3);
int IFDNumber = 0;
This first part of the parseApp1() method is where we consume the App1
and TIFF headers. We pull out the segment length as well as the identifier
string, which will be literally "Exif". We then consume a 2-byte pad and are
ready to parse the TIFF header. We save the value of the offset to the TIFF
header for later use. The TIFF header begins with a byte order that tells us
whether our file uses big endian or little endian notation. Just in case you
ignored the byte order or if you want to use simpler logic to determine it, the
next two bytes contain the number 42 encoded using the correct endian for the
file. In my example, I assume the file is little endian just to keep things as
simple as possible. Next, we consume more pad and get ready to perform the IFD
parsing loop.
{
System.out.println(" IFD: " + IFDNumber);
parseIFD(startTIFF);
nextIFDOffset = in.readByte();
in.skipBytes(3);
in.skipBytes(nextIFDOffset);
IFDNumber++;
}
// Skip to next segment
in.seek(startOfSegment + segmentLength);
}
Loop through each of the IFD's and when finished, skip to the end of the
segment.
{
int TIFFAttributes = this.unsignedShortRead();
for (int index = 0; index < TIFFAttributes; index ++)
{
int tag = EndianConverter.LEtoBE((short)in.readUnsignedShort());
int type =
EndianConverter.LEtoBE((short)in.readUnsignedShort());
int count = EndianConverter.LEtoBE(in.readInt());
int valueOrOffset = EndianConverter.LEtoBE(in.readInt());
if (tag == this.manufacturerToken)
{
long absoluteOffset = valueOrOffset + startTIFF;
String manufacturer =
this.indirectStringRead(absoluteOffset, count - 1);
System.out.println(" Manufacturer: " + manufacturer);
}
else if (tag == this.modelToken)
{
long absoluteOffset = valueOrOffset + startTIFF;
String model =
this.indirectStringRead(absoluteOffset, count - 1);
System.out.println(" Model: " + model);
}
else if (tag == this.dateTimeToken)
{
long absoluteOffset = valueOrOffset + startTIFF;
String dateTime =
this.indirectStringRead(absoluteOffset, count - 1);
System.out.println(" Date Time: " + dateTime);
}
else if (tag == this.xDimensionToken)
{
System.out.println(" X: " + valueOrOffset);
}
else if (tag == this.yDimensionToken)
{
System.out.println(" Y: " + valueOrOffset);
}
else
{
//System.out.println(" Unknown TIFF Tag");
}
}
}
Here, we have the entire parseIFD() method. For each attribute, we pull
out the tag, type, count, and valueOrOffset. We then check each of the tags to
see if it is one of the tags that we are interested in; if so, we parse out its
value appropriately, using offsets if needed for larger chunks of data. Note
again the C programming accent in that all ASCII encoded fields like manufacture
and model are null terminated, so we read one less character than the count
field calls for. Also note that the system out for unknown tags is currently
commented out. This is because there are many tags that we are not currently
interested in, but it is sometimes hard to print these out when debugging, so I
left them in. Now, you can see how easy it would be to look for a new tag. You
would just add in one more else if to this structure and a little bit of
logic, and you'd have another piece of metadata. For a complete list of
available tags, see the Exif documentation referenced previously.
{
long startOfSegment = in.getFilePointer();
// Segment length
int segmentLength = in.readUnsignedShort();
System.out.println(" Segment Length: " + segmentLength);
// Skip to next segment
in.seek(startOfSegment + segmentLength);
}
This method is the default parser for segments of unknown types.
{
return EndianConverter.LEtoBE((short)in.readUnsignedShort());
}
public String stringRead(int length) throws java.io.IOException
{
byte[] inputBuffer = new byte[length];
in.read(inputBuffer);
return new String(inputBuffer);
}
public String indirectStringRead(long absoluteOffset, int length)
throws java.io.IOException
{
long saveReadPoint = in.getFilePointer();
in.seek(absoluteOffset);
byte[] inputBuffer = new byte[length];
in.read(inputBuffer);
in.seek(saveReadPoint);
return new String(inputBuffer);
}
These three methods are used as convenience methods for doing direct and
indirect reads for particular types. Most notably, the indirectStringRead()
method is used for reading ASCII encoded strings that are too big to fit in the
data offset field and must therefore be read as an offset from start of
TIFF.
{
return (Integer.toHexString(myByte)).substring(6, 8);
}
Finally, we have a convenience method for formatting a byte as a
two-digit hexadecimal string.
So there you have it in a measly 195 lines
of code. If you download the sample file Clyde.jpg from the Web site
and run the code above against it, your output should look like this:
Start JPEGParser
Segment Marker: ff
App1 Segment
Length: 7678
Identifier: Exif
IFD: 0
Manufacturer: Canon
Model: Canon PowerShot S400
IFD: 1
Date Time: 2003:09:10 11:48:46
X: 2272
Y: 1704
----------------
Segment Marker: ff
Unknown Segment
Segment Length: 132
----------------
Segment Marker: ff
Unknown Segment
Segment Length: 17
----------------
Segment Marker: ff
Unknown Segment
Segment Length: 418
----------------
Segment Marker: ff
Start of Scan Segment
End JPEGParser
Endian Convenience Methods
OK, I did cheat a little bit. I left out one snippet
of code from a class called EndianConverter that has a couple of static methods
for converting from big endian to little endian.
public class EndianConverter
{
public static int LEtoBE(short le)
{
return ((le >> 8) & 0x00FF) | ((le << 8) & 0xFF00);
}
public static int LEtoBE(int le)
{
return (le >>> 24) | (le << 24) |
((le << 8) & 0x00FF0000) | ((le >> 8) & 0x0000FF00);
}
}
Happy Parsing
So I've put together the framework here, which should be pretty easy to extend. Just cut and paste out the code from here, compile it, and try running it against some JPEG files of your own. Then grab the Exif document from the link provided and add in some code to parse out a few more attributes.
Conclusion
The JPEG file format is very regimented, making it
easy to parse. Although the focus of this article was purely on parsing, JPEG
files it could be generalized to any similar image or video format. Even though
it is a "lossy" compression technique, it is still ideal for images that will
only be viewed by humans. The Exif extension to the standard SPIFF format
contains a number of interesting pieces of metadata about the file. In fact,
there are even attributes for GPS coordinates, so in the not-so-distant future
when GPS's are ubiquitous, your camera will not only record the time you took a
picture but also the location. There were really only two tricky parts to
writing this parser. The first was dealing with the big and little endians, and
the second was tracking down all the documentation on the file formats. Now that
those are in place, it is time to refactor this general framework into several
smaller, more manageable classes. I plan on continuing work on this project, so
if you are interested in keeping tabs on it or would like to help contribute,
just drop me an email. Parsing can be fun and so can going on a programming
binge from time to time.
The example used in this article was created
using Together Control Center 6.0.1 and was compiled using JDK Version 1.3.1.
Also, if you are going to do anything with binary files, I strongly suggest you
get a good hex editor. I like UltraEdit.
Michael
J. Floyd is an extreme
programmer and the Software Engineering Manager for
DivXNetworks. He is
also a consultant for San Diego State
University and can be reached at This email address is being protected from spambots. You need JavaScript enabled to view it..
LATEST COMMENTS
MC Press Online