Comp314: Project 1

Implement GZIP

Yes, you're really going to implement compression and decompression and it's going to be compatible with the real gzip utility that's standard on any Unix box (and can be easily installed on Windows as part of the Cygwin utilities). Gzip shows up in a variety of places where you might not expect it. Did you know that it's part of the most recent HTTP specification?

Specification due	Tue, Feb 14, 11:59pm
Prototype due	Tue, Feb 21, 11:59pm
Final due	Tue, Feb 28, 11:59pm

Project groups are online and ready to go. Email your graders now to set up your spec and prototype check-off meetings.

Where to begin?

Lucky for you, the Internet Engineering Task Force (IETF) has three incredibly detailed documents that describe how gzip (and its Lempel-Ziv style deflate algorithm) works.

RFC 1950 - ZLIB Compressed Data Format Specification version 3.3
RFC 1951 - DEFLATE Compressed Data Format Specification version 1.3
RFC 1952 - GZIP file format specification version 4.3

Your goal is really to produce two programs: one that compresses and the other that uncompresses. The real gzip has a large number of other features (e.g., it can uncompress the output of the Unix compress tool, and it likewise knows how to deal with all the different CR/LF/CR-LF text issues between different operating systems). You are not responsible for implementing these features. Likewise, you only ever need to deal with input from System.in and output to System.out. You should correctly implement and verify the checksums.

The most important requirement of this assignment is that your programs are compatible with the real gzip. Gzip should be able to uncompress your files, and you should be able to uncompress its files. Also, consider that the real gzip can operate on very, very long files. Your implementation should as well.

The decompression end is fairly straightforward. You implement the spec and it works. On the compression end, you've got choices to make about how you want to implement things. You'll find a suggested algorithm in the RFCs, which is almost certainly a fine place to start. You are strongly encouraged to design a system that allows you to implement other compression strategies as well.

For the spec deadline, the RFCs will certainly make your life much easier, with respect to the high-level aspects of your design. This means you should focus on how all the modules of your program will fit together. In addition to the usual Javadoc, your spec should describe what algorithm(s) you intend to pursue for the compression side and why you think your algorithms might improve on the one listed in the RFC. Your spec should also describe the order in which you intend to flesh out the various features of your project. You may find it helpful, for example, to first get the decompression part working before you work too much on compression.

Hints

When you're doing I/O based on bytes, life is easy. When you're doing it based on bits, things get more complicated. My advice is that you implement some kind of wrapper around the usual InputStream or Reader that allows you to read or write bits as well as the usual bytes. When you're reading bits, perhaps as part of a Huffman decoder, you may find it helpful to grab enough bits that you can have the longest possible Huffman code at your fingertips. When you figure out that you only need a few of those bits, you will want a way to push back the bits that you didn't actually consume. Implementing a data structure to allow for pushing anything back is a pain, but it's straightforward to separate the idea of reading bits from the idea of advancing through the file.

Initially, do everything one byte at a time. Grabbing multiple bytes at a time (e.g., a 64-bit long integer) may seem attractive, from a performance perspective, but it complicates two things. One, you will need to deal with the corner cases, such as when there aren't enough bytes remaining in a file to fill a 64-bit long integer. Two, you need to pay attention to byte ordering (see the Wikipedia's article on endianness). We've been stuck with this issue throughout the history of computer science. Intel defines the x86 as a "little endian" architecture. Java defines the JVM as "big endian". For the most part, when gzip has values larger than one byte, it's defined to be little endian. You should be sure to have unit tests that cover this issue. In theory, your Java program should behave exactly the same way, regardless of whether it's running on a little-endian or big-endian computer, although you would be wise to test on a Sparc or PowerMac (big endian) as well as a normal x86 PC (little endian).

Unit tests will save your skin. It's very difficult to see what's going wrong in a program like this when using a traditional debugger. Good unit tests will let you exercise your code and make you more comfortable that it's really working. Consider how unit tests can be worked into your implementation strategy. If you implement the decompressor first, then you can have unit tests that use inputs generated by the real gzip to drive your decompressor. Later, when you're working on your compressor, you can use your own decompressor to validate that your compressed data uncompresses properly. To make this all go smoothly, you probably want an internal structure that lets you pipe the output of one stage into the next stage (versus hard-coding System.in and System.out).

When things go wrong, a common cause is going to be a loss of synchronization, i.e., your decompression tool may trip over data it didn't anticipate or maybe over an impossible / broken input. You are strongly encouraged to wire some error checking into the decoding process from the beginning. You could have unit tests that expect your decoder to fail at a specific byte offset into the file (and you could even throw a specialized exception that encodes this information). It's generally much better to fail in a predictable way than to output garbage.

Binary editors may also be helpful. Emacs lovers may want to try out hexl-mode. Or, from the command line, you can use tools like od or xxd. That way, you can look at your binary output and see if it was what you were expecting. This is significantly more painful than cleverly designed unit tests, but it's still quite useful.

Cool points

You can expect your graders to have a variety of test cases that they will throw at your system, both for compression and uncompression, including corrupt data. There will certainly be cool points awarded for the group(s) with the best compression ratios and/or significantly faster implementations for a given compression ratio.

The obvious questions

Can I look at / use / study / consider the code for the real gzip?

You may read the code as a source of inspiration for your own work. If you decide to reimplement some of their algorithms to use on your own, you must give credit where credit is due. Most likely, you will find it easier to work from scratch, as the real gzip is much more complex than what you are required to do.

I noticed a java.util.zip package that looks like it has everything I might need for this assignment...

Your compressor / decompressor implementation may not use java.util.zip. However, you may find it useful for your unit tests. Ditto for the GNU Classpath version. Also, as with the real gzip, you're free to read the code, and cite to it for ideas that you decided to borrow.

Can I beg / borrow / steal some library for dealing with bit-wise rather than byte-wise I/O?

No. You should implement this yourself, although you're certainly allowed to consult other implementations, with credit given where due.