Motion Vector Applications

Motion Vector Applications

Introduction

Motion vectors are typically used to compress video by storing the changes to an image from one frame to the next. We use these vectors creatively to detect and track motion and to find an alternative to traditional video decoding using phase shifting.

Background

MPEG-1 is an ISO standard developed by the Moving Pictures Expert Group. It was developed for Video CDs. While VCDs never became popular because movies were broken into several CDs, the standard gained acceptance in a variety of other applications. Most of the MPEG files found on the internet are MPEG-1 files. In addition, the output of digital video cameras is often in an MPEG-1 stream. In short, MPEG-1 is a digital video compression standard which has gained wide acceptance in the past decade.

Before we can apply the information encoded in an MPEG video, we must first thoroughly understand what information is encoded in an MPEG. MPEG seeks to compress digital video as much as possible while still maintaining a high quality picture. The first step in the compression process involves switching from the RGB color space to the YCrCb color space. YCrCb describes a color space where the three components of color are luminance, red chrominance, and blue chrominance. This switch is made because the human eye is less sensitive to chrominance than it is to luminance. Chrominance data can then be sampled at a quarter the rate of luminance data.

The next step in compression involves reducing spatial redundancy. This is done using essentially the same methods as JPEG. The image is divided into 16 x 16 pixel macroblocks. Each macroblock contains 16 x 16 luminance pixels, and 8 x 8 red/blue chrominance pixels. The luminance block is then split into 4 8 x 8 blocks. Now we have 6 8 x 8 blocks on which a DCT is performed. The DCT coefficients are quantized , filtered, and then stored.

The next step in compression is intended to reduce temporal redundancy. The first step in this process is to divide a series of frames into a group of pictures (GOP) and then to classify each frame as either I, P, or B. The usual method is to break a video into GOPs of 15 frames. The first frame is always an I frame. In a 15 frame GOP, it is common to have two B frames after the I frame, followed by a P frame, followed by two B frames, etc.

The classification of a frame as I, P, or B determines the manner in which temporal redundancies are encoded. An I frame is encoded "from scratch", just as described above. However, a P frame is encoded by breaking the image into macroblocks, and then using a matched filter, or a similar setup, to match each macroblock to a 16 x 16 pixel region of the last I frame. Once the best match is found, the motion vector (the vector pointing from the center of the macroblock to the center of the region that is the closest match) is assigned to that macroblock, and the error between the DCT coefficients of the current macroblock and the region it is being compared to from the I frame is encoded (as DCT coefficients). A B frame differs from a P frame only in that the above step is performed twice, once relating a macroblock in B to a point in the last I or P frame, and once relating the same macroblock to a point in the next I or P frame. The DCT coefficients of the two regions are then averaged, and the error between that and the actual DCT coefficients is encoded.

A final step in the compression of the video stream involves taking the actual bitstream and then using a variable length Huffman coding algorithm to reduce redundancies in the bitstream.

Overview