The Upcoming Generation of Lossless Data Compression, Part 1

05/31/2010

The Upcoming Generation of Lossless Data Compression, Part 1

First, yes – there is such a thing as next gen lossless data compression.

Here’s one for starters: cross-file content-aware dedup for images and video.

Take a JPEG image, compress it with your favorite lossless compression algorithm, say BZip2 or LZMA, and you’ll get a larger file. Of course that makes sense, as JPEG itself is written to be compressed out of the box.

Now take a group of, say, 100 JPEGs, and compress them into a single archive. You’ll get no real compression. Even though there are likely redundancies among those photos, JPEG isn't designed to handle cross file compression, and LZMA can't handle it because the redundancies aren't obvious by looking only at the binary patterns.

But do that with a compression algorithm from the future, and you’ll get a much smaller file. These algorithms will know how to do content-aware cross-file compression.

What would these algorithms look like?

In fact we already have algorithms that know how to de-dup among a set of pictures. They’re called video compression algos. Try taking those same 100 JPEGs, put them into a movie file format, and run your favorite video compression algorithm over them with the right parameters and you’ll get your smaller compressed archive. Of course this is just a hacked approach, but you get the picture!

There are also various tricks you could do on text, audio, etc.

This technology will become increasingly important as more and more data gets centralized into the cloud.

Someone from the commercial side is already working on this. After thinking about this idea for several hours and getting really excited, I came across a company called Ocarina (chalk talk here) who is commercializing this approach.

That doesn’t mean this technology should be limited to big companies with big budgets, though. I’m hoping someone from academia or the open source community will eventually pick this up and run with it!

Part 2 to come soon! (Edit: link to part 2) In the mean time let’s hear it if you have any further thoughts or ideas!

(There was a good discussion of this blog post on Hacker News here.)

May 31, 2010 1:14:08 PM

5 Comments

Yeah I have a thought: Why start making "futuristic" means of optimizing/compressing current formats, when these said, current formats will with all likeliness be abandoned soon enough?

I fail to see the point of this endeavour in the light of a much smarter idea: keep this approach in mind when developing new formats.

Posted by: hackermom | 05/31/2010 at 02:08 PM

Hi Adam:

Great comments, and thanks for noticing us. Although there's been a lot of R&D in symbol-based deduplication, there's been very little research in data compression in the last couple decades. Ocarina is one of the few working on it. In fact we'll be publishing a book soon (draft here: http://www.mattmahoney.net/dc/dce.html)

What you're talking about -- inter-file image compression -- is something we call 3D compression. This inter-file compression is far more complex than the usual dedupe algorithms. For images, opportunities for further compression are to be found in color-table similarities, delta encoding, etc. You can also think of MPEG coding techniques (b-frame, p-frame, motion compensation) applied to a collection of still images. We would probably focus on lossless to start, but lossy and 'visually lossless' would be options as well.

There are some obvious applications for 3D image compression, and the applicability increases as the personal cost of shooting photos decreases. For example, today I might take 50 images of a meadow in Yosemite, whereas 10 years ago I would have only taken 3. There's tons of similarities to take advantage of in those images. Also think about medical imaging where 1 study might consist of 500 image 'slices' each with very subtle deltas from the prior one. The raw storage required for such a study is huge, and 3D compression will deliver real benefits.

Of course image compression is just part of what we do. For example we're the only game in town to combine dedupe and compression for a given data set, which speaks to an overlying logic framework that's very good at picking algorithms and chunk boundaries. With that approach we'll get about 80% reduction on an average home-shares data-set, which of course includes a lot of pre-compressed data types.

Posted by: Mike Davis | 06/01/2010 at 10:04 AM

Thanks for the comment, Mark. Very thoughtful.

Posted by: Adam Smith | 06/01/2010 at 12:31 PM

"" Now take a group of, say, 100 JPEGs, and compress them into a single archive. You’ll get no real compression. Even though there are likely redundancies among those photos, JPEG isn't designed to handle cross file compression, and LZMA can't handle it because the redundancies aren't obvious by looking only at the binary patterns. ""

This isn't future. We had .tar specifically to group files and allow compression algorithms to figure out redundancy across multiple files.

Many compression tools support this through the 'solid compression' mode. Please review the following command of RAR to see size squeeze possibilities:

rar a -mc63:128t -m5 -md4096 -r -s -t -tsm0 -tsc0 -tsa0 -en testing.rar

my test:
file count 1542
file sizes 67'682'547 bytes
compression result: 62'294'893

Ofcourse, for more spectacular results I recommend using one of the PAQ8 based compression tools.

Posted by: Pling Zarooba | 06/01/2010 at 02:40 PM

There's tons of similarities to take advantage of in those images

Posted by: maternity nursing clothing | 07/01/2010 at 01:34 AM

The comments to this entry are closed.

Adam Smith's Blog

05/31/2010

The Upcoming Generation of Lossless Data Compression, Part 1

5 Comments

Adam Smith

Search

Recent Comments