It's not just new technology that could help us to store digital data in the future. Already, we compress some data so that it takes up a lot less space. To find out how this works, Chris Smith and Dave Ansell were joined by PhD student Christian Steinruecken who is from the Department of Engineering at Cambridge University.
Dave - So Christian, what exactly is data compression?
Christian - Data compression is really the art of communicating using fewer bits that it would usually take.
Dave - Where bits are the 1s and 0s which all digital information is stored in.
Christian - That's right, using fewer units of information to communicate the same message.
Dave - So, how do you reduce the number of bits you're going to use?
Christian - So, the idea is to re-write the message in a different way and that re-writing can be made to exploit properties of the data that include for example, long repetitions or statistical properties in such a way that fewer symbols are needed.
Dave - So, if you have something which goes 101 101 a thousand times, you could say, instead of writing 101 a thousand times, "please write me a 101 a thousand times" which is that one sentence rather than...
Christian - That's one way of doing it.
Dave - Do we use compression a lot at the moment?
Christian - Yes, compression is ubiquitous. Every smartphone has compression algorithms built in and it's something that ships with every operating system at the moment, it's a technique that is vital for certain things to work at all. If you imagine, for example, the idea of a Mars rover on the surface of Mars and you have limited bandwidth to communicate with that rover, you are talking about a very expensive mission. You want to have the ability to get as much data back from Mars as you can. The bandwidth limit is not the camera on rover or the instruments. It's the satellite link. We'd like to compress the data as much as possible so that we can send more data overall.
Dave - So, I guess there's two ways of compressing. You can either attempt to produce exactly the same output or you can just try and get the important parts of the data.
Christian - Yes, both play a role. So, there's a difference between lossy compression and lossless compression. Lossy compression is what's happening in jpeg for example: we can throw away part of the data that we don't so much care about. The other idea (of lossless compression) is to preserve precisely the data that there are, but exploiting statistical properties to make them smaller.
Dave - So, what are you actually looking at in your research?
Christian - In my research, I look at structural compression, which is [based on] the idea that certain mathematical properties of data can be exploited to make them really small. For example, if you have combinatorial objects, multisets or permutations, that sort of thing, these are [examples of] mathematical properties which can be exploited. Most data that we're dealing with are highly structured. For example, the surrounding sequence in a long sequence of text tells us a lot about the bits that we don't yet know. So when communicating a text or a long file, having already seen part of the file helps us to encode the next symbol in a better way. I look at compression algorithms that take advantage of such properties in files, in order to make it easier to transmit larger files and make them smaller.
Dave - So, if you really, really understand what you're sending, you don't actually have to send nearly as much.
Christian - Yeah, the idea is that if we already know which data we're going to send then we don't have to send anything. The problem is making an algorithm that learns very quickly what the data is and gets a good idea of what is going to come, and exploiting those properties that we learn. So, it's really a form of machine learning.
Chris - So, it's not just about a machine where you sort of turn the handle, you put the data in and it comes out in a crunched down way. This is actually about the machine, learning the pattern of the data in order to become better as it goes on at producing a more condensed or compressed output.
Christian - That's right. The problem is that at the time when the Mars rover is sending a file, the technology that's in it to compress doesn't yet know what the data is going to be. So it has to learn as it goes along what the data is, to exploit those properties. At the receiving end, the receiver will learn exactly the same way and as data is decoded, will update its internal model in order to be able to follow along.
Chris - Does one have to send the code to the other so that I've compressed something and learned how I'm doing as I go along? Do I need to send you the code I use so that you know how to unpick what I did and regenerate what I started with?
Christian - The idea is that we both use the same compression programme and when we start out, both compression programmes are in exactly the same state. Now, as I start compressing a file, my compression programme starts learning. It starts learning that certain names appear very often in the text. It may learn all sort of other things. At the time when I decompress the file, the same process repeats. For the copy that it's decompressing, it will also learn gradually as it decompresses what those names are. So, the compressor and the decompressor really have to be in sync. They are learning the same thing separately, on perhaps different planets.
Chris - And is it just on Mars rovers or are there other applications? I mean, we're going to compress the audio for this programme for example and that means we'll throw away some of the frequencies, some of the sounds that we don't think people are going to be able to hear or will not notice if they're gone. So, could you take this similar sort of thing and make even better ways of compressing sound files so that our programmes aren't going to take up as much space in someone's iPod?
Christian - That's right. Whenever we have a better understanding of what properties we care about in the data, we can make a better compressor, and this is partially why there are many different file formats for audio because every now and then there's a small revolution where an incremental change produces a better way of storing information. And so, for audio for example, something that happens in the mp3 format, and in other formats, is that some of the frequency bands are thrown away that humans can't easily hear.
Chris - And that means that then you're saving space overall. You've thrown the data, but you'll never going to get it back though. So that's an example of a lossy compression. Are you saying that your thing can be used in a way that will get back that sound we've thrown away?
Christian - I think it completely depends. Many sensors record things that we don't care about, but when it comes to things like electronic text, we probably want to preserve the exact text without changing it when we decompress it. So, this is a case where we really want lossless compression. Similarly, when we have executable files or computer programmes that we may want to compress, or perhaps medical data, all sorts of things where it's really important to keep the precise file that we had to begin with then we want to use lossless compression.