How do ZIP files work?

How does data get compressed without any getting lost? A data scientist explains...
26 April 2021
Presented by Phil Sansom
Production by Phil Sansom.

BINARY

Binary digits spiralling away to infinity.

Share

Listener Ellie wanted to know: "how do ZIP files work on my computer?" Phil Sansom unzipped the question - with an answer from research data scientist Peter Foster...

In this episode

zip on some clothing

QotW: how do zip files work?

Phil - Those are files that end in the letters .zip, and to do anything with them, you have to first click a button that says ‘extract’ - and somehow, out come a new set of files! What on Earth is going on? Here’s how research data scientist Peter Foster sees it...

Peter - A ZIP file is a convenient way to bundle up one or more files, with the seemingly magical property that its contents are shrunk in size, but no information is lost. In this sense, ZIP files are all about data compression.

Phil - Without lovely compression we could be drowning in data. Watching an uncompressed, high-definition video could easily burn through your whole monthly mobile data plan in a single second - but thanks to compression, watching youtube on your phone still leaves gigabytes to spare. And this isn’t just for computers.

Peter - We all use our own kind of data compression when we use textspeak (e.g. acronyms and abbreviations), to shorten our messages. This works if the recipient knows the meaning behind the textspeak! If you wanted to be 100% certain the person you’re texting can decode your texts, you’d send them all the textspeak definitions you’re using in advance through a carefully chosen dictionary.

Phil - This kind of dictionary would tell you that “l o l” is code for “laughing out loud”, for example, but it could also use custom abbreviations for phrases that appear a lot in your specific message. And with the right abbreviations, the overall ‘coded’ message - plus dictionary - could be much, much shorter.

Peter - In a similar way, ZIP files are encoded versions of the files that they contain, interspersed with dictionary entries, which together allow us to decode the files. The abbreviations can represent sequences of data of any length - for example, strings of characters.

Phil - In other words, compression is all about finding - and exploiting - patterns in data. No patterns - no compression.

Peter - If you tried to zip up a file which contained only randomly-generated data, you would need to be extremely lucky to see any shrinkage in the zipped version.

Phil - And so we have today’s ZIP files.

Peter - By the way, the ZIP file is an example of lossless data compression (simply meaning without loss of information), but it is worth mentioning that there are other types of compression, which are lossy, like JPEG for images or MP3 for audio.

Phil - And that last one is probably how you’re hearing me now! Thanks very much to Peter Foster, from the Alan Turing Institute. Next time we’re answering this question - and don’t let it ‘bug’ you - from listener Jeffrey…

Jeffrey - We’ve had a cold and snowy winter, and I’ve had to shovel my driveway every few days. We had a fly in our house, and I was curious if it survived the cold somehow, or recently hatched?

Comments

Add a comment