Naked Science Forum

General Science => Question of the Week => Topic started by: nudephil on 19/04/2021 12:05:08

Title: QotW - 21.04.19 - How do zip files work?
Post by: nudephil on 19/04/2021 12:05:08
This week's question comes from listener Ellie:

How do zip files work on my computer?

Can anyone unzip this one?
Title: Re: QotW - 21.04.19 - How do zip files work?
Post by: CliffordK on 19/04/2021 18:25:12
Zip files are simply a compressed file format using a specific formula to compress the files.  At one time a separate program was required to zip and unzip files, but now the patents have long since expired, and the format is commonly included with browsers.

ZIP is considered "lossless" compression, I.E.  What you put in is what you get back.  On the other hand, JPG is a "lossy" compression, I.E. one generally loses resolution depending on the settings, but gets good compression.

I'm not sure exactly how ZIP works, but anything that is predictable in a file such as extra spaces, white space, using a limited number of letters or colors, etc...  all can be compressed.  However, a previously compressed file (JPG, MPG, etc) usually compresses very poorly a second time due to already having significant "randomness" in the files.

There are a few reasons why you might have difficulty uncompressing a file:

When I have an unknown file, I like to dump it in a text editor or byte editor, and see if the header is human readable.  I'm not real sophisticated, but many programs put identity information into either a header or footer.
Title: Re: QotW - 21.04.19 - How do zip files work?
Post by: evan_au on 20/04/2021 11:21:24
"Zipping" a file saves space on your disk storage, by removing "redundant"/"not strictly necessary" information.
- Unzipping the file when it is accessed restores the missing information so the file can be read, at the cost of extra execution time.
- If you don't access the files very often, zipping them is a good way to save space, without costing you much time
- If you need to send information to someone via email, radio, podcast, or streaming video, it saves a lot of time and money to compress it first.

When I first received a laptop running Windows 10, I found it was extremely sluggish.
- I eventually traced it to a "feature" in Windows 10 which also compresses data in RAM (Random Access memory)
- Unfortunately, in a virtual-memory system like Windows 10, the data stored in RAM is there because it is frequently accessed
- So they were continually zipping and unzipping the information which is most frequently accessed
- My computer sped up a lot when I turned off this "feature"
- And sped up even more when I installed twice as much RAM!
See: https://www.howtogeek.com/319933/what-is-memory-compression-in-windows-10/

Specialised Compression Methods
There are some compression techniques that are better suited for certain types of content
- ZIP files are good at compressing text files, and can get exactly the same file back again, so it is "lossless". It is a good general-purpose compression method.
        See: https://en.wikipedia.org/wiki/Lempel%E2%80%93Ziv%E2%80%93Welch
- MP3 and AAV are specialised for compressing audio files. These intentionally drop some sound content that the average ear can't detect. What you get back "sounds" the same, but it is not exactly the same. This is a "lossy" compression.
       See: https://en.wikipedia.org/wiki/MP3
- JPEG and JPG are specialised for compressing still photographs
       See: https://en.wikipedia.org/wiki/JPEG
- MPEG and MP4 are optimised for compressing video content
       See: https://en.wikipedia.org/wiki/MPEG-4

Improvements over Time
Most of the lossy compression techniques allows the originator of the content to make a tradeoff between file size and quality of the reconstructed file.
- New video compression techniques are continually being developed which can achieve better compression ratios without degrading perceived image quality, but usually at the cost of greatly increased processing power
- If you want to move from HDTV (2k pixels) to UHDTV (4k pixels) without improving the compression technique, you will need 4x the bandwidth on your internet connection and/or 4x the storage on your portable device.
- But if you install a more powerful image processor on your display device, and use better image compression standard, you may only need 2x the bandwidth.

See: https://en.wikipedia.org/wiki/Data_compression
Title: Re: QotW - 21.04.19 - How do zip files work?
Post by: alancalverd on 20/04/2021 12:04:54
Why do I keep reading the question as "zip flies"?
Title: Re: QotW - 21.04.19 - How do zip files work?
Post by: Colin2B on 20/04/2021 13:38:46
Why do I keep reading the question as "zip flies"?
Becoming forgetful in your old age?
Title: Re: QotW - 21.04.19 - How do zip files work?
Post by: CliffordK on 20/04/2021 17:05:45
Why do I keep reading the question as "zip flies"?
I was all ready to answer the question "How do Zip Ties Work?"
Title: Re: QotW - 21.04.19 - How do zip files work?
Post by: evan_au on 20/04/2021 22:32:28
Quote from: OP
How do zip files work?
An application that has exploded in the past 20 years is unzipping the DNA helix.
- Human DNA has about 3 billion base pairs
- With simple encoding, this would take about 3 Gigabytes to store the genome of one human (by using 2 bits per base, you could reduce this by a factor of 4)
- However, humans (or any species) has well over 99% similarity in their DNA
- So by comparing to a "Reference Genome" for the species of interest (and just recording the differences), you can reduce the size of a stored genome to less than 1% of the original size.
- Of course, you still need to store the Reference Genome before you can compress the first subsequent genome.
- There are a number of specialised programs that do this
See: https://en.wikipedia.org/wiki/Compression_of_Genomic_Sequencing_Data

Of course, these specialised data compression programs are optimised for the particular kind of data they are expecting, in a particular format.
- They will perform really badly if you give them the wrong type of data; if you feed a DNA sequence into a video-compression algorithm, the results will not be pretty!
Title: Re: QotW - 21.04.19 - How do zip files work?
Post by: nudephil on 22/04/2021 11:55:18
I was all ready to answer the question "How do Zip Ties Work?"

That's next week! And zip lines after that.
Title: Re: QotW - 21.04.19 - How do zip files work?
Post by: nudephil on 26/04/2021 09:20:39
Here's the finished Question of the Week, with data scientist Peter Foster's contribution: https://www.thenakedscientists.com/podcasts/question-week/how-do-zip-files-work
Title: Re: QotW - 21.04.19 - How do zip files work?
Post by: Jarek Duda on 28/04/2021 20:36:35
Small update:
Instead of ancient zip, now your files are often written with modern zstd: 2-5x faster, better compression, e.g. in Linux kernel and products of hundreds of companies: https://en.wikipedia.org/wiki/Zstandard
On Apple with LZFSE: https://en.wikipedia.org/wiki/LZFSE
For DNA compression nearly default is CRAM: https://en.wikipedia.org/wiki/CRAM_(file_format)
In the next months JPEG(, GIF, PNG) should be successfully replaced with fresh JPEG XL - 3x smaller photos and many missing features: https://en.wikipedia.org/wiki/JPEG_XL

In place of Huffman coding (only complete bits), all the above use ANS (handling fractional bits): https://en.wikipedia.org/wiki/Asymmetric_numeral_systems