Naked Science Forum
General Science => Question of the Week => Topic started by: nudephil on 19/04/2021 12:05:08
-
This week's question comes from listener Ellie:
How do zip files work on my computer?
Can anyone unzip this one?
-
Zip files are simply a compressed file format using a specific formula to compress the files. At one time a separate program was required to zip and unzip files, but now the patents have long since expired, and the format is commonly included with browsers.
ZIP is considered "lossless" compression, I.E. What you put in is what you get back. On the other hand, JPG is a "lossy" compression, I.E. one generally loses resolution depending on the settings, but gets good compression.
I'm not sure exactly how ZIP works, but anything that is predictable in a file such as extra spaces, white space, using a limited number of letters or colors, etc... all can be compressed. However, a previously compressed file (JPG, MPG, etc) usually compresses very poorly a second time due to already having significant "randomness" in the files.
There are a few reasons why you might have difficulty uncompressing a file:
- The file is not a ZIP file, but improperly named
- A common way to pass viruses in the past was to use multiple file extensions, so for example Virus.ZIP.exe or Virus.JPG.EXE. Now allowing spaces in file names, it can also be Virus.ZIP.[bunch of spaces].exe. Windows is getting better with trapping these, but hiding extensions doesn't help.
- Broken Zip file. Because it is a compressed file, damage to the file can cause significant problems. However there may be repair programs.
- ZIP allows passwords, further scrambling the files
- Lack of free space or swap space to decode. Depending on the system you will need at least enough free space in the destination for the entire uncompressed file, and may need significant space on a swap partition
When I have an unknown file, I like to dump it in a text editor or byte editor, and see if the header is human readable. I'm not real sophisticated, but many programs put identity information into either a header or footer.
-
"Zipping" a file saves space on your disk storage, by removing "redundant"/"not strictly necessary" information.
- Unzipping the file when it is accessed restores the missing information so the file can be read, at the cost of extra execution time.
- If you don't access the files very often, zipping them is a good way to save space, without costing you much time
- If you need to send information to someone via email, radio, podcast, or streaming video, it saves a lot of time and money to compress it first.
When I first received a laptop running Windows 10, I found it was extremely sluggish.
- I eventually traced it to a "feature" in Windows 10 which also compresses data in RAM (Random Access memory)
- Unfortunately, in a virtual-memory system like Windows 10, the data stored in RAM is there because it is frequently accessed
- So they were continually zipping and unzipping the information which is most frequently accessed
- My computer sped up a lot when I turned off this "feature"
- And sped up even more when I installed twice as much RAM!
See: https://www.howtogeek.com/319933/what-is-memory-compression-in-windows-10/
Specialised Compression Methods
There are some compression techniques that are better suited for certain types of content
- ZIP files are good at compressing text files, and can get exactly the same file back again, so it is "lossless". It is a good general-purpose compression method.
See: https://en.wikipedia.org/wiki/Lempel%E2%80%93Ziv%E2%80%93Welch
- MP3 and AAV are specialised for compressing audio files. These intentionally drop some sound content that the average ear can't detect. What you get back "sounds" the same, but it is not exactly the same. This is a "lossy" compression.
See: https://en.wikipedia.org/wiki/MP3
- JPEG and JPG are specialised for compressing still photographs
See: https://en.wikipedia.org/wiki/JPEG
- MPEG and MP4 are optimised for compressing video content
See: https://en.wikipedia.org/wiki/MPEG-4
Improvements over Time
Most of the lossy compression techniques allows the originator of the content to make a tradeoff between file size and quality of the reconstructed file.
- New video compression techniques are continually being developed which can achieve better compression ratios without degrading perceived image quality, but usually at the cost of greatly increased processing power
- If you want to move from HDTV (2k pixels) to UHDTV (4k pixels) without improving the compression technique, you will need 4x the bandwidth on your internet connection and/or 4x the storage on your portable device.
- But if you install a more powerful image processor on your display device, and use better image compression standard, you may only need 2x the bandwidth.
See: https://en.wikipedia.org/wiki/Data_compression
-
Why do I keep reading the question as "zip flies"?
-
Why do I keep reading the question as "zip flies"?
Becoming forgetful in your old age?
-
Why do I keep reading the question as "zip flies"?
I was all ready to answer the question "How do Zip Ties Work?"
-
How do zip files work?
An application that has exploded in the past 20 years is unzipping the DNA helix.
- Human DNA has about 3 billion base pairs
- With simple encoding, this would take about 3 Gigabytes to store the genome of one human (by using 2 bits per base, you could reduce this by a factor of 4)
- However, humans (or any species) has well over 99% similarity in their DNA
- So by comparing to a "Reference Genome" for the species of interest (and just recording the differences), you can reduce the size of a stored genome to less than 1% of the original size.
- Of course, you still need to store the Reference Genome before you can compress the first subsequent genome.
- There are a number of specialised programs that do this
See: https://en.wikipedia.org/wiki/Compression_of_Genomic_Sequencing_Data
Of course, these specialised data compression programs are optimised for the particular kind of data they are expecting, in a particular format.
- They will perform really badly if you give them the wrong type of data; if you feed a DNA sequence into a video-compression algorithm, the results will not be pretty!
-
I was all ready to answer the question "How do Zip Ties Work?"
That's next week! And zip lines after that.
-
Here's the finished Question of the Week, with data scientist Peter Foster's contribution: https://www.thenakedscientists.com/podcasts/question-week/how-do-zip-files-work
-
Small update:
Instead of ancient zip, now your files are often written with modern zstd: 2-5x faster, better compression, e.g. in Linux kernel and products of hundreds of companies: https://en.wikipedia.org/wiki/Zstandard
On Apple with LZFSE: https://en.wikipedia.org/wiki/LZFSE
For DNA compression nearly default is CRAM: https://en.wikipedia.org/wiki/CRAM_(file_format)
In the next months JPEG(, GIF, PNG) should be successfully replaced with fresh JPEG XL - 3x smaller photos and many missing features: https://en.wikipedia.org/wiki/JPEG_XL
In place of Huffman coding (only complete bits), all the above use ANS (handling fractional bits): https://en.wikipedia.org/wiki/Asymmetric_numeral_systems