Naked Science Forum

General Science => General Science => Topic started by: Atomic-S on 03/12/2007 00:30:55

Title: Ultimate telecommunications compression
Post by: Atomic-S on 03/12/2007 00:30:55: Regarding the ideal compression of telecommunications transmissions:

For varuis types of data there are various ways to compress them. For sound there is the .MP3; for pictures the .JPG . Presumably they are not interchangeable. To send, therefore, a stream of assorted types of data efficiently thru a telecommunications channel, one cannot use just one of these schemes, but must invoke for each message (file) the appropriate scheme. But if the message traffic is general -- not limited to specific recognized data types, but potentially composed of arbitrary data types some of which may not be known in advance -- this will not work. You need a generalized compression algorithm to make it applicable to arbitrary data streams.

But how?

The information value of any symbol (e.g., byte) in a message is equal to the logarithm of its unlikelihood; and the information content of the whole message equals the sum of those values. Thus, a character in a long string of the same character may be regarded as having a 100% probability of occurance, so that its information value is (expressed in bits) log_2(1) = 0; so that the message has an information value of 0; there is therefor no point in sending it, because its content is already known. At the other extreme, in a string of random bytes, the odds against the occurance, in any one instance, of any particular value is 256 to 1, so that it, having an unlikelihood of 256/1, has an information value in bits of log_2(256/1) = 8, as expected, and a file having n random bytes has an information content of 8n bits .

All other messages fall somewhere in between these extremes. In standard English text, the letter "e" occurs quite frequently, thus on the average has a lower significance than, say, "x". The information value of each letter therefore is different. Compression would say, let us arrange all these letters in order of probability, and then assign short bit strings to the most probable letters, and longer bit strings to the less probable (as in the Morse code). This does indeed achieve compression; however this process is imperfect in that it assumes that the probability of occurance of any letter is simply the number of occurrances of it in the average string of such data, divided by the total number of characters in the string. It is equivalent to saying that we will create a loaded die having as many faces as possible characters, such that upon tossing it, the probability of any character coming up will match the number of such characters in a long string of text. But this approach is weak in two ways: First, the die has to be constructed based upon preconcieved understandings as to what the data will typically be like, which some kind of an assume in advance that it will be one or another kind (pictures, sound, text, whatever), which is what we want to avoid. Secondly, this scheme does not take into account the effect of conditional probility: namely, that the probability of the occurance of a certain character at a particular place in a message, may depend upon other elements in the message.

To take all this into account, it would appear necessary to seek a general compression algorithm which, without knowing in advance what would be coming through the channel, would be able to analyze it, reduce it to its minimum size, and reconstruct it on the other end. And in so doing it would have to take into consideration the fact that the probability, at any one position, of any particular character could depend, in general, upon exactly what other characters occur everywhere else in the message (or more generally, in the entire traffic stream). This means that we do not simply consider how frequently a specific character may occur on the average, but that we examine the probability of any specific occurrance of it in the light of the entire content of the rest of the message, and to do so in a situation where the messages may in general include types not previously known or anticipated. One suspects that such a process would have to be capable of "learning". All this would in general seem to be a formidable mathematical problem.

Does anyone know what is involved in solving it, and just how close to solving it practical science has been able to come?
Title: Ultimate telecommunications compression
Post by: another_someone on 03/12/2007 03:26:44: One of the things you have to bear in mind is the both MPEG and JPEG are lossy compression algorithms. What this means is that in both cases, what you get out is not an exact of the decompression algorithm is not exactly the same as went into the compression algorithm. The theory is that the data that is lost is such that our senses would scarcely notice it (e.g. MPEG will ignore some quiet sounds if they are masked by loud sounds - the theory being that the human ear will probably not notice if some quiet sound goes missing in such a situation). Clearly, for this kind of lossy compression to work, it must have some understanding of the type of data it is handling (after all, if you were sending down a database containing payments to be made into a bank account, the last thing you want is to lose some of the data).

General purpose compression algorithms are things like huffman compression (which seems to be roughly what you are describing) and LZW compression (which is contextual in nature - the idea with LZW is essentially that is it sees a 'T' followed by an 'H', it will remember it, and next time it sees a 'T' followed by an 'H' it will replace it with a single symbol, although the symbol will be larger than a byte). Both compression algorithms are quite old (LZW used to be encumbered by patents, but no longer). LZW is one of the standards applied to voice modem data compression, as well as used within the GIF standard to compress image data - but the GIF standard ran foul of the patents on the algorithm, which is why PNG was developed.

Other standard compression algorithms also exist, but Huffman and LZW are the archetypal variable symbol length, and variable string length, encoding methods.

You can modify the Huffman encoding so that you can change tables based on the context you are in (i.e. if you presently have a 'T', which is itself a common character, and so will have a short symbol), you can use a new table that puts 'H' higher up in the ranking. The biggest problem with this is that you can easily accumulate an awful lot of tables, and this can eat up a lot of memory.
Title: Ultimate telecommunications compression
Post by: Soul Surfer on 03/12/2007 09:23:17: Atomic As another someone has already said the generalised codings of types timilar to what you describe have beenn used for years and are far from the most effective means of coding signals although they will work well with structure data but not with encrypted data where the whloe idea is to randomise the data as much as possible. the loassy coding processes used for MPEG and JPEG coding and many other coding systems for sound and images compress the data a very great deal depending on how poor an image or sound you will tolerate.

It is truly amazing what you can get down standard telecoms lines with advanced modems and data compression. When I first started out in nelectronics 50 bit per sec macanical teleprinters was standard and if you went above 1000 bits per second you needed special lines.
Title: Ultimate telecommunications compression
Post by: lyner on 06/12/2007 14:05:25: Quote
When I first started out in electronics 50 bit per sec macanical teleprinters was standard and if you went above 1000 bits per second you needed special lines.
It was quite a feat to get a mechanical teleprinter to know which symbol to print out. It involved getting enough current to operate a thumping great solenoid, reliably, at the front end six or seven times for each symbol. And, for ASCII, more like eleven times.
A big factor in that was the way the GPO used to charge you ! They were making plenty enough money on their old gear to bother with improving it.
It was only when there was an alternative and commercial pressures allowed the technology to shine through.