CompressionCS240E: Data Structure and Data management (Enriched)David Duan, 2019 Winter, Prof. BiedlEncoding BasicsMotivationMeasuring PerformanceFixed-Length Encoding/DecodingVariable-Length Encoding/DecodingPrefix-FreeDecoding of Prefix-Free CodesEncoding From the TrieHuffman EncodingMotivationProcedureAnalysisProof: Huffman returns the "best" trieRuntime AnalysisExercisesRun-Length EncodingStrategyEncodingDecodingAnalysisExercisesLempel-Ziv-WelchStrategyEncodingDecodingAnalysisExercisesSummary Move-To-Front TransformMotivationProceduresAnalysisInput: Long RunsOutput: Uneven DistributionAdaptive DictionaryRuntimeExerciseBurrows-Wheeler TransformMotivationEncodingCyclic ShiftStrategyPseudocodeBzip2PipelineAnalysis
Problem How to store and transmit data?
The main object for compression is to make small.
We look at the compression ratio .
In this course, we usually let , and we focus on physical, lossless compression algorithm, as they can safely be used for any application.
Fixed-length encoding makes the decoding process easy, as we could break into pieces and translate each piece back to the original. However, the compression ratio is not optimal.
Example: Caesar Shift, ASCII
More frequent characters get shorter codewords.
Example: Morse Code (which can be represented as a trie), UTF-8 encoding (which can handle more than 107000 characters using only 1 to 4 bytes).
Note that Morse code is not uniquely decodable; it uses additional pauses to avoid ambiguity.
The prefix property requires that there is no whole code word in the system that is a prefix (initial segment) of any other code word in the system. From now on, we only consider prefix-free codes (codes with the prefix property), which corresponds to a trie with characters of only at the leaves.
The codewords need no end-of-string symbol if the code is prefix-free.
Any prefix-free code is uniquely decodable.
Runtime: . Note that is incremented in the inner while loop. Intuitively, each character we consumed corresponds to one step traversing the trie.
We can also encode directly from the trie.
Runtime: .
Youtube resource:
- Concise: https://www.youtube.com/watch?v=dM6us854Jk0
- Detailed: https://www.youtube.com/watch?v=ikswC-irwY8 (Stanford)
Consider with prefix-free codes
*
0 1
* *
0 1 0 1
L O E S
Then the encoded text will be . Observe the length of the code word is
To minimize , we want to put the infrequent characters "low" in the trie.
Huffman encoding is a greedy algorithm which gives us the "best" trie that minimizes .
Strategy
We build a mini-trie using two least-frequent characters and treat this trie as a single character with the sum of two children's frequencies being its frequency. Repeat this process until we have exhausted all used characters.
Note that we need to return both encoded text and trie/dictionary, because otherwise we can't decode the text as the algorithm may return many variations of the trie (not unique).
Pseudocode
Huffman Code uses two passes: building frequency array and building min-heap.
Huffman Code is not "optimal" in the sense that it
A correct statement is, Huffman encoding finds the prefix-free trie that minimizes .
We do induction on . We want to show that constructed using Huffman satisfies for any trie that encodes .
Base Case: , call them . Then Huffman returns
xxxxxxxxxx
*
/ \
a a'
which is the best possible trie.
Step: We know has the above mini-trie at the bottom. Consider a differen trie with and be the characters that are at the lowest level; and are somewhere else.
Suppose , if we remove them, then both and give tries for alphabet where . Observe is the Huffman trie for . By IH , we get
Otherwise, switch with in and argue on the resulting tree. Details skipped.
Encoding: .
Decoding: (see 1.4.2).
{'A':10, 'E':15, 'I':12, 'S':3, 'T':4, 'P':13, '\n':1}
. Solution
Variable-length & multi-character encoding: multiple source-text characters receive one code-word; the source and coded alphabet are both binary ; decoding dictionary is uniquely defined and not explicitly stored.
Elias Gamma Code
We encode with copies of followed by the binary representation of , which always starts with :
in binary | encoding | ||
---|---|---|---|
1 | 0 | 1 | 1 |
2 | 1 | 10 | 010 |
3 | 1 | 11 | 011 |
4 | 2 | 100 | 00100 |
5 | 2 | 101 | 00101 |
6 | 2 | 110 | 00110 |
... | ... | ... | ... |
Example
.
Thus: .
Pseudocode
Example
As a remark, let denote the number of zeros we encounter during each iteration, note that we always read bits to get the length of current block.
.
Pseudocode
This works very well for long runs. For example, if your string is zeros, then you can use zeros to represent the string:
However, RLE works poorly when there are many runs of length or , because the encoded text might be even longer than the original text. To see this, observe you need three bits to represent a block of length two and five bits to represent a block of length four. In fact, up to , you at most do as well as before.
Thus, this is bad for random bitstring, but good for sending pictures with a lot of white with a few block spots.
Solution
Huffman and RLE mostly take advantage of frequent or repeated single characters, but we can also exploit the fact that certain substrings show up more frequently than others. The LZW algorithm encodes substrings with code numbers; it also finds substrings automatically. This is used in compress
command and GIF
format.
Always encode the longest substring that already has a code word, then assign a new code word to the substring that is 1-bit longer.
Adaptive Encoding
In ASCII, UTF-8, and RLE, the dictionaries are fixed, i.e., each code word represent some fixed pattern (e.g., ). In Huffman, the dictionary is not fixed, but it is static, i.e., the dictionary is the same for the entire encoding/decoding process (we compute the frequency array beforehand).
However, in LZW, the dictionary is adaptive:
In short, the dictionary keeps changing after receiving new information. As a result, both encoder and decoder must know how the dictionary changes.
Procedure
Example
Final output: str(bin(65)) + str(bin(78)) + ... + str(bin(129))
where +
denote string concatenation.
Pseudocode
Procedure
We build the dictionary which maps numbers to strings while reading strings. To save space, store string as code of prefix plus one character. If we want to decode something but it is not yet in dictionary, we know the "new character" for this one must be the first character in its parent.
Example
Start with being ASCII. Convert to .
Convert to ; add to with .
Convert to ; add to with .
Convert to space; add to with .
Convert to ; add to with .
Convert to ; add to with .
Convert This code number is not in yet!
However, we know it would be added to representing where is the unknown next character. After that, we would use as . This tells us ! In general, if code number is not yet in , then it encodes "previous string + first character of previous string".
Pseudocode
Encoding: parse each character while going down in trie, so where is the input string.
Decoding: parse each code word while going down in trie, so where is the coded text.
Overall linear time.
Solution
Recall the MTF-heuristic when storing a dictionary as an unsorted array. The search is potentially slow, but if we have reasons to believe that we frequently search for some items repeatedly, then the MTF-heuristic means a very fast search.
Suppose the input has long runs of characters. We can view as a sequence of requests for characters in a dictionary that contains ASCII. If we store the dictionary as an unsorted array with the MTF-heuristic, we can transform into a sequence of integers by writing, for each such "request-character" , the index where was found. In genral, if has a character repeating times, then will have a run of zeros. You will see this becomes useful when processing output from Burrows-Wheeler transform.
The output should also have the property that small indices are much more likely than larger indices, presuming we start with ASCII. This holds because the used characters are brought to the front by MTF, so if they are ever used again they should have a much smaller number. Therefore, the distribution of characters in (characters now means number in ) is quite uneven, making it suitable for Huffman encoding.
MTF transform also uses adaptive dictionary, since the dictionary changes at every step. However, the change happens in a well-defined manner that only depends on the currently encoded character. Therefore, the encoder can easily emulate this, and no special tricks are needed for decoding.
The runtime is propotional to the total time that it takes to find the characters in the dictionary, which is in the worst case, but in practice the performance is much better since frequently used characters should be found quickly.
Solution.
Given a string , the th cyclic shift of is the string , i.e., the th suffix of , with the "rest" of appended behind it.
Take the cyclic shifts of source text , sort them alphabetically, and then output the last character of the sorted cyclic shifts.
To make BWT encoding more efficient in practice, observe the th character of the th cyclic shift is , so we can read any cyclic shift directly from without doing MSD radix sort.
We said to sort the cyclic shifts, but actually for extracting the encoding, all we need is to describe what the sorting would have been, or put differently, to give the sorting permutation. It turns out that it is better to have the inverse sorting permutation that satisfies the following: if is the sorted order of cyclic shifts, then is the -th cyclic shift, i.e., . In consequence, the th character of the encoding is (or if ).
TBC
Text : Original Text
Text
Text
Text
Text : Encoded Text
Encoding can be slow: BWT bottle-neck (recall if using suffix array via modified MSD radix sorting).
Decoding is fast: Both MTF and BWT can be decoded in without need to anything more complicated than countsort.
Drawback: We need the entire source text at once because BWT needs to sort cyclic shifts. Thus it is not possible to implement bzip2 such that both input and output are streams, i.e., we can't use it for large text that can't fit into memory.