Compression

CS240E: Data Structure and Data management (Enriched)

David Duan, 2019 Winter, Prof. Biedl

Encoding Basics

Motivation

Problem How to store and transmit data?

Source Text: The original data, string $S$ of characters from the source alphabet $\Sigma_S$ .
Coded Text: The encoded data, string $C$ of characters from the coded alphabet $\Sigma_C$ .
Encoding: An algorithm mapping source texts to coded texts.
Decoding: An algorithm mapping coded texts back to their original source text.

Measuring Performance

The main object for compression is to make $|C|$ small.

We look at the compression ratio $\displaystyle \frac{|C|\cdot \log|\Sigma_C|}{|S|\cdot \log|\Sigma_S|}$ .

In this course, we usually let $\Sigma_C = \{0, 1\}$ , and we focus on physical, lossless compression algorithm, as they can safely be used for any application.

Fixed-Length Encoding/Decoding

Fixed-length encoding makes the decoding process easy, as we could break $C$ into pieces and translate each piece back to the original. However, the compression ratio is not optimal.

Example: Caesar Shift, ASCII

Variable-Length Encoding/Decoding

More frequent characters get shorter codewords.

Example: Morse Code (which can be represented as a trie), UTF-8 encoding (which can handle more than 107000 characters using only 1 to 4 bytes).

Prefix-Free

Note that Morse code is not uniquely decodable; it uses additional pauses to avoid ambiguity.

The prefix property requires that there is no whole code word in the system that is a prefix (initial segment) of any other code word in the system. From now on, we only consider prefix-free codes (codes with the prefix property), which corresponds to a trie with characters of $\Sigma_S$ only at the leaves.

The codewords need no end-of-string symbol $\$$ if the code is prefix-free.

Decoding of Prefix-Free Codes

Any prefix-free code is uniquely decodable.

\begin{align*} &PrefixFreeDecoding(T, C[0 \cdots n-1]) \\ &T:\text{The trie of a prefix-code,} \\ &C: \text{Text with characters in $\Sigma_C$.} \\ &1. \quad \text{Initialize empty string $S$} \\ &2. \quad i \leftarrow 0 \\ &3. \quad \text{while $i < n$} \\ &4. \qquad r \leftarrow T.root \\ &5. \qquad \text{while $r$ is not a leaf} &\text{Since prefix-free, at leaf means decoded}\\ &6. \qquad \quad \text{if $i = n$ return "Invalid Encoding"} &\text{Exhausted input} \\ &7. \qquad \quad c \leftarrow \text{child of $r$ that is labeled with $C[i]$} \\ &8. \qquad \quad i \leftarrow i + 1 \\ &9. \qquad \quad r \leftarrow c \\ &10. \quad \;\; S.append(\text{character stored at $r$})\\ &11. \;\; \text{return $S$} \end{align*}

Runtime: $O(|C|)$ . Note that $i$ is incremented in the inner while loop. Intuitively, each character we consumed corresponds to one step traversing the trie.

Encoding From the Trie

We can also encode directly from the trie.

\begin{align*} &PrefixFreeEncodingFromTrie(T, S[0 \cdots n-1]) \\ &T: \text{The trie of a prefix-free code,} \\ &S: \text{Text with characters in $\Sigma_S$}. \\ &1. \quad \text{$L\leftarrow$ array of nodes in $T$ indexed by $\Sigma_S$} \\ &2. \quad \text{for all leaves $l$ in $T$} \\ &3. \qquad L[\text{character at $l$}] \leftarrow l \\ &4. \quad \text{Initialize empty string $C$} \\ &5. \quad \text{for $i = 0$ to $n-1$} \\ &6. \qquad \text{$w \leftarrow$ empty string; $v \leftarrow L[S[i]]$} \\ &7. \qquad \text{while $v$ is not the root} \\ &8. \qquad \quad w.prepend(\text{character from $v$ to its parent}) \\ &9. \qquad C.append(w) \\ &10. \;\; \text{return $C$} \end{align*}

Runtime: $O(|\Sigma_S| + |C|)$ .

Huffman Encoding

Youtube resource:
Concise: https://www.youtube.com/watch?v=dM6us854Jk0
Detailed: https://www.youtube.com/watch?v=ikswC-irwY8 (Stanford)

Motivation

Consider $S = LOSS$ with prefix-free codes


                                         *
                                      0     1
                                    *         *
                                  0   1     0   1
                                 L     O   E     S

Then the encoded text will be $00 \; 01 \; 11 \; 11$ . Observe the length of the code word $|C|$ is

\sum_\text{Leaves $\ell$ in Trie $T$} \text{frequence(character at leaf $\ell$)} \cdot \text{depth of $\ell$}

To minimize $|C|$ , we want to put the infrequent characters "low" in the trie.

Huffman encoding is a greedy algorithm which gives us the "best" trie that minimizes $|C|$ .

Procedure

Strategy

We build a mini-trie using two least-frequent characters and treat this trie as a single character with the sum of two children's frequencies being its frequency. Repeat this process until we have exhausted all used characters.

Note that we need to return both encoded text and trie/dictionary, because otherwise we can't decode the text as the algorithm may return many variations of the trie (not unique).

Pseudocode

\begin{array}{ll} &HuffmanEncoding(S[0 \cdots n-1]) \\ &1. \quad f \leftarrow \text{array indexed by $\Sigma_S$, initially all $0$} & \text{Array of frequencies} \\ &2. \quad \text{for $i = 0$ to $n-1$ do increase $f[S[i]]$ by $1$} &\text{Build frequencies} \\ &3. \quad Q \leftarrow \text{min-heap that stores trie} & \text{Min-Heap} \\ &4. \quad \text{for all $c \in \Sigma_S$ with $f[c] > 0$ do } &\text{Build Min-Heap} \\ &5. \qquad \text{$Q.insert($single-node trie for $c$ with weight $f[c]$}) \\ &6. \quad \text{while $Q.size() > 1$ do } &\text{Build decoding trie} \\ &7. \qquad T_1 \leftarrow Q.deleteMin(), f_1 \leftarrow \text{weight of $T_1$} &\text{Pop two trees with lowest freqs} \\ &8. \qquad T_2 \leftarrow Q.deleteMin(), f_2 \leftarrow \text{weight of $T_2$} \\ &9. \qquad Q.insert(\text{trie with $T_1, T_2$ as subtries and weight $f_1+f_2$}) &\text{Insert the built tree back to PQ}\\ &10. \;\; T \leftarrow Q.deleteMin() &\text{Final result}\\ &11. \;\; C \leftarrow PrefixFreeEncodingFromTrie(T, S) \\ &12. \;\;\text{return $C$ and $T$} \end{array}

Analysis

Huffman Code uses two passes: building frequency array and building min-heap.

Huffman Code is not "optimal" in the sense that it

ignores the fact that we need to send an additional trie, and
assume we use character-by-character encoding

A correct statement is, Huffman encoding finds the prefix-free trie that minimizes $|C|$ .

Proof: Huffman returns the "best" trie

We do induction on $|\Sigma_S|$ . We want to show that $T_H$ constructed using Huffman satisfies $C(T_H) \leq C(T)$ for any trie $T$ that encodes $S$ .

Base Case: $|\Sigma_S| = 2$ , call them $\{a, a'\}$ . Then Huffman returns

which is the best possible trie.

Step: We know $T_H$ has the above mini-trie at the bottom. Consider a differen trie $T$ with $b$ and $b'$ be the characters that are at the lowest level; $a$ and $a'$ are somewhere else.

Suppose $\{a, a'\} = \{b, b'\}$ , if we remove them, then both $T_H'$ and $T'$ give tries for alphabet $\Sigma_s'= \Sigma_S - \{a,a'\} \cup \{c\}$ where $freq(c) = freq(a) + freq(a')$ . Observe $T_H'$ is the Huffman trie for $\Sigma_S'$ . By IH $C(T'_H) \leq C(T')$ , we get
$C(T_H) = C(T'_H) + f(a) + f(a') \leq C(T') + f(a) + f(a') = C(T).$
Otherwise, switch $\{b, b'\}$ with $\{a, a'\}$ in $T$ and argue on the resulting tree. Details skipped. $\square$

Runtime Analysis

Encoding: $O(|\Sigma_S| \log |\Sigma_S| + |C|)$ .
- Build frequency array takes $O(|S|)$ .
- Build min-heap takes $O(|\Sigma_S| \log |\Sigma_S|)$ .
- Build decoding trie takes $O(|\Sigma_S| \log |\Sigma_S|)$ .
- Encoding takes $O(|\Sigma_S| + |C|)$ .
Decoding: $O(|C|)$ (see 1.4.2).

Exercises

Huffman is a [ single / multi ] - chararcter encoding algorithm.
Huffman uses [ fixed / variable ] - length codeword.
Huffman requires [ 1 / 2 / 3 ] passes through the text.
Describe the procedure before constructing the final trie.
Describe the procedure for constructing the final trie.
Analyze the runtime for each step of Q4 & Q5 solutions.
What is the main goal of Huffman encoding?
Is the trie returned by Huffman unique? Does this have any implication?
Given the following frequency array, produce a Huffman trie with $\Sigma_C = \{0,1\}$ : {'A':10, 'E':15, 'I':12, 'S':3, 'T':4, 'P':13, '\n':1}.
What's the ideal scenario to use Huffman encoding?

Solution

Single: looking at frequency of each individual character.
Variable: codewords correspond to leaves with different depth in a trie.
2 (one for building trie, one for actual encoding).
Loop through the input text $S$ and build an array of frequencies $f$ indexed by $s \in \Sigma_S.$
Initialize an empty min-heap $Q$ ; for each $s \in \Sigma_S$ , insert a singleton tree to $Q$ with frequency $f[s]$ representing $s$ ; while $Q$ has more than one tree, pop the two trees with least frequencies, build a new tree using them as children, set the frequency as their sum; when this ends, we are left with a single tree inside $Q$ that represents the optimized trie.
Build frequency array takes $O(|S|)$ ; build min-heap takes $O(|\Sigma_S| \log |\Sigma_S|)$ ; build decoding trie takes $O(|\Sigma_S| \log |\Sigma_S|)$ ; encoding takes $O(|\Sigma_S| + |C|)$ .
Find the prefix-trie that minimizes $|C|$ .
No, not unique; because of this, we need to return both the encoded text and the trie from the algorithm.
Solution not unique; watch first youtube video for one possible solution.
Huffman is good when the frequencies of characters are unevenly distributed.

Run-Length Encoding

Strategy

Variable-length & multi-character encoding: multiple source-text characters receive one code-word; the source and coded alphabet are both binary $\{0, 1\}$ ; decoding dictionary is uniquely defined and not explicitly stored.

Elias Gamma Code

We encode $k$ with $\lfloor \log k \rfloor$ copies of $0$ followed by the binary representation of $k$ , which always starts with $1$ :

$k$	$\lfloor \log k \rfloor$	$k$ in binary	encoding
1	0	1	1
2	1	10	010
3	1	11	011
4	2	100	00100
5	2	101	00101
6	2	110	00110
...	...	...	...

Encoding

Example

$S = 11111110010000000000000000000011111111111$ .

$C[0] = 1$ .
Block of $1$ 's with length $7$ : $00111$ .
Block of $0$ 's with length $2$ : $010$ .
Block of $1$ 's with length $1$ : $1$ .
Block of $0$ 's with length $20$ : $000010100$ .
Block of $1$ 's with length $11:$ $0001011$ .

Thus: $C = 10011101010000101000001011$ .

Pseudocode

\begin{array}{ll} &RLEEncoding(S[0 \cdots n-1]) \\ &S:\text{bitstring} \\ &1. \quad \text{Initialize output string $C \leftarrow S[0]$} &\text{$C[0]$ denotes the value of first block}\\ &2. \quad i \leftarrow 0 & \text{ptr: index of parsing $S$} \\ &3. \quad \text{while $i < n$ do} &\text{while we have not exhausted $S$}\\ &4. \qquad k \leftarrow 1 & \text{counter: length of run} \\ &5. \qquad \text{while $(i + k < n \text{ and } S [i + k] = S[i])$ do $k++$} &\text{compute length of current run} \\ &6. \qquad i \leftarrow i + k &\text{update ptr}\\ &7. \qquad K \leftarrow \text{empty string} &\text{will store $k$ in binary} \\ &8. \qquad \text{while $k > 1$} & \text{get Elias gamma code & binary encoding of $k$} \\ &9. \qquad \quad C.append(0) & \text{append adequate zeros}\\ &10. \qquad \;\; K.prepend(k \mod 2) &\text{compute binary encoding of $k$}\\ &11. \qquad \;\; k \leftarrow \lfloor k / 2 \rfloor &\text{update $k$} \\ &12. \quad \;\; K.prepend(1) &\text{$K$ is now $k$ in binary} \\ &13. \quad \;\; C.append(K) &\text{update encoded text}\\ &14. \;\; \text{return $C$} &\text{return encoded text} \end{array}

Decoding

Example

$C = 00001101001001010$

First bit $b = 0$ , so current bit is $0$ .
Followed by $000$ , so we know the block has length $8 \leq k < 16$ . We read $1101$ and get $13$ , so we append $13$ zeros to $S$ .
Next, $00$ , so we know the block has length $4 \leq k < 8$ . We read $100$ and get $4$ , so we append $4$ ones to $S$ .
Next, there is no zero, just a one, so we know the next block is just a single zero; append $1$ zero to $S$ .
Next, $0$ , so we know the block has length $2 \leq k < 4$ . We read $10$ and get $2$ , so we append $2$ ones to $S$ .

As a remark, let $l$ denote the number of zeros we encounter during each iteration, note that we always read $l+1$ bits to get the length of current block.

$S = 00000000000001111011$ .

Pseudocode

\begin{array}{ll} &RLEDecoding(C) \\ &C: \text{encoded text, stream of bits} \\ &1. \quad \text{Initialize output string $S$ } &\text{start with empty $S$}\\ &2. \quad b \leftarrow C.pop() &\text{bit-value for the current run} \\ &3. \quad \text{while $C$ has bits left} &\text{while we have not exhausted $C$}\\ &4. \qquad l \leftarrow 0 &\text{counter: denote number of zeros} \\ &5. \qquad \text{while $C.pop() = 0$ do $l++$} & \text{get the number of zeros} \\ &6. \qquad k = 1 &\text{base-2 number converted} \\ &7. \qquad \text{for $j \leftarrow 1$ to $l$ do} &\text{we read $l+1$ bits to get length}\\ &8. \qquad \quad k \leftarrow k * 2 + C.pop() & \text{compute $k$; if $C$ runs out, INVALID ENCODING} \\ &9. \qquad \text{for $j \leftarrow 1$ to $k$ do $S.append(b)$} &\text{update $S$} \\ &10. \quad \;\; b \leftarrow 1 - b& \text{flip bit for next run} \\ &11. \;\; \text{return $S$} \end{array}

Analysis

This works very well for long runs. For example, if your string is $n$ zeros, then you can use $2 \lfloor \log n \rfloor + 2$ zeros to represent the string:

One leading $0$ to denote it is a zero block,
$\lfloor \log n \rfloor$ leading zeros for the run,
$\lfloor \log n \rfloor + 1$ bits to represent the length in binary.

However, RLE works poorly when there are many runs of length $2$ or $4$ , because the encoded text might be even longer than the original text. To see this, observe you need three bits to represent a block of length two and five bits to represent a block of length four. In fact, up to $k=5$ , you at most do as well as before.

Thus, this is bad for random bitstring, but good for sending pictures with a lot of white with a few block spots.

Exercises

RLE is a [ single / multi ] - chararcter encoding algorithm.
RLE uses [ fixed / variable ] - length codeword.
RLE requires [ 1 / 2 / 3 ] passes through the text.
Describe Elias Gamma Code (hint: two parts).
Describe the encoding procedure.
Describe the decoding procedure.
During decoding, if we find $l$ initial zeros, how many bits do we need to read afterward?
What's the best case performance for RLE?
When is RLE good / bad?
What's the ideal scenario to use RLE?

Solution

Multi: it works on a block of code.
Variable: depends on Elias Gamma code.
One pass only.
For a block of length $k$ , Elias Gamma code is $\lfloor \log k \rfloor$ followed by $k$ in binary.
The first bit tells the bit of the first block; concatenate the Elias Gamma code for each block to produce $C$ .
The first bit tells the bit of the first block; if see a block of $l$ zeros, read the following $l+1$ bits and denote it as the length of the block.
$l+1$ bits.
A block of $n$ zeros can be encoded using $2 \lfloor \log n \rfloor + 2$ zeros.
Good: pictures with a lot of whites and a few black spots; bad: random bit strings. In particular, up to length $k=5$ , you at most do as well as the original text.
RLE is good when there are long runs.

Lempel-Ziv-Welch

Huffman and RLE mostly take advantage of frequent or repeated single characters, but we can also exploit the fact that certain substrings show up more frequently than others. The LZW algorithm encodes substrings with code numbers; it also finds substrings automatically. This is used in compress command and GIF format.

Strategy

Always encode the longest substring that already has a code word, then assign a new code word to the substring that is 1-bit longer.

Adaptive Encoding

In ASCII, UTF-8, and RLE, the dictionaries are fixed, i.e., each code word represent some fixed pattern (e.g., $65 \leftrightarrow A$ ). In Huffman, the dictionary is not fixed, but it is static, i.e., the dictionary is the same for the entire encoding/decoding process (we compute the frequency array beforehand).

However, in LZW, the dictionary is adaptive:

We start with a fixed initial dictionary $D_0$ , usually ASCII.
For $i \geq 0$ , $D_i$ is used to determine the $i$ th output character.
After writing the $i$ th character to output, both encoder and decoder update $D_i$ to $D_{i+1}$ .

In short, the dictionary keeps changing after receiving new information. As a result, both encoder and decoder must know how the dictionary changes.

Encoding

Procedure

We store dictionary $D_i$ as a trie.
Parse trie to find the longest prefix $x$ already in $D_i$ , i.e., $x$ can be encoded with one code number.
All $xK$ to the dictionary (this prefix would have been useful), where $K$ is the character that follows $x$ in $S$ . This creates one child in trie at the leaf we stopped.

Example

S = ANANASANNA

Convert $A = 65$ , add $AN$ to $D$ as $128$ (first available code word as ASCII uses $0$ to $127$ ).
Convert $N = 78$ , add $NA$ to $D$ as $129$ .
Convert $AN = 128$ , add $ANA$ to $D$ as $130$ .
Convert $A = 65$ , add $AS$ to $D$ as $131$ .
Convert $S = 83$ , add $SA$ to $D$ as $132$ .
Convert $AN = 128$ , add $ANN$ to $D$ as $133$ .
Convert $NA = 129$ , done.

Final output: str(bin(65)) + str(bin(78)) + ... + str(bin(129)) where + denote string concatenation.

Pseudocode

\begin{array}{ll} &LZWEncode(S) \\ &1. \quad \text{Initialize $D$ with $ASCII$ in a trie} \\ &2. \quad i \leftarrow 128 & \text{next available character} \\ &3. \quad \text{while there is input in $S$ do } &\text{while we have not exhausted $S$}\\ &4. \qquad v \leftarrow root(D) &\text{root of trie $D$} \\ &5. \qquad K \leftarrow S.peek() &\text{check next character} \\ &6. \qquad \text{while $v$ has a child $c$ labeled $K$} &\text{parse the Trie} \\ &7. \qquad \quad v \leftarrow c; S.pop() &\text{go to this node and pop char}\\ &8. \qquad \quad \text{if there is no more input in $S$ break} &\text{we are done} \\ &9. \qquad \quad K \leftarrow S.peek() &\text{check next character} \\\ &10. \quad \;\; \text{output codenumber stored at $v$} & \text{code number for longest existing prefix} \\ &11. \quad \;\; \text{if there is more input in $S$} \\ &12. \qquad \;\; \text{create child $v$ labeled $K$ with code number $i$} \\ &13. \qquad \;\; i++ \end{array}

Decoding

Procedure

We build the dictionary which maps numbers to strings while reading strings. To save space, store string as code of prefix plus one character. If we want to decode something but it is not yet in dictionary, we know the "new character" for this one must be the first character in its parent.

Example

C = 67, 65, 78, 32, 66, 129, 133

Start with $D$ being ASCII. Convert $67$ to $C$ .
Convert $65$ to $A$ ; add $CA$ to $D$ with $i = 128$ .
Convert $78$ to $N$ ; add $AN$ to $D$ with $i = 129$ .
Convert $32$ to space; add $N\underline{\phantom{1}}$ to $D$ with $i = 130$ .
Convert $66$ to $B$ ; add $\underline{\phantom{1}} B$ to $D$ with $i = 131$ .
Convert $129$ to $AN$ ; add $BAN$ to $D$ with $i = 132$ .
Convert $133???$ This code number is not in $D$ yet!
However, we know it would be added to $D$ representing $ANx$ where $x$ is the unknown next character. After that, we would use $133$ as $ANx$ . This tells us $x = A$ ! In general, if code number is not yet in $D$ , then it encodes "previous string + first character of previous string".

Pseudocode

\begin{array}{ll} &LZWDecode(C) \\ &1. \quad \text{Initialize $D$ with $ASCII$ in a trie} \\ &2. \quad i \leftarrow 128 & \text{next available character} \\ &3. \quad S \leftarrow \text{empty string} \\ &4. \quad c \leftarrow \text{first code from $C$} \\ &5. \quad s \leftarrow D[code]; S.append(s) &\text{we don't do much during first iteration} \\ &6.\quad \text{while there are more codes in $C$ do} &\text{now we need to update stuff} \\ &7. \qquad s_0 \leftarrow s & \text{denote the previous (decoded) string} \\ &8. \qquad c \leftarrow \text{next code from $C$} \\ &9. \qquad \text{if code $\ne i$: $s \leftarrow D[code]$} &\text{normal situation} \\ &10. \quad \;\; \text{else: $s \leftarrow s_0 + s_0[0]$} &\text{special situation} \\ &11. \quad \;\; S.append(s) &\text{we decoded one code number} \\ &12. \quad \;\; D.insert(i, s_p+s[0]) &\text{insert new substring to dictionary}\\ &13. \quad \;\; i ++ \\ &14. \;\; \text{return $S$} \end{array}

Analysis

Encoding: parse each character while going down in trie, so $O(|S|)$ where $S$ is the input string.

Decoding: parse each code word while going down in trie, so $O(|C|)$ where $C$ is the coded text.

Overall linear time.

Exercises

LZW is a [ single / multi ] - chararcter encoding algorithm.
LZW uses [ fixed / variable ] - length codeword.
LZW requires [ 1 / 2 / 3 ] passes through the text.
What's the biggest difference between LZW and Huffman/RLE?
How is the dictionary for LZW different from dictionary for Huffman?
Describe the procedure for encoding.
Describe the procedure for decoding.
What is the special case for decoding?
Analyze the runtime for LZW.
What's the ideal scenario to use RLE?

Solution

Multi: LZW work on repeated substrings.
Fixed: Each substring is represented by a fixed-length codeword in a dictionary (ASCII).
One pass: LZW parses and updates the dictionary at the same time.
LZW takes advantage of repeated substrings whereas Huffman and RLE take advantage of repeated characters.
Dictionary for Huffman is static, that is, it is fixed once it is built; dictionary for LZW is adaptive, that is, it is frequently updated using the new information given.
Starting with ASCII (as a trie), parse the trie to find the longest prefix that is already in $D$ ; append this to the output and add the prefix with the next character to $D$ ; repeat until run out of input.
Same as encoding, update the dictionary while parsing the text, unless it's the spcial case (see Q8).
If we want to decode something but it is not yet in dictionary, we know the "new character" for this one must be the first character in its parent. In general, if code number is not yet in $D$ , then it encodes "previous string + first character of previous string".
Both encoding and decoding are basically parsing text while going down in a trie, so linear time.
LZW works well when there are repeated substrings. It is also very fast.

Summary

Move-To-Front Transform

Motivation

Recall the MTF-heuristic when storing a dictionary as an unsorted array. The search is potentially slow, but if we have reasons to believe that we frequently search for some items repeatedly, then the MTF-heuristic means a very fast search.

Procedures

\begin{array}{ll} &MTFEncoding(S)\\ &S:\text{Stream $S$ of characters in alphabet $\Sigma_S$} \\ &1. \quad L \leftarrow \text{array with $\Sigma$ in some pre-agreed, fixed order (typically ASCII)} \\ &2. \quad \text{Initialize output stream $C$} \\ &3. \quad \text{while $S$ has more characters do} \\ &4. \qquad c \leftarrow S.pop() \\ &5. \qquad \text{for $i = 0, 1, \ldots$ do} \\ &6. \qquad \quad \text{if $L[i] = c$ then break} & \text{If found} \\ &7. \qquad C.append(i) &\text{Append index} \\ &8. \quad \text{for $j = i-1$ down to $0$ do} &\text{MTF} \\ &9. \qquad \text{swap $L[j]$ and $L[j+1]$} \end{array}

\begin{array}{ll} &MTFDecoding(C)\\ &C:\text{Stream $C$ of indices in $\{0, \ldots, |\Sigma_S| - 1$} \\ &1. \quad L \leftarrow \text{array with $\Sigma$ in some pre-agreed, fixed order (typically ASCII)} \\ &2. \quad \text{Initialize output stream $S$} \\ &3. \quad \text{while $C$ has more indices do} \\ &4. \qquad i \leftarrow C.pop() \\ &5. \qquad S.append(L[i]) &\text{Append character} \\ &8. \quad \text{for $j = i-1$ down to $0$ do} &\text{MTF} \\ &9. \qquad \text{swap $L[j]$ and $L[j+1]$} \end{array}

Analysis

Input: Long Runs

Suppose the input $T_1$ has long runs of characters. We can view $T_1$ as a sequence of requests for characters in a dictionary that contains ASCII. If we store the dictionary as an unsorted array $L$ with the MTF-heuristic, we can transform $T_1$ into a sequence of integers by writing, for each such "request-character" $c$ , the index where $c$ was found. In genral, if $T_1$ has a character repeating $k$ times, then $T_2$ will have a run of $k-1$ zeros. You will see this becomes useful when processing output from Burrows-Wheeler transform.

Output: Uneven Distribution

The output $T_2$ should also have the property that small indices are much more likely than larger indices, presuming we start with ASCII. This holds because the used characters are brought to the front by MTF, so if they are ever used again they should have a much smaller number. Therefore, the distribution of characters in $T_2$ (characters now means number in $\{0, \ldots, 127\}$ ) is quite uneven, making it suitable for Huffman encoding.

Adaptive Dictionary

MTF transform also uses adaptive dictionary, since the dictionary changes at every step. However, the change happens in a well-defined manner that only depends on the currently encoded character. Therefore, the encoder can easily emulate this, and no special tricks are needed for decoding.

Runtime

The runtime is propotional to the total time that it takes to find the characters in the dictionary, which is $O(|\Sigma_S| \cdot |S|)$ in the worst case, but in practice the performance is much better since frequently used characters should be found quickly.

Exercise

Explain what MTF-heuristic does.
When used in BZip2, what assumption do we have for its input? In other words, why do we believe that MTF would have good performance given this input?
Describe the procedure for encoding and decoding.
What is the goal for MTF? In other words, what kind of output do we hope to pass on?

Solution.

Upon searching for an element, move it to the front. Hence the name "MTF".
After BWT, there will usually be long runs of repeated characters. If we see each of them as one look-up request, searching for the same key repeatedly is very fast in MTF.
Starting with ASCII, search for each character and append the index where we found it to the output stream until we run out of input characters. Same for decoding, except we are outputting the characters given the index.
We hope to output numbers with unevenly distributed frequencies because Huffman could take advantage of that. In particular, there will be long runs of zeros if the input has long runs of repeated characters.

Burrows-Wheeler Transform

Motivation

Encoding

Cyclic Shift

Given a string $S[0\cdots n-1]$ , the $i$ th cyclic shift of $S$ is the string $S[i \cdots n-1] + S[0 \cdots i-1]$ , i.e., the $i$ th suffix of $S$ , with the "rest" of $S$ appended behind it.

Strategy

Take the $n$ cyclic shifts of source text $S$ , sort them alphabetically, and then output the last character of the sorted cyclic shifts.

To make BWT encoding more efficient in practice, observe the $k$ th character of the $i$ th cyclic shift is $S[i+k\mod n]$ , so we can read any cyclic shift directly from $S$ without doing MSD radix sort.

We said to sort the cyclic shifts, but actually for extracting the encoding, all we need is to describe what the sorting would have been, or put differently, to give the sorting permutation. It turns out that it is better to have the inverse sorting permutation $\pi$ that satisfies the following: if $s_0, s_1, \ldots, s_{n-1}$ is the sorted order of cyclic shifts, then $s_i$ is the $\pi(i)$ -th cyclic shift, i.e., $s_i = S[\pi(i), \pi(i)+1, \ldots, n-1, 0, 1, \ldots, \pi(i)-1]$ . In consequence, the $i$ th character of the encoding is $S[\pi(i) - 1]$ (or $S[n-1] = \$$ if $\pi(i) = 0$ ).

Pseudocode

\begin{array}{ll} &BWEncoding(S) \\ &S: \text{String to be encoded (we can't deal with a stream here!)}\\ &1. \quad \text{Initialize output stream $C$} \\ &2. \quad \pi \leftarrow \text{Inverse sorting permutation of cyclic shifts of $S$} \\ &3. \quad \text{for $i = 0, \ldots, |S|-1$ do} \\ &4. \qquad \text{if $\pi[i] = 0$ then $C.append(\$)$} \\ &5. \qquad \text{else $C.append(S[\pi[i]-1])$} \end{array}

TBC

Bzip2

Pipeline

Text $T_0$ : Original Text

BWT: If $T_0$ has repeated long substrings, then $T_1$ has long runs of characters.

Text $T_1$

MTF: If $T_1$ has long runs of characters, then $T_2$ has long runs of zeros.

Text $T_2$

RLE: If $T_2$ has long run of zeros, then $T_3$ has unevenly distributed input.

Text $T_3$

Huffman: Compresses well since input-chars are unevenly distributed.

Text $T_4$ : Encoded Text

Analysis

Encoding can be slow: BWT bottle-neck (recall $O(n \log n)$ if using suffix array via modified MSD radix sorting).

Decoding is fast: Both MTF and BWT can be decoded in $O(n)$ without need to anything more complicated than countsort.

Drawback: We need the entire source text at once because BWT needs to sort cyclic shifts. Thus it is not possible to implement bzip2 such that both input and output are streams, i.e., we can't use it for large text that can't fit into memory.