# Dictionaries

## CS240E: Data Structures and Data Management (Enriched)

### David Duan, 2019 Winter

#### Motivation

A dictionary is a collection of key-value-pairs (KVP) with operations

• search(k), or findElement(k)
• insert(k, v), or insertItem(k,v)
• delete(k), or removeElement(k)

We assume that each key can be stored in $O(1)$ space and compared in $O(1)$ time and all keys in a dictionary are distinct.

##### Naive Implementations
• Unsorted list/array: $O(1)$ insert, $O(n)$ search, $O(n)$ delete.
• Sorted list/array: $O(n)$ insert, $O(\log n)$ search, $O(n)$ delete.

#### Binary Search Trees

A BST is a binary tree (where each node stores a KVP) that satisfies the BST order property: key(T.left) < key(T) < key(T.right) for all node T.

##### Operations

Search

1. If the node is NIL, return.
2. If current node has a key equal to the target key, return current node.
3. If current node has a key less than the target key, search the left subtree.
4. Otherwise, search the right subtree.

Insert

1. If the node is NIL, construct a new node with given KVP.
2. If the node's key equals the target key, renew its value.
3. If the node's key is less than the target key, insert KVP to the left subtree.
4. Otherwise, insert KVP to the right subtree.

Delete

1. If the target node is a leaf, remove it.
2. If the target node only has one child, promote its child to its position.
3. Otherwise, swap the node with its in-order successor (or predecessor), which is the left-most node in its right subtree, then delete the node (which is now a leaf).

Remark

In this course, deletion is not studied. Instead, we use lazy-deletion, replacing the node to be deleted with a "dummy" and keeping the overall structure unchanged. We also keep track the number of dummies, $d$, and rebuild the entire data structure if $d > n$. Note rebuild can happen only after $\Theta(n)$ deletions, which makes the amortized overhead for rebuilding $O(1)$ per deletion. Since rebuild with only $n$ valid items, this can be done within $O(n)$ time. Note that when a node is marked as "dummy", you might still want to preserve its key for comparisons.

##### Analysis

Let $h$ denote the height of a BST. Define $h(\varnothing) = -1$ and $h(x) = 0$ when $x$ is a singleton.

• Worst case: $h = (n-1) \implies h \in \Theta(n)$.
• Best case: $h \approx \log n\implies h\in \Theta(\log n)$.
• Average case: $h \in \Theta(\log n)$, see below.

Average Case Analysis

Given $\{0, \ldots, n-1\}$, we want to build a BST. Suppose the node inserted first has key $i$.

By BST ordering principle, $\{0, \ldots, i-1\}$ will in the left subtree and $\{i+1, \ldots, n-1\}$ will in the right subtree. Thus, the left subtree has size $i-1$ and the right subtree has size $n - i - 1$.

Let $c_n$ denote the height of a BST with $n$ nodes, $c_n = 1 + \max\{c_i, c_{n-i-1}\}$. Since $i$ is unknown, we take the expected value of the expression above:

• Summation: $i$ can range from $0$ to $n-1$.
• $1/n$: each value of $i$ has $1/n$ probability as there are $n$ possible values.
• $\max\{c_i, c_{n-i-1}\}$: we want to consider the maximum of two subtrees.

Let $T(n) = c_n$ for better notation. If $i$ lies within $n/4$ and $3n/4$ (this has $1/2$ probability), then the size for both subarrays is bounded above by $3n/4$, i.e., $T(n)\leq 1 + T(3n/4)$ in this case:

This is similar to our quick-select $\Theta(n)$ proof.

Lower Bound for Search

We can also prove the lower bound $\Omega(\log n)$ for comparison-based search using decision trees.

For a correct comparison-based search, there are $n$ possible outputs, which correspond to $n$ leaves in our decision tree. We know the minimum height for a binary tree with $n$ leaves is $\log n$, so the lower bound for comparison-based search is $\Omega(\log n)$. $\square$

##### Random BST

Again, we could use randomization to improve the theoretical performance of BST. A random binary search tree of size $n$ is obtained in the following way: take a random permutation $x_0, \ldots, x_{n-1}$ of $0, \ldots, n-1$, and add its elements, one by one, into a BST.

A random BST can be constructed in $O(n \log n)$ time (by doing $n$ insertions); in a random BST, search takes $O(\log n)$ expected time.

#### Treaps

Treap is a balanced BST, but is not guaranteed to have height as $O(\log n)$.

The idea is to use randomization and binary heap property to maintain balance with high probability. The expected time complexity for search, insert, and delete is $O(\log n)$.

Every node of a Treap maintains two values:

1. Key which follows standard BST ordering (left smaller and right greater)
2. Priority which is a randomly assigned value that follows max-heap ordering property.

In short, a Treap follows both BST ordering property and max-heap ordering property.

##### Operations

Search(k) - Same as an ordinary BST, with complexity $O(h)$.

Insert(k)

1. Randomly pick a priority in $\{0, 1, \ldots, n-1\}$ for $k$.
2. Perform insertion of $(k,p)$ as an ordinary BST.
3. Fix the priority (max-heap order property) with rotations, i.e., locally rearrange the BST while keeping the BST order property.

Rotation The tree rotation renders the inorder traversal of the tree invariant. For more details, check out rotation section on AVL tree.

Note that, because of rebalancing rotations, Treaps hide insertion orders.

##### Analysis

Treaps don't guarantee a small height, but if the priorities are chosen uniformly on random, then $\mathbb E(h) \in O(\log n)$.

#### AVL Tree

An AVL Tree is a BST with an additional height-balance property: the heights of the left subtree $L$ and the right subtree $R$ differ by at most $1$. In other words, we require $h(R)- h(L) \in \{-1, 0, 1\}$, where $-1$ means the tree is left-heavy, $0$ means the tree is balanced, and $+1$ means the tree is right-heavy. An AVL tree on $n$ nodes has $\Theta(\log n)$ height and can preform search, insert and delete at $\Theta(\log n)$ in the worst case.

Every node of a Treap maintains two values:

1. Key which follows standard BST ordering (left smaller and right greater).
2. Height which tracks the height of the current subtree (leaf has height $0$).

It can be shown that it suffices to store $h(R) - h(L)$ at each node, which may save some space but the code gets more complicated.

AVL-search takes $\Theta(h)$ just as ordinary BSTs. We cover AVL-insertion extensively.

##### Insertion
###### Procedure
1. Insert as for a BST.
2. Go back up to the root and update heights and check balances. Recall that if the height difference becomes $\pm 2$ at node $z$, we call $z$ unbalanced.
3. If some node $z$ is not unbalanced: Let $y$ and $x$ be the descendent on the insert path, apply the appropriate rotation at $z, y, x$ to bring the middle value of $\{x,y,z\}$ to the root.
###### Pseudocode
​x1AVL-insert(r, k, v)2r: root of the tree3(k, v): KVP to insert4​51.  z <- BST-insert(r, k, v)62.  z.height <- 073.  while (z is not the root)84.    z <- parent of z95.    if (|z.left.height - z.right.height| > 1) then106.      let y be the taller child of z (break ties arbitrarily)117.      let x be the taller child of y (break ties to prefer left-left or right-right)128.      z <- restructure(x)  // see rotation139.      break1411.   setHeightFromSubtrees(z)
xxxxxxxxxx31setHeightFromSubtrees(u)21.  if u is not an empty subtree32.    u.height <- 1 + max(u.left.height, u.right.height)
###### Rotation There are many different BSTs with the same keys. Our goal is to change the structure among these three nodes without breaking BST order property, so that the subtree becomes balanced.

xxxxxxxxxx61rotate-right(current)21.  newRoot <- current.left32.  current.left <- newRoot.right43.  newRoot.right <- current54.  setHeightFromSubtrees(current), setHeightFromSubtrees(newRoot)65.  return newRoot
xxxxxxxxxx61rotate-left(current)21.  newRoot <- current.right32.  current.right <- newRoot.left43.  newRoot.left <- current54.  setHeightFromSubtrees(current), setHeightFromSubtrees(newRoot)65.  return newRoot
xxxxxxxxxx41rotate-left-right(current)21.  current.left <- rotate-left(current.left)32.  newRoot <- rotate-right(current)43.  return newRoot
xxxxxxxxxx41rotate-right-left(current)21.  current.right <- rotate-right(current.right)32.  newRoot <- rotate-left(current)43.  return newRoot
xxxxxxxxxx211restructure(x)2x:  node of a BST that has a grandparent, i.e., the lowest node in (x, y, z) triple.3​40. let y and z be the parent and grandparent of x51. switch:6​7       z82.    y      =>  return rotate-right(z)9     x10     11      z123.   y       =>  return rotate-left-right(z)13      x14       15      z164.     y     =>  return rotate-right-left(z)17      x18     19     z205.    y      =>  return rotate-right(z)21       x

As a remark, the middle key of $(x, y, z)$ becomes the new root.

##### Analysis
###### Runtime

AVL-search takes $\Theta(\log n)$ as usual.

Claim. If an AVL-tree after insertion is unbalanced at $z$, then this rotation makes subtree of $z$ balanced and reduces its height.

Thus, AVL-insert takes $\Theta(\log n)$ overall:

• Constant time per rotation.
• At most $\Theta(\log n)$ rotations (one per level).
• In fact, one rotation is all we need (according to the claim).
###### Drawbacks
1. Not space efficient: extra space to cache height.
2. Height roughly $1.6\log n$ but we want the tree to be as close to $\log n$ as possible.
3. Nearly almost all insertions require a rotation, which add up and take quite some time.

#### Scapegoat($\alpha$)-Tree

A scapegoat tree is a self-balancing BST that provides $O(\log n)$ amortized insertion, $O(\log n)$ height, and no rotation involved.

Instead of small incremental rebalancing operations used by most balanced tree algorithms, scapegoat trees rarely but expensively choose "scapegoat" and completely rebuild the subtree rooted at the scapegoat into a complete binary tree. Thus, scapegoat trees have $O(n)$ worst-case update performance.

##### Weight $\alpha$
###### $\alpha$-Weight-Balanced

Recall that a BST is said to be weight-balanced if half of the nodes are on the left of the root and half on the right. An $\alpha$-weight-balanced node is defined as meeting a relaxed weight balance criterion:

A BST that is $\alpha$-weight-balanced must also be $\alpha$-height-balanced, that is,

By contraposition, a tree that is not $\alpha$-height-balanced is not $\alpha$-weight balanced.

Scapegoat trees are not guaranteed to keep $\alpha$-weight-balance at all times, but are always loosely $\alpha$-height-balanced in that

###### Choosing and Working with $\alpha$
• We choose the constant $\alpha$ between $1/2 < \alpha < 1$ and keep it fixed.
• As $\alpha \to 1/2$, $h \to \log n$.
• To maintain height-balance property during insertion, every node $n$ stores $n_v$, the number of nodes in subtree of $v$.
##### Insertion
###### Procedures
1. Insert as for a BST.
2. On each node along insertion path, update height and deposit a "token". Note that, this token is just analysis purpose and is not needed in practice.
3. If height-balance property is violated, i.e., $h > \log_{1 /\alpha} n$, we completely rebuild some subtree.
###### Claim

Claim. If the insertion path has height $h > \log_{1/\alpha} n$, then there exists some node on that path satisfying $n_v > \alpha \cdot n_{\text{parent(v)}}$.

Proof. We show the contrapositive. Assume $n_v < \alpha \cdot n_{p(v)}$ for all $v$. Then

• $n_\text{root} = n$
• $n_\text{child(root)} \leq \alpha \cdot n$
• $n_\text{grandchild(root)} \leq \alpha^2 \cdot n$
• $\cdots$
• $n_\text{leaf} \leq \alpha^h \cdot n$

Thus, $1 = n_\text{leaf} \leq \alpha^h \cdot n \implies h \leq \log_{1/\alpha} n$. $\blacksquare$

###### Rebuild

Let $v$ be the highest node with $n_v > \alpha \cdot n_{\text{parent(v)}}$. Rebuild the tree at $p:=\text{parent}(v)$. Afterwards, we get $n_z \leq \alpha \cdot n_{p(z)}$ for all nodes on path. After a complete rebuild, remove tokens at $p$ and all its descendents.

##### Analysis

We show the amortized time for insertion is $O(\log n)$ using potential function.

Assume constant $c$ so big (i.e., largest $c$ in all $O$-expressions) that

• Rebuilt takes time $\leq c \cdot n_p$
• The height is $\leq c \cdot \log n$
• Insert without rebuild takes time $\leq c \log n$.

Let $k = c/(2a-1)$ and define $\Phi(t) = k \cdot \text{number of tokens at time t}$. Then

1. Amortized time of insertion without rebuild:

2. Amortized time of insertion with rebuild:

It can be shown that the number of token at $p$ is at least $(2\alpha - 1)\cdot n_p$. $\blacksquare$

#### Skip Lists

A skip list allows $O(\log n)$ search complexity as well as $O(\log n)$ insertion complexity within an ordered sequence of $n$ elements. Thus it can get the best of array (for searching) while maintaining a linked list-like structure that allows insertion, which is not possible in an array.

Fast search is made possible by maintaining a linked hierarchy of subsequences, with each successive subsequence skipping over fewer elements than the previous one. Searching starts in the sparsest subsequence until two consecutive elements have been found, one smaller and one larger than or equal to the element searched for. Via the linked hierarchy, these two elements link to elements of the next sparsest subsequence, where searching is continued until finally we are searching in the full sequence. The elements that are skipped are usually chosen probabilistically.

##### Operations
###### Search

Remark: This is not really a search, but a find-me-the-path-of-the-predecessors.

xxxxxxxxxx101skip-search(L, k)2L: skiplist; k: key to be searched31.  p <- topmost left node of L42.  P <- stack of nodes, initially containing p53.  while below(p) != null do64.    p <- below(p)75.    while key(after(p)) < k do86.      p <- after(p)97.    push p onto P108.  return P

Keep going right (line $5-6$) until the next key on the current level is no smaller than $k$. Go down a level and repeat.

1. Not only we return where we stopped, we return the entire search path (the stack $P$), or the predecessors of $k$ at all levels.
2. The stack returned does not contain $k$; to extract $k$ (assume it exists in $L$), we need after(top(P)).
###### Insert

Randomly repeatedlly toss a coin until you get tails. Let $i$ be the number of times the coin came up heads; this will be the height of the tower of $k$.

Increase height of skip list, if needed, to have $h > i$ levels.

Search for $k$ with skip-search(S, k) to get stack $P$. The top $i$ items of $P$ are the predecessors $p_0, p_1, \ldots, p_i$ of where $k$ should be in each list $S_0, S_1, \ldots, S_i$.

Insert $(k, v)$ after $p_0$ in $S_0$, and $k$ after $p_j$ in $S_j$ for $1 \leq j \leq i$.

##### Complexity Analysis
###### Expected height of a skip list

Let $X_k$ be a random variable denoting the height of tower of key $k$.

Recall that the height of tower of key $k$ equals the $i$, the number of coin tosses resulted in heads before the first tail, thus:

Expected value of $X_k$:

Expected value of height of the skip list:

The result gets messy. Instead, we'll prove something related to this:

Plug in $i = 3 \log n$ (the smallest number that makes a point), we see that:

Hence, the height of a skip list is at $\leq 3 \log n$ with high probability. This roughly means that the expected height of the skip list is $\leq 3 \log n$. $\blacksquare$

###### Expected space of a skip list

The space of a skip list is proportional to $n + \sum X_k$, size plus all the tower heights.

The expected space of a skip list is proportional to

###### Expected time for operations

The number of drop-downs $\leq \text{expected height} \in O(\log n)$.

The number of forward-steps gets messier, so we use a different approach.

Consider the search path $\pi$ and its reverse $\pi^{-1}$. We want to find $|\pi|$: I stopped here, how long does it take for me to go back to the beginning?

Let $h$ be the top layer, which equals to the height of the skip list. Define $C(j)$ to be the expected number of steps to get to some node on $\pi$ on layer $h-j$. (Looking at some node at some height, we want to know what is the length of $\pi^{-1}$ from this node to the beginning.)

Note that probability is involved to compute $C(j)$. For example, getting to the left XXX takes 1 step but the right one takes 2 steps:

xxxxxxxxxx31                                    XXX2                                    |3                                    XXX -- XXX

Here is the recursive formula:

• $C(0) = 0$: the top layer only has two sentinels; we start at the $-\infty$ and never get to $\infty$.

• Recursion:

• The probability that tower has height $h-j$ only depends on the probability where a coin toss resulted in tail, which has probability $1/2$ for a fair coin. Similar the other probability is also $1/2$.
• The $1$ at the end represents the operation of dropping down or moving right.
• Key intuition: We can approach a node from the left only if the node doesn't have anything above that node (take a look at your condition for dropping down and moving forward!)
• Thus, time for search is $O(C(h)) = O(2h) = O(2 \cdot 3 \log n) = O(\log n)$. $\blacksquare$

Conclusion: the expected time for search and insert (and delete) inside skip list is $O(\log n)$. $\square$

###### Conclusion

Why do we use skip lists?

• Skip lists perform better for range search.
• Although not much space is saved, skip lists are easier to work with.
• Trees need priority to be well randomized but here the only randomness we rely on is coin flip.

#### Splay Trees

We now present possibly the best implementation for dictionaries. In reality, requests for certain keys are much more frequent, so we want to tailor the dictionaries so that we can complete those requests faster.

##### Array-Based Dictionaries

We study two scenarios:

1. We know access-probabilities of all items (unrealistic, mostly for motivation).
2. We don't (more realistic).
###### First Case

To store items in an array and make accesses efficient, we would want to put the items with high probability of access at the beginning and items with low probability of access at the end. In other words, we "sort" the array with access probability in descending order.

xxxxxxxxxx51                                        item with lowest probability2                                                      |3                        [             array             ]4                          |5             item with highest probability

We define the following terms:

• Access: just search, not inserting.

• Access-probability: number of access to item $x$ divided by total number of accesses.

• Access-cost: cost to retrieve item $x$ during search, which, in a linear search (starting left), equals $O(\text{idx}(x) + 1)$.

• Thus, the expected access cost of a certain permutation equals the expected value of the random variable representing access cost.
• Our goal is simple: we want the access cost to be at lowest at possible.

Claim For all possible static ordering, the one that sorts items by non-increasing access probability minimizes the expected access cost. (Easy to prove).

###### Second Case

If we do not know the access probabilities at design time, we need an adaptive data structure, which updates itself upon being used.

MTF Array Upon searching, move the searched item to the front. This has $\Theta(n)$ worst case (when we keep searching for the last item).

Transpose Array Instead of moving all the way to the front, we only do one swap. This still has $\Theta(n)$ worst case.

MTF does not work well with BSTs, as rotating leaf elements makes the BST to degrade into linked lists. We use more complicated tree rotations that do not unbalance the tree.

A splay tree is a self-optimizing BST, where after each search, the searched item is brought up two levels (using rotations as AVLs).

###### Operations

Search/Insert: as in BST, then use rotations as in AVLs.

###### Amortized Analysis

We want to show that a search/insertion has amortized cost $O(\log n)$.

Define the potential function $\phi(t) = \sum_v \log(n_v)$, where $v$ is a node in the tree and $n_v$ is the size of the subtree with root at $v$. Define the "rank" of $v$, $r(v) := \log(n_v)$, i.e., $\phi(t) = \sum_v r(v)$.

Actual Time (Overall)

The actual time of search/insert (in term of time units) is one plus the number of rotations.

Amortized Time for One Rotation

We show that the amortized time for one rotation is at most $3r'(x) - 3r(x)$ (plus 1 for single rotations, which we don't care), where $x$ is the node we inserted or searched, $r(x)$ and $r'(x)$ are the rank ($\log n_x$) of subtree at $x$ before and after the rotation.

The actual time for one rotation is easy: two for zig-zig/zig-zag and one for a single rotation.

For potential difference, let $p$ be the parent of $x$, $g$ be the grandparent of $x$. Then

Only subtrees of the three nodes involved in rotation changed, so

Before rotation, $r(g)$ measures the entire (sub)tree, whereas after rotation, $r'(x)$ measures the same thing. Thus, $r(g)$ and $r'(x)$ cancel each other:

We can also bound $r(p)$ from above: $r(p) \geq r(x)$ before rotation, so

Detour: Recall $n'_v$ denotes the size of subtree with root $v$ after rotation, so $n'_g + n'_p \leq n_x'$ as RHS measures the size of entire (sub)tree. By concavity of logarithm,

Thus,

Back to main: