World's most popular travel blog for travel bloggers.

[Solved]: Facts about internal and external path lengths of binary tree

, , No Comments
Problem Detail: 

While learning binary tree's properties, I came across internal path length and external path length, number of comparisons required for successful and unsuccessful search.

My book specifies some facts about these as follows:

  1. $S_n=1+\frac{U_0+U_1+...+U_{n-1}}{n}$

  2. For large $n$, $S_n=2 \log_en$

  3. Internal path length $I_n$ of binary tree with $n$ internal nodes is at least $n \times \lg(n/4)$ and at most $\frac{n(n-1)}{2}$

Where

  • $n \rightarrow$ number of (internal) nodes in binary search tree
  • $S_n \rightarrow$ average number of comparisons required for successful search in a random binary search tree with $n$ internal nodes
  • $U_n \rightarrow$ average number of comparisons required for unsuccessful search in a random binary search tree with $n$ internal nodes
  • $I_n \rightarrow$ Internal path length of binary tree with $n$ internal nodes
  • $E_n \rightarrow$ External path length of binary tree with $n$ internal nodes (not used above)

Out of these I can complete reason out only the maximum internal path length statded in point 3 as follows: It occurs in degenerate binary tree (point 3.6 on this page). In this case, internal path length is $0+1+2+...+(n-1)=\frac{n(n-1)}{2}$.

About the minimum internal path length, the book says following:

The minimum internal path length occurs in case of the best case binary tree i.e. almost complete binary tree (maximum possible number of nodes having both children). Such tree has $(n+1)$ external nodes at height no more than $\lg {n}$. Multiplying these and applying the below property: $$E_n=I_n+2n$$ we get the bound: $$(n+1)\lg n-2n<n \lg (N/4)$$

Q1. However I am still not clear how the above bound is obtained

Q2. Also I did not understood how first two facts out of above 3 are derived and especially what the first fact is meant to imply.

Asked By : Mahesha999

Answered By : Yuval Filmus

I'm assuming the following model for a binary search tree of size $n$. Consider the following infinite process. At step 0, we have the empty tree. At step 1, we have the tree having some arbitrary value at the root. At step $m>1$, if the current (ordered) contents of the tree at the previous step are $x_1<\ldots<x_{m-1}$, we choose an interval uniformly at random among the $m$ intervals $$(-\infty,x_1),(x_1,x_2),\ldots,(x_{m-2},x_{m-1}),(x_{m-1},\infty),$$ choose an arbitrary element from the interval, and insert it to the binary tree. A random binary search tree of size $n$ is obtained by stopping this process at step $n$, and erasing the values at the vertices.

You can show that the arbitrary choices do not affect the distribution. Also, the relative order of the elements of the tree at step $n$ (in order of insertion) is a random permutation on $n$ elements.

You haven't explained what you mean by $S_n$ and $U_n$ exactly, so I assume you are using the following definitions:

  • Let $T_n$ be a random binary search tree of size $n$ before erasing the values. Choose one of the $n$ elements at random, and run binary search on $T_n$ and this element. The average number of comparisons in this process is $S_n$.

  • Let $T_n$ be a random binary search tree of size $n$ before erasing the values, and let $x$ be the random element inserted by the process at the following step. Run binary search on $T_n$ and $x$. The average number of comparisons in this process is $U_n$.

First fact

Given these definitions, the first fact is easy to prove. Suppose that the element we search for was inserted at step $i$. Note that the distribution of $i$ is uniform over $\{1,\ldots,n\}$. Inserting $i$ at step $i$ requires $U_i$ comparisons in expectation. It takes us the same number of steps to reach this element at the current experiment, and one more comparison to verify that we have indeed found the element. The formula follows.

Second fact

Here we need a non-trivial argument, outlined in the Wikipedia article. Let the sorted elements in the tree be $x_1<\cdots<x_n$. Suppose that the element we look for is $x_i$. The search goes through an element $x_j$ if $x_j$ is an ancestor of $x_i$. Thus, given $i$, the average number of comparisons is $$\sum_{j=1}^n \Pr[\text{$x_j$ is an ancestor of $x_i$}].$$ Now $x_j$ is an ancestor of $x_i$ if among all elements in $\{x_i,\ldots,x_j\}$ (or $\{x_j,\ldots,x_i\}$), $x_j$ was the first inserted. Since the true order of insertion is a random permutation of $x_1,\ldots,x_n$, we conclude that $$\Pr[\text{$x_j$ is an ancestor of $x_i$}] = \frac{1}{|i-j|+1}.$$ Choosing a random $i$, this shows that $$ \begin{align*} S_n &= \frac{1}{n} \sum_{i=1}^n \sum_{j=1}^n \frac{1}{|i-j|+1} \\ &= \frac{1}{n} \sum_{i=1}^n \frac{1}{|i-i|+1} + \frac{2}{n}\sum_{i=1}^n\sum_{j=1}^{i-1} \frac{1}{i-j+1} \\ &= 1 + \frac{2}{n}\sum_{i=1}^n \sum_{j=2}^i \frac{1}{j} \\ &= 1 + \frac{2}{n}\sum_{i=1}^n (H_i - 1) \\ &= \frac{2}{n} \sum_{i=1}^n H_i - 1 \\ &= \frac{2(n+1)}{n} (H_{n+1} - 1) - 1. \end{align*} $$ Here $H_i$ is the $i$th harmonic number, and we used the identity $\sum_{i=1}^n H_i = (n+1)(H_{n+1}-1)$. It is well-known that $H_i = \ln i + \gamma + O(1/i)$ and that $\ln (n+1) = \ln n + O(1/n)$, and so $$ \begin{align*} S_n &= 2\left(1+\frac{1}{n}\right) (\ln n + \gamma - 1 + O(1/n)) - 1 \\ &= 2\ln n + 2\gamma - 3 + \frac{2\ln n}{n} + O\left(\frac{1}{n}\right). \end{align*} $$

Third fact

The upper bound is given by a path, in which case $I_n = 0 + 1 + \cdots + (n-1) = n(n-1)/2$. For the lower bound, it is easier to consider first the special case $n = 2^m-1$. In that case, the lower bound is given by a balanced binary tree. Such a tree has $2^k$ nodes at depth $k$, and so $$ I_n = \sum_{d=0}^{m-1} 2^k k = 2^m (m-2) + 2 = (n+1)\log_2 \tfrac{n+1}{4} + 2. $$ More generally, if $n = 2^m-1+x$ for $x < 2^m$, it is best to put the $x$ extra nodes at depth $m$, and so $$ \begin{align*} I_n &= 2^m(m-2) + 2 + mx \\ &= (m-2)(2^m+x) + 2(x+1) \\ &= (m-2)n + m-2 + 2(x+1). \end{align*} $$ Notice that $m \leq \log_2 (n+1) < m+1$, that is, $m = \lfloor \log_2 (n+1) \rfloor$. Therefore $$ I_n = n\lfloor \log_2 (n+1) \rfloor + O(n) = n \log_2 n + O(n). $$ To get an actual lower bound, notice that $$ I_n \geq (m-2)n + m > n \log_2 \frac{n+1}{4} \geq n\log_2 \frac{n}{4}. $$

Best Answer from StackOverflow

Question Source : http://cs.stackexchange.com/questions/56428

3.2K people like this

 Download Related Notes/Documents

0 comments:

Post a Comment

Let us know your responses and feedback