I'm studying the worst case runtime of quicksort under the condition that it will never do a very unbalanced partition for varying definitions of very.
In order to do this I ask myself the question what the runtime $T(n, p)$ would be in the case quicksort always happens to partition in some fraction $0 < p \leq {1\over 2}$ such that $\lfloor{p(n-1)}\rfloor$ elements are in the left partition and $\lceil(1 - p)(n - 1)\rceil$ are in the right partition (leaving $1$ element, the pivot, in the middle).
It shouldn't be hard to see that $T(n, p)$ gives an upper bound for the worst case where $p$ is the maximally unbalanced allowed partition, as any partition with fraction $> p$ will be more balanced and have a smaller runtime, and any fraction $<p$ is not allowed.
It's obvious that $T(n, {1 \over 2})$ is the best case and $T(n, 0)$ is the worst case of quicksort. Both have easy recurrence relations that are found in any educational resource. But I have no clue how to study $T(n, p)$ in general. The obvious relation would be:
$$T(n, p) = n + T(\lfloor{p(n-1)}\rfloor, p) + T(\lceil(1 - p)(n - 1)\rceil, p)$$
Here I get stuck. I've tried searching around, but all literature that I could understand about divide and conquer algorithms took the "divide" literally and "cheated" the analysis using the fact that the partitions are always equal in size, merging the terms into one times a constant.
I do not know how to deal with two recursive calls, and I do not know if it is safe to remove the rounding. Is this possible to solve analytically, and if yes, how?
P.S.: I'm not interested in asymptotics (which is easy to show $\Theta(n \log n)$ for any constant $p$). I am interested in how much slower quicksort becomes as $p$ gets smaller, for example I'm interested in the ratio $T(n, 0.25) \over T(n, 0.5)$.
P.P.S: As an undergraduate student I apologize if I've made obvious things too lengthy or underexplained nontrivialities. And while I don't know if it's looked down upon here so much as other SE sites, I'll note that this is personal interest, not homework.
Asked By : orlp
Answered By : Yuval Filmus
As you mention, the Akra–Bazzi theorem shows that the solution to the recurrence $T(n,p)$ is $O(n\log n)$ for all $p \in (0,1)$. However, this does not reveal the nature of the dependence on $p$. To determine the latter, we can use a recursion tree approach.
At the root of the recursion tree is the interval $\{1,\ldots n\}$. Its two children are the intervals $\{1,\ldots,pn\}$ and $\{pn+1,\ldots,n\}$, whose total length is again $n$. Each of these nodes has two children (assuming $n$ is large enough), and so on. For simplicity we ignore rounding errors, that is, we assume that $pn$ is an integer; this is just a technicality, and I wouldn't worry about it. We stop the process whenever a node has length at most $1$. The complexity of the algorithm is proportional to the total length of intervals in the tree. When $p \neq 1/2$, the leaves (nodes at which we stop the process) have different depth, and that makes it more difficult to determine the overall complexity.
We can obtain a simple upper bound by noting that the tree has at most $\log_{1-p} (1/n)$ levels: each node is at least a factor of $1-p$ smaller than its parent. Just like in the analysis for $p = 1/2$, the total length of intervals at any level is at most $n$, and we obtain an upper bound of $O(n\log_{1-p} (1/n))$ on the running time. Since $\log_{1-p} (1/n) = \log n/\log (1-p)^{-1}$ and $\log (1-p)^{-1} = -\log (1-p) = p \pm O(p^2)$ for small $p$, we can write this as $O(n\log n/p)$.
Here is a more accurate calculation. Consider level $t$. Suppose we don't stop the process upon reaching a small interval. We can generate a random vertex by taking $t$ steps, in each of which we go left (say) with probability $p$ and right (say) with probability $1-p$. Each time we take a left step the log of the length of the interval decreases by $-\log p$, and each time we take a right step it decreases by $-\log (1-p)$. A vertex is in the actual tree of the log of the length decreased by at most $\log n$. The total weight of intervals on level $t$ of the tree is exactly the probability that a vertex generated according to this process correspond to a decrease of at most $\log n$. That is, if $D$ is the distribution which is equal to $-\log p$ with probability $p$ and to $-\log(1-p)$ with probability $1-p$, and $X_1,\ldots,X_t \sim D$ are independent, then the total weight of level $t$ is $\Pr[X_1+\cdots+X_t \leq \log n]$. For super-constant $t$, the random variable $X_1+\cdots+X_t$ is roughly normally distributed with mean $[-p\log p-(1-p)\log(1-p)]t$ and variance linear in $t$, so for $t$ satisfying $[-p\log p-(1-p)\log(1-p)]t \leq (\log n)/2$, say, the probability will be very close to $1$, while for $t$ satisfying $[-p\log p-(1-p)\log(1-p)]t \geq 2\log n$, say, it will be very close to zero. Defining $h(p) = -p\log p-(1-p)\log(1-p)$ (known as the binary entropy function), we conclude that the running time is $\Theta(n\log n/h(p))$ (uniform in $p$, as $n\to\infty$). As $p\to 0$ we have $h(p) \approx -p\log p$, and so our earlier estimate was not tight.
Another way of looking at the same analysis is by having an infinite sequence of independent random variables $X_1,X_2,\ldots$ as before, and defining a stopping time $T$ to be the first time $t$ such that $X_1 + \cdots + X_t \geq \log n$. The running time is then proportional to $n\mathbb{E}[T]$. The elementary renewal theorem then states that $\lim_{n\to\infty} \mathbb{E}[T]/\log n = 1/\mathbb{E}[D] = 1/h(p)$, implying that the total size of intervals equals $(1+o(1))n\log n/h(p)$. More accurately, for every constant $p$ the total size of intervals is $(1+\alpha_p(n))n\log n/h(p)$, where $\alpha_p(n) = o(n)$. The convergence in the elementary renewal theorem is exponential in the time parameter — $\log n$ in our case — so should be polynomial in $n$, that is, $\alpha_p(n) = O(n^{-C_p})$. The convergence is also probably uniform for $p \in (\delta,1-\delta)$ for any $\delta > 0$.
Summarizing, the total length of intervals in the recursion tree, which is proportional to the running time, is of the following form for every $p$: $$ T(n,p) = (1+o(1)) \frac{n\log n}{h(p)}, $$ where $\log n$ and $h(p) = -p\log p-(1-p)\log(1-p)$ are taken to the same base, and the $o(1)$ is a function depending on $p$ and tending to $0$ with $n$.
Moreover, it is probably true that for any $\delta > 0$ and any $p \in (\delta,1-\delta)$ it is true that the total length of intervals is of the form $$T(n,p) = (1+O(n^{-C_\delta})) \frac{n\log n}{h(p)}, $$ where $C_\delta > 0$ and the hidden big O constant depend only on $\delta$. In particular, it should be the case that for all constant $p_1,p_2$, $$\lim_{n\to\infty} \frac{T(n,p_1)}{T(n,p_2)} = \frac{h(p_2)}{h(p_1)}, $$ and the convergence is polynomially fast.
Best Answer from StackOverflow
Question Source : http://cs.stackexchange.com/questions/31928
0 comments:
Post a Comment
Let us know your responses and feedback