World's most popular travel blog for travel bloggers.

# [Answers] Merge-by-weight to solve reachability problems in trees and DAGs

, ,
Problem Detail:

Let $T = (V, E)$ be a tree with a designated root $r \in V$. The fact that the tree is rooted allows us to speak of "subtrees of $T$ rooted at some node $x \in V$". Let's say we have a (not necessarily injective) function $c: V \rightarrow \mathbb{N}$ that assigns a color to every node.

We want to find for every node $x \in V$ the number of distinct colors in the subtree rooted at $x$. So if $A(x)$ is the set of children of $x$, we want to find for every node $x$, the cardinality of the set

$S(x) = \bigcup_{y\in A(x)} S(y) \cup \{c(x)\}$

There is a rather simple algorithm that solves this problem offline in $O(n \log^2 n)$:

• Process the nodes in depth-first order
• Represent the sets $S(x)$ as a persistent, self-balancing binary search trees
• For every node $x$, let $y \in A(x)$ be the child of $x$ that maximizes $|S(y)|$. Now we just merge for every other child $z \in A(x) \setminus {y}$ the set $S(z)$ into the set $S(y)$ via insertion. We also insert $c(x)$ and now have $S(x)$ as a binary search tree and can give the answer for node $x$

We can even avoid the persistent trees by mutating the subresult for the heaviest child instead of creating a new set.

The typical argument I have seen to justify the bound is the following:

A "move" consist of taking an element from a set and inserting it in another set. Let's prove that there's at most $O(n \log n)$ moves, each move can be done in $O(\log n)$, the total complexity is $O(n \log^2 n)$.

When you are merging two sets you move all elements from smaller set (assume its size is $k$) to the bigger one, so every element in the smaller set now is in a set with at least size $2k$. In other words every element that has been moved $t$ times is in a set with at least size $2^t$, thus every element will be moved at most $O(\log n)$ and we have $n$ elements so in total there will be at most $O(n \log n)$ move operations.

(Quote from a Codeforces comment).

While this is a beautiful argument, it's not quite correct, because usually we remove duplicates, so sets don't in fact grow exponentially (for example if every node has the same color).

I am convinced that the bound holds, because I can prove it using structural induction. What I want to know is the following:

• Is there a way to prove the $O(n \log^2 n)$ bound using an argument similar to the above, by bounding the number of "moves" of the color of a node? In other words, can the proof be "fixed" easily?
• The argument from above would also hold in DAGs, not only in trees. However it seems unlikely that we could achieve a runtime better than $O(n \cdot m)$ to solve this problem in a DAG, so I guess there is no fix for the proof idea in this case. Am I right? Is there a good intuition for that?

#### Answered By : Yuval Filmus

There are two (equivalent) ways to fix the argument:

1. Start by not eliminating duplicates. Then the argument works. Then argue that removing duplicates can only decrease the running time.
2. Count elements with multiplicity, that is, consider each set $S(x)$ as a multiset.

Both of these arguments require a slight modification of the algorithm: for each vertex $x$, you keep track of the size of the subtree, and use this to select the heaviest child rather than the number of distinct colors.

This argument breaks down for a DAG since a vertex can have multiple parents.