World's most popular travel blog for travel bloggers.

[Answers] Any way to optimize this with subset caching?

, , No Comments
Problem Detail: 

I have some algorithm and I wonder whether there is a way to optimize it beyond the naive approach.

Basically I have time steps and the current time is $t_c$. At any given time step, a couple of "ranges" are active from now up to some given time $t_i$ (i.e. $\forall t\leq t_i: r_i\in R(t)$).

   now       -time->    v r1 **** r2 *************** r3 ******* r4 ***************************** r5 ************ 

At each time $t$ I have to calculate some expensive function of all currently active ranges $f(\{r|r\in R(t)\})$ (in the diagram I'd have to calculated $f(r_1, r_2, r_3, r_4, r_5)$ now and given nothing changes I need $f(r_2, r_4, r_5)$ 9 steps later). Afterwards new ranges are generated and the time moves one step further. Therefore the existing ranges slowly change. Of course, past ranges are irrelevant. That's it.

    now      -time->     v r1 **** r2 *************** r3 ******* r4 ***************************** r5 ************ r6 ********** 

My function is associative and cumutative, which means I could make use of any cached function results on subsets of the currently active set. It also time-scales with the number of arguments. Therefore given $f(a,b,c)$ I can calculate $f(d,a,b,c,e)$ in only two calculations rather than five.

Can you see any way to optimize this which is faster than calculating the function at each time for each active range?

Asked By : Gerenuk

Answered By : D.W.

It sounds like you are not entirely clear on the properties and structure of your function $f$, which makes things harder. So, my first advice would be to spend a little more time trying to characterize them, as they may help you select suitable algorithms. That said, I will suggest several methods below, which you might find helpful. I think you might find Method 3 perfect for your needs: it should help a lot.

Method 1: Memoize $g$

Suppose

$$f(a,b,c,\dots) = g(a)+g(b)+g(c)+\cdots,$$

where $+$ is some fast monoid operation (i.e., commutative and associative).

Then a very simple approach is: whenever you see a new range $r$, compute $g(r)$, and remember the results of this computation (use memoization so you never need to compute $g(r)$ again).

Now, at each time step, you have some set of active ranges, say $a,b,c,d$. So, compute $f(a,b,c,d)$ by computing $g(a)+g(b)+g(c)+g(d)$, where you recompute the sum from scratch but you don't need to re-compute $g(\cdot)$ values you've computed before. If $a,b,c$ were previously present and only $d$ is new, then you'll only need to compute $g(d)$. You've previously computed $g(a),g(b),g(c)$, so there is no need to re-compute those values -- you can just look them up in your memoization table. Finally, you apply the $+$ operator three times to sum up the four values $g(a),g(b),g(c),g(d)$.

The running time of this method is $O(n)$ evaluations of the $+$ operator and $O(n_+)$ evaluations of $g(\cdot)$ per time step, where $n$ is the total number of ranges active at that time step and $n_+$ is the number of new ranges introduced at that time step (i.e., which weren't active at the previous time step).

The advantage of this method is that it is extremely simple: it should be very easy to implement. It might be the best choice when the $+$ operator is very fast and $g(\cdot)$ is very slow, because it avoids re-computing $g(\cdot)$ but it makes no attempt to minimize applications of $+$. If $+$ is also very slow, then you might do better using one of the more sophisticated methods below.

Method 2: Prefix sums

Suppose $f$ is as above, but now evaluting $+$ is slow. Then the basic idea of Method 2 is that you can remember prefix sums. If you've computed the sum $g(a)+g(b)+g(c)+g(d)$, then the prefix sums are the sums of the prefixes of this expression: i.e., the values $g(a)$, $g(a)+g(b)$, and $g(a)+g(b)+g(c)$. Using the obvious method to compute the sum, it takes no extra effort to find the prefix sums (you generate them along the way). So, you might as well remember the value of all of those prefix sums, and reuse them whenever possible.

I suggest that initially, you take all of the ranges that are initially active, you sort them by their end date (so that the longest-lived ranges come first in the sequence), and then you compute prefix sums. Say the initial ranges are $a,b,c,d$, in order of decreasing end date. Then you compute $g(a)$, $g(a)+g(b)$, $g(a)+g(b)+g(c)$, and $g(a)+g(b)+g(c)+g(d)$. The latter value is exactly $f(a,b,c,d)$, which you output. However you also remember the prefix sums. At some point in the future, $d$ will expire, so now you can throw away the value of $g(a)+g(b)+g(c)+g(d)$ and remember just $g(a)$, $g(a)+g(b)$, and $g(a)+g(b)+g(c)$. You always use the longest of those prefix sums as the beginning point for adding in new elements.

Unfortunately, it's a bit of a challenge to turn this into a full algorithm that will be really useful, so I'll jump straight to Method 3, which is a generalization of this.

Method 3: binary tree

Here's a better method. Sort the active ranges by decreasing end date. We'll maintain a binary tree, with one leaf per active range.

Annotate each node with the result of evaluating $f(\cdot)$ on all of the ranges underneath that node (e.g., if the leaves that are descendants of that node correspond to ranges $a,b,c,d,e$, then that node will be annotated with the value of $f(a,b,c,d,e)$). Notice that the annotation on the root is just the result of evaluating $f(\cdot)$ on the set of all active ranges, so it is the desired output.

When you move on to the next step, some ranges will expire and some new ranges will be added. This means you'll need to delete some existing leaves and insert some new leaves (in the appropriate place, maintaining the sorted order).

When you delete a leaf, you'll need to update the value on each of its ancestor nodes, but each ancestor node can be updated with $O(1)$ evaluations of $+$. Insertion works similarly. So, inserting or deleting a leaf at depth $d$ takes $O(d)$ work. If you use a balanced binary tree data structure, then the depth will be $O(\lg n)$, where $n$ is the total number of leaves, so each deleted/inserted leaf will take $O(\lg n)$ work. There are a variety of possible implementations, e.g., using AVL trees, red-black trees, or other balanced tree data structures. Or, you can ignore the rebalancing operations and hope that your tree will stay approximately balanced in practice and keep your implementation simple.

Ultimately, this requires $O(n_+)$ evaluations of $g(\cdot)$ and $O((n_+ + n_-) \lg n)$ evaluations of $+$ per time step, where $n$ denotes the total number of ranges active at that time, $n_+$ the number of ranges newly introduced at that time step, and $n_-$ the number of ranges that expired at that time step. This is significantly better than Method 1, and it generalizes Method 2.

Method 4: Group operation

Finally, for completeness, let me mention one specific situation where one can achieve better running time. Suppose the $+$ operation has an inverse. In other words, suppose

$$f(a,b,c,\dots) = g(a)+g(b)+g(c)+\cdots,$$

where $g$ is a slow function and $+$ is some group operation. (In other words, $+$ is a binary operator that is commutative, associative, and has an inverse operator $-$ with the obvious properties: e.g., $x+y-y=x$.) Assume $+$ and $-$ are fast to compute.

In this case, there is a good solution. You can simply update the value of $f(\cdots)$ on the fly. At each time step, some new ranges get added and some get removed, so you compute $g$ of each of those ranges and add the values that are being added and subtract the ones that are being removed.

For example, suppose at time step 0 we have $a,b,c$ active; then we compute $f(a,b,c)$. Now suppose at time step 1 just $b,c$ are active: then we can compute $f(b,c)=f(a,b,c)-f(a)$, using a single computation of $g$ and a single subtraction (since we already know the value of $f(a,b,c)$, no need to recompute it). Let's say that at time step 2 we have $b,c,d$ active: then $d$ is new, so we compute $f(b,c,d) = f(b,c)+f(d)$, which requires just a single computation of $g$ and a single addition.

You can further improve this using memoization: remember the value of $g$ at all the ranges you've computed it on. Then to get from $f(a,b,c)$ to $f(b,c)$ requires just a single subtraction: there's no need to recompute $f(a)=g(a)$, since you've previously computed it.

I know you said that this doesn't apply to you, since you don't have any way to compute the inverse operation -- but I'm mentioning this for completeness, in case anyone else comes across a similar situation but does have inverses.

More reading, background, and tangential stuff

Here's some additional reading on topics that might be helpful or interesting to you.

Best Answer from StackOverflow

Question Source : http://cs.stackexchange.com/questions/19342

3.2K people like this

 Download Related Notes/Documents

0 comments:

Post a Comment

Let us know your responses and feedback