World's most popular travel blog for travel bloggers.

# Determine what is the best order for running filters on a dataset

, ,
Problem Detail:

I'm trying to figure out what is the optimal order for running a sequence of tasks in a pipeline.

Each task filters a percentage of the dataset.

Assuming I got the tasks t1, t2, t3, ..., ti and a dataset. Each task have an execution time per item in the dataset denoted as e1, e2, e3, ..., ei and each task filters a certain percentage of the dataset denoted as f1, f2, f3, ..., fi.

What is the optimal sequence of tasks to achieve the minimum running time ?

I've first tried to examine two tasks - t1 and t2 with their representative execution time and filter percentage - e1, f1 and e2, f2.

I want to find the ordering predicate for getting e1 + f1*e2 < e2 + f2*e1. I'll get the condition e1 / (1 - f1) < e2 / (1 - f2). And I guess that from here I'll need to prove it in induction, but I'm stuck because the equation is some kind of recursion.

###### Answered By : Eyal Schneider

You can prove that the optimal ordering is according to ascending order of `e(i)/f(i)`.

Proof:

Let `r(i) = 1 - f(i)`. The total cost of an ordering is: `e(1) + e(2)*r(1) + e(3)*r(2)*r(1) + ... + e(N)*r(N-1)*...*r(1)`

Now, assume that there's an optimal ordering that doesn't follow the rule above. It follows that there are two adjacent tasks t(i) and t(i+1) such that `e(i)/(1-r(i)) > e(i+1)/(1-r(i+1)) => e(i)*(1-r(i+1)) > e(i+1)*(1-r(i))` swapping these two will change the value of only two expressions in the total cost expression, resulting in a delta of:

``e(i) * r(i-1)*...*r(1) + e(i+1)*r(i)*...*r(1) - e(i+1)*r(i-1)*...*r(1) + e(i)*r(i+1)*r(i-1)*...*r(1) = r(i-1)*...*r(1)*[e(i) + e(i+1)*r(i) - e(i+1) - e(i)*r(i+1)] = r(i-1)*...*r(1)*[e(i)*(1-r(i+1)) - e(i+1)*(1-r(i))] > 0 ``

The delta is positive, meaning that the original configuration is more expensive than the outcome of the swap, therefore we have a contradiction.