I have a set S (so no duplicates) of d-dimensional vectors of non-negative real numbers (or if you would prefer, floats).
I say a vector u "covers" a vector v if, in every dimension 1..d, u[i] >= v[i]. So for d=3, (3,3,2) covers (2, 3, 1), but (3, 3, 1) doesn't cover (2, 2, 2).
I am interested in finding a subset T of S, such that for every v in T, there is no u in S with u != v, such that u covers v. Alternatively, I'm interested in removing from S those vectors that are covered by other vectors in S.
What is an efficient algorithm for this? Barring that, what is at least the "real name" of this problem to help me search for it?
An O(n^2) algorithm is obvious: for every vector in S, check it against every other vector in S, and add it to T if nothing covers it. I'm having trouble doing much better than that.
I have considered trying to use kd-trees and range trees, but this is for a real world problem, and in practice d is too high and the number of vectors too low, so that e.g. the O(n * log^d(n)) running time is actually worse than the naive O(n^2). (d is approximately 20-40, and the number of vectors is in the millions).
Edit - Particular Problem Details
As requested, more information about the specific problem I am trying to solve.
The implementation problem at hand involves several sets of vectors, approximately 60-100 sets, each with about 10-20 vectors in them. For every set s, for every vector v in s, v has dimension d, and all axes of v are zero save for one which is positive. A result vector is formed by choosing precisely one vector from each set and adding them together, and a given result vector is strictly worse if it is covered (or apparently the real word is 'dominated') by another result vector. You can assume the non-zero axis of a base vector varies uniformly in [.9, 1.1]. In practice most intermediate results appear to be dominated, and size(T) << size(S)
Asked By : James Dowdell
Answered By : D.W.
You're looking for a set of Pareto-optimal alternatives. If you're looking for all such possibilities (i.e., looking to make $T$ as large as possible), you're looking for the Pareto-optimal front. You asked for keywords or names to search for this -- those are standard terms for this concept. See https://en.wikipedia.org/wiki/Multi-objective_optimization.
For specific situations, if the data is structured, it may be possible to take into account knowledge of the structure of the data to find a better algorithm -- but this will depend on the nature of the data.
In your specific case, there are algorithmic techniques that might lead to significant improvements in running time. In particular, in your setting, each vector $v \in S$ is obtained as a sum of some component vectors $v = \sum_i v_i$, where we have a collection of sets $S_1,S_2,\dots,S_k$ and each $v_i$ is chosen from some set $S_i$.
Let me establish some notation and definitions, prove some mathematical results, and then show them how to compute the Pareto-optimal front for your specific situation.
Definition. We say that $u$ dominates $v$ if $u[j] \ge v[j]$ for all dimensions $j$. (This is the same as what you call "covers", but "dominates" is the more standard term.)
Definition 1. If $S,T$ are sets, then $S+T$ is defined to be the set $S+T=\{s+t : s \in S, t \in T\}$.
Definition 2. If $S$ is a set of vectors, define $P(S)$ to be the Pareto-optimal front of $S$, i.e., the set of vectors $s \in S$ such that there does not exist any vector $t \in T$ that dominates $s$.
Your problem is the following: given sets $S_1,\dots,S_k$ of vectors, find the Pareto-optimal front of $S_1+\dots + S_k$, i.e., compute $P(S_1 + \dots + S_k)$. I'll show an algorithm to compute this.
The following lemmas will be helpful:
Lemma 1. If $u$ dominates $v$, then $u+w$ dominates $v+w$.
Lemma 2. If $u_i$ dominates $v_i$ for each $i$, then $\sum_i u_i$ dominates $\sum_i v_i$.
Lemma 3. $P(S+T) = P(P(S)+P(T))$.
(Lemmas 2 and 3 follow immediately from Lemma 1.)
We'll now use Lemma 3 repeatedly to compute $P(S_1 + \dots + S_k)$. We build a binary tree over the $k$ leaves $S_1,\dots,S_k$. At each node, we compute the Pareto-optimal front of the sum of the two sets corresponding to its two children.
In other words, here's how we compute the Pareto-optimal front for several values of $k$:
$$P(S_1+S_2) = P(P(S_1)+P(S_2)).$$
$$P(S_1+S_2+S_3+S_4) = P(P(P(S_1)+P(S_2)) + P(P(S_3)+P(S_4))).$$
(In other words, to compute $P(S_1+S_2+S_3+S_4)$, we first compute $U_1 = P(S_1+S_2)$ efficiently using Lemma 3 and $U_2 = P(S_3 + S_4)$ using Lemma 3, then we compute $P(U_1+U_2)$ and use Lemma 3 to conclude that this is equal to $P(S_1+S_2+S_3+S_4)$.)
You can see how this works recursively. To compute $P(S_1+\dots + S_k)$, we divide the sets in half, recursively compute $X=P(S_1+\dots + S_{k/2})$ and $Y=P(S_{k/2+1} + \dots + S_k)$ using the efficient algorithm, and then compute $P(X+Y)$.
At each stage of the binary tree, we are filtering out all dominated vectors. If you're lucky, this greatly reduces the number of possibilities that remain at each level of the binary tree.
I suggest you try this on your specific application. If, as you say, the final size of $P(S_1+\dots + S_k)$ is not too large (say $m$), then this should be efficient: the size at each internal node of the tree will be no larger than $m$, so the time to merge the two children of any node will be $O(m^3)$ [using the trivial quadratic algorithm mentioned in the question], and the total running time will be at most $O(m^3 k)$. In practice, the actual running time might be significantly faster than that.
You could also explore different algorithms for computing $P(\cdot)$, the Pareto-optimal front of a set. There are basically two standard algorithms: the "simple cull" algorithm [which is the same as the quadratic-time algorithm you mentioned in the question], and a recursive divide-and-conquer algorithm.
The simple cull algorithm has $O(n^2)$ worst-case running time, as you say. Actually, if we know that the Pareto-optimal front has size $m$, the running time is a bit better: the running time is $O(mn)$. (In other words, if we let $n=|S|$ and $m=|P(S)|$, then you can compute $P(S)$ from $S$ in $O(nm)$ time.) The divide-and-conquer algorithm has $O(n (\log n)^{d-1})$ worst-case running time, which as you say in the question will likely be too small for your parameters.
However, it is also possible to combine the two techniques, where you use the recursive divide-and-conquer algorithm but once the problem size becomes small enough you switch to the simple cull algorithm. Whether this will lead to any improvement in your actual setting can probably only be determined through experimentation.
For an overview of these algorithms, see for example Section of the following paper:
A Calculator for Pareto Points. Marc Geilen and Twan Basten. Design, Automation & Test in Europe Conference & Exhibition 2007 (DATE'07), IEEE.
To compute $P(P(S)+P(T))$, the best algorithm I know is to compute $S'=P(S)$ and $T'=P(T)$ using one of the above methods, then compute $S'+T'$ using the obvious quadratic-time algorithm (combine all pairs), then compute $P(S'+T')$ using one of the above methods.
Best Answer from StackOverflow
Question Source : http://cs.stackexchange.com/questions/40142
0 comments:
Post a Comment
Let us know your responses and feedback