I have collection $U$ of sets, where each set is of size at most 95 (corresponding to each printable ASCII character). For example, $\{h,r,l,a\}$ is one set, and $U = \{\{h,r,l,a\}, \{l,e,d\}, \ldots\}$. The number of sets in $U$ is nearly a million. Also a set in $U$ will mostly contains 8-20 elements.
I am looking for a datastructure for storing collection of sets that support following operations:
- set matching, e.g. check if set $\{h,r,l,a\}$ is present in $U$
- subset matching e.g. check if set $\{h,r,l\}$ is subset of any set in $U$
- superset matching e.g. check if set $\{h,r,l,a,s\}$ is superset of any set in $U$
- union matching e.g. check if set $\{h,r,l,a,e,d\}$ is union of sets in $U$
- approximate set matching e.g. check if set $\{h,r,l,e\}$ is present in $U$, should return true
In particular, we can assume that once the data structure is built, no modifications are made but only queries of the above type (the structure is static).
I was thinking of trie data structure. But, it demands storing data in some order. So I have to store every set as a bit vector, but then the trie becomes binary decision tree. Am I in the right direction? Any pointers will be appreciated.
Asked By : Curious
Answered By : Fizz
Generically, these are sometimes called subset/containment dictionaries. The fact that you had partial matching in your question (but deleted it) is actually not a coincidence, because subset/containment queries and partial matching are equivalent problems for sets.
You probably want to use an UBTree (unlimited branching tree) for this; it's basically a modified trie. See Hoffmann and Koehler (1998) for more details.
You could also have a look at the more recent set-trie data structure proposed by Savnik (2013) for the same purpose. (If you miss the color in the graphs in that preprint paper and don't have access to the official Springer publication [in which the colors aren't missing], there's precursor to that paper which has almost the same graphs/info and no missing colors.)
Both papers (Hoffmann & Koehler, and respectively Savnik) have experimental results, so you can have some idea what they can handle in practice. Both seem to handle data sets with a cardinality of U around 1M.
If you somehow have TCAM hardware (or the money for it), there's a way to leverage that for subset/superset queries. You can actually do both subset/superset queries in parallel assuming you have enough TCAM words (2x |U|). Since TCAM words can be configured to be 144-bit wide, you and you have only 95 bits/letters to test, you wouldn't even need to bother with Bloom/hashing, you'd have an exact test using TCAM; this is trivial enough I'll even say here how: every {0, 1} bit-vector corresponding to every set in your U is simply turned into a {0, *} vector for subset queries and to a {1, *} vector for superset queries.
A more general problem tackled in ETI (no free copy, sorry) is finding a set with given similarity measure. For example, the similarity measure can be $J(Q,S)=\frac{|Q\cap S|}{{|Q\cup S|}}$ and the query for a given $Q$ can be to find all $S$ (in the database) with $J(Q,S)\geq 0.3$. Both the constant and the similarity measure are user-definable as a predicate in ETI. (The $J$ given as example here is called the Jaccard index.)
Best Answer from StackOverflow
Question Source : http://cs.stackexchange.com/questions/39976
0 comments:
Post a Comment
Let us know your responses and feedback