World's most popular travel blog for travel bloggers.

[Solved]: Returning a random subset with length k of N strings while only storing at most k of them

, , No Comments
Problem Detail: 

Here's the problem. I've written a program that reads strings from stdin, and returns a random subset of those strings. The only other argument provided to the program is the length of the subset, $k$. The subset must contain exactly $k$ strings selected uniformly at random from the entire input set.

It's easy to do this if every single string is stored in memory. (Memory proportional to N). The question is how to only store at most $k$ strings, and still ensure that the output is perfectly random.

I've tried to work it out with the following base case. Say $k = 1$.

> Subset 1 A B C 

A will always be added, since the queue contains less than k items. If I only operate on what I know, which is the current number of items in the queue, the required length $k$, and the encountered strings $n$. So I've tried doing this: ($k$ - items in queue)/$n$.

Using that, the probability of A being replaced by B is $1/2$. Then the probability of B being replaced by C will be $1/3$. The problem there is that there's no way of reducing the probability of A being replaced before I've seen all of the strings.

I'm sure this question must have a clever answer. The problem of writing this Subset program to use memory $<= k$ is a bonus question on an Algs course I'm taking, and I really hope they wouldn't ask something unanswerable.

Asked By : user2845306

Answered By : D.W.

Use reservoir sampling. This is a good description in Wikipedia, or in Knuth.

Let's start with the simple case, where $k=1$. You always have one string in memory. When you read the first string, you store it in memory. Each time you read a new string, you replace it with the one in memory with probability $1/i$, if this is the $i$th string you've read so far. At the end, output whatever is stored in memory. The end result is that each string in the input is equally likely to be output. See also Choosing an element from a set satisfying a predicate uniformly at random in $O(1)$ space for description of this approach (thank you, Juho!).

This extends to arbitrary $k$. See Wikipedia's description for details.

Best Answer from StackOverflow

Question Source : http://cs.stackexchange.com/questions/28123

0 comments:

Post a Comment

Let us know your responses and feedback