best way to pick a random subset from a collection?

By : Tom
Source: Stackoverflow.com
Question!

I have a set of objects in a Vector from which I'd like to select a random subset (e.g. 100 items coming back; pick 5 randomly). In my first (very hasty) pass I did an extremely simple and perhaps overly clever solution:

Vector itemsVector = getItems();

Collections.shuffle(itemsVector);
itemsVector.setSize(5);

While this has the advantage of being nice and simple, I suspect it's not going to scale very well, i.e. Collections.shuffle() must be O(n) at least. My less clever alternative is

Vector itemsVector = getItems();

Random rand = new Random(System.currentTimeMillis()); // would make this static to the class    

List subsetList = new ArrayList(5);
for (int i = 0; i < 5; i++) {
     // be sure to use Vector.remove() or you may get the same item twice
     subsetList.add(itemsVector.remove(rand.nextInt(itemsVector.size())));
}

Any suggestions on better ways to draw out a random subset from a Collection?

By : Tom


Answers

Your second solution of using Random to pick element seems sound, however:



two solutions I don't think appear here - the corresponds is quite long, and contains some links, however, I don't think all of the posts relate to the problem of choosing a subst of K elemetns out of a set of N elements. [By "set", I refer to the mathematical term, i.e. all elements appear once, order is not important].

Sol 1:

//Assume the set is given as an array:
Object[] set ....;
for(int i=0;i<K; i++){
randomNumber = random() % N;
    print set[randomNumber];
    //swap the chosen element with the last place
    temp = set[randomName];
    set[randomName] = set[N-1];
    set[N-1] = temp;
    //decrease N
    N--;
}

This looks similar to the answer daniel gave, but it actually is very different. It is of O(k) run time.

Another solution is to use some math: consider the array indexes as Z_n and so we can choose randomly 2 numbers, x which is co-prime to n, i.e. chhose gcd(x,n)=1, and another, a, which is "starting point" - then the series: a % n,a+x % n, a+2*x % n,...a+(k-1)*x%n is a sequence of distinct numbers (as long as k<=n).



I wrote an efficient implementation of this a few weeks back. It's in C# but the translation to Java is trivial (essentially the same code). The plus side is that it's also completely unbiased (which some of the existing answers aren't) - a way to test that is here.

It's based on a Durstenfeld implementation of the Fisher-Yates shuffle.



This video can help you solving your question :)
By: admin