Random Lookup Methodology

By : McLuvin
Source: Stackoverflow.com
Question!

I have a postgres database with a table that contains rows I want to look up at pseudorandom intervals. Some I want to look up once an hour, some once a day, and some once a week. I would like the lookups to be at pseudorandom intervals inside their time window. So, the look up I want to do once a day should happen at a different time each time that it runs.

I suspect there is an easier way to do this, but here's the rough plan I have: Have a settings column for each lookup item. When the script starts, it randomizes the epoch time for each lookup and sets it in the settings column, identifying the time for the next lookup. I then run a continuous loop with a wait 1 to see if the epoch time matches any of the requested lookups. Upon running the lookup, it recalculate when the next lookup should be.

My questions: Even in the design phase, this looks like it's going to be a duct tape and twine routine. What's the right way to do this?

If by chance, my idea is the right way to do this, is my idea of repeating the loop with a wait 1 the right way to go? If I had 2 lookups back to back, there's a chance I could miss one but I can live with that.

Thanks for your help!

By : McLuvin


Answers
Add a column to the table for NextCheckTime. You could use either a timestamp or just an integer with the raw epoch time. Add a (non-unique) index on NextCheckTime.

When you add a row to the database, populate NextCheckTime by taking the current time, adding the base interval, and adding/subtracting a random factor (maybe 25% of the base interval, or whatever is appropriate for your situation). For example:

my $interval   = 3600; # 1 hour in seconds
my $next_check = time + int($interval * (0.75 + rand 0.5));

Then in your loop, just SELECT * FROM table ORDER BY NextCheckTime LIMIT 1. Then sleep until the NextCheckTime returned by that (assuming it's not already in the past), perform the lookup, and update NextCheckTime as described above.

If you need to handle rows newly added by some other process, you might put a limit on the sleep. If the NextCheckTime is more than 10 minutes in the future, then sleep 10 minutes and repeat the SELECT to see if any new rows have been added. (Again, the exact limit depends on your situation.)

By : cjm


How big is your data set? If it's a few thousand rows than just randomizing the whole list and grabbing the first x rows is ok. As the size of your set grows, this becomes less and less scalable. The performance drops off at a non-linear rate. But if you only need to run this once an hour at most, then it's no big deal if it takes a minute or two as long as it doesn't kill other processes on the same box.

If you have a gapless sequence, whether there from the beginning or added on, then you can use indexes with something like:

$i=random(0,sizeofset-1);
select * From table where seqid=$i;

and get good scalability to millions and millions of rows.



This video can help you solving your question :)
By: admin