get a series of random numbers from a big list

This is the place for queries that don't fit in any of the other categories.

get a series of random numbers from a big list

Postby pygene » Thu Mar 21, 2013 8:35 pm

Guys,
I want to generate a series of random numbers WITHOUT REPLACEMENT at an ascending order. Basically I have a list of strings (many million lines) and some of them (small percentage) are repetitive. What I want to do is
1. take 10,000 random strings from this list, store them into file 1, remove these 10,000 picks for the rest of sampling
2. then take another 10,000 random strings from the reminder of the list, store them along with file 1 into file 2, remove these 10,000 picks
3. another 10,000 random strings from the reminder of the list, store them along with file 2 into file 3, remove these new 10,000 picks
...
until I have sampled all of the available strings.

I kinda of have a rough idea of how to implement it. Since I'm fairly new to python, I would like to know if there are efficient way to do this. So I'd love to hear what your thoughts on this!

Thanks!
pygene
 
Posts: 4
Joined: Fri Mar 15, 2013 2:51 pm

Re: get a series of random numbers from a big list

Postby KevinD » Fri Mar 22, 2013 1:29 am

You might want to look at the "random" module, specifically the "shuffle", "choice" and "sample" methods.

You could use the "sample" method to select a specific number of random integers, then take the lines corresponding to those positions.
Quanto lignum posset materiae materietur marmota Chuck si materiam possit materiari foedans, penitus lignum?
KevinD
 
Posts: 30
Joined: Fri Feb 08, 2013 3:15 am

Re: get a series of random numbers from a big list

Postby pygene » Fri Mar 22, 2013 3:39 pm

Yeah, I'm reading that module. Thanks for the hints on those methods.
pygene
 
Posts: 4
Joined: Fri Mar 15, 2013 2:51 pm

Re: get a series of random numbers from a big list

Postby tnknepp » Fri Mar 22, 2013 4:04 pm

Some things to consider:
1. Do you want only unique strings? Making your list unique may reduce your process speed substantially. Even if it only changes by 10%, if the total process time is hours this can be advantageous.
2. How will you handle the last few list items if the original list length is not a multiple of 10,000? If, on your last iteration, you have only 5 items in the list and you try to pull 10,000 you could hit a problem.
3. For big, repetitive, jobs you will find it advantageous to find simple ways of optimizing your code. List comprehensions trump "for" loops. I like to convert my data to a numpy array when I can (and when it makes sense) because operations can be done very quickly and logically with the available indexing.
4. Eventually, it may make sense to implement parallel processing.

I have taken code that I put together hastily, and had a process time of 7-8 hours, and reduced the process time to 20 minutes by optimizing it and implementing parallel processing. Optimization can be highly important for the big, repetitive tasks.

Keep in mind that if you have to spend three hours optimizing your code that will be run once, and you reduce your process time from 1hr to 10 minutes, you have still failed (though you probably learned a lot). Depends on how much your time is worth.
Python: 2.7 via Anaconda
Numpy: 1.7
Pandas: 0.11
OS: Windows 7
IDE: Spyder/IPython
User avatar
tnknepp
 
Posts: 123
Joined: Mon Mar 11, 2013 7:41 pm


Return to General Coding Help

Who is online

Users browsing this forum: Baldyr, Bing [Bot], umarmughal824 and 2 guests