Machine learning Questions

A forum for general discussion of the Python programming language.

Machine learning Questions

Postby AEA » Thu Sep 19, 2013 12:16 am

Hi Python forum,

I have no idea whether the task I am undertaking would be considered Big data, all I know is it got too big for excel so I have had to switch to a different method. I have made my first few scripts in python creating RSS feeds Scraping websites and comparing dicts. But I know this is all just getting me ready for the real reason I started using Python, machine learning on a scale I couldn't achieve with excel and its solver engine(I used to leave it running for weeks to optimise my models.)

Now I have about 150k results in my date each one of them with 120 different parameters( 85% of which require either calculating of some form of standardisation). Now what I don't know is should I be creating the model by queering a database or should I be creating the model from standard datafiles. How would a model like this normally be done in python.

One other query I have is can standard workstations handle the solving of said models (8gb ram i5) or would i need to invest in a dedicated 16/32 core workstation?

Any help, recommendations or comments would be greatly appreciated, just to set me off on the correct footing.

Best regards

AEA
AEA
 
Posts: 32
Joined: Thu Apr 18, 2013 11:37 am

Re: Machine learning Questions

Postby ochichinyezaboombwa » Thu Sep 19, 2013 8:54 pm

What model? what task? are you doing classification, regression, clustering, what? I wouldn't worry about Gb or database before having a clear understanding of what you're trying to achieve and by what ML method.
ochichinyezaboombwa
 
Posts: 200
Joined: Tue Jun 04, 2013 7:53 pm

Re: Machine learning Questions

Postby Marbelous » Thu Sep 19, 2013 9:58 pm

Yes, more info please...

Especially about this whole "queering a database" thing. :?:

One comment: Python is great but it's rather high-level which make it slow for number crunching if you don't think in terms of optimization. Check into the numpy module, consider writing your math intensive code in C and calling it from python, and/or look into optimizers like Psyco.

http://www.numpy.org/
http://psyco.sourceforge.net/
Marbelous
 
Posts: 128
Joined: Fri May 31, 2013 8:12 pm

Re: Machine learning Questions

Postby AEA » Tue Sep 24, 2013 8:26 am

Hi guys first of all many thanks for the reply’s :)

Background:
I am in a scientific competition to attempt to elucidate the cause/causes of a particular type of cancer. The cancer causes for this type of cancer are already known however generating an efficient algorithm that can do it with this smaller example could enable us to perform the same process of a range of cancers. So the first part of the competition is to try and identify which person in a group of people has the cancer based on 120 different genetic markers. Of course I need to also consider combinations of different markers e.g. app1 with kat2. It is generally accepted that the cause of this cancer is multifactorial and I expect that the model may identify several different biochemical mechanisms for the cancer to develop.

My previous model:
My previous model was working extremely well however it was created using excel and solver. In other words couldn’t be scaled up. This model calculated the % of people who had cancer against the number of times the patient had a copy of the gene e.g. a patient with 5 x app1 gene could have a 4.7% chance of developing said cancer. In this first model I didn’t manage to include every gene before the sheet died (literally couldn’t open it). However my method was as follows: Perform a 6th order polynomial regression of the data and calculate the % chance of the patient based on that one gene. Then I proceeded to do this for all of the genes. I then simply added all of the % chances together and this achieved a fairly accurate model. However I improved this further using excels evolutionary algorithm (part of their solver package). I used the following method (x*n)+(y*n)+(z*n)+(xa*n)+(xb*n) where n is a number between 1 and 0 and x-xb are the % chances of developing the cancer as defined by the polynomial regression. This is the part that I would leave running for days in order to optimise the model, this evolutionary algorithm adjusting n and therefore adjusting the weightings to each of the regressions of my parameters.

With regards to the database querying, I am starting to think that it is much simpler to keep the data in csv format. I have started looking at numpy & Scikit as two key tools in this process. Any guidance, comments or general advice would be greatly appreciated. Since I really really want this to work.

Kind regards
AEA
AEA
 
Posts: 32
Joined: Thu Apr 18, 2013 11:37 am

Re: Machine learning Questions

Postby Kebap » Tue Sep 24, 2013 10:11 am

Hey AEA, this is quite a task, and I welcome you to try using python for it. I think it will work fine. Now I don't have any specific guidance for you, as I can't currently grab your math. However, I am sure the nice people on this forum can definitely help you with creating this, step by step. It always helps when you try and go for it, then show the detailed places where you can't continue due to errors or sheer confusion, etc. I just wanted to add this: I had an Excel script for some other task which included lots of calculation and cross-referencing. In Excel, this took several minutes, while the computer was totally unusable, peaked to the end of its multitasking obviously. In Python, that same task was finished in 1 second. I really scratched my head, when I noticed this, going over it a couple of times to make sure I did not miss any substantial part of the calculation, but I guess this is really it, and Excel just kind of messed things up. So on that same scale, you may hope for a reduction of your calculation time from a couple of weeks to maybe 2-3 hours, if this can be compared at all.
Learn: How To Ask Questions The Smart Way
Join the #python-forum IRC channel on irc.freenode.net and chat with uns directly!
Kebap
 
Posts: 390
Joined: Thu Apr 04, 2013 1:17 pm
Location: Germany, Europe

Re: Machine learning Questions

Postby AEA » Tue Sep 24, 2013 6:55 pm

Yeh I know, I am just trying to take it one step at a time. In learning how to code this all, but I thought there might be something I should bear in mind or something I might not realise until I have completed the coding. When I started the project in excel I had no idea I would hit its capacity. :geek:
AEA
 
Posts: 32
Joined: Thu Apr 18, 2013 11:37 am

Re: Machine learning Questions

Postby ochichinyezaboombwa » Tue Sep 24, 2013 9:11 pm

If at all possible,
1) give a small sample of the input data;
2) give a short simplified procedure (either in Python or in pseudo-code) that
a) takes the given input;
b) calculates result.
ochichinyezaboombwa
 
Posts: 200
Joined: Tue Jun 04, 2013 7:53 pm

Re: Machine learning Questions

Postby AEA » Tue Sep 24, 2013 11:08 pm

ochichinyezaboombwa wrote:If at all possible,
1) give a small sample of the input data;
2) give a short simplified procedure (either in Python or in pseudo-code) that
a) takes the given input;
b) calculates result.


Hi not possible yet, still learning how to use many of the features of the language. Also If I could make it work for a small dataset then I would probably be able to make it work for the bigger one. :) Does my description leave anything unclear? :)

Many thanks AEA
AEA
 
Posts: 32
Joined: Thu Apr 18, 2013 11:37 am

Re: Machine learning Questions

Postby micseydel » Tue Sep 24, 2013 11:20 pm

AEA wrote:Also If I could make it work for a small dataset then I would probably be able to make it work for the bigger one.

If your computation uses too much time or space because it scales poorly, then it could work for a smaller dataset, and we could advice you about making it scale better.
Join the #python-forum IRC channel on irc.freenode.net!

Please do not PM members regarding questions which are meant to be discussed publicly. The point of the forum is so that others can benefit from it. We don't want to help you over PMs or emails.
User avatar
micseydel
 
Posts: 1262
Joined: Tue Feb 12, 2013 2:18 am
Location: Mountain View, CA

Re: Machine learning Questions

Postby AEA » Tue Sep 24, 2013 11:50 pm

micseydel wrote:
AEA wrote:Also If I could make it work for a small dataset then I would probably be able to make it work for the bigger one.

If your computation uses too much time or space because it scales poorly, then it could work for a smaller dataset, and we could advice you about making it scale better.


Ahh very true, in order to try and pre empt this what sort of things cause a model to scale poorly ?
AEA
 
Posts: 32
Joined: Thu Apr 18, 2013 11:37 am


Return to General Discussions

Who is online

Users browsing this forum: W3C [Linkcheck] and 3 guests