Sklearn Lessons Learned
Some bullets:
If your data isn’t already 100% numeric, you’ll need to write a lot of glue code for each data type.
I didn’t do this interactive in Ipython, (although I did use the debugger a lot). I still think Ipython encourages bad coding habits.
Use pandas. Without it, you’re tracking columns AND rows by numeric in a matrix that you keep changes the dimensions as you remove junk rows and add/remove feature columns.
Use pandas-sklearn. Without it, after a few transforms, you’re back to numeric row and column identifiers.
Cache as you go. Everytime I have a good dataframe, I save it. When I rerun, I check to see if that step is already done and reload from disk.
Use HDF for dataframes, pickle for everything else. If your dataframe has anything except numbers, to_csv is very dodgy. It deals poorly with text with internal linebreaks commas and text delimiters. Importing or exporting to excel is too slow.
Random Forests and it’s close relatives is best unless you got some theoretical understanding of your data to indicate something else will do better.
Your code should support these modes:
- Exploration. Model should spit out accuracies and feature importances. It should be really fast and run with as few rows and features as you can without the model becoming trash. Memory and drive space management are not important here.
- Final testing. Model should use as much data as possible and try as many feature/hyperparameter variations as you can deal with. At this point, you will need to write code that cleans up memory and drive space as you go along.
If you don’t have enough memory, first fiddle with the features (row and column count management), because you might be trying to create a Quintillion Terabyte matrix and that won’t be solved with a larger workstation.
If a model isn’t working…well that is where I am now, so this advice isn’t proven yet…
- get more/different features. The current features might be too weak, have too little information in them. This is also where one might need to think about what would have strong information about what you are predicting, (e.g twitter bio text is predictive of all sorts of things, but time zones and time features probably are better predictors of location, since all times have some information about location and most bios don’t)
- get more samples.
