My First Kaggle

Wrangling, Regression, and Reflections

Nicholas Teague
From the Diaries of John Henry
50 min readAug 17, 2017

--

This post will be a departure from some of my earlier writings. The following represents a coding diary of sorts, sort of a stream of consciousness following the progression of a machine learning coding project which took place over the course of about three days of focus. The project was my first attempt at tackling the data wrangling, algorithmic coding, and processing of machine learning algorithms in response to a public competition for beginners offered through the website kaggle here. While I’ll include all relevant (python) code and corresponding images of output, if anyone is interested in the full executable form I’ve uploaded the jupyter notebook to github here, which as long as you have the correct libraries installed (such as Numpy, Pandas, Tensorflow, Keras, etc) can be run bug free. Note that medium doesn’t always behave with python code as far as line breaks and wrapping around text which is obviously a big deal with python, bear in mind if you want to try any of this code yourself. Given that this was my first attempt at this kind of project (or of pretty much any coding project for that matter), it should be no surprise that some of the points dwelled on are fairly basic — while this could prove beneficial to beginners looking for ideas of how to overcome similar obstacles, I would offer to those more knowledgeable that have tried to balance some of those beginner considerations with a few points advanced insights as have picked up of the course of literature review (and perhaps a few brief tangents). If your OS allows it I do recommend reading with the soundtrack (via spotify) playing in background for ambience — at least that is how it is intended to be read. Anyway here is my first Kaggle entry in its entirety.

Little Feat — Waiting For Columbus (album)

This notebook will be an attempt to tackle the Kaggle beginners competition for House Prices: Advanced Regression Techniques. Kaggle is a forum for learning and competing with machine learning and even offers cash prizes sponsored by companies looking to incorporate the models. I just got my hands on a notebook for Kaggle titanic problem tutorial to another beginner problem set (prepared by Andrew Conti and is available from github here) meant to address another Kaggle introductory competition, which partly inspired this approach of code-blogging (is code-blogging a thing? if not I think we should make it one).

The first step will be to evaluate and preprocess the data to prepare them for machine learning algorithms. Once complete, I will lean on a demonstration for training this type of regression problem available in Chollet’s ‘Deep Learning With Python’ section 3.8 (a book still pending release but have reviewed the MEAP pre-sale version and I highly recommend this text — the pre-sale version is available here).

This notebook is being run initially on my local Anaconda installation for data exploration and processing, and then once is time for training will transition to a AWS EC2 AMI instance which comes preloaded with several deep learning frameworks such as TensorFlow, Keras, etc. (My Macbook is a little underpowered (M chip) but hopefully won’t be an issue.) For budgetary considerations will mostly work off of the micro (cheapest hardware) AMI instance but as venture into deeper models may spend some training time on one of the versions that includes GPU hardware for speed considerations. Although these GPU instances can be pricier, they are offered on a $/hr pay-as-you-go basis, so as long as you don’t leave running outside of training period it is manageable. You can view a list of available AWS deep learning AMI’s here — Chollet recommended the official Amazon Ubuntu deep learning version although I already have a bitfusion version running so will run that for now. A startup guide for initiating a bitfusion instance is available here.

  • As an aside I find it unabashedly cool that this same list of AWS deep learning AMI’s includes one from BayesForge which in addition to deep learning frameworks is also preloaded with all of the public environments for quantum computing from likes of Rigetti, IBM, D-Wave, or open source equivalents.
  • As another aside it would stand to reason that AWS wouldn’t be the only platform offering GPU enabled virtual machines pre-loaded with deep learning frameworks, however when I just checked Google Cloud Compute Engine the GPU enabled instances were still in beta, so who knows maybe they’re waiting to offer machines running on that fancy new TPU chip that they started publicizing earlier this year.
comparison of chip performance of some typical machine learning tasks (source)

Although in preparation for this my first Kaggle competition have read several texts ranging from introductory to advanced, sat a few MOOC’s, and toyed around with Python, this will be my first attempt to work with a real live data set. As such will lean heavily on these tutorials and document assumptions and other sources of inspiration (read google search, stack overflow, etc) throughout. I’ll probably fall a little short of potential in custom data explorations but it’s a simple dataset so shouldn’t be too big of a deal.

Yeah so anyway here we go.

Part 1: Data Exploration and Preprocessing

To make my life easy from an imports standpoint will just copy the list of imports from that titanic tutorial notebook, which should cover what we’re trying to address here.

So if you’re trying to follow along and running these imports from a bitfusion virtual machine (as I did initially), you may receive an import error with a few of the apis such as statsmodels (although this one won’t be an issue if you’re running this from Anaconda) — not to worry statsmodels it turns out was just a statistics library used in the titanic notebook which I will just not use here. Actually you know what since am not using will just delete from the import list any not applicable to avoid any bugs that could result from the import error.

Another error from the titanic tutorial imports you’ll see independent of which environment you’re running your notebook in is associated with a library called KaggleAux which appears to be a custom tool for preparing Kaggle data in some fashion — I’ll just intend to do any of these steps manually so will delete the KaggleAux import as well to avoid another error.

So current agenda is to read training data, explore, decide on which features to include in our model, how they should be transformed to improve training efficiency, then begin data processing to prepare for regression analysis. Note that the output of our data processing will be to convert the training data set to corresponding dataframes x_train (training features), y_train (training labels i.e. house prices), x_val (validation features), y_val (validation labels). We’ll also need to perform comparable transformations on the test data to create x_test (test features). Ok we’re ready to look at the data now.

Note that missing values are represented as “NaN” which I assume stands for not a number(?).

So even though there are 81 columns to this dataset (reference the shape description at the bottom of the output), there are only ~20 displayed. Is it skipping columns like it is skipping rows? But I don’t see any marking to indicate where such a skipping is taking place. (oh wait just found where the skipping is taking place, between Utilities and PoolArea)

Since this view isn’t showing the full list of columns, will parallel open the training data file in a visual data mining utility I downloaded for this purpose — a tool called RapidMiner which is a free utility for datasets below 10,000 entries, can be downloaded here.

Rapidminer appears to be capable of performing data transformations, however for now will try to limit data processing to this notebook as based on a google search my impression was that it is more common to manipulate data through python than with a visual editing platform.

There are some other handy features in Rapidminer beyond its core functionality of transformations. Some I have found so far include the ability to easily view basic attributes of each column, it is also possible to view simple charts — for instance one use could be for those columns displaying textual categories instead of numeric values, you could view a scatter plot of those categories compared to home price and eyeball to see if any expected predictive power from those distributions. Note that more sophisticated statistical analysis appears unavailable, although if I am interpreting it correctly one way that could quickly rank the distributions of a column of categories such as for instance (just to pick one) MSZoning would be to select that column in the Statistics feature, show chart, and then choose chart type of distribution with a comparison against house prices — a screen grab of this type of output below next paragraph.

At this stage in Titanic tutorial, author suggests dropping a few rows that might be populated with primarily missing cells and thus won’t have much predictive power. Upon inspection in Rapidminer, the primary columns that have mostly NA values are PoolQC, Fence, and MiscFeature. However before blindly striking these columns, my thought is that at minimum the inclusion of a pool should have some measurable impact on a home price, and since this is a manageable three columns will quickly turn back to Rapidminer tool and view the distributions vs home price as described above. Here for instance is the distributions of categories under PoolQC column:

Were the pool classification benign on house price, I would have expected a distribution comparable to NA with simply different magnitudes. But clearly Ex (exterior pool) increases a home value. I can’t for the life of me imagine why, my father likes to say that a swimming pool is like a hole in the ground you dig in your back yard and then pour money into — and don’t even get me started on a certain neighbor with a giant oak tree overhanging over that same money pit which somehow drops about three times its weight in leaves but I digress.

So clearly PoolQC is worth keeping in our model, as for the other two a quick inspection reveals both Fence and MiscFeature have at least one category that appears to have measurable effect on price distribution — thus will forgo dropping any of these three even though they all have a high prevalence of missing cells. That is not to say won’t drop any columns — I would expect that the more columns we keep in our model the higher the training expense of computation, assuming this Titanic tutorial will hopefully go further into how to evaluate potential inclusions of features as we start looking at data graphically. Speaking of which that’s actually the next step.

So upon reviewing the titanic tutorial data explorations, it appears they’ve taken some approaches to visualization that are pretty specific to their particular problem (which happens to be a categorization problem between Kate and Leo, and don’t even get me started about why Leo had to swim. They could have easily fit two people, I mean just look:

Since ours is a regression problem and especially since our feature set is much more elaborate (81 columns vs Titanic’s 12), it may be worth a more structured approach to visualization than what is demonstrated in the titanic notebook. Of course we always have Rapidminer to fall back on, but I know have seen some demonstrations of Python easy visualizations in some prior readings so am going to do a little digging real quick, hold please.

So one worthwhile text have perused was Ge-ron’s Hands-On Machine Learning with Scikit-Learn and TensorFlow, so will start by attempting some of the data visualization tips they suggested in second chapter, such as histograms and scatter plots.

  • As is not unusual for generous authors in this field, the code from this same text is available in format of jupyter notebooks on github here.

So I was a little confused about the output of this histogram plot at first because there are only ~38 charts shown here in comparison to the 81 columns. However I believe what is going on here is that the histogram function only displays sets for those columns with numerical values. It’s not clear if missing values / NaN’s / or the like might prevent additional displays. Actually that reminds me at some point am going to have to verify that we have some approach for cells with missing values — either going to have to input a plug number such as median for the column or the like, note to self etc. Pause… The good news is just verified that although there are some missing cells in the numerical columns, they did not prevent them showing on in this histogram.

So inspecting these figures, looking for points of interest. Here are some that grab my attention along with their description:

  • Id (as expected this is just a counter)
  • PoolArea (Pool area in square feet appears to all show same value?): since only have 7 nonzero points here (looked at the rapidminer scatterplot), with a significant outlier, I worry there’s not enough points here to extract anything meaningful, this is best candidate yet for a dropped column
  • ScreenPorch (Screen porch area in square feet): doesn’t appear to be a lot of information from this chart however when zoom in using rapidminer chart there does appear a small depth the the range.
  • MiscVal ($Value of miscellaneous feature): this would only be of value when combined with the misc feature categorization, and since almost all of the values here are 1 am thinking this is another good candidate for a drop

Actually the PoolArea comment above got me wondering if there are any other points with insufficient data for any real feature extraction, another perusal along that lens makes me also question these columns:

  • 3SsnPorch (Three season porch area in square feet) =>drop, after inspection in rapidminer not enough corelation with price
  • LowQualFinSF (Low quality finished square feet (all floors)) => drop
  • BsmtHalfBath (Basement half bathrooms) => after rapidminer inspection decided to keep this one

So in summary, after reviewing the histogram plot (in conjunction with a few glances at rapidminer), have decided we will drop the following columns: Id (actually not positive can drop this one, this might be necessary for Kaggle submission), PoolArea, ScreenPorch, MiscVal, 3SsnPorch, LowQualFinSF, BsmtHalfBath.

(will set this list aside for now so as to perform the dataset transformations all at once once settle on full list of features).

Another dataset consideration that will be less determined by charts and visualizations and more by common sense applied to understanding your dataset will be from just walking through the description of features one by one and ensuring that they all make sense with respect to any predictive power and that they all work in isolation. Our data set here has 81 columns which to my untrained imagination strikes me as on the high side. The goal here is always to first get a simple working model established to serve as a baseline and only then attempt to address the problem with more sophisticated resources, and am starting to wonder if maybe instead of starting with histograms and visualizations should have attempted this first, and you know what what the hell would be fun to get something working before more in depth dataset explorations, so I’ll turn my attention here to developing a simple baseline model based on a bare minimum feature set.

The assumption is that of our 81 features there are probably some that are most significant in evaluation. For example, based on my admittedly limited exposure to housing market in recent years would expect that some of the dominant features would be those like neighborhood (location location location after all), square footage, year of construction, quality of construction, and just to throw one more in since it appeals to my mechanical engineering background let’s include inclusion of central air. Those five features are as follows:

Neighborhood / GrLivArea / YearBuilt / OverallQual / CentralAir (I’m not positive but assuming we won’t need the Id feature too for submitting input to Kaggle, still need to read a tutorial on submission process, note to self…)

Oh yeah obviously we’ll need the sales price column too for training duh: SalePrice

  • It occurs to me that there are probably statistical ranking tools in Python to rank a feature set by correlation with a target variable (some kind of p value matrix, right?), that would be a more scientific approach to selecting these bare minimum features perhaps, but I’m too excited for having already selected my intuitive bare minimum features so decided not to climb down that rabbit hole just yet.

So the theory is that this paired down feature set will be much less computationally expensive to train and will allow me to establish a baseline to measure performance against once I start training with a more elaborate feature set.

So some of these features are numeric integers and we can just leave those as is. For those textual classifications (neighborhood and central air) we could a) assign a numeric value to each classification, or b) assign a sparse identity vector via “one hot encoding”. I think have read that it is easier to train with a sparse vector so will attempt option b. Oh yeah at some point will also need to normalize the numeric values to range -1,1 centered on median etc.

But first am curious how many missing values we are dealing with and how we should address that.

Cool so if am interpreting this correctly that means we don’t have to worry about missing values for this first pass which makes our lives a little easier.

Oh and sidetrack alert I just stumbled across the correlation matrix was talking about earlier, am curious about applying to our pared down feature set so will give that a quick try:

So the selection of features appears reasonable, note it is only providing values for numerical features, and you know what just because I like tangents am curious how we did in selecting these features in first place so just going to back up to the full dataset and do a similar quick ranking of correlation:

So yeah we did reasonably well in selecting features based on common sense, still it probably wouldn’t be too much trouble to incorporate just a few more of the stronger sets, and looking here two that jump out are GarageCars and TotalBsmtSF

Am not going to add GarageArea since is redundant with GarageCars (which has a stronger correlation), also not going to add 1stFlrSF since that is already encoded in the total square feet above ground GrLivArea. Speaking of complementary features, it strikes me that full baths and half baths are usually in MLS listings counted together, assigning half baths value of 0.5, so perhaps it would make sense here to take a similar approach and combine the count of full baths and half baths into single value — it would be confirmation if we did this and the correlation measure was higher than the full bath count. Would that exercise be deviating from our goal of bare minimum model? Probably, but kind of want to figure out how to do this so am going to tackle it here anyway.

So we will make a new bathroom column using this derivation: TotBathroom = FullBath + (0.5) * HalfBath + BsmtFullBath + (0.5) * BsmtHalfBath

Note that BsmtHalfBath was the only one of these features that was negatively correlated with sales price, perhaps there could be argument for leaving it out of this summation but am a little OCD so just going to include as anything else wouldn’t be true.

And you know while we’re at it might as well incorporate basement square feet into above ground square feet, however since I suspect there will be some cases where basements are unfinished and don’t contribute as much value, will subtract unfinished square feet from total square feet as a way to counter this effect.

So area derivation will be as: TotSF = GrLivArea + TotalBsmtSF — BsmtUnfSF

(Hi Nick From the future here: in hindsight I am questioning the wisdom of combining above ground and basement square feet from a predictive power standpoint — in next iteration may keep these separate.)

Once we complete these transformations we’ll want our complete feature set to be:

Neighborhood / TotSF / YearBuilt / OverallQual / CentralAir / TotBathroom / SalePrice

(Hi again, Nick from the future here. After performing all of the data transformations and running through the training and testing, I expect that a few additional features could reasonable be expected to have improved our predictive capability, if I were to do this again probably would include at least some of the following. Granted these would have necessitated a wider hidden layer but am sure we could have accommodated: Fireplaces / MasVnrType / PoolQC / SaleCondition / RoofMatl

Great so now that we have our list of desired features lets recreate the bare minimum dataframe, this time including the other columns we want to include in our manipulations. We’ll also double check again for missing values.

I don’t think it matters for training purposes that SalePrice is no longer the final column. I’m also not sure that the deletion of columns as was done here is either necessary or even good practice, but by doing so it makes it easier for cleaner views, so will stick to that practice until someone tells me otherwise.

Ok next step is to convert the textual classifications to vectors. Once we have that done will normalize all of the features. If any reader needs any motivation to keep going may I suggest suggest a little background music?

  • Worth noting that when I started this exercise the intent was to lean pretty heavily on that titanic tutorial, however am not finding as useful as had hoped and am ending up spending more time with the “Hands-On Machine Learning with Scikit-Learn and TensorFlow” notebooks, as well as a fair share of google searches and resulting stack overflow tutorials. The titanic tutorial may come back into play as we transition into training but seem to be making progress so not going to sweat deviations from original agenda as long as getting closer towards end goal. After all, like the laid back tennis ball canning factory that was shipped potatoes by mistake learned, flexibility is a virtue.
Be like Pringles.

For these next steps of data preprocessing I believe the appropriate tools are available through scikit/learn preprocessing library with documentation here.

Inspecting the list at this link a few tools jump out at me. First even though we didn’t find any missing values in our training set, it is still possible that there could be some in the test data, so one way to mitigate that effect is by inserting the mean of a range into the missing cell as can be accomplished with preprocessing.Imputer. Another tool that we will use is the preprocessing.LabelBinarizer which will allow us to convert the CentralAir Y/N classifications to 0/1 numerical designation. With respect to normalization of numerical range columns, there is a good discussion of the different approaches in scikit learn documentation here, since I don’t expect distributions to deviate significantly from gaussian distribution (which admittedly may be a naive assumption), am going to stick to the basic preprocessing.StandardScaler approach for converting to gaussian with 0 mean and unit variance. Finally, for the all important neighborhood column with textual classifications and high evident correlation as one would expect, am going to use the preprocessing.OneHotEncoder method.

Note that for the Imputer function, expect will need to be careful when dealing with columns where a majority of cells are missing values such as is the case in our full dataset for PoolQC, Fence, etc. — fortunately for our bare minimum approach we have scrubbed the columns where this might be an issue. Actually come to think of it am not sure how this function will behave while the dataframe still includes either text classifications or event transformations to sparse vectors, in both cases a mean and unit variance would be meaningless. So it may be the case that will need to explore alternate means to address missing cells in these type of features, will do a little explorations on that point. There appears to be a good blog post here which will review. One option is to remove rows with missing values but that’s the easy way out and anyway once we graduate to an expanded dataset it won’t serve us well.

Update: upon further exploration it appears it’s possible to do some basic transformations without resorting to scikit learn. Pandas documentation has some good tutorials here — see this link’s ‘Cleaning / filling missing data’ section for examples.

So I’m going to attempt to address potential issues with missing values for each of the columns in our paired down data set. Note that we already established that this training set doesn’t even have any missing values (if I am interpreting In[96] correctly), still we don’t know for sure if our test data will be comparable and even though it seems unlikely I’m guessing it is good practice to address just in case.

(Nick from the future here, it turned out that this was a good decision.)

Two comments on this Imputer transformation: 1) note that in process it transformed integer values to floats. My expectation is that this will not impact the regression. 2) Note that I left the SalePrice column untouched, as the output value I hope no normalization is required since we lose some information in the process and would then have to figure out how to switch test outputs back to original scale, hoping don’t have to go there.

Ok now we finally get to transform the textual classifications, then after that we will normalize. Again we’ll use OneHot Encoder for the neighborhood and binarizer for the central ac.

For the OneHot Encoder, there is a feature to auto-select the number of potential values based on the data which it defaults to, however I think would be prudent to enter this value manually in case we are missing some category in training set but not in test set (such as our dummy ‘fill’ category meant to address missing cells, for which we need to remember to add one count to the number of designated categories — actually on second thought even that won’t work on its own since the onehotencoder only knows how to assign vector based on labels that it sees in the input data, so way I figure is that in order to properly encode here we would need to ensure that all potential values are represented in the training data, and if they aren’t then assume we would need to enter additional dummy rows containing those values in training data prior to encoding — tell you what to ensure that the dummy fill value is present I’ll just pick one data point and replace the Neighborhood value with our dummy ‘fill’ value — this will increase the length of our sparse vector from 25 to 26). We’ll also need to convert each text category to a numerical id before applying the encoder, I’ll be basing this on a demonstration from the Hands-On Machine Learning with Scikit-Learn and TensorFlow text chapter 2 notebook.

note just saw that it is also possible to accomplish the two steps of transforming text to numerical id and then converting that list to sparse vector all in one shot using sklearn.preprocessing LabelBinarizer class, however the approach already recorded is working so not going to backtrack. (NFTF here: upon subsequent inspection this encoding as presented may have bug associated with consistency of order of columns between training and test data transformations, masked by the practice of labeling encoded columns with simple range numbering, working on a fix and hope to address in next iteration.)

now that have gone through the trouble to encode, I’m a little confused as to how to combine this sparse matrix back with the dataframe object containing all the rest of our “bare minimum” training data for purposes of training… I’ll going to set this question aside for now and hopefully will come across answer as get deeper into tutorials. Perhaps is not necessary that all training data is included in common dataframe object, to be continued…

Next we will perform the preprocessing.LabelBinarizer which will allow us to convert the CentralAir Y/N classifications to 0/1 numerical designation.

Great so next step of our data preprocessing activities on the bare minimum set is to normalize the remaining columns again using the preprocessing.StandardScaler approach for converting to gaussian with 0 mean and unit variance (aka ‘feature scaling’).

As an aside, one point I’m a little unclear on is whether its best practice to perform data preprocessing transformations directly on the original dataframe object or whether each transformation should create a whole new column which we’ll just have to keep track of. I can see benefits of both approaches.

Another consideration for preprocessing, Hands-On Machine Learning with Scikit-Learn and TensorFlow text suggested using the sklearn.pipeline class to combine sequences of transformations into single function. That could be one way to improve this process in future iterations. As we get into larger datasets (such as the full 81 for this problem as presented), I could definitely see the value of such a pipeline function.

Found a stack overflow tutorial on performing feature scaling in place on select columns in a dataframe object, can view source here. Note that this is another case where extra brackets are required, I think this is because the feature needs a 2D array as input for some reason (?), it’s not exactly intuitive to include those brackets for whatever it’s worth.

Ok reviewing the remaining data, note that this set still includes the textual classifications for Neighborhood, the encoded vectors are in a separate object celled “nbrhd_cat_1hot”. I’m hoping the fact that is in separate object won’t be an issue when it comes to training, will cross that bridge when we get there, for now I’ll go ahead and delete the original Neighborhood column from the set. (before doing so will save the data to a separate list just in case we need it again at some point (fearing ominous foreshadowing here (*NFTF here, this fear was unfounded)).)

A thought: I’m maybe overcomplicating the onehot encoding by keeping it as a separate set, perhaps it’s just as simple as adding these 26 columns to our training dataframe. While not sure if this will be a “sparse” representation memory-wise where computer only has to store values for cells with non-zero values, for datasets of our scale that would hardly be a concern, M chip or otherwise. So yeah after talking it through I think it is easier than had originally supposed. So here goes going to try and add these 26 columns to our dataframe.

First note that we already have some of the data I think we’ll need. The column labels are extracted in array “nbrhd_cat” which are then encoded into numerical id’s in array “nbrhd_cat_encoded”, and then were finally transformed into sparse matrix in object “nbrhd_cat_1hot” (which can then be viewed as an array via nbrhd_cat_1hot.toarray(). So in order to join the sparse array to the test data which is a pandas dataframe object we’ll first have to convert from the numpy array to a pandas dataframe, and then we can concatenate to combine the two — important to keep in mind that we haven’t done any sorting of our data so features for both sets should still be in same order. I’ll just name the columns for the sparse array with integers [1,27], hopefully it won’t be an issue they they are labeled with integers instead of strings, will have to keep that in mind when it comes time to call any of these columns.

One last task for data preprocessing. I believe that when it comes time to actually run the training we’ll want our features and labels (aka sales price) sorted into separate objects, x_train and y_train. So let’s quickly create those two sets.

So while I’m tempted to say we are done with data preprocessing, it occurs to me that there are a few additional steps we have missed, 1) all of this processing that was done to our training set we’re going to have to do all over again for our test set and 2) we’re going to have to break the training and label sets into two pieces (one for training, one for validation). Boy this data preprocessing is just the gift that keeps on giving. I think we’ll tackle number 2 first..

Ideally when you break out the validation set from training samples you would want a random sampling of rows from the set in case there was any bias in order data was collected (for example with home sales data one could imagine a scenario where data was collected in chronological order of home sale although that’s not the actual case here). At first was going to take lazy route and just assume the data was already randomly distributed in test set, however just found that sklearn has a method for breaking complementary x,y sets into corresponding random split sets based on the same seeding for randomness. It’s called train_test_split, however I don’t see any reason we can’t adopt the tool for a train validation split. More on approach here.

As for the size of the training vs validation sets, going by memory I think I’ve seen some different rules of thumb thrown around for portion that should be applied to validation ranging from 15% to 30% (again this is by memory so may be off), I’m going to arbitrarily assign 20% to validation, so the training set size will be 1460 * 0.8 = 1168, and validation will be 1460–1168 = 292.

Note that before splitting the training data will first make a copy of the full set in case we later want to attempt K-fold validation after our initial training run (we will).

Ok this next bit of code is going to be a bit longer than what we’ve tackled so far, but it’s just repeat and slight reshuffling of what has already been covered. The goal here is to take our test data set (which Kaggle provided in a separate file) and perform all (ok most) of the same preprocessing transformations so as to derive a set called x_test. Note that this process will be slightly reduced in that it won’t we won’t bother with all of the explorations, we won’t have the sales price column to work with, and finally we won’t need to split this into test / validation like we did for the test set. Before we start let’s quickly upload the file and confirm it’s set up as we expected (as before we’ll need to ensure that our data file is saved in same folder as this notebook otherwise we would need to redirect the path). This inspection could parallel be performed in rapidminer again if so inclined.

As expected this set has one fewer column since the sale price is not included (this is the feature we’re trying to predict after all) — oh and apparently one fewer row which doesn’t seem important. I brought the following up earlier and am still a little unsure about this point: when we were preparing our train set we scrubbed the Id number column so as not to interfere with the training operation, I’m hoping the Id number isn’t used for some type of Kaggle submission validation, will have to find a tutorial. For now will assume it is not. (Update: the id number is in fact used for Kaggle submission, however will for now intend to take the approach of adding the id number back to the predictions set after applying our machine learning magic, after all I don’t see why we would be shuffling the test set at any point so don’t expect that will be a problem.)

Ok here comes the fun part, going to revisit all of our prior transformations to the test set. As mentioned prior this part would probably have been a little easier to assemble if we had used the sklearn.pipeline class instead of performing transformations one at a time, but you know as my prominent lower back tattoo says in an elaborate and poetic font: No Regerts.

Now that we have our test set, just to ensure that we have equivalent structure of features, let’s quickly pull up our x_train and compare.

So unless I’m mistaken we have completed our data preprocessing journey. We started with a training dataset which we converted to x_train, y_train, x_val, and y_val; we kept a copy of the full test set under x_train_full and y_train_full; and we also started with a test set which we converted in similar fashion to x_test and labels_test. Was a lot of code at once in this test set processing, the hardest part was dealing with a bug in the Imputer call (I had originally used axis=0 instead of axis=1 which to be honest am not entirely clear on what this is specifying but the code is working now so no complaints). Now I do believe are ready to transition to the fun stuff, training. I’m going to celebrate by staring at my phone for a few hours until I get tired, see you in the morning :).

Part 2: Training

So we have our training and validation data, now what do we do with it? This home price prediction question is a regression problem (not to be confused with a logistic regression problem which is in fact a poorly named type of classification problem but I digress). While in the first part of this notebook I kind of used a whole slew of different sources, tutorials, books, etc as reference, I suspect the source material for these training operations will be a bit more condensed. Chollet has a section in Deep Learning With Python devoted to sample Keras training code for not only this type of problem but also this specific application for predicting house prices (section 3.8) — the primary difference between his example and what we’re trying to accomplish here is a) our 26 sparse columns and b) his example has less training data (which he addresses by turning to k-fold validation to make use of as much of the training data as he can). Although we have a little more data (around 1,400 samples vs I think he described as around 500), it still seems like would be good practice to attempt the k-fold approach. So even though one would think that starting our model with a k-fold validation technique would deviate from our stated intent of starting with a minimum viable model, the fact that we have sample code right on hand from the text suggests this might be the easier of two paths, so yeah the decision is to just jump straight into k-fold instead of developing simpler model to start.

Since am leaning heavily on this text, I’ll just point out that recently submitted a review on Goodreads. Just like Keras pairs down neural network modeling to its most essential modeling elements for clarity and simplicity, the prose in this book is comparable and I highly recommend.

So at this point will switch from an Anaconda based jupyter notebook to running this set on my AWS EC2 AMI virtual machine really for two reasons: 1) my laptop is underpowered and not sure how it will take to training operations 2) I am sort of embarrassingly undercomplished in admittedly very simple operations such as installing platforms like Keras — it literally and embarrassingly took me like two weeks to install Tensorflow although in my defense Google buries the instructions for Anaconda install — so yeah these virtual machines have all of those environments preloaded so yeah path of least resistance and whatnot.

Ok well logging out and will be right back — in the cloud!

via XKCD

Ok so before constructing our model based on Chollet’s demo, a few notes about some of the assumptions going in:

  • the models.Sequential() category is typical for types of models a beginner will be working with, so won’t go into depth on this point (a researcher may venture into the functional API for customizations beyond standard approaches)
  • our first (and only) hidden layer was demonstrated in the the test as 64 hidden units. Since we have a few more features in our training data I’m going to tap it up a notch to 70, this adjustment is very arbitrary though, the depth of this hidden layers will be one of our hyperparameters
  • the hidden layer will be densely connected and use relu activation units, this is not an unusual approach *however note that the Hands-On Machine Learning with Scikit-Learn and TensorFlow text has some interesting discussion about potential improvements to the relu activation (spoiler from memory I believe they smooth out the transition point to make it differentiable at x=0 and slightly offset from value of y=0). One more relu tangent, worth noting that although one of the big catalysts for the current deep learning renaissance was the discovery by Geoffrey Hinton et al that deep learning networks could be pre-trained using unsupervised learning algorithms, it was largely the advent of the relu activation units that allowed us to move to exclusive supervised training of deep networks (source)
  • the output layer will have 1 unit densely connected which will output our house price
  • I believe mse (mean squared error) loss function is usually preferred over cross entropy for regression problems, although in other use cases I tend to come across cross entropy more often, some discussion about tradeoffs on quora here.
  • mae (mean average error) is our validation metric that we will use to gauge progression of training over the epochs. I’m not sure if is a problem that this doesn’t exactly matches the evaluation metric from the Kaggle competition of root mean square error, ideally I’m thinking we would want our validation metric to match our evaluation metric I suppose, but am sort of struggling at navigating the Keras documentation on this point, both approaches are measure the same error so hopefully won’t hurt my cause to use slightly different metric in validation.
  • for the optimizer I’m just going to defer to the text recommendation of rmsprop, for those wishing to experiment there’s a good discussion of some different optimization approaches and their merits here although treatment of some next generation methods is omitted.
D-Wave tutorial on Quantum Annealing

Ok next we’re going to assemble code for the K-fold approach to validation. For those less initiated k-fold is useful approach when have limited training data, which could cause our validation metric to be highly impacted by randomness of selection of data points. The k-fold approach splits the data into k sets, each with a mutually exclusive validation partition, training is performed simultaneously on each of these folds with validation metric calculated, and then the validation metric across the three sets is averaged to determine the overall validation metric for each epoch. Note that higher number of folds will increase our training time. Chollet used four folds for his 500 training data points, as we have around 1500 I’ll arbitrarily bump up the number of folds to 5. We could increase this even more, have seen some suggest that even ten folds is not unusual, however that would also increase our training time so let’s start small and see how it goes. (Thus the k number of folds could be another hyperparameter if we so desired).

The following code is basically verbatim from Deep Learning With Python text except I will use my terms for the training data (x_train_full) and targets (y_train_full). Will start with 100 epochs as used in the text. One thing I learned as part of this process is that Keras requires data and labels in form of numpy arrays, since ours are currently saved as pandas dataframes will start here by converting.

So got this to run after fixing a few typos and figuring out the issue with dataframe vs numpy array. Also I tried running a single epoch just to get an idea of how long it would take to process on this micro instance, turned out was through the epoch in less than 10 seconds, so figure 100 epochs will take somewhere around 10 minutes, so I think am going to remain in this micro instance instead of transitioning to one with a GPU — partly because I’m cheap and partly because AWS makes it ridiculously hard to upgrade from their free tier so you can access GPU instances (turns out you have to initiate a service request and interact with an agent to enable access to AWS GPU instances, and then takes a day or two to process the request).

So comparing our mean average error to the mean of our training labels you see we have an average error of +/- 15%. Pulling up the Kaggle leader board 1100th place out of around 1500 entries. Top performance has error rate of 6.6% by user DSXL (who is dramatically far ahead of second place BTW, while second place has achieved 10.6% with much denser progression through rest of leader board — after getting through this write-up going to do a little digging if first place has published any hints of how they have achieved such dramatic improvements over the rest of the pack — my guess is they must have found some additional training data).

On it’s own I think achieving top 1,000 of a popular Kaggle introductory competition on my very first machine learning algorithm is kind of cool, however this is a validation score and not a test score so let’s not get too far ahead of ourselves. Still have some hyperparameter optimization to perform.

One thing worth note, the range of error rates between the separate k folds was much more dramatic in the text example than we are seeing here under all_scores array. I’m guessing that’s because of the increase number of training samples in our data, should not surprise that there is a big difference between the impact of randomness in fold distribution between 500 and 1500 data points. There’s a good illustration from Wolfram’s A New Kind of Science of the impact of increasing data points by orders of magnitude to the clean distribution of data sampled from a random walk that comes to mind.

image via A New Kind of Science by Stephen Wolfram

The next step that text example suggests is charting the progression of the mae metric across epochs while also upping he number of training epochs. To do so I’m going to repeat the K-fold code from above but with addition of a variable for number of epochs and then all of the various mae history materials from the text example Listing 3.91 (there are a few subtle differences from last training run I won’t try to list them all), which once complete with the training run we will illustrate in a chart.

At the expectation that increasing the epochs to 500 as suggested in the text will go against my proclivity for not sitting on my hands for more than an hour at a time, I’ll pair down the suggested epochs to 300 and also lower the number of k-folds to 3 — after all as we demonstrated in first attempt the difference of validation metrics between folds has not been as dramatic for this set as what was demonstrated in the text. These two concessions should help reduce training time.

Warning: running this code will take even longer to complete than the last iteration so may sure you have the time or a GPU before pressing play. Note that this code is likewise primarily derived from an example in Deep Learning With Python.

Don’t forget to save your trained model for later imports — otherwise the last 2–1/2 hours could go to waste!

Keras documentation on exporting and importing trained models here.

Occurs to me that it might be easier to eyeball the training improvement of later epochs if we change scale on the y axis to log plot so let’s give that a try.

So after 300 epochs, the final validation metric (mean average error) was 24.3k, which is an improvement over our initial 100 epoch run that came in with a score of about 27k. So increasing the number of epochs helped.

Note that the type of visualization performed here can also be accomplished with Tensorboard which is a browser based model and validation evaluation tool that is a part of Tensorflow. Which probably should have pointed out earlier that even though we are running this model through the Keras framework, that implementation is merely a front end for (our choice of) Tensorflow, Theano, etc. Thus if you wanted to monitor your model with Tensorboard I do believe it is possible as long as you are directing Keras to implement Tensorflow for its backend (which I believe it defaults to).

For our next step, we can attempt to tune some hyperparameters, keeping in mind that there must be a balance between improving our model’s accuracy and preventing overfit. Most of the more advanced problems out in the wild will have a whole slew of parameters to deal with, one great thing about this simple problem is that the single hidden layer architecture and established industry norms for addressing regression problems means that we won’t need to venture too far from our current model architecture. The text suggests playing with number of epochs and the size of the hidden layer. If we were feeling really adventurous we could experiment with some other features such as trying a different optimization algorithm (such as Adadelta), experimenting more with the number of k folds, add another hidden layer, or who knows maybe even incorporate dropout into the training (although truth be told I’m not sure if that’s appropriate for such a shallow model). For the more advanced problems there are even some tools out there to automate the hyperparameter exploration process. Chollet mentions a Python library called Hyperopt in the text which can explore feature spaces using Random Search or something called Tree of Parzen Estimators (you know a concept is on the cutting edge when there doesn’t even exist a wikipedia entry for it — would someone mind getting on that?:) — a Keras wrapper for this tool can be found on github here. There is even experimentation going on at OpenAI to automate hyperparameter search using reinforcement learning algorithms!

If our last derivation of Kaggle score is to be believed, this run would have taken us from around 1,000th place on the leaderboard to somewhere around 700th place. Which is nice I suppose, putting us around the midpoint of competitors — all without a GPU no less. A ‘C’ is a passing grade after all! But figure I’ve got two more runs in me before wrapping this up, and will try to use those to experiment with size of the hidden row. This last training run was enjoyable — used the time to watch the movie Contact with Jodie Foster et al. Who knows maybe lurking somewhere in that cosmic background radiation there is an alien intelligence trying to be heard — I bet it could place very high on the leaderboard! You know (spoiler alert) when that pod dropped through the $300B alien device it only lost communication for a fraction of a second — but for Jodie something like 18 hours transpired — that technology could come in really handy for shortening training time of deep networks! Anyway I’ll throw together another training run now and let it run overnight.

via Contact

So by reducing the units in the hidden layer from 70 to 55 with 3 k-folds and 300 epochs the mean average error validation metric increased from 24.3k to 24.8k (meaning our model performed worse with same number of epochs). Since my expectation is that an architecture with fewer weightings should be able to reach performance in fewer number of epochs (not positive on this point though), I’m interpreting as our architecture does not have sufficient complexity to handle our evaluation. I’ll try again, but this time moving the hidden layer units in opposite direction, from 70 to 85. This will probably be our last hyperparameter optimization attempt — and depending on result will either finalize model based on the higher number of units or revert back to the 70 unit depth. For next Kaggle problem set I tackle will certainly incorporate a GPU and automate the hyperparameter search.

Once we settle on our architecture, we’ll transition to the final task of preparing our Kaggle submission data which hopefully will be relatively straight forward.

Eureka, we have found an improvement to our validation metric from increasing our number of hidden layer units from 70 to 85. A reasonable next step would be to step up our number of units and try again (or even better pick some new random combination of perturbations within our hyperparameter search sphere rinse lather and repeat), reason was never my strong suit however so am going to consider this model official optimized. After all the goal was always minimal viable model to serve as a baseline for future iterations, and given expected leaderboard placement from this validation score I do consider this viable. Now we need to apply our trained model for feature extraction from our test set, at which point we can prep for and submit to Kaggle to get our score. The finish line is right around the corner!

Herbie Hancock — Feets Don’t Fail Me Now

So here we’ll apply our trained model to our test set and generate predictions. Even though our trained final model is already available in memory, I’ll go ahead and upload the trained model we saved after the last run for demonstration purposes. Note that we could upload any of the different saved models at this point, I’ll stick to the final one since it gave the best validation metrics. Just because I’m chicken and don’t want to lose the trained model currently in memory in case of bug, am going to name the final model something different.

And now for the application of our uploaded model to the test data. Cross your fingers!

Ok we are closer than ever. Now reviewing the Kaggle submission requirements it appears they want the data in format of a comma separated csv file, with headers on the columns, the ID numbers in first column, and test predictions in the second. I think to achieve that will convert the results array to a dataframe with header, merge the the ID dataframe, and then export to CSV. I’m guessing there are probably scripts developed by other competitors that could simplify this process but will save that exploration for next time.

Bad news I’m afraid. My first submission was a dud. My submission scored 0.26440, placing 1445 out of 1599 competitors. Given that I expected the error rate to fall somewhere around 13% based on comparing my validation score to the mean value in sale price column, my assumption is that I’ve entered the dreaded overfit territory — but note am approaching the borders of my expertise at this point. The good news is we now have a baseline to evaluate future performance. The better news is that we have nowhere to go but up. So yeah here goes a few more.

If this is overfit I can think of two directions that should help, 1) decrease the number of units in the hidden layer, 2) decrease the number of epochs. So what the hell let’s try both. In this run we will base off of epochs = 150 and hidden layer units = 70. It’s go time.

Well this final attempt didn’t help, so will have to revert to 85 units in hidden layer and 300 epochs as baseline. Current high score is 0.24734, 1429th place. This search process is not one where my limited experience will add much value, so instead will simply set this aside until have a GPU instance in hand and have figured out how to automate the hyperparameter search process. A job is only as easy as the tools you bring to bear — why try to use a hammer when a screwdriver is needed know what I mean?

There lies a steel driving man.

Conclusion

This has been a fun experiment. Although still leaves a little to be desired from the leaderboard standpoint, at least a baseline is officially established! Will post this on github in case has potential to help any others get started with kaggle. This is an interesting time for this competition because parallel there is another more advanced version of this home price prediction regression problem going on in Kaggle with cash prizes — so this set could be a warm-up for a real opportunity!

Note that I occasionally blog in a creative capacity on medium but sometimes touch on areas of professional interest — I wrote a bit about the emerging field of deep learning in a post if you’d like to see more — From the Diaries of John Henry. If you enjoyed or got any value from this writeup feel free to say hello, I’m on twitter at @_NicT_ or for professional inquiries can be reached on linkedin. Happy learning!

*For further readings please check out my Table of Contents, Book Recommendations, and Music Recommendations.

Books that were referenced here or otherwise inspired this post:

Deep Learning With Python — Francois Chollet

Deep Learning With Python

Hands-On Machine Learning with Scikit-Learn and TensorFlow — Aurélien Géron

Hands-On Machine Learning with Scikit-Learn and TensorFlow

(As an Amazon Associate I earn from qualifying purchases.)

Albums that were referenced here or otherwise inspired this post:

Waiting For Columbus — Little Feat

Waiting For Columbus

As an Amazon Associate I earn from qualifying purchases.

For further readings please check out my Table of Contents, Book Recommendations, and Music Recommendations.

--

--

Nicholas Teague
From the Diaries of John Henry

Writing for fun and because it helps me organize my thoughts. I also write software to prepare data for machine learning at automunge.com. Consistently unique.