Kaggle Completions and Capstones

Currently there is a Kaggle Data Science Competition entitled ‘Sberbank Russian Housing Market.” The competition asks data scientists to make predictive models in the volatile housing market of Moscow. While I may need to refine my skills before entering such a competition, I intend to look at the problem and replicate the problem for the housing market here in Washington, DC. This will be my capstone project for General Assembly’s Data Science Immersive. For this reason, I want to survey the etiquette and methods for data science competitions as well as familiarize myself the process. Additionally, I will look for any advice tips from experienced Data Scientists on the overall process.

Luckily there is a a discussion board, that helps address multicollinearity, feature engineering, outliers, the bias/tradeoff which is particularly relevant because as you will see below, the feature space is massive!

In the Sberbank competition , there are numerous features provided across a few csvs. The data_dictionary.text looks at what we expect to determine rent prices such as cost of a unit based on square footage, number of beds & baths, kitchen size, parking, amenities, neighborhood, but it goes well beyond that minimum (as to be expected for data science problems). See below.

Population: Migration, natural population growth, subarea (neighborhood) gender, under working age, working age, retired age, childbirth, mortality, infant mortality, life expectancy, student in secondary education, students in university, students in graduate school, number of MBAs, (number of M.Ds looped in with macroeconomic indicators).

Land dynamics: Buildings per subarea, cafe count, shopping malls, industrial zones, green zones, power plant zones, ‘diry’ zones, subway accessibility. There were also features that measured the distance between a property and x which could be a hospital, bank, market, airport, train station, etc.

Macroeconomic Indicators: GDP, GDP growth, Gross Regional Product, Gross Regional Growth, Consumer Price Index Growth, Ruble/Dollar exchange rates, volume of mortgage loans, growth of mortgage lending, weighted average of mortgage lending, average monthly salary, average monthly salary per capita, rate of buildings under contract, growth of nominal wages, retail trade turnover, retail trade turnover per capita, share of profitable entrepreneurs, employment rate, unemployment rate, size of labor force, old house share, relative number of doctors, relative number of nurses, number of visits per physician, hospital bed availability, average occupancy of hospital beds, city residential housing under construction, apartment condition, number of museum visits per 1000 people, number of sports attendance per 1000.

Rental Data: Rent price for luxury studio, rent price for luxury 1 bedroom, rent price for luxury 2 bedroom, luxury 3 bedroom etc. The same property types were included for ‘economy. (For Washington, I will be sure to look at the size and scope of Section 8 housing)

Again, I’m not entering my capstone into any sort of competition. What I do want to do though is treat this as if it were a competition, so that what ever predictive model I build will be valuable. For example, Owen Zing, CPO at DataRobot published a slide show on LinkedIn Slide Show offering tips on Data Science Competitions, which he breaks down into philosophical considerations, strategy, and techniques. His philosophy suggests overfitting can happen in a variety of ways. He also emphasizes that there are costs of peaking at the answers ahead of time and that usually a data scientists initial instinct is write. For him thinking more and trying less is ideal. In the end it’s a mix having of luck, implementing feature engineering, having background knowledge, being disciplined, choosing the appropriate model, selecting the best statistical packages, and having efficiency when it comes to data manipulation/coding.

“Good validation is better than a better model.” He argues that a regular train/test split is only the start. If possible, I should aim to have a holdout set that I leave alone through out the entire data science process which includes the feature selection. Mr. Sing says that if your holdout set result is bad, then you should scrap the problem. I may need to take that with a grain of salt.

In terms of feature engineering, the process was very specific to the specific competition he was involved in. In terms of model selection he briefly discussed his affinity for gradient boosting decision trees, random forest, extra decision trees, gradient boosting regressor and a few other novel concepts that are beyond my experience in Data Science. I will have to look at documentation if I choose to use any of those concepts that we did not use in my immersive course.

In any case, I’m confident that this competition will provide excellent guidelines on how to approach this problem of predicting rental prices in Washington. The access to differing viewpoints and strategies will be particularly beneficial, so I’m feeling optimistic.