My Groups Crack At The Tabular Playground Series February Competition

TristanCSC513
CUNY CSI MTH513
Published in
6 min readMay 19, 2022

The beginning of the end

Of all the classes I’ve taken at CSI this was definitely one of the most interesting. Having a open project that you would work on during the course of a month, building and iterating, watching your progress improve with a literal numerical value was both daunting and engaging to work through.

Watching the pieces come together was a fulfilling accomplishment and I’m happy that we finished with our score of 0.97188.

So, what exactly was this competition all about? Did we win a huge prize making a name for ourselves across the Kaggle landscape? Sadly…this was not the case. However, we did finish top 525 out of more then 1,255 teams, 1,312 competitors, and over 11,766 entries. Which did leave me feeling content. Moving on though, this project was all about “10 different bacteria species using data from a genomic analysis technique that has some data compression and data loss. In this technique, 10-mer snippets of DNA are sampled and analyzed to give the histogram of base count. In other words, the DNA segment

becomes

“ as quoted directly from the Tabular Playground Series — Feb 2020 page, and it summarizes our main goal of predicting bacteria species based on repeated lossy measurements of DNA snippets. Our project was off to a start.

Tackling the competition: Scoring

Our Submissions were scored/evaluated based on their categorization accuracy.

Class Slides
Google Machine Learning Crash Course
Google Machine Learning Crash Course

Pretty self explanatory, the better out model does at predicting pairs the higher a score we will get.

Tackling the competition: Data/Modeling

As said before, each row of data contains a different spectrum of histograms generated by repeating measurements of a sample. Totaling the output of the rows leaves us with 286 histogram possibilities which then has a bias spectrum(Random ATCGs) subtracted from the results. Our given data was as follows. “train.csv”, our given training set that contains the spectrum of 10-mer histograms for each sample. “test.csv”, our test set that we will use to predict the bacteria species(target) for each row_id. “sample_submission.csv”, our sample submission file in the format of

sample_submission.csv

We can take a look at the first 5 rows of our train data set and gather some rudimentary information.

Now let’s take a look at the test.

The first noteworthy finds were that there are no null values and that of the 286 features 278 were continuous while the other 8 were categorical, meaning they have less then 25 unique values.

[TPS-FEB-22] 📊EDA + Modelling📈

Looking past the first 5 it was also noted that of the 10 different target values, they all had a somewhat equal distribution. Around ~10% for each target.

Distribution/Shape of Data

After testing various models we achieved varied results from each, but we lacked a core concept of how the models were using the data and how the shape/distribution of our data worked well with our models. Creating different visualizations helped give us a idea of why certain models were working while others were not.

TPS-Feb22, EDA -> Ignore-Important Cols
TPS-Feb22, EDA -> Ignore-Important Cols

Tackling the Competition: Information Far and Wide

When first starting the competition there were many different routes we could go down when it came to building our model. Linear regression, polynomial regression, ridge regression, random forest… the lists goes on and we were told to try everything and anything to see what works. This proved to be a valid strategy in understanding and exploring the competition. Using not only our finds but consulting other successful model we started off with split training data into a two different training/validation models. We trained each model on x_train, y_train and then scored each model on x-val, y_val testing a multitude of different models starting with DecisionTreeClassifier.

Something we found immediate success with in increasing our score was dropping the row_id as it wasn’t really needed/useful. Other attempts were made at any sort of optimization, such as filling nan values in test data with mean values form train data, ultimately finding that it was harming our model.

Our initial models found varied success, first using DecisionTreeClassifier we were able to achieve a max score of .97198.

We tried out of curiosity using other models such as PassiveAggressiveClassifier and RandomForestClassifier but our results were lower then previous models even with tuning and changing of bootstrap and estimators.

Conclusions and Observations

Ultimately the model that we finished our top score with was ExtraTreesClassifier. Based on the shape of the data and what we learned from class/mild research it seemed like the best option as well as the most successful of the models we used. We tuned the estimators till the project end to n_estimators=500. I believe that we had much more success with ExtraTreesClassifier based on the nature of the model, its keen ability to improve the predictive accuracy, and control over-fitting, which I believe was occurring with in other models.

Starting the semester with this project and learning the observational strengths and weaknesses of certain models, especially how we can utilize different models with the type of data/task we are working on was a great learning experience that definitely helped give me a firm grasp on machine learning while moving on with the semester.

--

--