My Data Science Learning Journey begins…

I already have an existing blog but I am also curious to see what Medium has to offer above that. So since it also aligns with my new exploration to self-teach myself Data Science, I think this might be a good opportunity to kill two birds with one stone: Learn Data Science, and try a new blogging platform to document my personal journey and share any learnings I have with others.

So the first step turned out to be more of a stumble until recently. Overwhelmed with the terms of this new industry, and not having much mathematics or statistics background other than what I learnt (and forgot most of) from university some decades back, I found myself in the rare position of being a newbie once again in technology. Exciting!

Thankfully, Brisbane has a great pioneer in Data Science, specifically Artificial Intelligence, in Dr. Natalie Rens, who runs Brisbane.AI, a meetup for AI enthusiasts and professionals in Brisbane, and who organized an AI Hackathon that I was able to participate in.

For this hackathon we leveraged Kaggle, a platform which hosts competitions where datasets are offered up for competitors to use in developing predictive models on, often with financial rewards from companies offering up the datasets as they seek to solve very real-world problems they face. For the hackathon the beginners were focused on the Housing Prices: Advanced Regression Techniques competition.

We had a great mentor for the beginners in Lex Toumbourou, who walked us through several well timed short, but very useful, introductions to ideas on how to attack the problem, minimising the statistical and mathematical weeds we would need to navigate where possible and using plain English to not intimidate a beginner. In essence, he taught us some of the basics of data wrangling and modelling real data scientists use, giving us just enough of a sip of the entire ocean of Data Science to wet our appetites and achieve submissions for the competition, and keep us encouraged to learn more on our own even after the competition (as I have).

Kris Bock and Georgina Siggins of Microsoft were our hosts, and much thanks to Georgina for making sure the snacks plates and lunches were there on time, and for providing moral support and conversation during the break times. Thanks to Kris as well for persevering to share with us what the Azure platform offered for Data Science enthusiasts as well, even when the demo gods failed to be appeased by his efforts, he still made available the contents of the Deep Learning VMs he had hoped to have us run on our own.

The peer community itself who participated in the hackathon was very vibrant and supportive for learning from each other. Though I was a newbie to this field, with several much more experienced people in the room, the enthusiasm to answer my many (many) questions that were probably very basic to many of them was humbling and quite encouraging that this is definitely a spirited, positive community of folks. I have to thank again Natalie Rens and Lex Toumbourou for all they did to make that event happen.

In conclusion, several things I learnt over this weekend:

  1. Kaggle is a great learning tool and not just for the competitions. There were several notebooks shared by Lex that walked us through attacking the problem, and these types of notebooks are commonly shared by others in the community for free
  2. Python is really powerful and has a simplicity and elegance to it. In the course of the hackathon I got to learn, as side effects, how to use pyenv for Python version management as I setup my own Jupyter Notebook. Kaggle itself offers a custom interface to a Jupyter Notebook for writing competition code, so you didn’t need to how to set one up, but I wanted the flexibility to use my own. Doing the competition taught me alot about how to look at data, and how to use basic libraries such NumPy, Pandas and SciKitLearn for data wrangling as these are incredibly powerful libraries and are heavily leveraged in predictive modelling.
  3. While we started with a focus on a simple Linear Regresssion model for our learning algorithm, thru peer learning I learnt the basics of variations to these (and their associated libraries) such as Lasso and Ridge regression models as well as introducing me to ElasticNet, which I settled on for my hackathon entry, which I think didn’t do too badly for a first timer.

One of my first computer science lecturers at UWI always advocated mastering first principles as a means to being able to grasp any specialised realm in Computer Science. In writing this, I chuckle to myself as I also now remembered he was also the AI lecturer. His advice has allows stuck with me. So in looking retrospectively at this weekend’s competition, it started with the first principle of how to understand what data we were given, and how to measure it effectively so it can be applied in the predictive modeling algorithm chosen (in this case Linear Regression).

So the knowledge I am missing around statistics and measurement theory is where I am going to start next to look at, while also exploring the “cool” stuff like Python, Jupyter Notebook, Kaggle and other libraries, platforms and tools relevant to Data Science learning.