One of the big challenges of doing ‘data science’ is collecting sufficient data. Many models, particularly for complicated problems like machine vision or natural language processing, require large data sets and while we may live in the era of big data, it’s not always easy to tap into large, high quality data sets at home. Particularly for students of data science or new entrants in the field, the lack of neat data can seem like an impediment to learning — how much more mileage can we get out of the iris dataset, or the Titanic one?
If you’re a data scientist or data scientist in training looking for some side projects, I want to suggest that there’s another way to practice highly relevant skills that doesn’t require a lot of web scraping or premium access to curated data. If anything, the focus on high accuracy predictions — the sort of predictions that require models trained on lots of data — that a lot of data science instruction (in particular at the increasingly ubiquitous bootcamps) has does a disservice to one of the fundamental uses of building models: creating conceptual frameworks to better understand how or why things work the way they do. With this in mind, I contend that useful models can be built with no data collection necessary. …
Among the tragic losses to the current coronavirus pandemic is the brilliant mathematician John Conway, who passed on April 11th. He made major contributions to group theory and game theory, but is probably best known in the wider world for his Game of Life. The Game of Life is a simple, yet fascinating, cellular automata exercise that ended up developing something of a life of its own, far beyond what Conway expected.
The Game of Life takes place on a grid, with certain cells being marked ‘alive’ or ‘active’ and others being marked ‘dead’ or ‘inactive’.

The initial condition of the grid is set by the player, but after that, the grid evolves according to set…
Handling time stamped data is one of the most intuitively obvious use cases for data science. After all, any data we can collect necessarily must be gathered from or represent the past and we often want to make predictions about unseen cases we’ll encounter in the future, so it makes sense that our models might have an important time component, in some form or other. …
Judging a classification model feels like it should be an easier task than judging a regression. After all, your prediction from a classification model can only either be right or wrong, while a prediction from a regression model can be more or less wrong, can have any level of error, high or low. Yet, judging a classification is not as simple as it may seem. There’s more than one way for a classification to be right or to be wrong, and multiple ways to combine the different ways to be right and wrong into a unified metric. …
It’s a little hard to comprehend how completely the new reality of climate change will alter our relationship with the world. As changes in temperature and precipitation patterns impact agriculture, we will need to rethink where our food comes from, possibly radically. Water stress may impact basically every item we consume or use, from the cotton in our clothes to the waste water from mining for materials used in consumer electronics. Higher temperatures and heatwaves are closely linked to hospitalizations from cardiovascular and respiratory complications, making climate change a looming public health crisis.

Perhaps more fundamentally, climate change is altering where we can even be, with changing coastlines, extreme weather and increased fire risk rendering some places unviable for habitation. In a way, it has been this impact of global warming that has long been front of mind in the popular imagination. The possibility of large swaths of the world’s coastlines simply disappearing under rising seas has been one of the most enduring images of climate change for decades. …
Now that it’s over, we can say for certain that 2019 was the second hottest year on record. High and sustained global temperatures have brought with them or worsened a host of freak climate events and catastrophes, including millions of people displaced by extreme weather, a heatwave in Greenland of all places and of course the devastating and tragic Australian bushfires (still ongoing as I write this). As if these current disasters were not enough, the future looks even bleaker with new reports coming out suggesting that a quarter of all people on earth face water crisis, that more cities will be affected by rising sea levels and more quickly than previously thought and that the arctic permafrost is thawing at a rate that exceeds what had previously been estimated to happen only by the year 2090. …
About a year ago I had a conversation with an old friend who shocked me by professing he didn’t think climate change was really a big deal. That led me to wonder, how could you convince someone otherwise? That in turn led me to take on climate data visualization as a bit of a side project, the early results of which are up on isclimatechangeabigdeal.com.
The site is now updated for the 2019 data, and if you’ve read the climate related news, you’ll already know that the 2019 data wasn’t great. …
After all the data collection and data cleaning, it’s time to fit a model and make some predictions. Why not try a couple of different models, actually — a plain linear regression or one with a regularization method like Ridge or Lasso, maybe KNN or a decision tree for regression. But now, how do you decide which of these models has actually performed the best? Is it possible to quantify how well any given model works? This is actually something of a tricky question and requires that you balance multiple goals that are frequently in opposition to each other. On the one hand, you want to minimize the error in any given prediction your model will make. On the other, you want your model to generalize, to work well on unseen data, which means avoiding overfit. There are numerous different metrics you can use to evaluate your models along these lines. …
Principle Component Analysis sits somewhere between unsupervised learning and data processing. On the one hand, it’s an unsupervised method, but one that groups features together rather than points as in a clustering algorithm. But principal component analysis ends up being most useful, perhaps, when used in conjunction with a supervised model, where it can be used for dimensionality reduction — reducing the number of feature variables. PCA is also useful in a few other situations, such as a way to filter out random noise from data, but it’s with an eye towards dimensionality reduction that we’ll consider it here.
What to do when you have too many…
Predictive models generally require what is called ‘labeled’ data to train on- that is that the data has some target variable that you have filled out already. Of course, the goal is to use the model on unseen data where you don’t know the value of the target variable, but without properly labeled training data, you have no way of validating your model. For this reason, data production is often the hardest part of a data science project. Say, for instance, you want to teach a computer to read handwriting. It’s not enough to collect hundreds of pages of written words and scan them in. You also need to label this data, to have an accompanying ‘correct reading’ for each word. …

About