Finished A Data Science Course? It’s time for projects!

Chris Bruehl
Learning Data
7 min readJan 11, 2024

--

Because courses are only the beginning of your education.

Photo by Benjamin Wedemeyer on Unsplash

Learning data science is challenging. And as someone who’s been doing this for several years now, I can tell you that teaching data science is almost equally as challenging.

As to why, there are a few arguments I could make: A constantly evolving tech stack, the vast diversity of skills used by data scientists (despite the fact that at least 50% of your time will be spend in SQL), absurd job descriptions that make students feel inadequate because they aren’t PhD level experts in multiple domains, and many more.

But the hardest part is helping students cement their skills and build their intuition for solving challenging problems. At Maven Analytics, we do our best to make our courses hands-on, with hand-on courses, demos and projects to help mimic on the job tasks.

For courses that focus on tools, this can be enough to get someone the knowledge they need to pivot their data or get their hover tooltips working properly in their dashboard. But like data analytics, and possibly more so, data science is less about tools and more about the intuition, grit, and critical thinking it takes to solve challenging problems.

So, how do we build those skills? On the job experience and mentorship is the gold standard, but that doesn’t help folks who haven’t landed their first role. The next best step is projects.

Before I share some project ideas — a few notes:

  • These are intentionally left as rough outlines, rather than step by step instructions — we often need to scope and explore open-ended projects, and our goals may change as we get more familiar with the project.
  • In terms of data — I highly recommend building your own dataset for at least one of these. A quick search on Maven’s Data Playground, Kaggle, or Google will often lead to a suitable .csv file in each of the domains, however, getting comfortable building datasets via web scraping and APIs is great practice will strengthen many of your skills.
  • You can pick and choose which pieces of each project you want to tackle. Don’t want to work with APIs? Find a flat file that contains similar data — you can always try working with APIs later!
  • There are no “right” answers here. There are definitely bad practices, but two data scientists working on the same ideas below could end up taking their projects in very different directions.
  • Be patient, and have fun!

So with all that said, here are a few project ideas you can get started on to put your data science skills to the test.

Regression: Predicting web scraped prices

Price prediction is a great use case for regression modeling. You may choose to focus more on predictive accuracy and minimizing your error, or you may opt for a simpler, more interpretable model that helps you really understand which product attributes most impact the price of a given good or service.

  1. Find a scrapable marketplace website like Craigslist
  2. Scrape apartment rental listings, used cars listings, TVs, etc in your area and predict the price using regression analysis
  3. Which factors had the most impact on price? Were there any factors that you expected to be important that weren’t?
  4. How accurate is your model? Which cars are overpriced or underpriced according to your model? Would the underpriced cars be a good value to purchase or are there factors that your model isn’t capturing?

Time Series Forecasting: Temperature & Energy Consumption

Energy consumption and temperature are inextricably linked. As the temperature drops below room temperature, more energy must be used to heat buildings. As temperature rises above room temperature, more energy is used to cool them. These seasonal effects can be seen at both an hourly and annual level. Can you successfully predict how much energy will be consumed in the next 24 hours? 72 hours? Week?

  1. Find an energy consumption dataset — many utility companies offer this on their sites, and there are many on Kaggle. Here is one example.
  2. Use seasonal decomposition to understand what types of seasonality exist in your data
  3. Build a model trained only on energy consumption first
  4. Then, try incorporating temperatures to see if that improves your accuracy
  5. Remember, you won’t have actual temperatures when you are forecasting energy consumption, so you will need to use forecasted temperatures as inputs into your energy forecasts

Classification: Unbalanced Data Classification

Many classification datasets used in courses have a near equal split among classes. For example, the famous iris dataset has exactly 50 observations for each of the three species of iris flower in the dataset. But many of the classification problems data scientists solve on the job have an unbalanced distribution of classes. Many ecommerce problems (churn, ad clicks, purchase) have event rates of 1% or less. Financial fraud, disease detection, and many other problems also have a large imbalance.

  1. This credit card fraud dataset on Kaggle is a good example of imbalance data
  2. Try fitting a model without imbalance techniques to get a baseline
  3. Then, try sampling strategies like downsampling & SMOTE
  4. Adjust your model’s threshold to maximize F1 score
  5. You can also try changing the class weights in your model

Unsupervised Learning: Clustering Restaurants

Clustering, and unsupervised learning in general is a challenging topic, not because the algorithms are tricky to learn, but because there is no “right” answer or clear accuracy metric like we get in supervised learning. This can be frustrating for students (and yours truly) who are used to supervised learning topics. On the other hand, the creativity we’re able to apply during these problems can be immensely fun.

Here, I’m suggesting clustering restaurants, but this can be applied to a wide variety of domains. You should let your own interests drive this — for example you may want clusters for “trendy and expensive”, “cheap eats’, “Under No Circumstances”, etc, based on your own preferences.

  1. Perform basic EDA to understand your data
  2. Think critically about what the limit on cluster numbers — is a solution with 30 different clusters going to be helpful?
  3. Use a scree plot to hone in optimal cluster numbers
  4. Are there any outliers or small clusters worth investigating?

Natural Language Processing: Product Review Topic Modelling

NLP has entered a new era with the rise of LLMs like ChatGPT and Google Bard. But being able to work with text data is still very important. Topic modeling is a great way to many documents into broad categories. This type of exercise can benefit both consumers and sellers.

What aspects of the product do customers keep talking about? Is there a common complaint about service or quality? Is there a feature many customers think is missing? And so on. Topic modeling helps you quickly summarize many documents into topics that are easy to digest.

  1. Find a dataset that has many samples of text. One option is to use a social media API like twitter (X), Reddit, etc and scrape posts for a give subject. Kaggle also offers similar datasets stored in flat files.
  2. Use a topic modelling algorithm — try tuning the number of topics to reveal more general and more specific topics — both ends of the spectrum can reveal unique insights
  3. Visualize your topics using a tool like pyLDAvis, which helps further understand the frequency of topics and similarity to others, which can help further tune the numberof topics.

Visualization: Interactive Dashboard Web Application

Finally, you should build at least one web application, using a framework like Dash, Shiny, or Streamlit. Ideally this will incorporate a machine learning model, possibly from one of the previous projects listed. I have really come to like the site pythonanywhere.com for deploying personal projects — they will allow you to deploy one project for free, or multiple for a relatively low monthly price. Regardless, deploying a web application with a trained model included will showcase your ability to put together and end-to-end project, showcasing your visualization & design skills as well as showing you are learn libraries and skills outside of the core Pandas/Scikit Learn stack.

  1. Pick a topic you are passionate about, and find data in that domain. Ideally this data is conducive to machine learning, but if not, there is still a lot of value in a project like this.
  2. Clean and explore your data — what are the interesting insights?
  3. What questions should your application help answer?
  4. Remember that less is often more. Data visualization projects can easily get overstuffed with a wide variety of visuals that add clutter and don’t help answer the question. A clean, easy to use and understand application with a handful of useful visualizations will be much more useful than a 12 tab dashboard that no one will look at.

Overall, there are a lot of options for portfolio projects here. These are a good opportunity not only for your to get additional practice and showcase your skills, but also to go deeper in the learning process as you tackle tricky problems that require techniques and functions that weren’t taught in your courses.

Do you have any other ideas for projects data scientists should tackle? Let us know in the comments! Feel free to share any project work you’ve done as well, I love seeing all the amazing things folks in this field do with their skills!

If you enjoyed this article, give me a follow! I write regularly about topics like Python, Pandas, and transitioning from data analytics to data science. I also have courses on these topics available on the Maven Analytics Platform and Udemy — would love to see you there!

--

--