Getting Started with Data Science: Beginner Level Datasets

Nirmalya Misra
Machine Learning India
4 min readJul 20, 2021

Introduction

Well, what is the best way to learn something new? Learn the methods and processes by heart? Not really. The best way to learn is through the application of whatever one has learned. This helps a beginner to experience real-life scenarios.

So, what better way to start your Data Science journey than to get your hands dirty with Data. In this article, I have gone through three Datasets, which one can start working on right away, even if they are beginners in the field. We will use a platform called Kaggle.

Starter Material for absolute beginners

I would like to clear up the fact that Machine Learning and Data Science comprise such a huge variety of topics that I can not mention a few topics as prerequisites. For this article, I’d like the reader to be comfortable with basic Python — up to conditionals and loops. Some other libraries are required, but we will learn them on the go.

I would suggest you open up Kaggle and create an account if you don’t have one already. Once you have created an account, on the left-hand side of the screen, you would see a lot of options such as HOME, COMPETE, DATA, etc. Click on Courses and find the course “Intro to Machine Learning”. Now, if you are a beginner, I’d like you to complete this course. It will give you a good base when it comes to coding for Data Science projects. It would take a maximum of 2–3 hours(including the exercises). If you are someone who has some experience with Data Science, you can check if you know all the ideas presented in the course.

Now that we have everyone on an equal footing, we can start with the Beginner level Datasets you could practice your newly gained skills on. Let us start with a recommendation given at the end of the Intro to Machine Learning course — Titanic: Machine Learning from Disaster.

Titanic: Machine Learning from Disaster

Image source

Who hasn’t heard of the ship Titanic, which met its fate after crashing into an iceberg in the Atlantic Ocean? This dataset contains the list of people aboard the Titanic and some of their details. We have to predict whether a person, given his details, would have survived the crash or not.

This is in the form of a competition. So, you have to train your model on the training set, then you have to submit your model’s predictions on the test set and your score will be evaluated accordingly.

This dataset encompasses almost everything a Data Scientist would have to do except data collection. Right from cleaning the data to making new features to analyzing features and finally making a model and predicting outcomes, this dataset has it all.

Iris Species

Image source

The Iris dataset consists of predicting which species a given Iris flower belongs to, given features such as Sepal Length, Petal Length, etc. What is so special about this dataset, I hear you asked. Well, this dataset is, first of all, very simple and is one of the first datasets beginners pick up.

On this dataset, a lot of emphases can be put on Exploratory Data Analysis or EDA. The task of EDA is to find out relationships between the columns of the dataset. It can be done by seeing numbers or by using several kinds of plots.

This dataset takes away some of the attention from making models, which is important, but sometimes it makes more sense to spend time exploring the dataset.

Although the task of classification seems simple, having fewer features makes it challenging.

Red Wine Quality

Image Source

Finally, we have a regression type problem. For this dataset, the value of wine quality has to be predicted given a set of features such as fixed acidity, volatile acidity, residue, alcohol, etc. Well, for a change, a continuous value has to be predicted here. The Quality is measured on a scale of 0 to 10.

This dataset will help you understand dependent and independent variables and the relationships between them. So, EDA assumes importance once again. Many micro-decisions will have to be taken while solving this problem, which will help you improve as you move further.

Bonus Tips

Here are some tips from my side which will help beginners get started:

  1. Don’t be afraid to ask for help. Almost everything is out there on the Internet and you can get whatever you need if you search for the right thing.
  2. These projects should not scare you and if they do, you have to go back to learning Python and how to use libraries such as Pandas, NumPy, and Matplotlib.
  3. There are a plethora of datasets you can start with. These are just a few which I found beginner-friendly.
  4. Finally, whenever errors occur in your code, be sure to reach the depth of it. Errors can teach you a lot more than anything ever will.

--

--