Data Science — A Walk-Through of a Tutorial

Aman Jaiman
Analytics Vidhya
Published in
6 min readDec 18, 2019

These days it seems like everyone is enthralled by data science. There is data all around us, as I’m sure you’ve heard, and people are taking advantage of public data sources to do analytics and predictions. So how can someone new survive in this new fast-growing (and competitive) field?

Machine Learning in a nutshell

Turns out it’s very easy to get started. If you are interested in learning about data science, there are plenty of resources online where you can learn. Towards Data Science is a great Medium platform where you can read and listen about trending topics in the field. You can view other work being done and work alongside tutorials to get a better understanding of how to get your hands dirty with data.

Kaggle is a great resource for data enthusiasts as well! Kaggle provides users with free data sources (published by other users) and courses in everything you need to master working with data. They have a list of ongoing competitions that you can enter to test out your newly developed skills. Most importantly, for new users, they have notebooks, built on the Kaggle platform. Here, you can look at other people’s work and learn how to approach new tasks. Once you feel ready enough, start your own notebook on Kaggle and get started!

The best way to learn data science is through tutorials. These tutorials are put together by people that want to solve a task for whatever reason, and they go step by step through their entire process. Recently, my friends and I made a tutorial as a final project for our data science class. In the tutorial, we work with Airbnb data from New York City, and try to look at different features as predictive variables for price. In this article, we’ll be walking through not only that tutorial, but what to expect when reading other tutorials.

Follow along with our full tutorial! Check it out on my website: amanjaiman.github.io/nyc-airbnb-data/

The first part of the tutorial deals with importing the right libraries, finding data to work with and collecting that data. For us, that is fairly straight-forward, as we work with data available on Kaggle. Because we are using a Kaggle notebook, we can simply get the data from their file system. If this isn’t the case, you can download the data in many different formats and then import it using a library.

Part of our dataframe, showing the first 15 columns and first 5 rows.

Here we had a csv file that we open into a dataframe. A dataframe is a 2d array. Think of it as a table of data, where each row corresponds to a new entry, and each column is a different variable. Most of the data you will work with will be in a similar format because it is easy to work with.

After getting the data, we walk through what this data actually means, and what it contains. In our case, we talk about each of the columns and it’s relevance. We also take a look what specific data points we’re working with. For example, we deal with 5 different neighbourhood groups (Brooklyn, Manhattan, Queens, Staten Island, Bronx), and three different room types (Shared, Private, Entire Home).

Why this is important: This is where you’ll preprocess the data and figure out what you’re dealing with. You’ll be able to look at some of the basic features and start to formulate what you want to do with the data set in the next steps.

Next steps, my favorite, exploratory data analysis. This is the process of looking at the data deeper, and visualizing the data in easy to understand ways.

Left: A heat map based on prices, Manhattan is more red than other boroughs. Right: Price distribution for each borough.

Here we can see the price distribution for each neighborhood. As we expect, Manhattan has a higher mean price, which makes sense because most people are visiting Manhattan than the other 4 boroughs.

Also, just for fun, we decided to use plotly and make an actual map of New York City, with data points colored in for the price. Manhattan clearly is more red than the other areas.

We hadn’t touched the text data yet, so we decided to do a little bit more and add some natural language processing. That’s a term you’ll be hearing a lot in the new field. NLP is the act of looking at written or spoken language and figuring out meaning behind the words and sentences we use. We looked at the top 25 words hosts used when naming their property, as well as the general sentiment behind their name, and how that correlates to price.

Why this is important: EDA is key to drawing conclusions about what the data is showing you. You’ll be able to gain key insights about certain variables and start to gain an expectation of what you will encounter when creating your model. It’s also fun making nice looking visuals!

If you’ve been following along with our tutorial, you’ll see that we’re now at the Predicting Price section. This leads into the third part of the tutorial, machine learning, of some sort. This is the most complex section of the tutorial, and deals with trying to predict one of the variables using the others. For our data set, we thought it would be appropriate to predict price from the numerical variables (we also encode some of the categorical variables with numeric values).

Our first model doesn’t seem to be doing too hot.

Our first model didn’t perform the greatest, which we attributed to the outliers in the data, and some of the assumptions the model makes. Seeing this, and doing some stats analysis on it, we decide to perform a log transformation on price. We see that that makes our model better! Our residuals are now normally distributed. We then look at the specific impact of the different variables on the price, and remove the ones that we find are not helping our model. We’re left with the final model.

Our model coefficients. Note that Manhattan has the largest positive correlation to price.

Why this is important: The machine learning step is the final piece to what you sought out to do. You collected your data and looked at some of the features, but now you are able to actually find a relationship between the features and use its predictive power in the future.

Once we summarize our findings, the tutorial is finished!

We’ve been able to walk through the entire process step by step: finding data and figuring out what we’re working with, exploratory data analysis, and machine learning. Hopefully the tutorial helps you understand the process of working with data, and this should serve as a tool for what to expect when coming across a tutorial. Once you understand why the different steps are so important, it’s much easier to approach a new problem yourself.

The next step is to take pick some data you’re interested in on Kaggle and get started!

--

--

Aman Jaiman
Analytics Vidhya

CS Grad Student @ UMD | Incoming MSFTie | Data Enthusiast