The sunset view from the Sky View Observatory at Columbia Center, the tallest building in Seattle.

Are you willing to be a Data Scientist?

Lets begin with Data Analysing using CRISP-DM methodologies

Bengü Banu Birinci
Published in
4 min readApr 15, 2020

--

As a data analyst now I am on my way to be a data scientist, in this path while extending my skills, I would like to share some main topics that I worked on it, and this post is the first one.

Definition of CRISP-DM

CRISP-DM is cross-industry standard process for data mining , with this methodology, it is provided that the processes progress and be understood step by step

These steps do not have sharp separations that in the process of making sense of the data, new business questions may arise and go back to business understanding phase, or while evaluating the outputs, you may go back to business understanding and business needs may change.

We will go through step by step while analysing Seattle and Boston Airbnb 2016–2017 data.

CRISP-DM stages https://commons.wikimedia.org/wiki/File:CRISP-DM_Process_Diagram.png

Business Understanding

First of all we need to understand the business cases and requirements, in this case there would be some questions needed to be answered about Airbnb s

  1. How are the availabilities of airbnbs in Seattle changing monthly?
  2. How is price changing city by city of Seattle airbnbs?
  3. How is price changing when comparing Seattle and Boston?

Data Understanding

We need to understand what we have and how we can answer our questions with this data, do we need more data, or can we answer more questions.

While trying to understand data, new business questions may arise.

In data understanding phase, looking from a big picture perspective, and planning the works to be done and we see that the first thing is preparing the data to be analysed

Data Preparing

I think this is the most challenging phase, that you need to spend most of your time to prepare data for modelling and analysing, you may need to use descriptive or inferential statistics or various data visualization classes

  1. Checking data types and nulls (making desicions about handling nulls in the data)

2. Using descriptive or inferential statistics

3. Column based transformations (convert, rename, remove, split)

4. Removing unnecessary data with simple or complex filters

and so on.

Data Modelling

If you need to answer some questions with Machine Learning algorithms like predicting price, you need to model the data, and make decision about which model are you going to use to get answer for the question

Also while modelling, you may need to go back to data preparation phase

Evaluation of the Results

After data understanding and preparing part, without modelling and using some descriptive statistics and data visualization classes we can get answers for the questions

In data preparation phase we are using exploratory visualizations, but evaluating outputs with audience we need to use explanatory visualizations

I had answers of my questions as below.

How is availability of Airbnbs changing by months in Seattle?

It seems that availability is upper than 50% for all months of the year, but we can say that availability rate decreases gradually due to the summer season

How is price changing city by city of Seattle airbnbs?

When calculating average monthly prices of airbns in Seattle, it seems that city center has the highest values

How is price changing when comparing Seattle and Boston airbnbs?

It is obvious that Boston prices are higher then Seattle, especially in September

Deployment

This phase is also important because of that reason communication is the most important part of the role of a data scientist. We can get more knowledge and motivation for our job by sharing experiences and results

As I learned being a data scientist is not just data analysis or designing a machine learning model, it requires adding business value to the work you are doing, so let’s continue to improve our skills constantly.

Thank you for reading to the end :)

I shared python codes and data set of this project on Git hub

You can find Git hub link here

--

--