[Data Analysis] Data Acquisition: Loading a CSV Dataset in Visual Studio Code (2.2/9)

Sam Taylor
6 min readSep 17, 2023

--

Learn how to load the Iris Flower dataset into VS Code with the help of a Jupyter Notebook, an essential skill for aspiring data analysts.

[This guide is part 2 of an 9-article walkthrough.]

Key concepts:

Visual Studio Code · Data analysis · Data acquisition · Python · Pandas

Photo by Pawel Czerwinski on Unsplash

Introduction:

In this guide, we’ll walk you through the process of loading the famous Iris Flower dataset in CSV format into VS Code, utilizing the power of a Jupyter Notebook for seamless data analysis.

To remind ourselves where in the data analysis process data acquisition comes into play, here is a general outline of the data analysis process:

  1. Define Objectives: Clearly understand the goals of your analysis.
  2. ➡️ Data Acquisition: Obtain the dataset you’ll be working with. This can involve importing data from various sources like: CSV files, Excel spreadsheets, databases, APIs, or web scraping.
  3. Data Exploration: Explore the dataset to get an initial understanding of its structure and content.
  4. Data Cleaning: Preprocess the data to ensure its quality and consistency.
  5. Data Visualization: Create visualizations to gain insights into the data.
  6. Feature Engineering: Create new features or transform existing ones to enhance the dataset’s predictive power.
  7. Statistical Analysis (if applicable): Conduct statistical tests or analyses to answer specific questions or hypotheses.
  8. Machine Learning (if applicable): Split the data into training and testing sets, select an appropriate algorithm & train and evaluate the model’s performance using metrics like accuracy, precision, recall, or F1-score.
  9. Present solution: Interpret the findings in the context of your objectives. Document your analysis process and create a report or presentation summarising your analysis.

This article will cover step 2: data aquisition, specifically, how to load a dataset we have already downloaded into VS Code, ready to be analysed.

Step 1: Install Python, VS Code, and Jupyter Notebook

First, if you don’t already have them installed, install the following programs:

  • Install Python: a programming language known for its simplicity and readability, from python.org.
    Step-by-step guide: here
  • Install VS Code: a code editor used for writing code across various programming languages, from code.visualstudio.com.
    Step-by-step guide: here
  • Install Jupyter Notebook: an application used to create documents with code, visualisations and text, by running the following command in your terminal or command prompt:
    Step-by-step guide for installing Jupyter Notebook: here
    Step-by-step guide for using the command terminal: here
pip install notebook
  • Install Pandas: a Python package used for data manipulation, by running the following command in your terminal or command prompt:
    Step-by-step guide for using the command terminal: here
pip install pandas
Installing a package via the terminal or command prompt on macOS

Step 2: Open VS Code

  • Launch VS Code.

Step 3: Create a Jupyter Notebook

To start working on a dataset, we need to create a new file. To do so:

  • Create a new notebook: Click on “File” in the top-left corner, then select “New Notebook” to create a new Jupyter Notebook.
  • Attach a kernel: In the top-right, add the Kernel (usually Python3)
VS Code: Opening a new file and adding a Kernel

Step 4: Import Python Libraries

Now that we’ve opened a new file, the first step is to import any libraries or packages that we will need:

  • Import Python libraries: Add the following code to your Jupyter Notebook cell:
# Import the pandas package and name it 'pd'

import pandas as pd
  • Run the cell: To run a cell, you can click on the ▷ icon next to the input cell or use the shortcut `shift + enter`, whilst you are in the cell.

Notes:

Importing Pandas as ‘pd’ is the standard naming convention.

Importing packages in VS Code

Step 5: Load the Dataset

Now that the packages are added, we can upload our dataset. To do so:

  • Add a new cell: Right click on the last cell and select ‘Insert cell’ > ‘Insert code cell below’.
    ◦ Or, click on the ‘➕ Code’ button at the top of the notebook.
  • Load the CSV file: Type the following code into the new cell, to load the dataset:
    ◦ Here we are loading the iris_dataset.csv using the pd.read_csv() command and we are storing it in a variable named ‘df’.
    ◦ If you don’t already have it, you can download the iris dataset in CSV format from here and save it in your project folder [step-by-step instructions: here].
# Load the iris_dataset.csv and save it as the variable: df

df = pd.read_csv("iris_dataset.csv")
  • Run the cell: Make sure to run the cell, to ensure that the command has been executed.
Loading a dataset in VS Code using pd.read_csv()

Notes:

If you get an error saying the CSV file cannot be found, right-click on the CSV file on your computer and hold down the `option` key.
◦ The option to copy the filepath will appear.
◦ You can then copy and paste this into the pd.read_csv(“FILEPATH”) command.
Example: pd.read_csv(“/Users/samtaylor/Desktop/iris_dataset.csv”)

Storing a dataset as ‘df’ is also standard practice: ‘df’ stands for ‘data frame’.
◦ If you are working with many datasets at once, good practice is to give the files a short descriptive name.
◦ Example: iris = pd.read_csv(“iris_dataset.csv”).

Copying the filename/pathname of a file on your computer: right-click & hold down the ‘option’ key

Step 6: Verify the Data

Now that we have loaded the iris_dataset.csv into VS Code, it’s good practice to double check that it uploaded correctly. To ensure the dataset loaded correctly:

  • Add the following code to a new cell:
    ◦ Here we are using the .head() command on the ‘df’ variable.
    ◦ You can read this as: ‘Show me the head (top 5 rows) of df’.
# Display the top 5 rows of the dataset

df.head()
  • Run the cell.
    ◦ This should show you the first 5 rows of the dataset.
    ◦ Notice how Python starts counting from 0.
The head of the iris dataset in VS Code

Step 7: Save and Run the Jupyter Notebook

  • Save the notebook: Save your Jupyter Notebook by clicking ‘File’ > ‘Save’.
  • Run the notebook: To run the notebook, click ‘Run All’ at the top of the window. This will run all cells in the order they appear in the notebook.
‘Run all’ can be found at the top of the VS Code notebook
Loading a dataset in VS Code: the whole process from start to finish

Congratulations! 🎉 You’ve successfully loaded the Iris Flower dataset in CSV form into VS Code using a Jupyter Notebook.

Now that you can load a dataset into VS Code, you’re ready to begin the next steps of data analysis: 🤓 checking you data, 🧹cleaning your data, 📊 visualising your data, and more.

Happy analysing!

--

--

Sam Taylor

Operations Analyst & Data Enthusiast. Sharing insights to support aspiring data analysts on their journey 🚀. Discover more at: https://samtaylor92.github.io