[Data Analysis] Data Acquisition: Loading a CSV Dataset in Visual Studio Code (2.2/9)
Learn how to load the Iris Flower dataset into VS Code with the help of a Jupyter Notebook, an essential skill for aspiring data analysts.
[This guide is part 2 of an 9-article walkthrough.]
Key concepts:
Visual Studio Code · Data analysis · Data acquisition · Python · Pandas
Introduction:
In this guide, we’ll walk you through the process of loading the famous Iris Flower dataset in CSV format into VS Code, utilizing the power of a Jupyter Notebook for seamless data analysis.
To remind ourselves where in the data analysis process data acquisition comes into play, here is a general outline of the data analysis process:
- Define Objectives: Clearly understand the goals of your analysis.
- ➡️ Data Acquisition: Obtain the dataset you’ll be working with. This can involve importing data from various sources like: CSV files, Excel spreadsheets, databases, APIs, or web scraping.
- Data Exploration: Explore the dataset to get an initial understanding of its structure and content.
- Data Cleaning: Preprocess the data to ensure its quality and consistency.
- Data Visualization: Create visualizations to gain insights into the data.
- Feature Engineering: Create new features or transform existing ones to enhance the dataset’s predictive power.
- Statistical Analysis (if applicable): Conduct statistical tests or analyses to answer specific questions or hypotheses.
- Machine Learning (if applicable): Split the data into training and testing sets, select an appropriate algorithm & train and evaluate the model’s performance using metrics like accuracy, precision, recall, or F1-score.
- Present solution: Interpret the findings in the context of your objectives. Document your analysis process and create a report or presentation summarising your analysis.
This article will cover step 2: data aquisition, specifically, how to load a dataset we have already downloaded into VS Code, ready to be analysed.
Step 1: Install Python, VS Code, and Jupyter Notebook
First, if you don’t already have them installed, install the following programs:
- Install Python: a programming language known for its simplicity and readability, from python.org.
◦ Step-by-step guide: here - Install VS Code: a code editor used for writing code across various programming languages, from code.visualstudio.com.
◦ Step-by-step guide: here - Install Jupyter Notebook: an application used to create documents with code, visualisations and text, by running the following command in your terminal or command prompt:
◦ Step-by-step guide for installing Jupyter Notebook: here
◦ Step-by-step guide for using the command terminal: here
pip install notebook
- Install Pandas: a Python package used for data manipulation, by running the following command in your terminal or command prompt:
◦ Step-by-step guide for using the command terminal: here
pip install pandas
Step 2: Open VS Code
- Launch VS Code.
Step 3: Create a Jupyter Notebook
To start working on a dataset, we need to create a new file. To do so:
- Create a new notebook: Click on “File” in the top-left corner, then select “New Notebook” to create a new Jupyter Notebook.
- Attach a kernel: In the top-right, add the Kernel (usually Python3)
Step 4: Import Python Libraries
Now that we’ve opened a new file, the first step is to import any libraries or packages that we will need:
- Import Python libraries: Add the following code to your Jupyter Notebook cell:
# Import the pandas package and name it 'pd'
import pandas as pd
- Run the cell: To run a cell, you can click on the ▷ icon next to the input cell or use the shortcut `shift + enter`, whilst you are in the cell.
Notes:
Importing Pandas as ‘pd’ is the standard naming convention.
Step 5: Load the Dataset
Now that the packages are added, we can upload our dataset. To do so:
- Add a new cell: Right click on the last cell and select ‘Insert cell’ > ‘Insert code cell below’.
◦ Or, click on the ‘➕ Code’ button at the top of the notebook. - Load the CSV file: Type the following code into the new cell, to load the dataset:
◦ Here we are loading the iris_dataset.csv using the pd.read_csv() command and we are storing it in a variable named ‘df’.
◦ If you don’t already have it, you can download the iris dataset in CSV format from here and save it in your project folder [step-by-step instructions: here].
# Load the iris_dataset.csv and save it as the variable: df
df = pd.read_csv("iris_dataset.csv")
- Run the cell: Make sure to run the cell, to ensure that the command has been executed.
Notes:
If you get an error saying the CSV file cannot be found, right-click on the CSV file on your computer and hold down the `option` key.
◦ The option to copy the filepath will appear.
◦ You can then copy and paste this into the pd.read_csv(“FILEPATH”) command.
◦ Example: pd.read_csv(“/Users/samtaylor/Desktop/iris_dataset.csv”)Storing a dataset as ‘df’ is also standard practice: ‘df’ stands for ‘data frame’.
◦ If you are working with many datasets at once, good practice is to give the files a short descriptive name.
◦ Example: iris = pd.read_csv(“iris_dataset.csv”).
Step 6: Verify the Data
Now that we have loaded the iris_dataset.csv into VS Code, it’s good practice to double check that it uploaded correctly. To ensure the dataset loaded correctly:
- Add the following code to a new cell:
◦ Here we are using the .head() command on the ‘df’ variable.
◦ You can read this as: ‘Show me the head (top 5 rows) of df’.
# Display the top 5 rows of the dataset
df.head()
- Run the cell.
◦ This should show you the first 5 rows of the dataset.
◦ Notice how Python starts counting from 0.
Step 7: Save and Run the Jupyter Notebook
- Save the notebook: Save your Jupyter Notebook by clicking ‘File’ > ‘Save’.
- Run the notebook: To run the notebook, click ‘Run All’ at the top of the window. This will run all cells in the order they appear in the notebook.
Congratulations! 🎉 You’ve successfully loaded the Iris Flower dataset in CSV form into VS Code using a Jupyter Notebook.
Now that you can load a dataset into VS Code, you’re ready to begin the next steps of data analysis: 🤓 checking you data, 🧹cleaning your data, 📊 visualising your data, and more.
Happy analysing!