Exploring Pandas : Reading CSV Files into DataFrames
Before going through this article, please read the following for continuation:
Pandas Library for Data Analysis in Python
This article is part two of a Pandas and covers how to read data from CSV files into a Pandas DataFrame.
What is DataFrame ?
A DataFrame in Pandas is a two-dimensional, size-mutable, and potentially heterogeneous tabular data structure with labeled axes (rows and columns). Think of it as a spreadsheet or SQL table, or a dictionary of Series objects.
Here’s a simple breakdown:
- Rows and Columns: DataFrames consist of rows and columns, where each column can be of a different data type (integer, float, string, etc.).
- Indexing: Rows and columns are labeled, allowing for easy access and manipulation of data.
- Size-Mutable: DataFrames can be expanded or shrunk, meaning you can add or remove rows and columns as needed.
Key Features of DataFrames:
1. Heterogeneous Data: Different columns can contain different data types.
2. Alignment: DataFrame allows for automatic and explicit data alignment.
3. Data Manipulation: Provides various functions for data manipulation, aggregation, and transformation.
4. Integration:Easily integrates with NumPy, making it powerful for numerical operations.
Creating a DataFrame
Pandas DataFrames are powerful tools for data manipulation and analysis. They are two-dimensional, tabular data structures with labeled axes.
One way to create a DataFrame is to specify the column names and data types, and then populate it with data. Here’s an example:
import pandas as pd
# Define column names and data types
data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Age': [25, 30, 28]}
# Create DataFrame
df = pd.DataFrame(data)
# Print DataFrame
print(df)
This code creates a DataFrame with two columns, “Name” (string type) and “Age” (integer type), and populates it with three rows of data.
Reading CSV Files
The more common way to get data into a DataFrame is by reading it from a CSV file. The pd.read_csv()
function is used for this purpose.
import pandas as pd
# Specify file path and delimiter
filepath = "data.csv" # Replace with your actual file path
df = pd.read_csv(filepath)
# Print DataFrame
print(df)
This code reads data from a CSV file named “data.csv” (replace with your actual file path) and creates a DataFrame. By default, pd.read_csv()
assumes a comma (",") as the delimiter, the character that separates values in each column.
Working with the DataFrame
Once you have a DataFrame, you can access specific data points using indexing and selection. You can retrieve data by column name or by row number.
For example:
# Access data by column name
name = df['Name'][0] # Access first name
age = df['Age'][1]
# Access age of second row
# Access data by row number (zero-indexed)
first_row = df.iloc[0] # Get all data in the first row
# Print results
print(f"Name: {name}, Age: {age}")
print(first_row)
You can also add new columns to a DataFrame and save the DataFrame back to a CSV file using the to_csv()
function.
In summary, this Pandas article provides a basic understanding of working with CSV files. You learned how to:
- Create a DataFrame from scratch.
- Read data from a CSV file into a DataFrame.
- Access data points using indexing and selection.
- Add new columns to a DataFrame.
- Save a DataFrame to a CSV file.