10 most important Python skills I learned(Part2)
Hi there, in case I haven’t introduced myself to you.
My name is Matt and I am a Data Engineering Project Manager. My job requires me to collaborate with various departments within the central government of Taiwan on multiple projects. As part of my role, I oversee the data engineering aspect of each project. I love sharing my learnings from my work on Medium and LinkedIn.
I will be studying Data Science at the University of Sussex this September. I think I will write more about my experience in the data science master's degree in the UK.
Right, let’s get into the topic. If you haven’t checked Part 1, here is the link.
6. os.path.join / glob.glob
To use os.path.join()
and glob.glob()
to find files with specific text in their names, you can follow this approach:
- Use
os.path.join()
to construct a valid file path. - Use
glob.glob()
to find files that match the specified pattern.
Here’s an example:
import os
import glob
# Example directory path
directory_path = "/path/to/directory"
# Constructing a valid file path using os.path.join
file_name = "example_text"
file_extension = "*.txt" # Change this to the desired file extension, e.g., "*.csv"
file_path = os.path.join(directory_path, file_name + file_extension)
# Finding files in the directory that match the specified text
matching_files = glob.glob(file_path)
# Print the list of matching files
print(matching_files)
In this example, we start with a given directory_path
representing the directory where you want to search for files. You need to specify the file_name
that you want to find, and the file_extension
you are looking for (e.g., ".txt" for text files or ".csv" for CSV files).
Using os.path.join()
, we create a valid file path by combining the directory_path
, file_name
, and file_extension
. The resulting file_path
will look something like "/path/to/directory/example_text.csv" (if you are searching for csv files).
Next, we use glob.glob()
with the file_path
to find all files in the specified directory that match the pattern. The glob.glob()
function returns a list of file paths that match the pattern, or an empty list if no matches are found.
Finally, we print the list of matching files to see the results.
Make sure to change the directory_path
, file_name
, and file_extension
variables according to your specific use case.
7. [pd.read_csv(file) for file in file_names]
This method is very useful to me when I first scraped a bunch of data from a government website. After scraping and downloading the data in your machine, you can read those CSV files at once and then concat them into one single big data frame.
Here is an example:
import pandas as pd
# List of CSV file names
file_names = ['file1.csv', 'file2.csv', 'file3.csv']
# List comprehension to read CSV files into separate DataFrames
dataframes = [pd.read_csv(file) for file in file_names]
# Concatenate the DataFrames into a single DataFrame
combined_dataframe = pd.concat(dataframes, ignore_index=True)
# Now you have a single DataFrame containing the data from all CSV files
print(combined_dataframe)
In this example, we have a list file_names
that contains the names of multiple CSV files. You can use os.path.join / glob.glob from point 6 to get all the files’ names you want to concat.
We then use a list comprehension to read each CSV file into a separate DataFrame, resulting in a list of DataFrames called dataframes
. Finally, we use pd.concat()
to concatenate all the DataFrames in the dataframes
list into a single DataFrame combined_dataframe
.
The ignore_index=True
parameter in pd.concat()
is used to reset the index of the resulting DataFrame, so the concatenated DataFrame will have a new continuous index rather than inheriting the indices from the original DataFrames.
After executing this code, combined_dataframe
will be a single DataFrame that contains the data from all the CSV files specified in the file_names
list.
8. pivot
The pivot
method is useful when you want to reorganize data and create a more structured representation of your dataset, especially when dealing with tabular data that require summarization or aggregation. The pivot
method is used to reshape a DataFrame by converting values from one column into new columns, creating a pivot table.
The syntax for pivot
is as follows:
DataFrame.pivot(index=None, columns=None, values=None)
index
: This parameter specifies the column whose unique values will become the new index (rows) of the pivot table. It is optional, and if not provided, the current DataFrame index will be used as the pivot index.columns
: This parameter specifies the column whose unique values will become the new column headers of the pivot table. It is optional, and if not provided, the unique values of the specifiedindex
column will be used as columns.values
: This parameter specifies the column whose values will populate the cells of the pivot table. It is optional, and if not provided, all columns not specified inindex
orcolumns
will be used as values.
Here’s an example to illustrate the use of pivot
:
import pandas as pd
# Sample data
data = {
'Date': ['2023-07-01', '2023-07-02', '2023-07-01', '2023-07-02'],
'City': ['London', 'London', 'Manchester', 'Manchester'],
'Temperature': [24, 27, 22, 26],
'Humidity': [60, 55, 75, 70]
}
# Create a DataFrame
df = pd.DataFrame(data)
# df:
# Date City Temperature Humidity
# 0 2023-07-01 London 24 60
# 1 2023-07-02 London 27 55
# 2 2023-07-01 Manchester 22 75
# 3 2023-07-02 Manchester 26 70
# Pivot the DataFrame to create a pivot table
pivot_table = df.pivot(index='Date', columns='City', values='Temperature')
print(pivot_table)
City London Manchester
Date
2023-07-01 24 22
2023-07-02 27 26
In this example, we have a DataFrame df
with columns 'Date', 'City', 'Temperature', and 'Humidity'. We use the pivot
method to create a pivot table with the 'Date' column as the index, the 'City' column as the column headers, and the 'Temperature' column as the values. The resulting pivot table shows the temperature values for each city on different dates.
9. split columns
If you have a dataset like the one below,
data = {
'column_to_split': ['value1A_value1B', 'value2A_value2B', 'value3A_value3B']
}
df = pd.DataFrame(data)
# column_to_split
# 0 value1A_value1B
# 1 value2A_value2B
# 2 value3A_value3B
you will probably want to do some cleaning on this dataset.
Apparently, the DataFrame should contain three rows and two columns(valueA and valueB), but they are squeezed into one single column(Imagine the dataset now is not just three rows. It’s a massive dataset with all the columns squeezed together !).
Instead of using the for function to loop through all the indexes, trying to fix the value and finally store all the new values in a new column.
This is how:
# Access the column you want to split
column_to_split_series = df['column_to_split']
# Split the values in the column based on the separator '_'
separated_series = column_to_split_series.str.split('_')
# Create new columns with the separated values
df[['column_part1', 'column_part2']] = pd.DataFrame(separated_series.tolist(), index=df.index)
Now, your DataFrame will look like this:
column_to_split column_part1 column_part2
0 value1A_value1B value1A value1B
1 value2A_value2B value2A value2B
2 value3A_value3B value3A value3B
In the code above, we created a new DataFrame separated_series
containing the separated parts from the original column. Then, we used pd.DataFrame
to expand this Series into two new columns, column_part1
and column_part2
, and concatenated them with the original DataFrame df
. The trick here is that we specify two new columns column_part1 and column_part2 with the values that we sorted (separated_series) earlier. Once finding the right pattern for your dataset, you can simply do the same!
Next, I will introduce a better tool to use when looking for patterns in your data.
10. regular expression
Regular expressions (regex or regexp) are powerful tools used for pattern matching and manipulation of text. They provide a concise and flexible way to search, extract, and replace specific patterns within strings. Regular expressions are supported in various programming languages, including Python, JavaScript, Java, and many others.
Here are some key concepts and common symbols used in regular expressions:
- Literal Characters: Regular expressions can contain literal characters that match exactly themselves. For example, the pattern “hello” will match the word “hello” in a string.
- Metacharacters: Metacharacters are special characters with a special meaning in regular expressions. Some common metacharacters include:
.
(dot): Matches any character except a newline.^
(caret): Matches the start of a string.$
(dollar sign): Matches the end of a string.*
(asterisk): Matches zero or more occurrences of the preceding character.+
(plus): Matches one or more occurrences of the preceding character.?
(question mark): Matches zero or one occurrence of the preceding character.|
(vertical bar): Acts as an OR operator to match one of multiple expressions.
3. Character Classes: Character classes allow you to match any one character from a set of characters. Some common character classes include:
[0-9]
: Matches any digit (0 to 9).[a-z]
: Matches any lowercase letter (a to z).[A-Z]
: Matches any uppercase letter (A to Z).\d
: Matches any digit (equivalent to[0-9]
).\w
: Matches any word character (alphanumeric characters plus underscore).\s
: Matches any whitespace character (spaces, tabs, newlines, etc.).
4. Quantifiers: Quantifiers are used to specify the number of occurrences of a character or group in the pattern. For example:
{n}
: Matches exactly n occurrences.{n,}
: Matches at least n occurrences.{n,m}
: Matches between n and m occurrences (inclusive).
5. Grouping: Parentheses ()
are used for grouping parts of the pattern together.
Here’s a simple example of using a regular expression in Python to extract email addresses from a given text:
import re
text = "Please contact support@example.com for assistance or info@company.co.uk for more information."
pattern = r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'
matches = re.findall(pattern, text)
print(matches)
# ['support@example.com', 'info@company.co.uk']
The regular expression pattern pattern
is defined as follows:
\b
: The\b
is a word boundary, ensuring that the pattern matches only whole words (in this case, complete email addresses).[A-Za-z0-9._%+-]+
: This part matches the local part of the email address. The sign+
outside of the list allows one or more occurrences of alphanumeric characters (A-Z
,a-z
,0-9
), period (.
), underscore (_
), percent sign (%
), or plus sign (+
).@
: This matches the '@' symbol, which separates the local part from the domain part in an email address.[A-Za-z0-9.-]+
: This part matches the domain part of the email address. Again, t allows one or more occurrences of alphanumeric characters (A-Z
,a-z
,0-9
), period (.
), or hyphen (-
).\.
: This matches a literal dot (.
). We need to escape the dot with a backslash because a plain dot has a special meaning in regular expressions (it matches any character).[A-Z|a-z]{2,}
: This part matches the top-level domain (TLD) of the email address. It allows two or more occurrences of uppercase (A-Z
) or lowercase (a-z
) letters.\b
: Another word boundary to ensure the pattern matches only whole email addresses.
Lastly, the re.findall()
function is used to find all occurrences of the pattern in the text
string. It returns a list of matching email addresses found in the text.
Right, if you like my sharing, follow me for more of what I’ve learned from my data journey.
I will be sharing the next article regarding my English learning journey in the following days. Hope you will find it interesting. 😊
LinkedIn: https://www.linkedin.com/in/matt-chang-5627281a2/
Medium: https://medium.com/@MattYuChang
Cheers, and happy learning! I will see you next time!