Creating a CSV from HTML Titles
Python Tutorial
In this guide, we will explore how to extract titles from HTML files within a ZIP archive using Python. This skill can be invaluable for analyzing content, organizing data, or preparing datasets for various applications.
Additionally, you can adapt this code for other projects where you need to extract file names or create a CSV file using Python.
In upcoming tutorials, we will use the code developed here to analyze our Medium stories.
Together, we will go through the following steps:
- Imports
- Creating the CSV File
- Extracting Titles from HTML Files
- Creating the Main Function
Step 1: Imports
Before we begin coding, let’s import the necessary packages. You’ll need the following for this project:
import argparse
import csv
import os
from zipfile import ZipFile
argparse
: This library will help us handle command-line arguments.csv
: We'll use this library to work with CSV files.os
: It provides various functions for working with the operating system, such as file and directory operations.ZipFile
: This class allows us to work with ZIP archives.
Step 2: Creating the CSV File
Now, let’s dive into the code. We’ll start by creating a CSV file that will contain the extracted titles. We’ll use the csv
library to write the titles to the CSV file. You can customize this part of the code to match your specific project.
def create_csv_file(titles: list[str], csv_file_path: str) -> None:
"""Create a CSV file with the given titles.
:param titles: The titles to include in the CSV file.
:param csv_file_path: Path to CSV output file.
"""
with open(csv_file_path, "w", newline="", encoding="utf-8") as file:
writer = csv.writer(file)
writer.writerow(["Title", "Views", "Reads"]) # Write the header row
for title in titles:
writer.writerow([title, "0", "0"]) # Write each title
This function creates a CSV file, including a header row with columns for “Title,” “Views,” and “Reads.” You can modify the header and default values to match your data.
Step 3: Extracting Titles from HTML Files
Next, let’s extract titles from HTML files within a zip archive. This step involves reading the titles from the filenames of the HTML files.
def extract_titles_from_html(zip_file_path: str) -> list[str]:
"""Extract titles from HTML files within the zip archive.
:param zip_file_path: Path to zip file.
:param titles: The titles extracted from the HTML files.
"""
titles = []
with ZipFile(zip_file_path, "r") as zip_ref:
# Extract zip to a temporary directory
temp_dir = os.path.join(
os.path.dirname(zip_file_path), "temp_extracted"
)
zip_ref.extractall(temp_dir)
# Identify HTML files within the extracted structure
for root, dirs, files in os.walk(temp_dir):
for file in files:
if file.endswith(".html"):
# Extract title from the filename
title = (
file.split("_", 1)[1]
.rsplit("-", 1)[0]
.replace("-", " ")
.replace(".html", "")
)
titles.append(title)
# Cleanup extracted files
for root, dirs, files in os.walk(temp_dir, topdown=False):
for name in files:
os.remove(os.path.join(root, name))
for name in dirs:
os.rmdir(os.path.join(root, name))
os.rmdir(temp_dir)
return titles
This function extracts titles from HTML files within the provided zip archive and returns them as a list.
Step 4: Creating the Main Function
Finally, let’s implement the main function, which ties everything together:
def main(zip_file_path: str) -> None:
"""Main function for create CSV.
:param zip_file_path: Path to zip file.
"""
csv_file_name = (
os.path.splitext(os.path.basename(zip_file_path))[0] + "_titles.csv"
)
csv_file_path = os.path.join(os.path.dirname(zip_file_path), csv_file_name)
titles = extract_titles_from_html(zip_file_path)
if titles:
create_csv_file(titles, csv_file_path)
print(f"CSV file created successfully at {csv_file_path}")
else:
print("No titles were extracted.")
if __name__ == "__main__":
parser = argparse.ArgumentParser(
description=(
"Extract titles from HTML files in a zip archive and save to a "
"CSV file."
)
)
parser.add_argument(
"zip_file_path", help="Path to the zip file containing HTML files."
)
args = parser.parse_args()
main(args.zip_file_path)
This main function takes the path to the zip file containing HTML files as a command-line argument, extracts titles, and saves them to a CSV file. If titles are successfully extracted, it prints a success message along with the CSV file path. Otherwise, it indicates that no titles were extracted.
Testing the Script
To test the script, open your terminal or command prompt and run the following command, replacing "your_zip_file.zip"
with the path to your zip file:
$ python create_csv.py your_zip_file.zip
This will execute the script and create a CSV file with the extracted titles.
# .csv file
Title,Views,Reads
Upgrading Your Pre commit Linter My Latest Linter Configurations,0,0
Troubleshooting Docstring Table Rendering Issues in VS Code,0,0
Creating a Python Dictionary with Multiple Keys,0,0
...
Congratulations!
You’ve successfully created a CSV file from HTML titles using Python.
Access all the code: Check out the GitHub repository for all the code in this 30-day challenge!
Further Reading
If you want to learn more about programming and, specifically, machine learning, see the following course:
Note: If you use my links to order, I’ll get a small kickback. So, if you’re inclined to order anything, feel free to click above.