Creating a CSV from HTML Titles

Python Tutorial

Published in

Internet of Technology

4 min readFeb 4, 2024

In this guide, we will explore how to extract titles from HTML files within a ZIP archive using Python. This skill can be invaluable for analyzing content, organizing data, or preparing datasets for various applications.

Additionally, you can adapt this code for other projects where you need to extract file names or create a CSV file using Python.

In upcoming tutorials, we will use the code developed here to analyze our Medium stories.

Together, we will go through the following steps:

Imports
Creating the CSV File
Extracting Titles from HTML Files
Creating the Main Function

Step 1: Imports

Before we begin coding, let’s import the necessary packages. You’ll need the following for this project:

import argparse
import csv
import os
from zipfile import ZipFile

argparse: This library will help us handle command-line arguments.
csv: We'll use this library to work with CSV files.
os: It provides various functions for working with the operating system, such as file and directory operations.
ZipFile: This class allows us to work with ZIP archives.

Step 2: Creating the CSV File

Now, let’s dive into the code. We’ll start by creating a CSV file that will contain the extracted titles. We’ll use the csv library to write the titles to the CSV file. You can customize this part of the code to match your specific project.

def create_csv_file(titles: list[str], csv_file_path: str) -> None:
    """Create a CSV file with the given titles.

    :param titles: The titles to include in the CSV file.
    :param csv_file_path: Path to CSV output file.
    """
    with open(csv_file_path, "w", newline="", encoding="utf-8") as file:
        writer = csv.writer(file)
        writer.writerow(["Title", "Views", "Reads"])  # Write the header row
        for title in titles:
            writer.writerow([title, "0", "0"])  # Write each title

This function creates a CSV file, including a header row with columns for “Title,” “Views,” and “Reads.” You can modify the header and default values to match your data.

Step 3: Extracting Titles from HTML Files

Next, let’s extract titles from HTML files within a zip archive. This step involves reading the titles from the filenames of the HTML files.

def extract_titles_from_html(zip_file_path: str) -> list[str]:
    """Extract titles from HTML files within the zip archive.

    :param zip_file_path: Path to zip file.
    :param titles: The titles extracted from the HTML files.
    """
    titles = []
    with ZipFile(zip_file_path, "r") as zip_ref:
        # Extract zip to a temporary directory
        temp_dir = os.path.join(
            os.path.dirname(zip_file_path), "temp_extracted"
        )
        zip_ref.extractall(temp_dir)

        # Identify HTML files within the extracted structure
        for root, dirs, files in os.walk(temp_dir):
            for file in files:
                if file.endswith(".html"):
                    # Extract title from the filename
                    title = (
                        file.split("_", 1)[1]
                        .rsplit("-", 1)[0]
                        .replace("-", " ")
                        .replace(".html", "")
                    )
                    titles.append(title)

        # Cleanup extracted files
        for root, dirs, files in os.walk(temp_dir, topdown=False):
            for name in files:
                os.remove(os.path.join(root, name))
            for name in dirs:
                os.rmdir(os.path.join(root, name))
        os.rmdir(temp_dir)

    return titles

This function extracts titles from HTML files within the provided zip archive and returns them as a list.

Step 4: Creating the Main Function

Finally, let’s implement the main function, which ties everything together:

def main(zip_file_path: str) -> None:
    """Main function for create CSV.

    :param zip_file_path: Path to zip file.
    """
    csv_file_name = (
        os.path.splitext(os.path.basename(zip_file_path))[0] + "_titles.csv"
    )
    csv_file_path = os.path.join(os.path.dirname(zip_file_path), csv_file_name)

    titles = extract_titles_from_html(zip_file_path)
    if titles:
        create_csv_file(titles, csv_file_path)
        print(f"CSV file created successfully at {csv_file_path}")
    else:
        print("No titles were extracted.")


if __name__ == "__main__":
    parser = argparse.ArgumentParser(
        description=(
            "Extract titles from HTML files in a zip archive and save to a "
            "CSV file."
        )
    )
    parser.add_argument(
        "zip_file_path", help="Path to the zip file containing HTML files."
    )

    args = parser.parse_args()
    main(args.zip_file_path)

This main function takes the path to the zip file containing HTML files as a command-line argument, extracts titles, and saves them to a CSV file. If titles are successfully extracted, it prints a success message along with the CSV file path. Otherwise, it indicates that no titles were extracted.

Testing the Script

To test the script, open your terminal or command prompt and run the following command, replacing "your_zip_file.zip" with the path to your zip file:

$ python create_csv.py your_zip_file.zip

This will execute the script and create a CSV file with the extracted titles.

# .csv file
Title,Views,Reads
Upgrading Your Pre commit Linter  My Latest Linter Configurations,0,0
Troubleshooting Docstring Table Rendering Issues in VS Code,0,0
Creating a Python Dictionary with Multiple Keys,0,0
...

Congratulations!

You’ve successfully created a CSV file from HTML titles using Python.

Access all the code: Check out the GitHub repository for all the code in this 30-day challenge!

Creating a CSV from HTML Titles

Python Tutorial

Step 1: Imports

Step 2: Creating the CSV File

Step 3: Extracting Titles from HTML Files

Step 4: Creating the Main Function

Testing the Script

Congratulations!

Further Reading

Machine Learning

Offered by Stanford University and DeepLearning.AI. #BreakIntoAI with Machine Learning Specialization. Master…

Written by Oliver Lövström