Adventures in Building Custom Datasets Via Web Scrapping — Little Mermaid Edition

Published in

The Startup

5 min readNov 22, 2020

This is a quick, end to end write up where I go through parsing out a movie script from the web. We start with HTML and end up with an ordered CSV of lines per actor. Parsing and cleaning data does not have to be something we dread.

** warning: this post assumes you have some basic knownledge of python, text preprocessing and feature generation.url: http://www.fpx.de/fp/Disney/Scripts/LittleMermaid.html

Project Imports

import os
import re
import requests
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup

Setup Pipeline Objects

Lots of libraries already exist to help build efficient pipelines. I, however, went with a simple, custom approach. This gave me complete control and eneded up being pretty simple to setup and debug. Basically, each step in the process (Pipe) gets cached after completion. This allowed me to break up a large script into single stages so that I could easily debug the final output and quickly determine the area that needed more focus.

1. Pipe

class Pipe(object):
    def __init__(self, extension):
        self.extension = extension    def run(self, content):
        return content

2. Pipeline

class Pipeline(object):
    def __init__(self, directory, prefix, pipes):
        self._directory = directory
        self._pipes = pipes
        self._prefix = prefix        self._verify_directory_path()    def run(self, url):
        content = requests.get(url).text
        self._save(
            f'{self._prefix}1.html',
            content
        )        step = 2
        for pipe in self._pipes:
            content = pipe.run(content)
            self._save(
                f'{self._prefix}{step}.{pipe.extension}',
                content
            )            step += 1        return content    def _verify_directory_path(self):
        parts_in_path = self._directory.split('/')
        directory = parts_in_path[0]
        parts_to_check = parts_in_path[1:]
        for part in parts_to_check:
            directory += f'/{part}'
            if not os.path.exists(directory):
                os.mkdir(directory)    def _save(self, file_name, content):
        path = os.path.join(self._directory, file_name)
        with open(path, 'w') as output:
            output.write(content)
    
        print(f'saved: {path}')

Setup Parsing Pipes

HtmlToText

A Simple Pipe that converts HTML to Text using the BeautifulSoup library.

class HtmlToText(Pipe):
    def __init__(self):
        super().__init__('txt')    def run(self, content):
        bs = BeautifulSoup(content, 'html.parser')
        return bs.getText()

RemoveTags

Simple Pipe that removes a given set of HTML tags using the BeautifulSoup library.

class RemoveTags(Pipe):
    def __init__(self, tags):
        self.tags = tags
        super().__init__('html')    def run(self, content):
        bs = BeautifulSoup(content, 'html.parser')
        elements_to_remove = [
            bs.find_all(tag)
            for tag in self.tags
        ]
        for element_search in elements_to_remove:
            for tag in element_search:
                tag.decompose()        return bs.prettify()

FixLittleMermaidBlockquoteText

This Pipe runs through each <blockquote> and tries to designate it as either an ‘Action’ or ‘Act’ (scene) change.

class FixLittleMermaidBlockquoteText(Pipe):
    def __init__(self):
        super().__init__('html')    def run(self, content):
        bs = BeautifulSoup(content, 'html.parser')
        for blockqoute in bs.find_all('blockquote'):
            text = blockqoute.text.strip()
            if text.startswith('Disclaimer:'):
                continue        change_scene = len(
            re.findall(
r'^(Back|Cut|Fade|On|Titles\.|Morning at castle|Big finale)',
                text
            )
        )
        change_scene += len(re.findall(r'Fade to', text))
        if change_scene > 0:
            blockqoute.string = f'Act: {blockqoute.text}'
        else:
            blockqoute.string = f'Action: {blockqoute.text}'
        
        return bs.prettify()

FormatLittleMermaidScript

This Pipe does all of our heavy lifting to force our Actor / Action / Act labels onto a single line with their corresponding text, see example of the output above. Once completed, we can go ahead and easily extract each actor, line and act in the order it occurred.

class FormatLittleMermaidScript(Pipe):
    def __init__(self):
        super().__init__('txt')    def run(self, script):
        ## get only the script
        i = script.index('Disclaimer:')
        script = script[i:]
        script = re.sub(r'-------\s+THE END(\s|.)+', '', script)
         
        ## format lines
        script = re.sub(r'\s+', ' ', script)
        script = re.sub(r'\s:\s', ': ', script)        ## setup actors lines
        script = re.sub(
            r'((?:Act|Action|All|Disclaimer):)',
            '\n\g<1>',
            script
        )
        script = re.sub(
r'((?:Ariel|riel|Eric|Flounder|Scuttle|Sebastian|Triton|Ursula):)', 
            '\n\g<1>', 
            script
        )
        script = re.sub(
r'((?:Andrina|Atina|Carlotta|Flotsam|Grimsby|Jetsam|Louis|Priest|Sailor|Sailor \d|Sailors|Seahorse|Vanessa|Woman|Woman \d|Triton\'s daughters):)', 
            '\n\g<1>',
            script
        )
       
        ## repair
        script = re.sub(
            r'^riel:',
            'Ariel:',
            script,
            flags=re.MULTILINE
        )
 
        ## odd period placements
        script = re.sub(r' [.]', '.', script)
 
        ## condense
        script = re.sub(r'[ ]$', '', script, flags=re.MULTILINE)
        script = re.sub(
            r'^Disclaimer:[^\r\n]+',
            '',
            script,
            flags=re.MULTILINE
        )        return script.strip()## Example Output:
Act: Fade to beach.
Ariel: [Singing] Hi!
Eric: Hello!

Run Pipeline / Extract and Format Text

pipeline = Pipeline(
    './data/scripts/mermaid',
    'step_',
    [
        RemoveTags(['head', 'hr', 'br']),
        FixLittleMermaidBlockquoteText(),
        HtmlToText(),
        FormatLittleMermaidScript()
    ]
)script = pipeline.run(
    'http://www.fpx.de/fp/Disney/Scripts/LittleMermaid.html'
)

Convert Text to CSV — Build Features

I got a bit aggressive here but I wanted different ways to get at each line in the script. I wanted them grouped by act but searchable by actor. I wanted access to just what the character says but also be able to split the line based on the actions that occur during its delivery. The result was a simple Data Structure — Movie, Act, Line —which allowed me to easily split out the implementation for each feature I wanted to be in my final csv.

Movie Objects

1. Line

Simple object that handles each line in the script. This object is responsible for creating all of our features.

class Line(object):
    def __init__(self, actor, line):
        self.actor = actor
        self.line = line
        self.metadata = []
        self.compressed_line = self.line
        self._process_line()    def __str__(self):
        text = re.sub(r'<.+?>', '', self.compressed_line)
        return text.strip()    def to_object(self):
        return {
            'actor': self.actor,
            'full': self.line,
            'compressed': self.compressed_line,
            'metadata': self.metadata,
            'line': self.__str__()
        }    def _process_line(self):
        self.metadata = []
        self.compressed_line = self.line
        meta = re.findall(r'[\[\(]([^\]]+?)[\)\]]', self.line)
        for i, data in enumerate(meta):
            text = text.replace(
                data,
                f'<{i}>'
            )
            text = re.sub(
                r'[\[\(](<.+?>)[\]\)]', 
                '\g<1>', 
                text
            )
            self.compressed = text
            self.metadata.append(
                {
                    'index': i,
                    'action': data.strip(),
                    'placement': text.index(f'<{i}>')
                }
            )

2. Act

Simple object to that groups lines by scene.

class Act(object):
    def __init__(self, order):
        self.order = order
        self.lines = []    def add_line(self, line):
        self.lines.append(line)    def to_object_array(self):
        return [ line.to_object() for line in self.lines ]

3. Movie

Simple object that contains all of the scenes in the movie. It also contains a method to convert the data structure to a Pandas DataFrame.

class Movie(object):
    def __init__(self):
        self.acts = []    def add_act(self, act):
        self.acts.append(act)    def to_pandas(self):
        acts = []
        for i, act in enumerate(self.acts):
            for obj in act.to_object_array():
                obj['act'] = i
                acts.append(obj)        df = pd.DataFrame(acts)
        columns = [
            'act',
            'actor',
            'line',
            'compressed',
            'full',
            'metadata'
        ]
        return df[columns]

Parse the Little Mermaid Script

movie = Movie()
act = Act(1)for line in script.split('\n'):
    if len(line) == 0:
        continue    actor, text = re.findall(r'([^:]+?):(.+)', line)[0]
    if actor == 'Disclaimer' or actor == 'Action':
        ## skip for now,
        continue
    
    if actor == 'Act':
        movie.add_act(act)
        act = Act(act.order + 1)
        continue    act.add_line(Line(actor, text))

Save as CSV

movie.to_pandas().to_csv('./data/scripts/mermaid/features.csv')