Adventures in Building Custom Datasets Via Web Scrapping — Little Mermaid Edition

Slaps Lab
The Startup
Published in
5 min readNov 22, 2020

This is a quick, end to end write up where I go through parsing out a movie script from the web. We start with HTML and end up with an ordered CSV of lines per actor. Parsing and cleaning data does not have to be something we dread.

** warning: this post assumes you have some basic knownledge of python, text preprocessing and feature generation.url: http://www.fpx.de/fp/Disney/Scripts/LittleMermaid.html

Project Imports

import os
import re
import requests
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup

Setup Pipeline Objects

Lots of libraries already exist to help build efficient pipelines. I, however, went with a simple, custom approach. This gave me complete control and eneded up being pretty simple to setup and debug. Basically, each step in the process (Pipe) gets cached after completion. This allowed me to break up a large script into single stages so that I could easily debug the final output and quickly determine the area that needed more focus.

1. Pipe

class Pipe(object):
def __init__(self, extension):
self.extension = extension
def run(self, content):
return content

2. Pipeline

class Pipeline(object):
def __init__(self, directory, prefix, pipes):
self._directory = directory
self._pipes = pipes
self._prefix = prefix
self._verify_directory_path() def run(self, url):
content = requests.get(url).text
self._save(
f'{self._prefix}1.html',
content
)
step = 2
for pipe in self._pipes:
content = pipe.run(content)
self._save(
f'{self._prefix}{step}.{pipe.extension}',
content
)
step += 1 return content def _verify_directory_path(self):
parts_in_path = self._directory.split('/')
directory = parts_in_path[0]
parts_to_check = parts_in_path[1:]
for part in parts_to_check:
directory += f'/{part}'
if not os.path.exists(directory):
os.mkdir(directory)
def _save(self, file_name, content):
path = os.path.join(self._directory, file_name)
with open(path, 'w') as output:
output.write(content)

print(f'saved: {path}')

Setup Parsing Pipes

HtmlToText

  • A Simple Pipe that converts HTML to Text using the BeautifulSoup library.
class HtmlToText(Pipe):
def __init__(self):
super().__init__('txt')
def run(self, content):
bs = BeautifulSoup(content, 'html.parser')
return bs.getText()

RemoveTags

  • Simple Pipe that removes a given set of HTML tags using the BeautifulSoup library.
class RemoveTags(Pipe):
def __init__(self, tags):
self.tags = tags
super().__init__('html')
def run(self, content):
bs = BeautifulSoup(content, 'html.parser')
elements_to_remove = [
bs.find_all(tag)
for tag in self.tags
]
for element_search in elements_to_remove:
for tag in element_search:
tag.decompose()
return bs.prettify()

FixLittleMermaidBlockquoteText

  • This Pipe runs through each <blockquote> and tries to designate it as either an ‘Action’ or ‘Act’ (scene) change.
class FixLittleMermaidBlockquoteText(Pipe):
def __init__(self):
super().__init__('html')
def run(self, content):
bs = BeautifulSoup(content, 'html.parser')
for blockqoute in bs.find_all('blockquote'):
text = blockqoute.text.strip()
if text.startswith('Disclaimer:'):
continue
change_scene = len(
re.findall(
r'^(Back|Cut|Fade|On|Titles\.|Morning at castle|Big finale)',
text
)
)
change_scene += len(re.findall(r'Fade to', text))
if change_scene > 0:
blockqoute.string = f'Act: {blockqoute.text}'
else:
blockqoute.string = f'Action: {blockqoute.text}'

return bs.prettify()

FormatLittleMermaidScript

  • This Pipe does all of our heavy lifting to force our Actor / Action / Act labels onto a single line with their corresponding text, see example of the output above. Once completed, we can go ahead and easily extract each actor, line and act in the order it occurred.
class FormatLittleMermaidScript(Pipe):
def __init__(self):
super().__init__('txt')
def run(self, script):
## get only the script
i = script.index('Disclaimer:')
script = script[i:]
script = re.sub(r'-------\s+THE END(\s|.)+', '', script)

## format lines
script = re.sub(r'\s+', ' ', script)
script = re.sub(r'\s:\s', ': ', script)
## setup actors lines
script = re.sub(
r'((?:Act|Action|All|Disclaimer):)',
'\n\g<1>',
script
)
script = re.sub(
r'((?:Ariel|riel|Eric|Flounder|Scuttle|Sebastian|Triton|Ursula):)',
'\n\g<1>',
script
)
script = re.sub(
r'((?:Andrina|Atina|Carlotta|Flotsam|Grimsby|Jetsam|Louis|Priest|Sailor|Sailor \d|Sailors|Seahorse|Vanessa|Woman|Woman \d|Triton\'s daughters):)',
'\n\g<1>',
script
)

## repair
script = re.sub(
r'^riel:',
'Ariel:',
script,
flags=re.MULTILINE
)

## odd period placements
script = re.sub(r' [.]', '.', script)

## condense
script = re.sub(r'[ ]$', '', script, flags=re.MULTILINE)
script = re.sub(
r'^Disclaimer:[^\r\n]+',
'',
script,
flags=re.MULTILINE
)
return script.strip()## Example Output:
Act: Fade to beach.
Ariel: [Singing] Hi!
Eric: Hello!

Run Pipeline / Extract and Format Text

pipeline = Pipeline(
'./data/scripts/mermaid',
'step_',
[
RemoveTags(['head', 'hr', 'br']),
FixLittleMermaidBlockquoteText(),
HtmlToText(),
FormatLittleMermaidScript()
]
)
script = pipeline.run(
'http://www.fpx.de/fp/Disney/Scripts/LittleMermaid.html'
)
Folder Structure after Running Pipeline

Convert Text to CSV — Build Features

I got a bit aggressive here but I wanted different ways to get at each line in the script. I wanted them grouped by act but searchable by actor. I wanted access to just what the character says but also be able to split the line based on the actions that occur during its delivery. The result was a simple Data Structure — Movie, Act, Line —which allowed me to easily split out the implementation for each feature I wanted to be in my final csv.

Movie Objects

1. Line

  • Simple object that handles each line in the script. This object is responsible for creating all of our features.
class Line(object):
def __init__(self, actor, line):
self.actor = actor
self.line = line
self.metadata = []
self.compressed_line = self.line
self._process_line()
def __str__(self):
text = re.sub(r'<.+?>', '', self.compressed_line)
return text.strip()
def to_object(self):
return {
'actor': self.actor,
'full': self.line,
'compressed': self.compressed_line,
'metadata': self.metadata,
'line': self.__str__()
}
def _process_line(self):
self.metadata = []
self.compressed_line = self.line
meta = re.findall(r'[\[\(]([^\]]+?)[\)\]]', self.line)
for i, data in enumerate(meta):
text = text.replace(
data,
f'<{i}>'
)
text = re.sub(
r'[\[\(](<.+?>)[\]\)]',
'\g<1>',
text
)
self.compressed = text
self.metadata.append(
{
'index': i,
'action': data.strip(),
'placement': text.index(f'<{i}>')
}
)

2. Act

  • Simple object to that groups lines by scene.
class Act(object):
def __init__(self, order):
self.order = order
self.lines = []
def add_line(self, line):
self.lines.append(line)
def to_object_array(self):
return [ line.to_object() for line in self.lines ]

3. Movie

  • Simple object that contains all of the scenes in the movie. It also contains a method to convert the data structure to a Pandas DataFrame.
class Movie(object):
def __init__(self):
self.acts = []
def add_act(self, act):
self.acts.append(act)
def to_pandas(self):
acts = []
for i, act in enumerate(self.acts):
for obj in act.to_object_array():
obj['act'] = i
acts.append(obj)
df = pd.DataFrame(acts)
columns = [
'act',
'actor',
'line',
'compressed',
'full',
'metadata'
]
return df[columns]

Parse the Little Mermaid Script

movie = Movie()
act = Act(1)
for line in script.split('\n'):
if len(line) == 0:
continue
actor, text = re.findall(r'([^:]+?):(.+)', line)[0]
if actor == 'Disclaimer' or actor == 'Action':
## skip for now,
continue

if actor == 'Act':
movie.add_act(act)
act = Act(act.order + 1)
continue
act.add_line(Line(actor, text))

Save as CSV

movie.to_pandas().to_csv('./data/scripts/mermaid/features.csv')
Photo by Girl with red hat on Unsplash

--

--

Slaps Lab
The Startup

Focused on generating original, compelling, short stories through the use of Artificial Intelligence.