Adventures in Building Custom Datasets Via Web Scrapping — Little Mermaid Edition

Slaps Lab
Slaps Lab
Nov 22, 2020 · 5 min read

This is a quick, end to end write up where I go through parsing out a movie script from the web. We start with HTML and end up with an ordered CSV of lines per actor. Parsing and cleaning data does not have to be something we dread.

** warning: this post assumes you have some basic knownledge of python, text preprocessing and feature generation.url: http://www.fpx.de/fp/Disney/Scripts/LittleMermaid.html

Project Imports

import os
import re
import requests
import numpy as np
import pandas as pd
from bs4 import BeautifulSoup

Setup Pipeline Objects

Lots of libraries already exist to help build efficient pipelines. I, however, went with a simple, custom approach. This gave me complete control and eneded up being pretty simple to setup and debug. Basically, each step in the process (Pipe) gets cached after completion. This allowed me to break up a large script into single stages so that I could easily debug the final output and quickly determine the area that needed more focus.

class Pipe(object):
def __init__(self, extension):
self.extension = extension
def run(self, content):
return content
class Pipeline(object):
def __init__(self, directory, prefix, pipes):
self._directory = directory
self._pipes = pipes
self._prefix = prefix
self._verify_directory_path() def run(self, url):
content = requests.get(url).text
self._save(
f'{self._prefix}1.html',
content
)
step = 2
for pipe in self._pipes:
content = pipe.run(content)
self._save(
f'{self._prefix}{step}.{pipe.extension}',
content
)
step += 1 return content def _verify_directory_path(self):
parts_in_path = self._directory.split('/')
directory = parts_in_path[0]
parts_to_check = parts_in_path[1:]
for part in parts_to_check:
directory += f'/{part}'
if not os.path.exists(directory):
os.mkdir(directory)
def _save(self, file_name, content):
path = os.path.join(self._directory, file_name)
with open(path, 'w') as output:
output.write(content)

print(f'saved: {path}')

Setup Parsing Pipes

  • A Simple Pipe that converts HTML to Text using the BeautifulSoup library.
class HtmlToText(Pipe):
def __init__(self):
super().__init__('txt')
def run(self, content):
bs = BeautifulSoup(content, 'html.parser')
return bs.getText()
  • Simple Pipe that removes a given set of HTML tags using the BeautifulSoup library.
class RemoveTags(Pipe):
def __init__(self, tags):
self.tags = tags
super().__init__('html')
def run(self, content):
bs = BeautifulSoup(content, 'html.parser')
elements_to_remove = [
bs.find_all(tag)
for tag in self.tags
]
for element_search in elements_to_remove:
for tag in element_search:
tag.decompose()
return bs.prettify()
  • This Pipe runs through each <blockquote> and tries to designate it as either an ‘Action’ or ‘Act’ (scene) change.
class FixLittleMermaidBlockquoteText(Pipe):
def __init__(self):
super().__init__('html')
def run(self, content):
bs = BeautifulSoup(content, 'html.parser')
for blockqoute in bs.find_all('blockquote'):
text = blockqoute.text.strip()
if text.startswith('Disclaimer:'):
continue
change_scene = len(
re.findall(
r'^(Back|Cut|Fade|On|Titles\.|Morning at castle|Big finale)',
text
)
)
change_scene += len(re.findall(r'Fade to', text))
if change_scene > 0:
blockqoute.string = f'Act: {blockqoute.text}'
else:
blockqoute.string = f'Action: {blockqoute.text}'

return bs.prettify()
  • This Pipe does all of our heavy lifting to force our Actor / Action / Act labels onto a single line with their corresponding text, see example of the output above. Once completed, we can go ahead and easily extract each actor, line and act in the order it occurred.
class FormatLittleMermaidScript(Pipe):
def __init__(self):
super().__init__('txt')
def run(self, script):
## get only the script
i = script.index('Disclaimer:')
script = script[i:]
script = re.sub(r'-------\s+THE END(\s|.)+', '', script)

## format lines
script = re.sub(r'\s+', ' ', script)
script = re.sub(r'\s:\s', ': ', script)
## setup actors lines
script = re.sub(
r'((?:Act|Action|All|Disclaimer):)',
'\n\g<1>',
script
)
script = re.sub(
r'((?:Ariel|riel|Eric|Flounder|Scuttle|Sebastian|Triton|Ursula):)',
'\n\g<1>',
script
)
script = re.sub(
r'((?:Andrina|Atina|Carlotta|Flotsam|Grimsby|Jetsam|Louis|Priest|Sailor|Sailor \d|Sailors|Seahorse|Vanessa|Woman|Woman \d|Triton\'s daughters):)',
'\n\g<1>',
script
)

## repair
script = re.sub(
r'^riel:',
'Ariel:',
script,
flags=re.MULTILINE
)

## odd period placements
script = re.sub(r' [.]', '.', script)

## condense
script = re.sub(r'[ ]$', '', script, flags=re.MULTILINE)
script = re.sub(
r'^Disclaimer:[^\r\n]+',
'',
script,
flags=re.MULTILINE
)
return script.strip()## Example Output:
Act: Fade to beach.
Ariel: [Singing] Hi!
Eric: Hello!

Run Pipeline / Extract and Format Text

pipeline = Pipeline(
'./data/scripts/mermaid',
'step_',
[
RemoveTags(['head', 'hr', 'br']),
FixLittleMermaidBlockquoteText(),
HtmlToText(),
FormatLittleMermaidScript()
]
)
script = pipeline.run(
'http://www.fpx.de/fp/Disney/Scripts/LittleMermaid.html'
)
Folder Structure after Running Pipeline

Convert Text to CSV — Build Features

I got a bit aggressive here but I wanted different ways to get at each line in the script. I wanted them grouped by act but searchable by actor. I wanted access to just what the character says but also be able to split the line based on the actions that occur during its delivery. The result was a simple Data Structure — Movie, Act, Line —which allowed me to easily split out the implementation for each feature I wanted to be in my final csv.

Movie Objects

  • Simple object that handles each line in the script. This object is responsible for creating all of our features.
class Line(object):
def __init__(self, actor, line):
self.actor = actor
self.line = line
self.metadata = []
self.compressed_line = self.line
self._process_line()
def __str__(self):
text = re.sub(r'<.+?>', '', self.compressed_line)
return text.strip()
def to_object(self):
return {
'actor': self.actor,
'full': self.line,
'compressed': self.compressed_line,
'metadata': self.metadata,
'line': self.__str__()
}
def _process_line(self):
self.metadata = []
self.compressed_line = self.line
meta = re.findall(r'[\[\(]([^\]]+?)[\)\]]', self.line)
for i, data in enumerate(meta):
text = text.replace(
data,
f'<{i}>'
)
text = re.sub(
r'[\[\(](<.+?>)[\]\)]',
'\g<1>',
text
)
self.compressed = text
self.metadata.append(
{
'index': i,
'action': data.strip(),
'placement': text.index(f'<{i}>')
}
)
  • Simple object to that groups lines by scene.
class Act(object):
def __init__(self, order):
self.order = order
self.lines = []
def add_line(self, line):
self.lines.append(line)
def to_object_array(self):
return [ line.to_object() for line in self.lines ]
  • Simple object that contains all of the scenes in the movie. It also contains a method to convert the data structure to a Pandas DataFrame.
class Movie(object):
def __init__(self):
self.acts = []
def add_act(self, act):
self.acts.append(act)
def to_pandas(self):
acts = []
for i, act in enumerate(self.acts):
for obj in act.to_object_array():
obj['act'] = i
acts.append(obj)
df = pd.DataFrame(acts)
columns = [
'act',
'actor',
'line',
'compressed',
'full',
'metadata'
]
return df[columns]

Parse the Little Mermaid Script

movie = Movie()
act = Act(1)
for line in script.split('\n'):
if len(line) == 0:
continue
actor, text = re.findall(r'([^:]+?):(.+)', line)[0]
if actor == 'Disclaimer' or actor == 'Action':
## skip for now,
continue

if actor == 'Act':
movie.add_act(act)
act = Act(act.order + 1)
continue
act.add_line(Line(actor, text))

Save as CSV

movie.to_pandas().to_csv('./data/scripts/mermaid/features.csv')
Photo by Girl with red hat on Unsplash

The Startup

Get smarter at building your thing. Join The Startup’s +787K followers.

Sign up for Top 10 Stories

By The Startup

Get smarter at building your thing. Subscribe to receive The Startup's top 10 most read stories — delivered straight into your inbox, once a week. Take a look.

By signing up, you will create a Medium account if you don’t already have one. Review our Privacy Policy for more information about our privacy practices.

Check your inbox
Medium sent you an email at to complete your subscription.

The Startup

Get smarter at building your thing. Follow to join The Startup’s +8 million monthly readers & +787K followers.

Slaps Lab

Written by

Slaps Lab

Focused on generating original, compelling, short stories through the use of Artificial Intelligence.

The Startup

Get smarter at building your thing. Follow to join The Startup’s +8 million monthly readers & +787K followers.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store