Just the ‘T’ from ‘ETL’

J Paul Daigle
Perplexinomicon
Published in
5 min readMar 20, 2018

Introduction

This is part 1 of a 2 part series in an N part series about about microservices. Our overall goal is to create three microservices with related data, host them on Heroku, and then look at strategies for coordinating those services.

In the last post, we downloaded three tab separated files from the Internet Movie DataBase and loaded them into a single Postgres Database, in three tables, title_basics, name_basics, and title_principals. We created a Phoenix application called DataMunger with three Schemas, TitlePrincipal, TitleBasic, and NameBasic. We can use these three tables to find information about movies (titles) and actors (names) by way of the title_principals_table. For example, in SQL, we can learn the names of the principal actors in the movie "Thor: The Dark World"

data_imdb=# select primary_name from name_basics
data_imdb-# where nconst in(
data_imdb(# select p.nconst from title_principals p,
data_imdb(# title_basics t
data_imdb(# where t.primary_title ilike('thor%dark%')
data_imdb(# and t.title_type='movie'
data_imdb(# and p.tconst=t.tconst
data_imdb(# and p.category ilike('act%'));
primary_name
-------------------
Natalie Portman
Stellan Skarsgård
Tom Hiddleston
Chris Hemsworth
(4 rows)

Or, we can find the titles of movies where Kirsten Dunst plays a principal role with Ecto:

iex(2)> q = Ecto.Query.from(
...(2)> n in DataMunger.NameBasic,
...(2)> where: ilike(n.primary_name, "Kirsten Dunst"),
...(2)> select: n.nconst)
iex(4)> titles_from_nconst = fn(n) ->
...(4)> Ecto.Query.from(
...(4)> p in DataMunger.TitlePrincipal,
...(4)> where: p.nconst == ^n and
...(4)> ilike(p.category, "act%"),
...(4)> select: p.tconst) |>
...(4)> DataMunger.ImdbRepo.all() end
iex(8)> q |>
...(8)> DataMunger.ImdbRepo.one() |>
...(8)> titles_from_nconst.() |>
...(8)> Enum.map(fn(p) ->
...(8)> Ecto.Query.from(
...(8)> t in DataMunger.TitleBasic,
...(8)> where: t.tconst == ^p,
...(8)> select: t.primary_title) |>
...(8)> DataMunger.ImdbRepo.all() end)
[["The Virgin Suicides"], ["Get Over It"],
["Luckytown"], ["The Devil's Arithmetic"],
["The Animated Adventures of Tom Sawyer"],
["Woodshock"], ["The Beguiled"], [...], ...]
iex(9)> v() |> Enum.count()
71

What we’d like to do in this post is to take the data that we have in the three tables of the imdb_data database and prepare to move a subselection of that data into a new database. In a future post we’ll extract the data we want, load it into the new tables, and export it from the new tables.

In the rest of this post, we’ll be primarily using Ecto to transform data from one table into another table.

Create new tables

So in our final service, we’re going to use a severely truncated subset of the data. Therefore, the database we have has some fields and some data that we know we aren’t interested in.

For example, there are a number of title_types in the title_basics table:

data_imdb=# select distinct(title_type) from title_basics limit 5;
title_type
--------------
short
tvMiniSeries
movie
tvEpisode
video
(5 rows)

But we’re ultimately only interested in one title_type, movie. Similarly, in title_principals, there are a number of categories:

data_imdb=# select distinct(category) from title_principals limit 5;
category
-----------------
writer
archive_footage
composer
archive_sound
cinematographer
(5 rows)

But we’re only interested in actor or actress. So we'll create three new schemas that reflect our use case:

Create the migrations:

$ mix ecto.gen.migration add_movies_table
$ mix ecto.gen.migration add_actors_table
$ mix ecto.gen.migration add_movies_actors_table

Define the migrations for movies:

defmodule DataMunger.Repo.Migrations.AddMoviesTable do
use Ecto.Migration

def change do
create table("movies") do
add :title, :string, size: 480
add :tconst, :string, size: 9
add :year, :integer
end
end
end

Actors:

defmodule DataMunger.Repo.Migrations.AddActorsTable do
use Ecto.Migration

def change do
create table("actors") do
add :name, :string, size: 480
add :nconst, :string, size: 9
add :birth_year, :integer
add :death_year, :integer
end
end
end

MovieActors:

defmodule DataMunger.Repo.Migrations.AddMoviesActorsTable do
use Ecto.Migration

def change do
create table("movie_actors") do
add :nconst, :string, size: 9
add :tconst, :string, size: 9
add :characters, {:array, :string}
end
end
end

Run the migrations:

$ mix ecto.migrate

Transform a single record

We’re going to want to transform a lot of different records from one type to another, TitleBasic to Movie, NameBasic to Actor, and TitlePrincipal to MovieActor, which means we're going to write some custom changesets. We can use tests to help us here, by defining what we want to have happen before we try to make it happen. The first two transformations are fairly straightforward:

defmodule DataMunger.ActorTest do
use ExUnit.Case, async: true
alias DataMunger.Actor
import Ecto.Query

setup do
:ok = Ecto.Adapters.SQL.Sandbox.checkout(DataMunger.Repo)
end

test "can convert a name_basic record to an actor record" do
["Humphrey Bogart", "Natalie Portman"]
|> Enum.each(fn(who) ->
[name_basic] = from(n in DataMunger.NameBasic,
where: ilike(n.primary_name, ^who),
select: n)
|> DataMunger.ImdbRepo.all()
change = Actor.changeset(%Actor{}, name_basic)

{:ok, actor} = DataMunger.Repo.insert(change)
assert actor.name == name_basic.primary_name
assert actor.nconst == name_basic.nconst
assert actor.birth_year == name_basic.birth_year
assert actor.death_year == name_basic.death_year
end)
end
end

Here, we are looking up 2 actors by name to get their NameBasic structs, creating an Actor changeset for each one, inserting that changeset into the database, and comparing the relevant keys between NameBasic and Actor.

In data_munger/lib/data_munger/actor.ex:

def changeset(%Actor{} = actor, 
attrs = %DataMunger.NameBasic{}) do
%{}
|> Map.put(:nconst, attrs.nconst)
|> Map.put(:name, attrs.primary_name)
|> Map.put(:birth_year, attrs.birth_year)
|> Map.put(:death_year, attrs.death_year)
|> (&(changeset(actor, &1))).()
end

def changeset(%Actor{} = actor, attrs) do
actor
|> cast(attrs, [:nconst, :name, :birth_year, :death_year])
|> validate_required([:nconst, :name])
end

We can take a similar approach to movies, but the MovieActortransformation is a little different. When we defined this schema, we made the characters field an array type. Most of our records, however, will probably involve actors who only played one character in a particular movie. To check that we’re handling the array properly, we’ll add a constraint on char_length to attempt to get responses in our test for actors who played multiple characters in a movie.

test "can convert a name_basic record to an actor record" do
from(p in DataMunger.TitlePrincipal,
where: ilike(p.category, "act%") and
fragment("char_length(?)", p.characters) > 100,
select: p, limit: 3)
|> DataMunger.ImdbRepo.all()
|> Enum.each(fn(principal) ->
change = MovieActor.changeset(%MovieActor{}, principal)

{:ok, movie_actor} = DataMunger.Repo.insert(change)
assert movie_actor.nconst == principal.nconst
assert movie_actor.tconst == principal.tconst
assert Enum.count(movie_actor.characters) > 1
end)
end

But when we run this query, we’re going to see a problem in our results:

iex(309)> Ecto.Query.from(p in DataMunger.TitlePrincipal, 
...(309)> select: p.characters,
...(309)> where: ilike(p.category, "act%") and
...(309)> fragment("char_length(?)", p.characters) > 100,
...(309)> limit: 3) |>
...(309)> DataMunger.ImdbRepo.all()
["[\"Housekeeper (segment \"The Ghost of Murder Hill\")\",
\"Susan Castaneda (segment \"The Sailor's Secret\")\"]",
.....

Specifically, the first result contains nested quotes. Now, we could probably parse the results in such a way as to keep that data, but in the interests of time, we’ll just drop any sort of segment or episode information, by removing anything that is between ( and ).

In lib/data_munger/movie_actor.ex:

def changeset(%MovieActor{} = movie_actor, 
%DataMunger.TitlePrincipal{} = attrs) do
attrs = Map.from_struct(attrs)
changeset(movie_actor, attrs)
end

def changeset(%MovieActor{} = movie_actor, attrs) do
movie_actor
|> cast(attrs, [:tconst, :nconst])
|> add_category(attrs)
|> validate_required([:nconst, :tconst])
end

@doc false
def add_category(changeset, attrs) do
case attrs
|> Map.get(:characters) do
nil -> changeset
character_string ->
characters = character_string
|> String.replace(~r/\ \(.*\)/, "")
|> Poison.decode!()
changeset
|> Ecto.Changeset.put_change(:characters, characters)
end
rescue
_ -> changeset
|> Ecto.Changeset.put_change(:characters, [])
end

Now we can convert one single record of any kind to its corresponding record in our new tables. See the github repository for details on how the Movie schema is handled.

In the next post we’ll look at how to extract, load, and export the data.

--

--

J Paul Daigle
Perplexinomicon

Father, husband, code monkey, experimental mathematician and conventional musician.