Just the ‘T’ from ‘ETL’

Published in

Perplexinomicon

5 min readMar 20, 2018

Introduction

This is part 1 of a 2 part series in an N part series about about microservices. Our overall goal is to create three microservices with related data, host them on Heroku, and then look at strategies for coordinating those services.

In the last post, we downloaded three tab separated files from the Internet Movie DataBase and loaded them into a single Postgres Database, in three tables, title_basics, name_basics, and title_principals. We created a Phoenix application called DataMunger with three Schemas, TitlePrincipal, TitleBasic, and NameBasic. We can use these three tables to find information about movies (titles) and actors (names) by way of the title_principals_table. For example, in SQL, we can learn the names of the principal actors in the movie "Thor: The Dark World"

data_imdb=# select primary_name from name_basics
data_imdb-# where nconst in(
data_imdb(# select p.nconst from title_principals p,
data_imdb(# title_basics t
data_imdb(# where t.primary_title ilike('thor%dark%')
data_imdb(# and t.title_type='movie'
data_imdb(# and p.tconst=t.tconst
data_imdb(# and p.category ilike('act%'));
  primary_name    
-------------------
 Natalie Portman
 Stellan Skarsgård
 Tom Hiddleston
 Chris Hemsworth
(4 rows)

Or, we can find the titles of movies where Kirsten Dunst plays a principal role with Ecto:

iex(2)> q = Ecto.Query.from(
...(2)> n in DataMunger.NameBasic,
...(2)> where: ilike(n.primary_name, "Kirsten Dunst"),
...(2)> select: n.nconst)iex(4)> titles_from_nconst = fn(n) ->  
...(4)> Ecto.Query.from(               
...(4)> p in DataMunger.TitlePrincipal,
...(4)> where: p.nconst == ^n and      
...(4)> ilike(p.category, "act%"),     
...(4)> select: p.tconst) |>           
...(4)> DataMunger.ImdbRepo.all() end iex(8)> q |> 
...(8)> DataMunger.ImdbRepo.one() |>
...(8)> titles_from_nconst.() |>         
...(8)> Enum.map(fn(p) ->                
...(8)>  Ecto.Query.from(                 
...(8)>  t in DataMunger.TitleBasic,      
...(8)>  where: t.tconst == ^p,           
...(8)>  select: t.primary_title) |>      
...(8)>  DataMunger.ImdbRepo.all() end)   
[["The Virgin Suicides"], ["Get Over It"],
["Luckytown"], ["The Devil's Arithmetic"],
["The Animated Adventures of Tom Sawyer"],
["Woodshock"], ["The Beguiled"], [...], ...]
iex(9)> v() |> Enum.count()
71

What we’d like to do in this post is to take the data that we have in the three tables of the imdb_data database and prepare to move a subselection of that data into a new database. In a future post we’ll extract the data we want, load it into the new tables, and export it from the new tables.

In the rest of this post, we’ll be primarily using Ecto to transform data from one table into another table.

Create new tables

So in our final service, we’re going to use a severely truncated subset of the data. Therefore, the database we have has some fields and some data that we know we aren’t interested in.

For example, there are a number of title_types in the title_basics table:

data_imdb=# select distinct(title_type) from title_basics limit 5;
 title_type  
--------------
 short
 tvMiniSeries
 movie
 tvEpisode
 video
(5 rows)

But we’re ultimately only interested in one title_type, movie. Similarly, in title_principals, there are a number of categories:

data_imdb=# select distinct(category) from title_principals limit 5;
    category     
-----------------
 writer
 archive_footage
 composer
 archive_sound
 cinematographer
(5 rows)

But we’re only interested in actor or actress. So we'll create three new schemas that reflect our use case:

Create the migrations:

$ mix ecto.gen.migration add_movies_table
$ mix ecto.gen.migration add_actors_table
$ mix ecto.gen.migration add_movies_actors_table

Define the migrations for movies:

defmodule DataMunger.Repo.Migrations.AddMoviesTable do
  use Ecto.Migration
              
  def change do
    create table("movies") do
      add :title, :string, size: 480
      add :tconst, :string, size: 9
      add :year, :integer
    end
  end
end

Actors:

defmodule DataMunger.Repo.Migrations.AddActorsTable do
  use Ecto.Migration
              
  def change do
    create table("actors") do
      add :name, :string, size: 480
      add :nconst, :string, size: 9
      add :birth_year, :integer
      add :death_year, :integer
    end
  end
end

MovieActors:

defmodule DataMunger.Repo.Migrations.AddMoviesActorsTable do
  use Ecto.Migration
              
    def change do
      create table("movie_actors") do
      add :nconst, :string, size: 9
      add :tconst, :string, size: 9
      add :characters, {:array, :string}
    end
  end
end

Run the migrations:

$ mix ecto.migrate

Transform a single record

We’re going to want to transform a lot of different records from one type to another, TitleBasic to Movie, NameBasic to Actor, and TitlePrincipal to MovieActor, which means we're going to write some custom changesets. We can use tests to help us here, by defining what we want to have happen before we try to make it happen. The first two transformations are fairly straightforward:

defmodule DataMunger.ActorTest do
  use ExUnit.Case, async: true
  alias DataMunger.Actor
  import Ecto.Query
              
  setup do
    :ok = Ecto.Adapters.SQL.Sandbox.checkout(DataMunger.Repo)
  end
              
  test "can convert a name_basic record to an actor record" do
    ["Humphrey Bogart", "Natalie Portman"]
    |> Enum.each(fn(who) -> 
      [name_basic] = from(n in DataMunger.NameBasic,
                     where: ilike(n.primary_name, ^who),
                     select: n) 
                     |> DataMunger.ImdbRepo.all() 
      change = Actor.changeset(%Actor{}, name_basic)
              
      {:ok, actor} = DataMunger.Repo.insert(change)
      assert actor.name == name_basic.primary_name
      assert actor.nconst == name_basic.nconst
      assert actor.birth_year == name_basic.birth_year
      assert actor.death_year == name_basic.death_year
    end)
  end
end

Here, we are looking up 2 actors by name to get their NameBasic structs, creating an Actor changeset for each one, inserting that changeset into the database, and comparing the relevant keys between NameBasic and Actor.

In data_munger/lib/data_munger/actor.ex:

def changeset(%Actor{} = actor, 
  attrs = %DataMunger.NameBasic{}) do
  %{}
  |> Map.put(:nconst, attrs.nconst)
  |> Map.put(:name, attrs.primary_name)
  |> Map.put(:birth_year, attrs.birth_year)
  |> Map.put(:death_year, attrs.death_year)
  |> (&(changeset(actor, &1))).()
end
              
def changeset(%Actor{} = actor, attrs) do
  actor
  |> cast(attrs, [:nconst, :name, :birth_year, :death_year])
  |> validate_required([:nconst, :name])
end

We can take a similar approach to movies, but the MovieActortransformation is a little different. When we defined this schema, we made the characters field an array type. Most of our records, however, will probably involve actors who only played one character in a particular movie. To check that we’re handling the array properly, we’ll add a constraint on char_length to attempt to get responses in our test for actors who played multiple characters in a movie.

test "can convert a name_basic record to an actor record" do
  from(p in DataMunger.TitlePrincipal, 
  where: ilike(p.category, "act%") and 
  fragment("char_length(?)", p.characters) > 100, 
  select: p, limit: 3)
  |> DataMunger.ImdbRepo.all()
  |> Enum.each(fn(principal) -> 
    change = MovieActor.changeset(%MovieActor{}, principal)
              
    {:ok, movie_actor} = DataMunger.Repo.insert(change)
    assert movie_actor.nconst == principal.nconst
    assert movie_actor.tconst == principal.tconst
    assert Enum.count(movie_actor.characters) > 1
  end)
end

But when we run this query, we’re going to see a problem in our results:

iex(309)> Ecto.Query.from(p in DataMunger.TitlePrincipal, 
...(309)> select: p.characters,                          
...(309)> where: ilike(p.category, "act%") and           
...(309)> fragment("char_length(?)", p.characters) > 100, 
...(309)> limit: 3) |>                                   
...(309)> DataMunger.ImdbRepo.all()                       
["[\"Housekeeper (segment \"The Ghost of Murder Hill\")\",
\"Susan Castaneda (segment \"The Sailor's Secret\")\"]",
.....

Specifically, the first result contains nested quotes. Now, we could probably parse the results in such a way as to keep that data, but in the interests of time, we’ll just drop any sort of segment or episode information, by removing anything that is between ( and ).

In lib/data_munger/movie_actor.ex:

def changeset(%MovieActor{} = movie_actor, 
              %DataMunger.TitlePrincipal{} = attrs) do
  attrs = Map.from_struct(attrs)
  changeset(movie_actor, attrs)
end
              
def changeset(%MovieActor{} = movie_actor, attrs) do
  movie_actor
  |> cast(attrs, [:tconst, :nconst])
  |> add_category(attrs)
  |> validate_required([:nconst, :tconst])
end
              
@doc false 
def add_category(changeset, attrs) do
  case attrs
       |> Map.get(:characters) do              
    nil -> changeset
    character_string -> 
      characters = character_string
                   |> String.replace(~r/\ \(.*\)/, "")
                   |> Poison.decode!()
      changeset 
      |> Ecto.Changeset.put_change(:characters, characters)
  end
rescue
  _ -> changeset
       |> Ecto.Changeset.put_change(:characters, [])
end

Now we can convert one single record of any kind to its corresponding record in our new tables. See the github repository for details on how the Movie schema is handled.

In the next post we’ll look at how to extract, load, and export the data.

Just the ‘T’ from ‘ETL’

Introduction

Create new tables

Transform a single record

Written by J Paul Daigle