Computing Dissimilarity Scores on Long Time Use Sequences: Distributed Arrays in Julia

Surfing through life as a postdoc
16 min readDec 8, 2021

--

In time use research, one of the challenges is that we have considerably long sequences (96, 144, or 1440 steps), and we usually use optimal matching to calculate dissimilarity scores on those sequences. However, the optimal matching technique is very computationally demanding. At the moment, most social scientists use the TraMineR R package, which usually gives a great performance but is still not fast enough.

This tutorial will show you how to compute dissimilarity scores on long sequences with Julia on an iCore9 Mac laptop (16 threads). The Julia code in this demonstration uses distributed computing and can reduce calculation time by half compared to TraMineR!

You can follow the tutorial using the code on GitHub (jupyter notebook).

Computing dissimilarity scores on long sequences

The optimal matching technique for calculating dissimilarity scores on many (and long!) sequences is a computationally demanding process, but there are software packages that can help you do this. One of these is the TraMineR R package, which is at the moment the leader for performing optimal matching for social science research. Another less common option is Julia, which I used in this tutorial. Julia is a language specifically designed for scientific computing and has some great features for distributed computing like the distributed arrays.

What is a dissimilarity score

A dissimilarity score measures how different two objects (or sequences) are. In time use research, we often use this score to compare sequences of daily activities (i.e., the sequential progression of an ‘average’ day). There are many different ways to calculate a dissimilarity score, but the most common one in social sciences and time use research is the optimal matching technique.

Computing dissimilarity scores using Julia: pros and cons

The most evident advantage of doing computations with Julia is its performance. We can reduce the calculation time by half compared to other software packages like TraMineR.

For comparison, look how long it took R TraMineR to calculate the same matrix as in this tutorial:

However, unlike more famous languages like Python and R, Julia does not yet have a large community of developers. And that means that sometimes packages are no longer maintained, do not play well with each other, or haven’t even been developed. Nevertheless, Julia is definitely worth considering for scientific computing tasks given its supreme performance!

Let’s dive in!

## change directory to the directory where you want to store the project
cd("/Users/kamilakolpashnikova/Documents")
home_path = pwd()
home_path

You need to use your own path to your Documents (or other folder) that you will use in the code block above.

## Load packages
using Bio
using Bio.Align
using DataFrames
using CSV
using ZipFile
using BenchmarkTools
using StatsBase
using Dates
#if any of the packages above do not work then run
#import Pkg; Pkg.add("<package name in double quotes>")
using Distributed
nprocs()

It will output 1 (most ).

import Hwloc
n = Hwloc.num_physical_cores()

I use a Mac with iCore 9. It has 8 cores but 16 threads, so I will be multiplying n (number of physical cores) by 2 to add the number of available threads. You need to check your own laptop’s processor to figure out how many threads are available. Sometimes it’s 8 cores and 8 threads, so 16 will not work for you. If not sure, comment out the first line and uncomment the second, and you will use just the number of physical cores.

addprocs(n*2, exeflags=`--project=$@__DIR__`)
#addprocs(n, exeflags=`--project=$@__DIR__`)
nprocs()

The output for me will be 17 for the above code block.

Download and Unzip the ATUS files

The function below is a UDF which will download a zip file with ATUS data store the .dat file in your working directory and output the data as a dataframe.

function open_file_to_dataframe(link, home_path)
download(link, home_path * last(split(link, "/")))
zarchive = ZipFile.Reader(home_path * last(split(link, "/")))

daf = DataFrame()
for file in zarchive.files
if last(split(file.name, ".")) == "dat"
#writing .dat file into home dir
write(home_path * "/" * file.name, read(file, String))

#reading the file and transforming into array of arrays
f = open(home_path * "/" * file.name);
lines = readlines(f)
lines = split.(lines, ',')

#transforming arr of arr into dataframe
namelist = Symbol.(lines[1])

for (i, name) in enumerate(namelist)
daf[name] = [lines[j][i] for j in 2:length(lines)]
end
end

end
close(zarchive)
return daf
end

Let’s download the full 2003–2020 dataset:

link = "https://www.bls.gov/tus/special.requests/atusact-0320.zip"
#download(link)
df_full = open_file_to_dataframe(link, home_path)
first(df_full, 5)

and the eldercare dataset:

df_ec = open_file_to_dataframe("https://www.bls.gov/tus/special.requests/atusrostec-1120.zip", 
home_path)
first(df_ec, 5)
This is how the output will look like
## select only idsdf_ec = select(df_ec, "TUCASEID")
first(df_ec, 5)
  • subset the full dataset by the ids in the eldercare dataset:
## filter caregiversdf = innerjoin(df_full, df_ec, on="TUCASEID")
df = unique(df)
nrow(df)
  • keep only the needed columns:
## select only needed columnsdf = select(df, ["TUCASEID", "TUACTIVITY_N", "TRCODEP", "TUACTDUR24"])
last(df, 5)
  • rename columns for readability
colnames = ["caseid", "actline", "activity", "duration"]
rename!(df, Symbol.(colnames))
first(df, 5)
  • create functions to recode original activity codes to 11 broad categories:
## custom functions recoding activity codes (you need to change it 
# if you want to focus on other activities)
# the coding is arbitrary (this is what I usually use in my research--
# you can change it the way you want)
#"1" or "a" = "Sleep",
#"2" or "b" = "Personal Care",
#"3" or "c" = "Housework",
#"4" or "d" = "Child Care",
#"5" or "e" = "Adult Care",
#"6" or "f" = "Work and Education",
#"7" or "g" = "Shopping",
#"8" or "h" = "TV Watching",
#"9" or "i" = "Eating",
#"10" or "j" = "Leisure",
#"11" = "Travel and Other"
function a_trans(val)
a_dct = Dict(1 => "a", 2 => "b", 3 => "c", 4 => "d",
5 => "e", 6 => "f", 7 => "g", 8 => "h", 9 => "i",
10 => "j", 11 => "k")
return a_dct[val]
end
function activity_trans(val)
act_dct = Dict("010101" => 1, "010102" => 1,
"010199" => 1, "010201" => 2, "010299" => 2,
"010301" => 2, "010399" => 2, "010401" => 2,
"010499" => 2, "010501" => 2, "010599" => 2,
"019999" => 2, "020101" => 3, "020102" => 3,
"020103" => 3, "020104" => 3, "020199" => 3,
"020201" => 3, "020202" => 3, "020203" => 3,
"020299" => 3, "020301" => 3, "020302" => 3,
"020303" => 3, "020399" => 3, "020400" => 3,
"020401" => 3, "020402" => 3, "020499" => 3,
"020500" => 3, "020501" => 3, "020502" => 3,
"020599" => 3, "020600" => 3, "020601" => 3,
"020602" => 3, "020603" => 3, "020681" => 10,
"020699" => 3, "020700" => 3, "020701" => 3,
"020799" => 3, "020800" => 3, "020801" => 3,
"020899" => 3, "020900" => 3, "020901" => 3,
"020902" => 3, "020903" => 3, "020904" => 3,
"020905" => 3, "020999" => 3, "029900" => 3,
"029999" => 3, "030100" => 4, "030101" => 4,
"030102" => 4, "030103" => 4, "030104" => 4,
"030105" => 4, "030106" => 4, "030107" => 4,
"030108" => 4, "030109" => 4, "030110" => 4,
"030111" => 4, "030112" => 4, "030199" => 4,
"030200" => 4, "030201" => 4, "030202" => 4,
"030203" => 4, "030204" => 4, "030299" => 4,
"030300" => 4, "030301" => 4, "030302" => 4,
"030303" => 4, "030399" => 4, "040100" => 4,
"040101" => 4, "040102" => 4, "040103" => 4,
"040104" => 4, "040105" => 4, "040106" => 4,
"040107" => 4, "040108" => 4, "040109" => 4,
"040110" => 4, "040111" => 4, "040112" => 4,
"040199" => 4, "040200" => 4, "040201" => 4,
"040202" => 4, "040203" => 4, "040204" => 4,
"040299" => 4, "040300" => 4, "040301" => 4,
"040302" => 4, "040303" => 4, "040399" => 4,
"030186" => 4, "040186" => 4, "030000" => 5,
"030400" => 5, "030401" => 5, "030402" => 5,
"030403" => 5, "030404" => 5, "030405" => 5,
"030499" => 5, "030500" => 5, "030501" => 5,
"030502" => 5, "030503" => 5, "030504" => 5,
"030599" => 5, "039900" => 5, "039999" => 5,
"040000" => 5, "040400" => 5, "040401" => 5,
"040402" => 5, "040403" => 5, "040404" => 5,
"040405" => 5, "040499" => 5, "040500" => 5,
"040501" => 5, "040502" => 5, "040503" => 5,
"040504" => 5, "040505" => 5, "040506" => 5,
"040507" => 5, "040508" => 5, "040599" => 5,
"049900" => 5, "049999" => 5, "050000" => 6,
"050100" => 6, "050101" => 6, "050102" => 6,
"050103" => 6, "050104" => 6, "050199" => 6,
"050200" => 6, "050201" => 6, "050202" => 6,
"050203" => 6, "050204" => 6, "050205" => 6,
"050299" => 6, "050300" => 6, "050301" => 6,
"050302" => 6, "050303" => 6, "050304" => 6,
"050305" => 6, "050399" => 6, "050400" => 6,
"050401" => 6, "050403" => 6, "050404" => 6,
"050405" => 6, "050499" => 6, "059900" => 6,
"059999" => 6, "060000" => 6, "060100" => 6,
"060101" => 6, "060102" => 6, "060103" => 6,
"060104" => 6, "060199" => 6, "060200" => 6,
"060201" => 6, "060202" => 6, "060203" => 6,
"060204" => 6, "060299" => 6, "060300" => 6,
"060301" => 6, "060302" => 6, "060303" => 6,
"060399" => 6, "060400" => 6, "060401" => 6,
"060402" => 6, "060403" => 6, "060499" => 6,
"069900" => 6, "069999" => 6, "050481" => 6,
"050389" => 6, "050189" => 6, "060289" => 6,
"050289" => 6, "070000" => 7, "070100" => 7,
"070101" => 7, "070102" => 7, "070103" => 7,
"070104" => 7, "070105" => 7, "070199" => 7,
"070200" => 7, "070201" => 7, "070299" => 7,
"070300" => 7, "070301" => 7, "070399" => 7,
"079900" => 7, "079999" => 7, "080000" => 7,
"080100" => 7, "080101" => 7, "080102" => 7,
"080199" => 7, "080200" => 7, "080201" => 7,
"080202" => 7, "080203" => 7, "080299" => 7,
"080300" => 7, "080301" => 7, "080302" => 7,
"080399" => 7, "080400" => 7, "080401" => 7,
"080402" => 7, "080403" => 7, "080499" => 7,
"080500" => 7, "080501" => 7, "080502" => 7,
"080599" => 7, "080600" => 7, "080601" => 7,
"080602" => 7, "080699" => 7, "080700" => 7,
"080701" => 7, "080702" => 7, "080799" => 7,
"080800" => 7, "080801" => 7, "080899" => 7,
"089900" => 7, "089999" => 7, "090000" => 7,
"090100" => 7, "090101" => 7, "090102" => 7,
"090103" => 7, "090104" => 7, "090199" => 7,
"090200" => 7, "090201" => 7, "090202" => 7,
"090299" => 7, "090300" => 7, "090301" => 7,
"090302" => 7, "090399" => 7, "090400" => 7,
"090401" => 7, "090402" => 7, "090499" => 7,
"090500" => 7, "090501" => 7, "090502" => 7,
"090599" => 7, "099900" => 7, "099999" => 7,
"100000" => 7, "100100" => 7, "100101" => 7,
"100102" => 7, "100103" => 7, "100199" => 7,
"100200" => 7, "100201" => 7, "100299" => 7,
"100300" => 7, "100303" => 7, "100304" => 7,
"100399" => 7, "100400" => 7, "100401" => 7,
"100499" => 7, "109900" => 7, "109999" => 7,
"120303" => 8, "120304" => 8, "110000" => 9,
"110100" => 9, "110101" => 9, "110199" => 9,
"110200" => 9, "110201" => 9, "110299" => 9,
"119900" => 9, "110289" => 9, "119999" => 9,
"120000" => 10, "120100" => 10, "120101" => 10,
"120199" => 10, "120200" => 10, "120201" => 10,
"120202" => 10, "120299" => 10, "120300" => 10,
"120301" => 10, "120302" => 10, "120305" => 10,
"120306" => 10, "120307" => 10, "120308" => 10,
"120309" => 10, "120310" => 10, "120311" => 10,
"120312" => 10, "120313" => 10, "120399" => 10,
"120400" => 10, "120401" => 10, "120402" => 10,
"120403" => 10, "120404" => 10, "120405" => 10,
"120499" => 10, "120500" => 10, "120501" => 10,
"120502" => 10, "120503" => 10, "120504" => 10,
"120599" => 10, "129900" => 10, "129999" => 10,
"130000" => 10, "130100" => 10, "130101" => 10,
"130102" => 10, "130103" => 10, "130104" => 10,
"130105" => 10, "130106" => 10, "130107" => 10,
"130108" => 10, "130109" => 10, "130110" => 10,
"130111" => 10, "130112" => 10, "130113" => 10,
"130114" => 10, "130115" => 10, "130116" => 10,
"130117" => 10, "130118" => 10, "130119" => 10,
"130120" => 10, "130121" => 10, "130122" => 10,
"130123" => 10, "130124" => 10, "130125" => 10,
"130126" => 10, "130127" => 10, "130128" => 10,
"130129" => 10, "130130" => 10, "130131" => 10,
"130132" => 10, "130133" => 10, "130134" => 10,
"130135" => 10, "130136" => 10, "130199" => 10,
"130200" => 10, "130201" => 10, "130202" => 10,
"130203" => 10, "130204" => 10, "130205" => 10,
"130206" => 10, "130207" => 10, "130208" => 10,
"130209" => 10, "130210" => 10, "130211" => 10,
"130212" => 10, "130213" => 10, "130214" => 10,
"130215" => 10, "130216" => 10, "130217" => 10,
"130218" => 10, "130219" => 10, "130220" => 10,
"130221" => 10, "130222" => 10, "130223" => 10,
"130224" => 10, "130225" => 10, "130226" => 10,
"130227" => 10, "130228" => 10, "130229" => 10,
"130230" => 10, "130231" => 10, "130232" => 10,
"130299" => 10, "130300" => 10, "130301" => 10,
"130302" => 10, "130399" => 10, "130400" => 10,
"130401" => 10, "130402" => 10, "130499" => 10,
"139900" => 10, "139999" => 10, "140000" => 10,
"140100" => 10, "140101" => 10, "140102" => 10,
"140103" => 10, "140104" => 10, "140105" => 10,
"149900" => 10, "149999" => 10, "150000" => 10,
"150100" => 10, "150101" => 10, "150102" => 10,
"150103" => 10, "150104" => 10, "150105" => 10,
"150106" => 10, "150199" => 10, "150200" => 10,
"150201" => 10, "150202" => 10, "150203" => 10,
"150204" => 10, "150299" => 10, "150300" => 10,
"150301" => 10, "150302" => 10, "150399" => 10,
"150400" => 10, "150401" => 10, "150402" => 10,
"150499" => 10, "150500" => 10, "150501" => 10,
"150599" => 10, "150600" => 10, "150601" => 10,
"150602" => 10, "150699" => 10, "150700" => 10,
"150701" => 10, "150799" => 10, "150800" => 10,
"150801" => 10, "150899" => 10, "159900" => 10,
"159999" => 10, "160000" => 10, "160100" => 10,
"160101" => 10, "160102" => 10, "160103" => 10,
"160104" => 10, "160105" => 10, "160106" => 10,
"160107" => 10, "160108" => 10, "160199" => 10,
"160200" => 10, "160201" => 10, "160299" => 10,
"169900" => 10, "169999" => 10, "159989" => 10,
"169989" => 10, "110281" => 10, "100381" => 10,
"100383" => 10, "180000" => 11, "180100" => 11,
"180101" => 11, "180199" => 11, "180200" => 11,
"180201" => 11, "180202" => 11, "180203" => 11,
"180204" => 11, "180205" => 11, "180206" => 11,
"180207" => 11, "180208" => 11, "180209" => 11,
"180280" => 11, "180299" => 11, "180300" => 11,
"180301" => 11, "180302" => 11, "180303" => 11,
"180304" => 11, "180305" => 11, "180306" => 11,
"180307" => 11, "180399" => 11, "180400" => 11,
"180401" => 11, "180402" => 11, "180403" => 11,
"180404" => 11, "180405" => 11, "180406" => 11,
"180407" => 11, "180482" => 11, "180499" => 11,
"180500" => 11, "180501" => 11, "180502" => 11,
"180503" => 11, "180504" => 11, "180599" => 11,
"180600" => 11, "180601" => 11, "180602" => 11,
"180603" => 11, "180604" => 11, "180605" => 11,
"180699" => 11, "180700" => 11, "180701" => 11,
"180702" => 11, "180703" => 11, "180704" => 11,
"180705" => 11, "180782" => 11, "180799" => 11,
"180800" => 11, "180801" => 11, "180802" => 11,
"180803" => 11, "180804" => 11, "180805" => 11,
"180806" => 11, "180807" => 11, "180899" => 11,
"180900" => 11, "180901" => 11, "180902" => 11,
"180903" => 11, "180904" => 11, "180905" => 11,
"180999" => 11, "181000" => 11, "181001" => 11,
"181002" => 11, "181099" => 11, "181100" => 11,
"181101" => 11, "181199" => 11, "181200" => 11,
"181201" => 11, "181202" => 11, "181203" => 11,
"181204" => 11, "181205" => 11, "181206" => 11,
"181283" => 11, "181299" => 11, "181300" => 11,
"181301" => 11, "181302" => 11, "181399" => 11,
"181400" => 11, "181401" => 11, "181499" => 11,
"181500" => 11, "181501" => 11, "181599" => 11,
"181600" => 11, "181601" => 11, "181699" => 11,
"181800" => 11, "181801" => 11, "181899" => 11,
"189900" => 11, "189999" => 11, "180481" => 11,
"180381" => 11, "180382" => 11, "181081" => 11,
"180589" => 11, "180682" => 11, "500000" => 11,
"500100" => 11, "500101" => 11, "500102" => 11,
"500103" => 11, "500104" => 11, "500105" => 11,
"500106" => 11, "500107" => 11, "509900" => 11,
"509989" => 11, "509999" => 11)

return a_trans(act_dct[val])
end

I avoid using numbers for sequence coding because the way how the pairalign function in Bio.jl package works. It takes strings as sequences, and if the numbers are used and the number of activities is over 9, there is a problem of separating activities. There are ways around it, but I chose to use characters instead.

  • recode the activities using the dictionary function above
df["act"] = activity_trans.(df["activity"]) 
first(df,5)
  • check if you don’t have ‘surprise’ activity codes left uncoded:
## check if all activities were recoded. If numbers appear in the list 
# --> some of the activities were missed
unique(df["act"])
  • change the type for duration variable to integer
## change the type of activity duration variable to integersdf[:dur] = parse.(Int64, df["duration"])
first(df, 5)
  • multiply activity codes by duration creating subsequences
## create sequences separately for each activity linedf["act_rep"] = df["act"]
for i in 1:nrow(df)
df["act_rep"][i] = repeat(df["act"][i], df["dur"][i])
end
first(df, 3)
  • initiate a dictionary which will contain ids and sequences
## to combine the sequences by caseid, 
# I'll create a separate dictionary with caseid's as keys
tempoTable = Dict()
for i in unique(df["caseid"])
tempoTable[i] = []
end
  • function that will chuck 1-minute sequences of 1440 steps into 15 minutes ones
## function that will divide an array (sequence) into equal parts of length nchunk(arr, n) = [arr[i:min(i + n - 1, end)] for i in 1:n:length(arr)]
  • replace each 15 minutes by the most common activity code:
## in these lines I loop over the df grouped by caseid
# collect all act_rep per caseid into sequences per caseid
# divide sequences into chunks of 15 (representing 15 minutes)
# choose the most common activity code to represent the chunk
for d in groupby(df, :caseid)
temp = join(push!(tempoTable[d["caseid"][1]], d["act_rep"])[1])
lst = ""
for c in chunk(temp, 15)
lst = lst * first(countmap(c))[1]
end
tempoTable[d["caseid"][1]] = lst
end
  • check out how the dictionary looks like:
tempoTable
This is the output you will see
  • check if all sequences are of length == 96
## when we transform sequences of 1440 minutes into 15-minutes slots
# the resulting sequences should be 96 steps in length
# these lines check if all values in tempoTable sequences are of length 96
# if it doesn't print anything it means that we are OK
for i in keys(tempoTable)
if length(tempoTable[i]) != 96
print(tempoTable[i])
print(" ")
end
end
  • create an array of sequences without ids:
## let's create an array that contains only the sequenceslines = collect(values(tempoTable))
  • save the array (as you might need to run on it many times)
## it might be that you will be running OM again, so better save the lines variable
outfile = "lines.txt"
open(outfile, "w") do f
for i in lines
println(f, i)
end
end
  • copy the sequences array to all threads
## you can then restart here, by just opening the lines file
@everywhere f = open("lines.txt");
@everywhere lines = readlines(f)
  • copy the package and costmodel to all threads
## before we can distribute the tasks to our threads 
# we need to give them all the data and packages
# that they need to perform the tasks
@everywhere using Bio.Align
@everywhere costmodel = CostModel(match=0, mismatch=2, insertion=1, deletion=1);
  • initiate the dissimilarity matrix with zeros
## let's create a dissimilarity matrix where the results will be stored
## because I will be hashing the line codes
# I use UInt32 for preserving memory
# However, do not run more than 20000 sequences at once
# --> may run into memory problems
# and hashing problems (above 40000)
d_m = zeros(UInt32, length(lines), length(lines))

I’m using UInt32 to save memory space.

  • hash the sequence pairs as an integer
## hash the combinations of sequences that will be matched 
# using a simple integer representing first sequence + second sequence
for i in 1:length(lines)
for j in 1:length(lines)
d_m[i,j] = i*100000 + j
end
end
  • transform to a distributed array A
## transform dissimilarity matrix to a distributed arrayA = DArray(I->d_m, (length(lines), length(lines)))
This is the output you will see
  • send a customized pairalign function to all threads
## function that will run optimal matching 
# 1. it will read the value and transform it into indexes of lines array
# 2. it will run optimal matching on those two lines (sequences)
# 3. only lower triangle of the dissimilarity matrix will be calculated
# (for optimization purposes)
@everywhere function pairalignEditNum_Opt(num::UInt32,
lines = lines, editdist = EditDistance(),
cost = costmodel)
pos1 = convert(UInt32, floor(num/100000))
pos2 = num - pos1*100000
if pos1>pos2
return distance(pairalign(editdist, lines[pos1], lines[pos2], cost))
else
return 0
end
end
  • log the start of OM
## log the start of OMDates.format(now(), "HH:MM")
  • OM
output_mat_opt = pairalignEditNum_Opt.(A)
  • log the end of OM
## log the end of OMDates.format(now(), "HH:MM")
  • collect the distributed array back
## let's bring the distributed array backB = Array{UInt32}(output_mat_opt)
  • calculate the other half of the array
for i in 1:convert(UInt32, length(B)^0.5)
for j in 1:convert(UInt32, length(B)^0.5)
if i<j
if B[i,j] == 0
B[i,j] = B[j,i]
end
end
end
end
  • check out the output (in UInt32)
## here is the final dissimilarity matrixB
This is the output
  • write the dissimilarity matrix to your working directory
CSV.write("dissimilarity_matrix.csv",  Tables.table(B), writeheader=false)
## the file will be saved in the working directory storing the dissimilarity scores matrix
## you can use that matrix for further analysis either in Julia or other languages (Python, R)

To check out what you can then do with this matrix in R, or do the same in R, check out my other tutorial:

  1. Kolpashnikova, K. (2021, November 10). Sequence Analysis: Time Use Data (ATUS) in R. https://doi.org/10.31219/osf.io/ts34v

I hope this tutorial was helpful. Happy computing!

--

--

Surfing through life as a postdoc

Perpetuum postdocum writing about random academic and non-academic stuff. Shortbread glutton.