How to train your trade data, part 1: download

(Code here https://gist.github.com/tweed1e/73d2e84246deefed77757f8c5a30a68d.)

At the end of the 19th century, governments made most of their money from import tariffs instead of income taxes. Fast forward 100 years and trade is one of the most popular fields in economics. Why? Customs agencies had to record the values of all goods coming in to their countries in order to levy those tariffs, which means they have lots of trade data for economists to study.

To get some, check out our friends at the NBER (http://www.nber.org/data/ , a great source for all kinds of data) and go through some Terms of Use stuff to get to http://cid.econ.ucdavis.edu/ucdstats.html. There’s a new dataset from 1960–2000, but they’re only available in Stata / SAS formats.

Do the codes

Let’s get some data. There’s 28 years, and 3 zip files per year, which makes 84 files you don’t want to download and unzip by hand. R is a great way to handle the download, the unzip, the data processing, the analysis and the graphics that come out the other end. To see the full file, check the gist here (https://gist.github.com/tweed1e/73d2e84246deefed77757f8c5a30a68d).

First, some download and unzip code:

# create tempfile so we don’t have to make up names for each file
tf <- tempfile(pattern=”wtdb”, fileext=”zip”) # they’re all zipped.
# download file at url `f_name` to tempfile `tf`
download.file(url=f_name, destfile=tf, method=’curl’)
# file is zipped; unzip it. there’s two files inside, take the first one (second one is a readme)
file <- unzip(tf)[[1]]

Since we’ll be downloading a country name/code concordance along with the 84 actual data files, it’s best to wrap this in a function that takes a url and returns the filename of the downloaded and unzipped file. After that, we’ll have to read and process it. Two things to remember:

# it’s a fixed width file; read_table is a decent way to get it all in.
# also make sure value is read as double.
z <- read_table(x,
col_names=c(“imp_code”,
”sitc”,
”exp_code”,
”value”),
col_types=cols(.default=’c’,
value=’d’))
# make sure we add the `year` observation
z <- z %>% mutate(year = 1900 + year)

Next, create the list of year/index pairs to download the full dataset

x <- expand.grid(i=80:97, j=1:3)

Then split the data frame to get a list of argument pairs

y <- split(x, 1:dim(x)[1])

Apply the `load_wtdb` function (see code for more https://gist.github.com/tweed1e/73d2e84246deefed77757f8c5a30a68d) to each pair in the list

z <- lapply(y, FUN=load_wtdb)

Then use the `rbind` (row bind) function to append all the dataframes in the list `z` together

w <- Reduce(rbind,z) # 12 million rows.

Merge on the country names using the country code list (repeat for `exp_name`)

# make sure we keep track of which name is the importe
rw <- w %>%
left_join(names,by=c(“imp_code” = “code”)) %>%
rename(imp_name = name)

Now we’ve got everything together (SITC is a 4-digit product classification, for more information, see http://unstats.un.org/unsd/cr/registry/regcst.asp?Cl=14).

> w
# A tibble: 12,002,871 x 5
sitc value year imp_name exp_name
<chr> <dbl> <dbl> <chr> <chr>
1 0011 283 1980 SOUTH AFRICA USA
2 0011 40 1980 SOUTH AFRICA UNITED KINGDOM
3 0011 34 1980 SOUTH AFRICA AUSTRIA
4 0012 71 1980 SOUTH AFRICA GERMANY
5 0012 18 1980 SOUTH AFRICA UNITED KINGDOM
6 0015 3158 1980 SOUTH AFRICA USA
7 0015 274 1980 SOUTH AFRICA ARGENTINA
8 0015 1368 1980 SOUTH AFRICA UNITED KINGDOM
9 0015 706 1980 SOUTH AFRICA NEW ZEALAND
10 0019 16 1980 SOUTH AFRICA SENEGAL
# ... with 12,002,861 more rows

Do a graph

Graph total exports by country (I’ll say more about the data processing in part 2). Note the US and Germany are each almost double France, the next economy in line. And remember Germany has about 80 million people relative to the 300 million of the US. And if you scroll down to the bottom, most countries export very little on this scale. (For more, check the gist here https://gist.github.com/tweed1e/73d2e84246deefed77757f8c5a30a68d.)

Note some problems we haven’t dealt with yet — some of the country names are pretty bizarre, like “KIRIBATI (includes” or “BR.IND.OC.TR”. We’ll have to deal with this in the future if we want to join this with other data (e.g., to look at trade relative to some covariates like GDP, geography, and so on).