Constructing a Swedish BIP39 wordlist using language engineering in Swift

I’ve been keen on developing a Swedish BIP-39 compliant wordlist for some time now — using my favourite programming language Swift of course! A BIP-39 wordlist is a list of 2048 words that are easy to remember. In the context of BIP-39, each word can be used to encode a random integer in the range 0–2047 (11 bits). Where say the integer 3 is mapped to the fourth word in the wordlist. This can be used as a human-friendly serialization of random bytes, e.g. 32 bytes (256 bits) randomly generated Bitcoin private key (simplification, since one would typically use HD keys).

At first glance I thought it would not be so much work — boy was I wrong! 2048 did not sound like so many words, and I thought I could do a lot of the work manually by considering words that are easy to remember, turns out 2048 is, in fact, quite a lot of words. It turns out that between 800–3000 lemmas (word family/root word) is large enough an English vocabulary for day-to-day use

If you learn only 800 of the most frequently-used lemmas in English, you’ll be able to understand 75% of the language as it is spoken in normal life.

Eight hundred lemmas will help you speak a language in a day-to-day setting, but to understand dialogue in film or TV you’ll need to know the 3,000 most common lemmas.

Quite early on I had an idea that homonyms would play a large role in the final word list since thanks to having multiple meanings it's more likely you will get an association with the word. So my idea is that it would be interesting to identify homonyms and give them a higher rank/priority.

I also figured part of speech (POS) would play a large role since nouns such as “elephant” are easier to get an association with and thus easier to remember than determiners such as “the” or coordinating conjunctions such as “whether”. So I decided to analyze the POS distribution of the English BIP-39 wordlist. This was done by using the awesome Python tool NLTK(Natural Language Toolkit) and this small Python script I wrote:

Simple Python script I wrote analyzing POS of English BIP39 wordlist

This is the POS distribution of the English BIP39 wordlist:

Image for post
Image for post
Cardinal number, Foreign words, Determiner, Pronouns, Conjunctions (surprisingly!) and Modal verbs are the type of POS labels not visible in the pie chart.

Not so surprisingly we see a clear noun dominance!

Corpus

Corpus Line Format

är  VB.PRS.AKT  |vara..vb.1|    -   316581  13026.365036

The columns contain this information:

  1. Word
  2. Part of speech, legend here
  3. Lemma(s) — here called “lemgram(s)” since it is the terminology used by Språkbanken in relation to the corpus.
  4. + or - which indicates whether a compound analysis was possible or not. E.g. (🇸🇪: "stämband", is a compound word consisting of "stäm" and "band")
  5. Raw frequency (total number of occurrences)
  6. Relative frequency (number of occurrences per 1 million words)

Parsing the corpus

Read lines

The goal of this step is to convert the source corpus into Swift ParsedLine models which we can write to a JSON file to allow faster execution of the program next time. For the next run of the program we can thus skip this step.

ParsedLine model

Apart from data parsed from corpus we add one additional property — indexOfWordInCorpus.

So far our algorithm looks something like this (Swifty-pseudocode):

let corpusFile = openFile("swedish_huge_corpus.txt")
let readLines = readLines(count: 100,000, from: corpusFile)
let parseLines = parseLines(readLines: readLines)

Reject unfit lines

We read the corpus until we have created a list of L lines. This step does not contain so much logic, but it is unnecessary to save lines which we know we will reject, e.g. because the word is too short, or because it is a delimiter.

On line #252 in the corpus we find this line:

sa VB.PRT.AKT |säga..vb.1| - 4857774 364.958352

If we were to just look at the word (first component) — 🇸🇪: “sa” (🇬🇧: “said”), we would reject this line since it is less than threshold character count of 3, however, if we look at the base word (lemma), 🇸🇪: “säga” (🇬🇧: “to say”), it is four characters long. Thus including this line we might get interesting data for the decision in relation to the base word.

But word length is not the only criteria for rejection, we also might completely exclude some words having unwanted POS-tags. E.g. “foreign word” and we probably want to somehow enforce a POS distribution similar to that of the English BIP39 wordlist. So we update the pseudocode of our algorithm to reflect this logic

let corpusFile = openFile("swedish_huge_corpus.txt")
let readLines = readLines(count: 100,000, from: corpusFile)
let parsedLines = parseLines(readLines)
let goodLengthLines = goodWordLengthLines(parsedLines)
let whitelistedPOSLines = whitelistedPOSLines(goodLengthLines)

I don’t want to accidentally pass in parsedLines to function whitelistedPOSLines instead of goodLengthLines. Thus I create a separate type as output from each step. But that would easily lead to code duplication, as I wanted to include the original properties of the ParsedLine . I could have just let that be a single property in e.g. struct WordLengthLine like so:

struct WordLengthLine {
let parsedLine: ParsedLine
}

But then I would need to access the properties of ParsedLinevia theparsedLine property all the time, so I decided to create a set of protocols allowing me to write like this:

Where LineFromCorpusFromLine inherits to some other protocols below, allowing me to access all the properties of ParsedLine directly from an instance of WordLengthLine just like WordLengthLine was a ParsedLine. Now you might ask — “why not just use classes and inheritance” — to which I answer, when L=500,000 I really wanted to benefit from the memory and execution efficiency of structs rather than classes. I also love automagically synthesised initializers and conformance to Equatable, Hashable we get from structs.

Our code still looks a bit boilerplate-y

let corpusFile = openFile("swedish_huge_corpus.txt")
let readLines = readLines(count: 100_000, from: corpusFile)
let parsedLines = parseLines(readLines)
let goodLengthLines = goodWordLengthLines(parsedLines)
let whitelistedPOSLines = whitelistedPOSLines(goodLengthLines)

Since we only use a variable declared on the line above ones and just pass it forward to a function call it would be nicer to be able to skip declaring the variable. What if we treat each function call here like a “step” or a “job” taking us one step closer to our result. We could create a small type for each job and then put them all in one “pipeline”, like so:

let config = Config(
lineCount: 100_000,
fileName: "swedish_huge_corpus.txt"
)
let pipeLine = Pipeline(config: config, jobs: [
OpenFileJob(),
ReadLinesJob(),
ParseLinesJob(),
...
])

I call the entity above Pipeline because I want to “pipe” unix style | the output from job N-1 and let it be input for job N , and let the output of job N be input of job N+1 etc. So this requires generics, so our jobs need to declare type of input and output. Since we cannot put non-heterogenous elements in an array in Swift the pseudocode above will not work. We need some type-erasure here for sure combined with the possibility to “pipe” jobs together.

Job

A runnable task performing some work.

Now, let’s — for sake of convenience (will save lots of code) — declare an operator allowing us to pipe jobs together.

Operator allowing for chaining — piping — “jobs” together.

Great! But how can we use these? Let’s create a type called Pipeline which makes use of these.

We can see that our Pipeline itself conforms to the protocol Job and thus have to declare types Input and Output respectively and also declaring the work:input method. So how can we initialize the Pipeline with several jobs and pipe them all together?

Turns out there’s language support — since Swift 5.1 — the Function Builder. Here is the original Swift Evolution proposal (dating back to 2019–06–03) — but just recently (2020–08–16) a much-updated version was suggested which:

“captures the actual state of the implementation on master (any trunk development snapshot 2 will do), most of which is also in Swift 5.3.”

By using Generics + Function Builders (@_functionBuilder) we can achieve great syntax call site, while also clear code and responsibility declaration site. But before I proceed with the implementation let’s talk a bit about variadic generics (Swift lang “Generics Manifesto”). We want to be able to pipe job A, B, C and D together, like so: A | B | C | D . Or to use the syntax according to the actual operator declared above: A |> B |> C |> D where each job, of course, conforms to the protocol Job , thus declaring an associatedtype Input and Output respectively. Thus we have lots of constraints: D.Input == C.Output and C.Input == B.Output on so on and so forth. Thus we need to be able to accept an array of jobs each where each type of job is declared as a generic type (with constraints) in the function signature. For a pipeline with 3 jobs, we require 3 generic types (with constraints…), for a pipeline with 4 jobs we require 4 generic types etc etc. This is called variadic generics — for which Swift currently (Swift 5.2 (Xcode 12 beta 4)) — does not have support. So have to manually declare an initializer/buildBlock function for each amount of jobs we want to support. Because lack of variadic generics support in Swift, Apple went with the solution of declaring a combo of 0…9 views types for the special purpose @_functionBuilder called @ViewBuilder — being the reason why you cannot declare more than 10 child views in any view in SwiftUI:

Image for post
Image for post
SwiftUI’s ViewBuilder declares 10 different buildBlock methods for a combo of 0...9 child views — due to lack of variadic generics support in Swift.

Anyhow, my point is, I have to retort to the same solution as Apple did with SwiftUI’s ViewBuilder, namely declaring one many similar buildBlocks. This kind of boilerplate makes me crazy! So I will be incredibly happy the day we have variadic generics in Swift. A possible solution to alleviate the repetitive boilerplate situation is gyb — “generate your boilerplate” — an Apple developed Swift code generation tool written in Python, use by Apple in e.g. swift-crypto (I wrote that section by the way 😃) and thus probably also in closed source CryptoKit. I reckon gyb also is used in SwiftUI. Another amazing alternative is the metaprogramming tool Sourcery. But I think I will make the same design choice as Apple did with SwiftUI and only support up to a max of 10 jobs for now and that is just below the pain point threshold justifying complicating things with gyb or Sourcery.

Without much further ado, here is the implementation:

buildBlock functions with 5–9 different jobs omitted for sake of brevity.

Where descriptionOf used in each Pipeline init is just a small function concatenating the name of the job types together. In the trailing closure of each Pipeline init we make use of our custom operator |> making it pretty sweet and easy to read IMO.

Wait, but why?

Creating an instance of Pipeline using functionBuilder syntax, N.B. the omission of commas after each job, short and sweet syntax IMO!

We can create minimal, self-contained, easy to test “jobs” and pipe (chain) them together with this “composition syntax”. If we realize we want to add more logic we can easily add another job or change any of the tiny job structs.

Further improvements

Where Cacher is just a simple utility I wrote using Swift’s Codable and writing the result to disc. We can then easily make all our *Line DTOs conform to Codable — which we get automatically since they are all structs.

I also do not output Array<ScannedLine> but rather a typealias ScannesLines = Lines<ScannedLine> as output, where Lines is my custom collection type for bundling together lines.

Now ScanJob is cachable and it can validate its cache by looking if we have load enough lines from the cache.

So what is the Swedish BIP39 wordlist?

Here’s the link to the GitHub repo with this project, it is called Behandla (🇬🇧 “Process”). I also decided to try to refactor out just the Pipeline part as a separate SPM package (it was over half a year since I worked on it though so I don’t really know its state, feel free to have a look anyway!).

This is my submission for TopTal’s Swift page — https://www.toptal.com/swift

Written by

Cryptocurrency and DLT evangelist. Freelancing IT consultant and app developer in love with Swift.

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store