FString: String lengths at compile time

Preventing `data-too-wide’ SQL exceptions in Scala

James Phillips
AnyJunk
6 min readFeb 5, 2018

--

Most strings that you store in a traditional SQL database have a maximum length allowed. This can frequently cause problems, since this maximum length isn’t known by your code.

You might have some checks dotted around, maybe on your API or DB layer, that only sensible length strings get into your system — but you probably don’t have them for every single field. Mutating strings, such as adding two fields together, also invalidates such checks — it will likely just cause a run-time error. One of the nice things about a type checker and functional programming in particular is that it enables and encourages you to remove entire classses of exceptions from your code. It would be great to have these maximum lengths encoded in the compiler, say in some class FString[_] (F stands for ‘finite’)

Then, if your API accepts an FString[50] and your database accepts an FString[50] the compiler will do all the checks for you all the way through your system — at compile time (and at the boundaries to your program). No more run-time data-too-wide exceptions.

As a bonus you can even make the compiler do arithmetic for you, so you know you cannot add two FString[30]s together and store them in an FString[50]field. You will be made aware of this potential problem and be forced to handle it.

How would we do this?

Nat

Getting string lengths into the compiler requires some concept of numbers in the type system. There’s a pretty standard approach to this in libraries such as Shapeless, a type called Nat.

Nat is essentially just an HList where every constituent type is Unit. Or if you don’t know what an HListis, think of Nat as list of ‘+1’ instructions, where the length denotes the value of the number.

Eg,

etc. Succ stands for ‘successor’. In the integers, this is just ‘+1’.
You can see that ‘3’ is just 3 levels of Succ around the base case _0. We created single values as well as the pure types above.

These are all generated by a macro in shapeless.

You can do some pretty cool stuff in the compiler with Nat, moving logic to compile-time:

Less than

Say we wanted a function that accepted two Nat s - one completely free and the second with a constraint that it must be less than the first. How would we do this?

We’d need a ‘less than or equal’ type class to begin with, which would act as evidence that the first type is ‘less than’ the second:

How do we generate the instances of this typeclass? From our definition of Nat, the integer < operation is translated directly into the Succ relation — _1 is less than _3 because _3 is some number of Succs wrapped around _1.

So we can write the following:

Now we can write the following:

Addition

Something else you can do with integers is add them together, so we’d like to be able to do this with Nat too.

This is a bit more complicated since we have a result to deal with — we generate a completely new Nat which we need to somehow record. Our typeclass representing Sum will then necessarily have three types, the two summands and the result.

It will use the ‘Aux’ pattern to record the result type. I’ll give a brief justification so skip this if you’re familiar with the ‘Aux’ pattern already.

We want the compiler to do the arithmetic for us. We don’t want the user (the programmer using this code) to have to do the addition themselves. But if our Sum type class was parameterised by all three types this is exactly what
they would need to do; they can’t say Sum[_1, _2, ???]— they would have to always know the result beforehand and always write Sum[_1, _2, _3].

So that means the result type must be an inner type to Sum (let’s call it Out for ‘output’). However if we do this then the inner type is essentially lost to the compiler, it becomes just sum.Out for some value sum, unable to be usefully compared to any other type.

To get around this problem the Aux pattern is a little trick to only reveal to the user a typeclass with just the two parameters we’re interested in, while internally maintaining the three explicit type parameters necessary to do
useful compile-time type comparisons. It looks like this:

So we have a useful type class. How do we generate the instances? Well, in integer arithmetic there are certain axioms to do with addition:

Identity: A + 0 = A, for all A
Associativity(ish): (A + 1) + B = C => A + (B + 1) = C, for all A, B

These two are sufficient for integer addition, so all we need to do is put them into code:

Note that commutativity (A + B = B + A) is a corollary of the two above and does not need its own function.

Given the above, we can now write the following:

SparseNat

Nat is great, but pretty cumbersome when some of your max DB string lengths are 255 (or higher). Making the compiler handle a 255-nested type won’t be quick either. The limit on my computer seems to be adding _22 to _22 !

To get around this, we standardised around a few round(ish) numbers for lengths in our DB, and rolled our own version of Nat which only handles these numbers and none in between. We called it SparseNat (because it’s sparse).

Here it is:

Note there are a few changes from Nat:

- We introduce _unbound as a way of typing an arbitrary-length string
- Each SparseNat carries an (optional) integer within it, representing its length
- Each type now has a non-empty object to go with it, so we can implicitly access the integer value it represents
- Succ has been renamed Jump, since we’re lurching unpredictably from number to number rather than stepping up by 1 each time
- Only a few numbers are represented. Feel free to implement as many as you like

Less than

Thankfully, Lte is the same as with Nat since it only depends on relative relationships rather than absolute relationships. So we can do this:

and Lte works with SparseNat just as with Nat.

Sum

Sum is more difficult with SparseNat since we’ve lost our associativity axiom. Observe:

Jump[_50] + _100 = _100 + _100 = _255 (minimal SparseNat >= 200)

but

_50 + Jump[_100] = _50 + _255 = _unbound (minimal SparseNat >= 305)

so things are unfortunately more complicated.

In fact, because of the nature of SparseNat being able to implement any subset of integers you want, there is nothing we can say at all about addition, except commutativity and the identity of 0.

This means that we need to implement each addition ourselves, manually. Thankfully though this can be a lazy procedure — if you never add _25 and _50 you never have to write its Sum instance.

SparseNat‘s Sum typeclass therefore looks like this:

Note: in a future blog post a method for speeding up Nat compile times will be proposed and can be used as a substitute for SparseNat with no impact on the rest of your code.

FString

Now we finally have enough to define our FString class. It’s basically just a wrapper around String, parameterised by a SparseNat with a constructor that returns a scalaz Validation, and some
utility methods such as ++.

If you don’t use scalaz validation you can use any similar construct: Either, Try, \/, cats.Validated, etc.

Here it is:

And that’s it. Now we can have models which have accurate length limitations on their string values, and take care of all validation at the extremes of our program: the API and the database layer. An FString field in a JSON case class would fail the decoding if the string were too long, immediately kicking out a user with a 400. Similarly with the database type mappers.

And if we tried to insert an FString[_50] from the API into an FString[_25] field in our database, our program would not compile. We’d be forced to either shrink the API field or widen the database field in order to handle the potential exception.

--

--