FString: String lengths at compile time
Preventing `data-too-wide’ SQL exceptions in Scala
Most strings that you store in a traditional SQL database have a maximum length allowed. This can frequently cause problems, since this maximum length isn’t known by your code.
You might have some checks dotted around, maybe on your API or DB layer, that only sensible length strings get into your system — but you probably don’t have them for every single field. Mutating strings, such as adding two fields together, also invalidates such checks — it will likely just cause a run-time error. One of the nice things about a type checker and functional programming in particular is that it enables and encourages you to remove entire classses of exceptions from your code. It would be great to have these maximum lengths encoded in the compiler, say in some class
F stands for ‘finite’)
Then, if your API accepts an
FString and your database accepts an
FString the compiler will do all the checks for you all the way through your system — at compile time (and at the boundaries to your program). No more run-time data-too-wide exceptions.
As a bonus you can even make the compiler do arithmetic for you, so you know you cannot add two
FStrings together and store them in an
FStringfield. You will be made aware of this potential problem and be forced to handle it.
How would we do this?
Getting string lengths into the compiler requires some concept of numbers in the type system. There’s a pretty standard approach to this in libraries such as Shapeless, a type called
Nat is essentially just an
HList where every constituent type is
Unit. Or if you don’t know what an
HListis, think of Nat as list of ‘+1’ instructions, where the length denotes the value of the number.
Succ stands for ‘successor’. In the integers, this is just ‘+1’.
You can see that ‘3’ is just 3 levels of
Succ around the base case
_0. We created single values as well as the pure types above.
These are all generated by a macro in shapeless.
You can do some pretty cool stuff in the compiler with
Nat, moving logic to compile-time:
Say we wanted a function that accepted two
Nat s - one completely free and the second with a constraint that it must be less than the first. How would we do this?
We’d need a ‘less than or equal’ type class to begin with, which would act as evidence that the first type is ‘less than’ the second:
How do we generate the instances of this typeclass? From our definition of
Nat, the integer
< operation is translated directly into the
Succ relation —
_1 is less than
_3 is some number of
Succs wrapped around
So we can write the following:
Now we can write the following:
Something else you can do with integers is add them together, so we’d like to be able to do this with
This is a bit more complicated since we have a result to deal with — we generate a completely new
Nat which we need to somehow record. Our typeclass representing
Sum will then necessarily have three types, the two summands and the result.
It will use the ‘Aux’ pattern to record the result type. I’ll give a brief justification so skip this if you’re familiar with the ‘Aux’ pattern already.
We want the compiler to do the arithmetic for us. We don’t want the user (the programmer using this code) to have to do the addition themselves. But if our
Sum type class was parameterised by all three types this is exactly what
they would need to do; they can’t say
Sum[_1, _2, ???]— they would have to always know the result beforehand and always write
Sum[_1, _2, _3].
So that means the result type must be an inner type to
Sum (let’s call it
Out for ‘output’). However if we do this then the inner type is essentially lost to the compiler, it becomes just
sum.Out for some value
sum, unable to be usefully compared to any other type.
To get around this problem the Aux pattern is a little trick to only reveal to the user a typeclass with just the two parameters we’re interested in, while internally maintaining the three explicit type parameters necessary to do
useful compile-time type comparisons. It looks like this:
So we have a useful type class. How do we generate the instances? Well, in integer arithmetic there are certain axioms to do with addition:
Identity: A + 0 = A, for all A
Associativity(ish): (A + 1) + B = C => A + (B + 1) = C, for all A, B
These two are sufficient for integer addition, so all we need to do is put them into code:
Note that commutativity (A + B = B + A) is a corollary of the two above and does not need its own function.
Given the above, we can now write the following:
Nat is great, but pretty cumbersome when some of your max DB string lengths are 255 (or higher). Making the compiler handle a 255-nested type won’t be quick either. The limit on my computer seems to be adding
To get around this, we standardised around a few round(ish) numbers for lengths in our DB, and rolled our own version of
Nat which only handles these numbers and none in between. We called it
SparseNat (because it’s sparse).
Here it is:
Note there are a few changes from
- We introduce
_unbound as a way of typing an arbitrary-length string
SparseNat carries an (optional) integer within it, representing its length
- Each type now has a non-empty object to go with it, so we can implicitly access the integer value it represents
Succ has been renamed
Jump, since we’re lurching unpredictably from number to number rather than stepping up by 1 each time
- Only a few numbers are represented. Feel free to implement as many as you like
Lte is the same as with
Nat since it only depends on relative relationships rather than absolute relationships. So we can do this:
Lte works with
SparseNat just as with
Sum is more difficult with
SparseNat since we’ve lost our associativity axiom. Observe:
Jump[_50] + _100 = _100 + _100 = _255 (minimal SparseNat >= 200)
_50 + Jump[_100] = _50 + _255 = _unbound (minimal SparseNat >= 305)
so things are unfortunately more complicated.
In fact, because of the nature of
SparseNat being able to implement any subset of integers you want, there is nothing we can say at all about addition, except commutativity and the identity of 0.
This means that we need to implement each addition ourselves, manually. Thankfully though this can be a lazy procedure — if you never add
_50 you never have to write its
Sum typeclass therefore looks like this:
Note: in a future blog post a method for speeding up
Nat compile times will be proposed and can be used as a substitute for
SparseNat with no impact on the rest of your code.
Now we finally have enough to define our
FString class. It’s basically just a wrapper around
String, parameterised by a
SparseNat with a constructor that returns a scalaz
Validation, and some
utility methods such as
If you don’t use scalaz validation you can use any similar construct:
Here it is:
And that’s it. Now we can have models which have accurate length limitations on their string values, and take care of all validation at the extremes of our program: the API and the database layer. An
FString field in a JSON case class would fail the decoding if the string were too long, immediately kicking out a user with a 400. Similarly with the database type mappers.
And if we tried to insert an
FString[_50] from the API into an
FString[_25] field in our database, our program would not compile. We’d be forced to either shrink the API field or widen the database field in order to handle the potential exception.