# FString: String lengths at compile time

## Preventing `data-too-wide’ SQL exceptions in Scala

Most strings that you store in a traditional SQL database have a maximum length allowed. This can frequently cause problems, since this maximum length isn’t known by your code.

You might have some checks dotted around, maybe on your API or DB layer, that only sensible length strings get into your system — but you probably don’t have them for every single field. Mutating strings, such as adding two fields together, also invalidates such checks — it will likely just cause a run-time error. One of the nice things about a type checker and functional programming in particular is that it enables and encourages you to remove entire classses of exceptions from your code. It would be great to have these maximum lengths encoded in the compiler, say in some class `FString[_]`

(`F`

stands for ‘finite’)

Then, if your API accepts an `FString[50]`

and your database accepts an `FString[50]`

the compiler will do all the checks for you all the way through your system — at compile time (and at the boundaries to your program). No more run-time data-too-wide exceptions.

As a bonus you can even make the compiler do arithmetic for you, so you know you cannot add two `FString[30]`

s together and store them in an `FString[50]`

field. You will be made aware of this potential problem and be forced to handle it.

How would we do this?

### Nat

Getting string lengths into the compiler requires some concept of numbers in the type system. There’s a pretty standard approach to this in libraries such as Shapeless, a type called `Nat`

.

`Nat`

is essentially just an `HList`

where every constituent type is `Unit`

. Or if you don’t know what an `HList`

is, think of Nat as list of ‘+1’ instructions, where the length denotes the value of the number.

Eg,

etc. `Succ`

stands for ‘successor’. In the integers, this is just ‘+1’.

You can see that ‘3’ is just 3 levels of `Succ`

around the base case `_0`

. We created single values as well as the pure types above.

These are all generated by a macro in shapeless.

You can do some pretty cool stuff in the compiler with `Nat`

, moving logic to compile-time:

#### Less than

Say we wanted a function that accepted two `Nat`

s - one completely free and the second with a constraint that it must be less than the first. How would we do this?

We’d need a ‘less than or equal’ type class to begin with, which would act as evidence that the first type is ‘less than’ the second:

How do we generate the instances of this typeclass? From our definition of `Nat`

, the integer `<`

operation is translated directly into the `Succ`

relation — `_1`

is less than `_3`

because `_3`

is some number of `Succ`

s wrapped around `_1`

.

So we can write the following:

Now we can write the following:

#### Addition

Something else you can do with integers is add them together, so we’d like to be able to do this with `Nat`

too.

This is a bit more complicated since we have a result to deal with — we generate a completely new `Nat`

which we need to somehow record. Our typeclass representing `Sum`

will then necessarily have three types, the two summands and the result.

It will use the ‘Aux’ pattern to record the result type. I’ll give a brief justification so skip this if you’re familiar with the ‘Aux’ pattern already.

We want the compiler to do the arithmetic for us. We don’t want the user (the programmer using this code) to have to do the addition themselves. But if our `Sum`

type class was parameterised by all three types this is exactly what

they would need to do; they can’t say `Sum[_1, _2, ???]`

— they would have to always know the result beforehand and always write `Sum[_1, _2, _3]`

.

So that means the result type must be an inner type to `Sum`

(let’s call it `Out`

for ‘output’). However if we do this then the inner type is essentially lost to the compiler, it becomes just `sum.Out`

for some value `sum`

, unable to be usefully compared to any other type.

To get around this problem the Aux pattern is a little trick to only reveal to the user a typeclass with just the two parameters we’re interested in, while internally maintaining the three explicit type parameters necessary to do

useful compile-time type comparisons. It looks like this:

So we have a useful type class. How do we generate the instances? Well, in integer arithmetic there are certain axioms to do with addition:

Identity: A + 0 = A, for all A

Associativity(ish): (A + 1) + B = C => A + (B + 1) = C, for all A, B

These two are sufficient for integer addition, so all we need to do is put them into code:

Note that commutativity (A + B = B + A) is a corollary of the two above and does not need its own function.

Given the above, we can now write the following:

### SparseNat

`Nat`

is great, but pretty cumbersome when some of your max DB string lengths are 255 (or higher). Making the compiler handle a 255-nested type won’t be quick either. The limit on my computer seems to be adding `_22`

to `_22`

!

To get around this, we standardised around a few round(ish) numbers for lengths in our DB, and rolled our own version of `Nat`

which only handles these numbers and none in between. We called it `SparseNat`

(because it’s sparse).

Here it is:

Note there are a few changes from `Nat`

:

- We introduce `_unbound`

as a way of typing an arbitrary-length string

- Each `SparseNat`

carries an (optional) integer within it, representing its length

- Each type now has a non-empty object to go with it, so we can implicitly access the integer value it represents

- `Succ`

has been renamed `Jump`

, since we’re lurching unpredictably from number to number rather than stepping up by 1 each time

- Only a few numbers are represented. Feel free to implement as many as you like

#### Less than

Thankfully, `Lte`

is the same as with `Nat`

since it only depends on relative relationships rather than absolute relationships. So we can do this:

and `Lte`

works with `SparseNat`

just as with `Nat`

.

#### Sum

`Sum`

is more difficult with `SparseNat`

since we’ve lost our associativity axiom. Observe:

Jump[_50] + _100 = _100 + _100 = _255 (minimal SparseNat >= 200)

but

_50 + Jump[_100] = _50 + _255 = _unbound (minimal SparseNat >= 305)

so things are unfortunately more complicated.

In fact, because of the nature of `SparseNat`

being able to implement any subset of integers you want, there is nothing we can say at all about addition, except commutativity and the identity of 0.

This means that we need to implement each addition ourselves, manually. Thankfully though this can be a lazy procedure — if you never add `_25`

and `_50`

you never have to write its `Sum`

instance.

`SparseNat`

‘s `Sum`

typeclass therefore looks like this:

Note: in a future blog post a method for speeding up `Nat`

compile times will be proposed and can be used as a substitute for `SparseNat`

with no impact on the rest of your code.

### FString

Now we finally have enough to define our `FString`

class. It’s basically just a wrapper around `String`

, parameterised by a `SparseNat`

with a constructor that returns a scalaz `Validation`

, and some

utility methods such as `++`

.

If you don’t use scalaz validation you can use any similar construct: `Either`

, `Try`

, `\/`

, `cats.Validated`

, etc.

Here it is:

And that’s it. Now we can have models which have accurate length limitations on their string values, and take care of all validation at the extremes of our program: the API and the database layer. An `FString`

field in a JSON case class would fail the decoding if the string were too long, immediately kicking out a user with a 400. Similarly with the database type mappers.

And if we tried to insert an `FString[_50]`

from the API into an `FString[_25]`

field in our database, our program would not compile. We’d be forced to either shrink the API field or widen the database field in order to handle the potential exception.