Hey F#, load me this CSV file into a table!
So here’s the thing: I’ve got this CSV file, nothing fancy, but still it has got integers, decimals, strings, and dates in it, now I need to load its contents into a relational table, like in MariaDB for example, on Linux if you don’t mind :-)
Let’s begin with the spoilers, we are going to do this :-P
Yep, the bulk of the job will be done by seven lines of code, but let’s go tiny step by tiny step. To perform this feat we are going to use two different type providers:
- CSV Type Provider, which reads a sample .CSV file, creating classes that represent a row and its columns, and functions that load files with similar structure returning a collection of well-typed objects.
- SQLProvider, which let us work against a relational database as illustrated in this previous post.
To begin with we have to create a folder, then a project inside it, and get these three packages: SQLProvider, MySQLData, and FSharp.Data. You can find detailed instructions on how to do this in steps 1 to 3 of this post (if you haven’t done it yet, this is a good time to follow that step-by-step guide). You should end up with a structure similar to this one:
Let me explain the “nice trick”: sometimes Paket, our package manager, is too eager, downloading library flavors for (too) many .NET Framework versions, line 1 framework: auto-detect in the paket.dependencies file, forces Paket to restrict itself to the framework versions actually present in my machine, so this is an optional step that, at least in my case, saves downloading a bunch of extra files.
Now back to our job, and there is one last configuration task to do before getting into the coding fun: as we are going to create a stand-alone executable, we need to be able to compile the project -in previous posts I have only used .fsx script files-, to accomplish this we will use Fake, a project builder á la C make, only functional and written in F# of course. Fake relies on the .fsproj project file to guide its work so we need to add the appropriate libraries to it, in order to do so:
- Press Ctrl-Shift-P, enter F#, and choose the F#: Add Reference option
- Select LoadCsv.fsproj from the list (or whatever the name of your project is)
- Enter FSharp.Data
- Repeat the same steps but now enter SQLProvider
If you want, just as a sanity check, you can open the LoadCsv.fsproj file and somewhere close to the bottom you should see references to the libraries, something like the image below, if they’re there, we’re good, smile, close it, and forget about it :-)
Now, onto data folders and files, as we said when we first mentioned the CSV Type Provider, it needs a sample .CSV file, so we create a data folder inside the LoadCsv folder, and there we create the paymentsSample.csv file (by the way, you can find the files related to this blog post here).
And we are finally ready to code! When we created the LoadCsv project, a LoadCsv.fs file was also created, this is our main (and only :-) ) source code file, it’ll be just a few lines long, and the gist of it is:
As I mentioned before, you can find the source code here, but I really suggest you to type in the code yourself in order to get a feeling for the language and the environment. Some relevant comments:
- Lines 7 to 9 just initialize some constants for the SQL provider, of particular note is ResPath, the path to the folder containing the MySQL assembly.
- Lines 11 and 12 create the type providers, here’s where the magic happens. Payments will know all about the .csv file contents, DbProvider will know all about the sales database.
- The loadCsv function does all the hard work: line 16 reads the whole .csv file found at path, so that payments is bound to a collection of objects with four properties (Id, Name, Amount, Date) each, context is our gateway to the sales database.
- At line 18 we take the lines already read from the .csv file and let the iter() function process each one with a lambda (OK, OK, just a function with no name :-) ), where p is already an object with properties for every field in the line
- At line 20 we create a new database row with the funnily named function ``Create(Amount, Date, Name)`` (yep, that’s its name, automatically created by the SQL provider), in line 21 we also set the primary key, that unfortunately is not a parameter for the Create() function :-( so we have to do it by hand, and in a mutable way also -a sin in the functional world but that’s what happens when you mess with external repositories. Mind you: the new row has not been committed to the database yet.
- Finally in line 22 we submit all the new rows to the database in one big transaction.
Just not to be too lame, we surround everything with some very basic exception control, and that’s pretty much it, 15 lines of code for the whole thing is not that bad, right? Let’s run it!
Beware of micro-benchmarks, but still: eight something seconds to load 10,000 records to a table is not bad, especially in my so-so laptop :-)
I know this has already become a long post, but one last thing may be bugging some of you: what if the .csv file had one million lines? Do we really want a one-million-inserts transaction at line 22? May be a small enhancement is in order:
Just two small changes:
- At line 19 we use the iteri() functions which gives us an additional parameter: a counter of how many lines have been processed already
- Then at line 22 we submit the new rows every 500 lines, a far shorter database transaction
And that’s it, I hope this gave you an idea of how much you can achieve with just a few lines of F#, as usual all feedback is welcomed!