[Ballerina] Converting CSV to JSON

Maryam Ziyad
Ballerina Swan Lake Tech Blog
3 min readSep 25, 2019

--

Photo by Wellington Cunha from Pexels

Today we came across a use case where it was required to convert data in a CSV file to JSON format.

Say I have the following CSV file (data modified a bit from a BBE):

I want to convert this to an array of JSON objects, where an element of the array would be a JSON object representing a single entry in the CSV file.

For example, the first entry (for Tom Cruise) should result in the following object.

{  
"id":"1",
"firstName":"Tom",
"lastName":"Cruise",
"age":"57"
}

So at the end of the conversion for this sample.csv file, we should have the following entries in a JSON object array.

{"id":"1", "firstName":"Tom", "lastName":"Cruise", "age":"57"}
{"id":"2", "firstName":"Maria", "lastName":"Sharapova", "age":"32"}
{"id":"3", "firstName":"James", "lastName":"Bond", "age":"60"}

Let’s break down this problem into two parts.

  1. Read the data from the CSV file and create an array of entries (where a single row would be represented by an array of strings). For example the entry for “Tom Cruise” would be [“1”, “Tom”, “Cruise”, “57”]
  2. For each element in the array, create a JSON object (i.e., a map<json> which is Ballerina’s representation of a JSON object)

Ballerina source code for the conversion is available here.

We’ve introduced two functions to address each sub-problem, and a main function that combines the two steps and prints out the individual elements at the end.

  • The main() function accepts a single string parameter which should be the path to the CSV file.
  • The retrieveRecordArray() function is called with the path as the argument. This function attempts reading the data in the CSV file and returns an array of string[]. Each string[] represents an entry (a row) in the CSV file, and each value in the string[] is a value specified for a particular column.

The sample.csv file would result in the following string[][].

[
["id", "firstName", "lastName", "age"],
["1", "Tom", "Cruise", "57"],
["2", "Maria", "Sharapova", "32"],
["3", "James", "Bond", "60"]
]
  • Once the string[][] is retrieved, the main() function invokes the convertToJsonObjectArray() function which accepts the array and converts it to an array of JSON objects. This conversion function identifies the column names dynamically from the string[][] retrieved — i.e., the first string[] is considered the names of the columns. This allows the conversion of any arbitrary CSV file — even when you may not know the exact structure — to JSON.

For the array retrieved in the previous step, the following array of JSON objects would be the result in this step.

[  
{
"id":"1",
"firstName":"Tom",
"lastName":"Cruise",
"age":"57"
},
{
"id":"2",
"firstName":"Maria",
"lastName":"Sharapova",
"age":"32"
},
{
"id":"3",
"firstName":"James",
"lastName":"Bond",
"age":"60"
}
]
  • Running this program should produce the following output.
$ ballerina run csv_to_json.bal ./files/sample.csv
{"id":"1", "firstName":"Tom", "lastName":"Cruise", "age":"57"}
{"id":"2", "firstName":"Maria", "lastName":"Sharapova", "age":"32"}
{"id":"3", "firstName":"James", "lastName":"Bond", "age":"60"}

And, we are done! :)

But, what if we want to do some additional validation, conversion, etc.?

For example, what if

  • I know the column names to be “id”, “firstName”, “lastName” and “age” beforehand, and I don’t want to look at the first string[] to figure out the column names every time I’m creating an object?
  • What if I want the “id” column and the “age” column to be integers instead of strings?

We can achieve both the above with slight changes to the object creation logic in convertToJsonObjectArray()you can find the changes here.

Just make sure to add an import for the int langlib module (import ballerina/lang.'int;).

Running csv_to_json.bal with the updated convertToJsonObjectArray() function would result in the following output. Note how id and age are numbers now.

$ ballerina run csv_to_json.bal ./files/sample.csv
{"id":1, "firstName":"Tom", "lastName":"Cruise", "age":57}
{"id":2, "firstName":"Maria", "lastName":"Sharapova", "age":32}
{"id":3, "firstName":"James", "lastName":"Bond", "age":60}

Note: to keep the example simple, the errors have not been handled in the best possible manner — error handling could and should be improved to better handle all the error scenarios.

This is what we came up with after some Google-ing and looking into the Balleriona I/O examples. Do share if you can think of other approaches/alternatives!

References

--

--