[Ballerina] Converting CSV to JSON
Today we came across a use case where it was required to convert data in a CSV file to JSON format.
Say I have the following CSV file (data modified a bit from a BBE):
I want to convert this to an array of JSON objects, where an element of the array would be a JSON object representing a single entry in the CSV file.
For example, the first entry (for Tom Cruise) should result in the following object.
{
"id":"1",
"firstName":"Tom",
"lastName":"Cruise",
"age":"57"
}
So at the end of the conversion for this sample.csv
file, we should have the following entries in a JSON object array.
{"id":"1", "firstName":"Tom", "lastName":"Cruise", "age":"57"}
{"id":"2", "firstName":"Maria", "lastName":"Sharapova", "age":"32"}
{"id":"3", "firstName":"James", "lastName":"Bond", "age":"60"}
Let’s break down this problem into two parts.
- Read the data from the CSV file and create an array of entries (where a single row would be represented by an array of strings). For example the entry for “Tom Cruise” would be
[“1”, “Tom”, “Cruise”, “57”]
- For each element in the array, create a JSON object (i.e., a
map<json>
which is Ballerina’s representation of a JSON object)
Ballerina source code for the conversion is available here.
We’ve introduced two functions to address each sub-problem, and a main
function that combines the two steps and prints out the individual elements at the end.
- The
main()
function accepts a singlestring
parameter which should be the path to the CSV file. - The
retrieveRecordArray()
function is called with the path as the argument. This function attempts reading the data in the CSV file and returns an array ofstring[]
. Eachstring[]
represents an entry (a row) in the CSV file, and each value in thestring[]
is a value specified for a particular column.
The sample.csv
file would result in the following string[][]
.
[
["id", "firstName", "lastName", "age"],
["1", "Tom", "Cruise", "57"],
["2", "Maria", "Sharapova", "32"],
["3", "James", "Bond", "60"]
]
- Once the
string[][]
is retrieved, themain()
function invokes theconvertToJsonObjectArray()
function which accepts the array and converts it to an array of JSON objects. This conversion function identifies the column names dynamically from thestring[][]
retrieved — i.e., the firststring[]
is considered the names of the columns. This allows the conversion of any arbitrary CSV file — even when you may not know the exact structure — to JSON.
For the array retrieved in the previous step, the following array of JSON objects would be the result in this step.
[
{
"id":"1",
"firstName":"Tom",
"lastName":"Cruise",
"age":"57"
},
{
"id":"2",
"firstName":"Maria",
"lastName":"Sharapova",
"age":"32"
},
{
"id":"3",
"firstName":"James",
"lastName":"Bond",
"age":"60"
}
]
- Running this program should produce the following output.
$ ballerina run csv_to_json.bal ./files/sample.csv
{"id":"1", "firstName":"Tom", "lastName":"Cruise", "age":"57"}
{"id":"2", "firstName":"Maria", "lastName":"Sharapova", "age":"32"}
{"id":"3", "firstName":"James", "lastName":"Bond", "age":"60"}
And, we are done! :)
But, what if we want to do some additional validation, conversion, etc.?
For example, what if
- I know the column names to be “id”, “firstName”, “lastName” and “age” beforehand, and I don’t want to look at the first
string[]
to figure out the column names every time I’m creating an object? - What if I want the “id” column and the “age” column to be integers instead of strings?
We can achieve both the above with slight changes to the object creation logic in convertToJsonObjectArray()
— you can find the changes here.
Just make sure to add an import for the int
langlib module (import ballerina/lang.'int;
).
Running csv_to_json.bal with the updated convertToJsonObjectArray()
function would result in the following output. Note how id
and age
are numbers now.
$ ballerina run csv_to_json.bal ./files/sample.csv
{"id":1, "firstName":"Tom", "lastName":"Cruise", "age":57}
{"id":2, "firstName":"Maria", "lastName":"Sharapova", "age":32}
{"id":3, "firstName":"James", "lastName":"Bond", "age":60}
Note: to keep the example simple, the errors have not been handled in the best possible manner — error handling could and should be improved to better handle all the error scenarios.
This is what we came up with after some Google-ing and looking into the Balleriona I/O examples. Do share if you can think of other approaches/alternatives!
References