How to: Convert CSV Objects to Parquet from IBM Cloud Object Storage

This is a story that was written a while back and was recently published on the IBM Cloud Blog.

IBM Cloud SQL Query is a serverless solution that allows you to use standard SQL to quickly analyze your data stored in IBM Cloud Object Storage (COS) without ETL or defining schemas. SQL Query allows you to query multiple types of data in your COS buckets including CSV, JSON, and Parquet; and each one has its benefits. However, if you’re looking to lower storage costs by compressing your data and get better query performance when querying the data stored in COS, you might have considered how to convert your data into Parquet. We have you covered.

If you’re starting out with a bucket of CSV objects and you want to convert those to Parquet, I’ve developed a simple command-line tool written in Python that’ll help you do that called csvtoparquet. All you need to do is list the names of the CSV objects in your COS bucket that you want to convert, provide those objects with a new name (you can add prefixes to the names as well), then it will use Apache Arrow to convert that CSV data to Parquet and store it back into the COS bucket.

In this article, we’ll show you how to set up the tool and get started using it.

How it Converts CSV Objects

csvtoparquet uses Pandas and Apache Arrow to convert your CSV objects to Parquet. First, the tool retrieves the objects without storing them on your system using the Python IBM COS SDK.

def _get_object(self, obj):
""" get the body of an object """
        cos = self.connection
cos_object = cos.get_object(Bucket=self.cos_bucket, Key=obj)
body = cos_object['Body'].read().decode('utf-8')
        return body

Those objects are then read with Pandas read_csv and converted to a dataframe.

def _read_csv_to_df(self, obj):
""" read the object contents and put them into a panda's dataframe """
        get_object = self._get_object(obj)
buff = StringIO(get_object)
        try:
df = pd.read_csv(buff)
            # replaces spaces in column names like `sample column` with `sample_column`
cols = df.columns.str.strip().str.replace(' ', '_')
df.columns = cols
            return df

except Exception:
logging.error("Can't create panda's dataframe from object")

CSV objects that have column names with spaces, need to have the spaces converted to _, otherwise, there is an error which Parquet throws.

Next, Pandas has a to_parquet method that will convert the dataframe to Parquet. The engine that we've selected to make the conversion is pyarrow, which is the Python library for Apache Arrow. That's installed as a dependency of the tool.

def _convert_to_parquet(self, obj, new_obj_name):
""" convert the pandas dataframe to Parquet """
        df = self._read_csv_to_df(obj)
parquet_obj = BytesIO()
df.to_parquet(parquet_obj, compression="gzip", engine="pyarrow")
parquet_obj.seek(0)

return self._save_object(new_obj_name, parquet_obj)

Once the object is converted to Parquet, we’ll save it back to the COS bucket and add the .parquet extension to the new object.

def _save_object(self, new_obj_name, parquet_data):
""" save the body of the object and rename with the .parquet file extension """
        cos = self.connection
new_obj_name_parquet = "{}.parquet".format(new_obj_name)
return cos.put_object(Body=parquet_data.getvalue(), Bucket=self.cos_bucket, Key=new_obj_name_parquet)

With the understanding of how the converter works, let’s move to on to understand how to use the tool to start converting objects to Parquet.

Installing and Using csvtoparquet

The requirement to start using the tool is having Python 3 installed on your system. Now, the tool is readily available to install using pip:

pip install csvtoparquet

Now that it’s installed, typing csvtoparquet -h will provide you with the options that are available. However, let's examine the required parameters first.

usage: csvtoparquet [-h] -a APIKEY -e COS_ENDPOINT -b COS_BUCKET [-l] [-c]
[-f FILES [FILES ...]] [-n NAMES [NAMES ...]]

Under “usage”, everything in brackets [ ] is optional, while the flags -a, -e, and -b are required. -a is your IBM Cloud API Key that you can get in your IBM Cloud account by clicking Manage > Security > Platform API Keys. Next, for -e you'll need the endpoint for the location of the COS bucket where your CSV objects reside. That's available in the COS management panel by clicking on Endpoint. And for -b, you need the name of the bucket where your objects are located.

Once you have those, you’re able to start using the optional commands that come with the tool.

-l, --list-all        List All Objects in IBM COS Bucket
-c, --csv List All CSV Objects in IBM COS Bucket
-f FILES [FILES ...], --files FILES [FILES ...]
Convert Selected Objects in IBM COS Bucket - must use
with -n
-n NAMES [NAMES ...], --names NAMES [NAMES ...]
Name(s) of the New Parquet Objects - must use with -f

Options -l and -c are commands that only list the objects that are currently stored in a COS bucket. -l lists all the objects in a specified bucket, while -c will list objects that are only CSV. All this does internally is check the object name to see if it has .csv attached. Let's look at this using some data in a COS bucket.

For this example, I’ve uploaded to my COS bucket three months of CSV data (January to March 2018) from a New York City bike share company called Citi Bike. The data has been uploaded, run csvtoparquet using the -c flag to show the CSV objects. If you already have CSV objects in your bucket, all of them will appear in your results.

csvtoparquet -a API_KEY -e s3.us-south.objectstorage.softlayer.net -b mydata -c
[
{
"Key": "201801_citibikejc_tripdata.csv",
"LastModified": "2018-06-23T02:31:31.031000+00:00",
"ETag": "\"536c4c2cf6f9b5ed9ae0bb090a157c4d\"",
"Size": 2523746,
"StorageClass": "STANDARD"
},
{
"Key": "201802_citibikejc_tripdata.csv",
"LastModified": "2018-06-23T02:31:34.224000+00:00",
"ETag": "\"d3c8cbe0082d11f5c6d5c005ec04a2d1\"",
"Size": 3008614,
"StorageClass": "STANDARD"
},
{
"Key": "201803_citibikejc_tripdata.csv",
"LastModified": "2018-06-23T02:32:18.667000+00:00",
"ETag": "\"684925a792d9a2afaa7bc54f3c911762\"",
"Size": 3402749,
"StorageClass": "STANDARD"
},
]

As you can see, the command will provide you with the object name (Key), date it was uploaded or modified, a tag, the size in bytes, and the COS storage class. When converting these objects to Parquet, you’ll see a significant size reduction of the object size because they are compressed and thus optimized for SQL Query use.

In order to convert these CSV objects, you’ll need to use -f and -n flags together. -f should be a list of objects that you want to convert stored in COS, and -n is a list of new names that objects as Parquet (you can add prefixes). Both lists should not have commas separating the objects. So, if we wanted to convert the three CSV objects in the example above, we'll need to select them by name, then rename them. In the command below, I've added a prefix just to separate the new Parquet objects from those I've uploaded.

csvtoparquet -a API_KEY -e s3.us-south.objectstorage.softlayer.net -b mydata -f 201801_citibikejc_tripdata.csv 201802_citibikejc_tripdata.csv 201803_citibikejc_tripdata.csv -n trip/01 trip/02 trip/03

Running this command, you’ll see that it’s converting the CSV objects, and it provides you with the new Parquet object extension:

csvtoparquet -a API_KEY -e s3.us-south.objectstorage.softlayer.net -b mydata -f 201801_citibikejc_tripdata.csv 201802_citibikejc_tripdata.csv 201803_citibikejc_tripdata.csv -n trip/01 trip/02 trip/03
Now Converting: 201801_citibikejc_tripdata.csv --> trip/01.parquet
Now Converting: 201802_citibikejc_tripdata.csv --> trip/02.parquet
Now Converting: 201803_citibikejc_tripdata.csv --> trip/03.parquet

Querying Data

Looking in your COS bucket or using the csvtoparquet tool with the -l flag, we can view the new Parquet objects in the bucket along with our original CSV objects. Below are the new Parquet objects we created with csvtoparquet:

[
{
"Key": "trip/01.parquet",
"LastModified": "2018-08-14T00:10:58.173000+00:00",
"ETag": "\"4a42038cbd6f19365d345dee2d10360d\"",
"Size": 314153,
"StorageClass": "STANDARD"
},
{
"Key": "trip/02.parquet",
"LastModified": "2018-08-14T00:11:11.833000+00:00",
"ETag": "\"c135c35caee3359320e60b9ec9e1d4d5\"",
"Size": 372361,
"StorageClass": "STANDARD"
},
{
"Key": "trip/03.parquet",
"LastModified": "2018-08-14T00:11:29.132000+00:00",
"ETag": "\"dd7467dcd209d10f1df35d7c2d23efd2\"",
"Size": 426614,
"StorageClass": "STANDARD"
}
...]

Notice the significant size difference of the Parquet objects. By converting them to Parquet, we’ve reduced their size from 3–4MB to 300–400KB. As a result, the speed of any SQL queries we run in SQL Query will be increased. Running the following query on the original CSV objects will take about 30 seconds to run:

SELECT  
tripduration,
`start station id`,
`end station id`,
bikeid
FROM cos://us-geo/geodata/2018*
WHERE bikeid = 31929 AND `start station id` != `end station id`
ORDER BY tripduration DESC

We use * as a wildcard in the query to join all three CSV files together without having to use the SQL JOIN clause. To run the query over the converted Parquet objects, we'll only need to modify the name of start station id to start_station_id and end station id to end_station_id, since csvtoparquet added _ for spaces in column names. Additionally, we can use the wildcard * after the trip/ prefix to join all of our trip Parquet objects along with STORED AS PARQUET to indicate the objects use the Parquet file format:

SELECT  
tripduration,
`start_station_id`,
`end_station_id`,
bikeid
FROM cos://us-geo/geodata/trip/* STORED AS PARQUET
WHERE bikeid = 31929 AND `start station id` != `end station id`
ORDER BY tripduration DESC

Running this query, you will see a significant time reduction from a 30-second query to just 8 seconds. The significant time reduction is a very good indicator of the advantage that querying Parquet objects has over CSV. Nonetheless, if our objects were any smaller than 2MB, the query times between CSV and Parquet objects might be negligible for your use case.

Summing up

The new tool that I’ve developed to get started converting your CSV objects to Parquet is easy to install and try out. One caveat is that it’s meant for smaller objects < 2 GB in size because it uses your system’s resources to make the conversion — not the cloud. At the moment, we’re experimenting with turning the tool into a serverless function with IBM Cloud Functions. We will be writing a blog post about how to do that soon. In the meantime, use the tool and give us some feedback about how you use it and some improvement’s that you’d like to see made.