Overcoming Parquet Schema Issues

Shyamal Akruvala
Nerd For Tech
Published in
4 min readAug 20, 2020
Photo by Genessa Panainte on Unsplash

Couple approaches on how we overcame parquet schema related issues when using Pandas and Spark dataframes.

In my current project we rely solely on parquet files for all our data processing. Initially we extracted data from MS SQL Server into .csv (dat) files and schema into .sch files. Using Spark and iconv we converted these into parquet files, ensuring the schema was always applied before saving into our Hadoop Data Store.

Recently we moved to using Pandas (with pyodbc) for extractions, and directly uploading to HDFS as parquet using Pandas to_parquet() method. These files needed one more round of processing to apply the schema, which are now safely stored in a central schema repository. This is where we encountered the issue shown below

TypeError: field ActiveFlag: IntegerType can not accept object 1.0 in type <class 'float'>

Let me explain my analysis

Schema of a table showing the column ActiveFlag as tinyint in the database
Image by author

In the screenshot you can see a dummy table (EMP) that I created with the column ActiveFlag as tinyint. Expected values are 0, 1 or null

Image by author

Created some dummy data to populate the table. Highlighted record is the one we are interested in.
As we can see ActiveFlag has integer values 0, 1 and null for last record.

I read the data in a Pandas dataframe, display the records and schema, and write it out to a parquet file.

Image by author

As we see above highlighted, the ActiveFlag column is stored as float64. Values like 0, 1, and null are converted to 0.0, 1.0 and NaN. This is due to Pandas limitation to store null in series of int64 Dtype.

We now read the file into a Spark dataframe, print the records and review the schema. We see ActiveFlag as double and all columns marked nullable = true

Image by author

Let’s create a schema and apply it to the Spark dataframe. The schema gets applied successfully but when we perform a show or a write that’s when Spark will throw a TypeError indicating it cannot convert float into IntegerType.

One approach is to create a PyArrow table from Pandas dataframe while applying the required schema and then convert it into Spark dataframe.
As seen below the PyArrow table shows the schema and data correctly

Image by author

If you do a search on the ways to convert a PyArrow table into a Spark dataframe you’ll most commonly see the to_pandas() method of PyArrow table being called and then Spark’s createDataFrame method on the Pandas dataframe. But if we again convert back to Pandas then the schema would be lost and data messed up. So reviewing the options available we saw the to_pydict() method on the PyArrow table. This would convert the table into an ordered dictionary with schema and data retained correctly. And then from this we can create a Spark dataframe and apply our schema.

Image by author

Another approach I figured out recently is to use Int64 Dtype newly available in Pandas 1.0.0 . We can directly use convert_dtypes() on the Pandas dataframe when saving the parquet file and it would store the data and datatype correctly. Then we simply read it in Spark and apply our schema. Work done :)

Image by author

References:
1. https://stackoverflow.com/questions/48578787/convert-ordered-dictionary-to-pyspark-dataframe
2. https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.convert_dtypes.html

--

--