Wide & Tall data formats

Krishna Kumar Tiwari
W2HDS
Published in
3 min readMar 5, 2019

--

Wide & Tall data formats are commonly used by many of the data scientist and statistician from long time, but these terms are less used in engineering side. In the current era, many of the developers/engineers are interested in data science, either in the complete data scientist role or applied sciences kind of role which involves data science implementation of models following engineering practices to make it production ready. This article is just to cover the Wide & Tall data formats using simple examples for engineers who are interested/working in data science domain.

Wide & Tall data formats are quite easy to understand, lets understand this based on intuition. In simple terms, when you think

Tall → Wide

wide think horizontal

long think vertical

Lets take an example where we have a data frame with product list with it’s attributes and values.

Data Frame Product List

Above data frame(df), you see from an engineer’s eye, you will notice that product attributes are represented in vertical manner which is also making this df (table) long when you are seeing this. This particular representation is know as tall data format where data is growing more on vertical side.

Now the next thought which comes to the mind is why not to flatten the data?

There are multiple ways one can achieve data flattening using Pandas, most preferred choice is using pivot method. It is similar to MS Excel pivot table, in this you need to define the index (rows), columns and values.

The above representation of data is in wide format as data is grouped and horizontally flatten.

Similar flattening can be using using the unstack method. It is quite similar to pivot. Below is the sample code to do that.

Unstack example

I strongly recommend to try pivot melt, stack &unstack methods which are commonly used in reshaping the data frame. (sample example)

In this articles we have used Python to do the reshaping and used Pandas for wide & tall data transformations. Same thing is possible in R, tidyr and dplyr are mostly used for such transformations.

When to use what?

In Wide format, categorical data is always grouped. It is easier to read and interpret as compared to long format while on long format every row represents an observation corresponding to a particle category/attribute.

Thanks for reading this, please share your thoughts, feedback &ideas in comments. You can also reach out me on @simplykk87 on twitter and linkedin.

--

--

Krishna Kumar Tiwari
W2HDS

40 Under 40 Data Scientist | Mentor at Atal Mission, Niti Aayog | Founder ML-Ai Community (ml-ai.in) | IIT-G