Analyzing data for automatic datavisualization

tienvn3012
Linagora Engineering
5 min readOct 20, 2017

In the world of technology, data is the most important resource. So, processing and put data in statistical form become more important than ever and currently many software exist (Excel, Google sheet …) but most are still closed source products and some of them are not good enough to analyze the data of user and provide statistics from them as you can see in the image below:

MS Excel 2013

With OpenData module Linagora aims to develop an efficient open source product capable of processing data and provide statistics from them.

So What Is OpenData Module ?

OpenData is an OpenPaaS module which processes data and presents them in statistical form through charts.

Currently at the stage of research & development, it is capable of processing data and building charts automatically for users. In a near future it will evolve in the form of an open data portal integrated into OpenPaaS.

Ideas and activities of OpenData

As you know, we have many types of data stored from large database managers to small files.

So is there a way for the computer to know what the data is about and show it in a way that every user can understand? That is the problem need to be solved by OpenData.

These data can be about time, number or anything, but ultimately all of the data consist of three main categories:

● Time

● Number

● Text

And since then the problem has become the identification and presentation of data.

Features of OpenData

As above, OpenData will have two main features: identifying and presenting data

Identification:

Let’s look at simple data:

Country,2005,2010,2015

China,1302285,1336681,1367486

India,1090974,1173109,1251696

United States,295517,309348,321369

Indonesia,229245,243423,255994

Brazil,186021,195835,204260

Pakistan,169279,184405,199086

Identify these data into a form is a part of OpenData:

But the actual data is not so simple like that, let’s look at another example:

Country,Year,earnings

China,2005,13102285$

India,2005,1090974$

US,2005,295517$

Indonesia,2005,229245$

Brazil,2005,169279$

Pakistan,2005,169279$

China,2010,1336681$

India,2010,1173109$

US,2010,309348$

Indonesia,2010,243423$

Brazil,2010,195835$

Pakistan,2010,184405$

China,2015,1367486$

India,2015,1251696$

US,2015,321369$

Indonesia,2015,255994$

Brazil,2015,204260$

Pakistan,2015,199086$

The data have 3 columns, each representing a different data type. You can see the first column is the “text” , the second column is the “datetime” data type and the third column is the “number” type. As in the example above, we will have to identify each record of the column, determine which data type of those columns and return it to the table form.

For the data which type is “datatime”, each country, each region will have its own format of time, so we need to identify the format of the data for processing.

Not only that, you can see that with the data on the third column containing the whole unit of data, so with data like this OpenData requires identifying the unit of data.

In addition, the data has a special thing that the data in its first two columns is repeated. Identifying repetitive data is of very importance, so many data in reality has thousand of records but in actually it just has fewer than 10 records that are repeated in all those records and then the representation will become much simpler.

Three columns of data after being identified

The above examples do not cover all the actual data but are very basic examples of what to identify in the data. So that is how OpenData works. There are still a lot of types and attributes of data which OpenData must identify but it will be updated in the later versions.

Data representation:

The mission after identifying the data is to represent them:

OpenData will display the identified data as graphs, giving users the most complete view of their data.

The question of representation is to determine which kind of graph can be drawn with that data and how to draw if it is feasible.

In addition, the most important issue is to not present to numerous graphs to the user (we can not display 1000 graphs and then users scroll through 999 graphs to see the last graph is the one they need), the graphs must show what it represents best (eg, time-related data is most likely to use a line chart than a pie chart).

The chart shows the data in the example above
The chart of a column of repetitive data
Chart statistics the number of occurrences of records in a column
The chart represents time data
The graph shows the time data statistic by month
The graph shows the time data statistic by year
Graphs represent data corresponding to the time of each month over the years

That’s how OpenData represents the data. In the future, with the help of AI, Machine learning, Big Data, … OpenData will be a great application for business.

Conclusion

OpenData is completely focused on the identification and presentation of data, so it is expected to provide better quality than today’s multifunctional data processing software such as Excel, LibreOffice…

OpenData is still under development so there are still a lot of drawbacks, such as slow computation and insufficient processing of large amounts of data. But these problems can be solved in the future with data processing technologies such as BigData.

For the purpose of building an application that automatically identifies and demonstrates data in the most appropriate way, our development team has researched and developed this project, but in the process of development has a lot of problems that we can not even have predictation on, but finally, after analyzing and finding a solution, OpenData has its first form.

There are still have many problems to be solved, but it is hoped that Linagora in particular and the development community will contribute to develop OpenData so that it can soon become a great open source application in the near future.

Clip Demo

--

--