Cortilia: a data driven Company

Ivan Tiziano
Cortilia Team Blog

--

Today, following the path that leads to becoming a “Data Driven Company” is essential to develop skills and make decisions based on data rather than opinions.
This article describes in some of its parts the “Data Driven Model” we have adopted.

AWS Data Lake

The Data Lake is the centralized repository of all corporate data, in a structured and de-structured form. Making the data usable allows us to realize:

  • Analysis, to discover the relationship between components through data discovery tools;
  • Machine Learning models;
  • Integration with third-party tools;

We built the data lake using the following AWS services:

  • Amazon S3 (with column-wise storage Parquet)
  • Amazon Glue (ETL service that uses Apache Spark and Python in a serverless enviriorment)
  • Amazon Athena (with Presto SQL engine)
  • Aws Database Migration Service
  • Amazon Kinesis

Specifically, Kinesis simplifies the collection, processing and analysis of data streams in real time, to obtain timely analysis and to react quickly to new information. Thanks to real-time processing we can build a key approach to embrace the data driven culture.

Architecture

In Cortilia we use an architecture that allows loading data from heterogeneous sources to the Data Lake in order to make it the only source of truth.

We split the architecture into three layers:

  • Data Source
  • Data Ingestion
  • Data Storage

Data Ingestion

This Layer deals with all the necessary processes for data management from the source to the Data Lake.

The two methodologies used to date are:

  • Batch
  • Streaming

Batch

The Batch part was implemented using Amazon Glue, which allows the data loading (via Python / Spark script) from a source (Layer Data Source). Each process is orchestrated through Glue Workflows which allows scheduling and monitoring. The data is then loaded into the Layer Data Storage.

It is also possible to create your own architecture using “Apache Iceberg” which allows transaction management, record level update / delete and many other features to improve the data governance process.

Streaming

The streaming part uses Amazon Kinesis together with Aws Dms: this pair of services allows us to load the data into the Layer Data Storage through the change data capture (CDC) that took place on Amazon Aurora. Every change in the Data Source Layer is replicated at the final level in our Data Lake in real-time.

Data Storage

We have chosen to upload our data to Amazon S3 in the “Parquet” format for performance and cost reasons.

Amazon Athena

Amazon Athena uses Presto, a distributed open source SQL query engine that enables ad hoc analysis of low-latency data. Then you can query large datasets in Amazon S3 using ANSI SQL, with full support for joins, functions and arrays directly in Parquet format (the storage format). It also connects to various business intelligence tools via the JDBC driver. In our context we use it with Tableau (https://www.tableau.com/it-it) which, in addition to BI, allows us to use it efficiently thanks to the Self-Reporting tool.

Conclusion

Data is the beating heart of a data-driven company, their value is important for the analysis and development of business, sustainability and innovation. The Data Lake allows to open new scenarios and to increase not only the depth of analysis but also the performance. In Cortilia, thanks to the innovation that distinguishes us and the choice of Amazon AWS, we have been able to increase the value of the data by obtaining increasingly precise and useful information at a strategic level (https://aws.amazon.com/it/big-data/ datalakes-and-analytics / what-is-a-data-lake /) .

--

--

Ivan Tiziano
Cortilia Team Blog

Senior Data Engineer @Cortilia Spa Società Benefit