Data Warehouse — need for speed

Florin Paun
eMAG TechLabs
Published in
3 min readAug 27, 2015

The main purpose of a Data Warehouse (DW/DWH) is to provide BI (Business Intelligence) with the entire information set that enables decision factors to make the best business decisions. In other words, DWH requires periodical copying of huge data volumes from different sources (applications, ERP, files, websites etc.), that are stored on one or more servers to be used for statistical reports and analyses (data mining).

At eMag, the necessity for creating a DWH appeared in 2012 as a consequence of a huge increase in reporting demands in QlikView, namely of the problems occurred, such as:

  1. Non-syncronized reports: the same indicator is displayed differently in 3 reports. Causes:
  • different calculation formula, provided by different beneficiaries or created by different developers;
  • different data sources;
  • different time intervals regarding frequency and report updating moments.
  1. Non-updated information / poor reporting system. Cause: every night, for 5 hours, around 80 tables were imported from MySQL into SQL Server. Frequently, there were situations when one or more tables could not be imported as a consequence of the following causes:
  • changes in the structure of the source-table: column deletion, change of date type from smallint in bigint, from int in varchar etc.
  • anomalous values unsupported by the SQL Server: date type, such as 30.02.2012 or 00.00.000.

Therefore, missing data in certain key-tables influenced several essential indicators used in reporting.

At random, the benefits of a DWH are enumerated below:

  1. Integration of multiple sources: data imported from different sources (MySQL , MSSQL, SAP, PostgreSQL, csv/txt files, etc.), can be connected / linked fast and easily;
  2. Time-saving: as data are centralised from other sources, users can interrogate data much faster and in a much easier way, without IT support;
  3. Increase in data interrogation performance: a DWH automatically implies rethinking the entire architecture of its bases and tables. The objective is to create a small number of tables integrating a large volume of information (denormalization of database). For example, table Products from DWH eMAG contains all relevant information imported from 9 operational tables. The less number of tables in a join, the more performant;
  4. Historical data storage: unlike the transactional system, where data can be archived, modified or even lost, the DWH system keeps record of the change history;
  5. Increase in data quality and consistency: different cleaning and recording procedures are used in the data import process;
  6. Precalculation of indicators: a series of indicators / KPI used by BI can be calculated when data are imported into DWH, facilitating the BI colleagues’ work;
  7. Elaboration of prognoses: using special tools (R, SSAS, SPSS), we can apply a series of statistical analyses / algorithms / models, such as: predictions, product associations, classifications, segmentations, profilings, forecast, etc.

To have an overview upon the DataWarehouse operating system in eMag, see below its graphic representation in connection with data final sources and destinations.

DWH development in eMag is an ongoing process. At present, in the first half of 2015, the data volume reached about 6 Tb, covering almost all business areas. In a year, we expect to be up to 15 Tb and in 2 years, to exceed 30 Tb.

--

--