It is no secret that the enterprise world floats in petabytes of volatile data every day. Some of the key challenges faced by the businesses from the perspective of data integration and consolidation are:
- Diverse data sources, Data Isolation / Data Silos,
- No single version of the truth due to multiple data sources,
- Time delay and associated risk.
The issue is further multiplied by the increasing complexity in an organization due to the expansion of multiple systems, locations, growing dynamics of mergers and acquisitions, etc. It is not unusual to see senior executives spending time in debating and resolving the data issues before dealing with business decisions in hand.
In the Data world, it is known that the Data Warehouse is the base of Business Intelligence and ETL is the base of Data Warehousing. ETL stands for Extract-Transform-Load which covers the process of how data is loaded from the source system and transported to the Data Warehouse. ETL is a three-step process in database management and data warehousing. In some designs as per requirement, ETL incorporates cleaning as an additional step. The sequence then becomes Extract; Clean; Transform; and Load.
Keywords: ETL (Extract-Transform-Load), BI (Business Intelligence), DW (Data Warehouse), DB (Database), DBA (Database Administrator), DI (Data Integration).
WHAT IS ETL?
ETL involves the first task of extracting data from various sources such as SAP, ERP, CRM and other operational systems. Data from different source systems is converted into one consolidated data warehouse format (to adhere to a particular schema), which is then made ready for transformation. In managing databases, extract, transform, and load refers to three separate functions combined into a single programming tool.
First, the extract function reads data from a specified source database and extracts a desired subset of the data. Second, the transform function works with the acquired data — using rules or lookup tables, or creating combinations with other data to convert to the desired state. Finally, the load function is used to write the resulting data (either the whole subset or overwrite only the changes) to a target database, which may already exist. If not, a new database is created and the data is written on it.
WHY ETL IN DATA MANAGEMENT?
ETL is the process of moving and blending data from multiple sources and turning it into a meaningful consolidated form, often to facilitate downstream analytics. But what is the need for ETL in the Data market? To answer this question, consider there is a brand X having several retailers and stores in the market. Apparently, not all the retailers would use the same technology stack or software to collect and process data in a similar fashion, or technically, in a homogeneous format. Some retailers may use similar software and DB schema to that of brand X and dump all the sales data (across geographies) into their database and it would sync with X’s central database or its DW. Some other retailers may use other DBs (Oracle, SQL Server, NoSQL, PostgreSQL etc.) or a different technology stack altogether. Moreover, there could even be retailers using a simple form of data storage such as text files and spreadsheets. Therefore, the challenge is to integrate all these data into a common format (schema chosen by X) and bring the data into a central place, which is ideally the data warehouse of X.
This can be achieved using the ETL process. The Extract portion of ETL is used to get the data from different data sources in a homogeneous format. The Transform portion is used to solve problems with logic and calculations in percentage profits, currency conversions across counties. There could also be data cleaning activities such as arriving at a common abbreviation and expansion (Str. / St. -> Street in the address column, M & F -> Male and Female respectively, no. / num -> Number, etc.) which is essential to get quality data. Finally, the Load functionality of ETL loads all the data into the DW database and makes it readily available in a suitable format. Further, Load can also be used for loading necessary data into data cubes and tabular structures for optimized, fast, and aggregated data retrieval.
Nowadays, BI reporting tools work much effectively using inputs from tailored data. Millions of datasets are processed by fast and highly optimized, scheduled, and automated engines using ETL. All of the tedious processes in data integration, consolidation, and availability is taken care of by the process framework. When dealing with heavy datasets, ETL is unavoidable. The dream of having a single source to house all data for an organization has long passed. Integrating and leveraging data from a number of source systems has become a strategic skill for most analysis efforts in enterprise businesses.
ETL PRODUCTS: PURPOSE AND PAIN POINTS ADDRESSED
ETL products provide a visual framework for creating and designing the ETL jobs (the workflows for data transformation & integration). It provides a clear level of abstraction of the code/script which was used earlier to manage data. In other words, ETL tools allow the users to focus on the business logic and rules instead of the technical implementation. Therefore, one can focus on what to implement rather than how to implement. ETL products are expected to create an optimized code-behind for a workflow and are expected to either bring the data to the ETL server or perform operations in the databases depending on performance standards. In simple terms, users (ETL developers) need not work on to optimize the code but only configure the required jobs/transformations so that an optimized code is created by the tool for their workflow.
Most ETL tools should have pre-created connectors/drivers for most type of data sources. For example, to connect to a DB, all that is needed should be DB server details and credentials. Typically, users don’t have to know to code to connect to a data source. ETL products usually have mappings for each of the commonly used functions; built-in mappings. The metadata repository in an ETL tool should have a predefined mapping for every common data transformation for all data source.
For instance, say a business transformation requires to find out date from timestamps. In one database, it could be DATE (timestamp column), in another, the column name could be DT. In the ETL tool, the user only needs to select one date from timestamp function and then the tool checks which type of data it is and performs the corresponding mapping so that, the underlying data source understands it and may store the mapping for future use. A good ETL product should also be able to extrapolate this logic to other similar transformation with other data source and targets. It must also be able to communicate with many different relational databases and read the various file formats used throughout the company.
Speaking of advantages of ETL products, it goes without saying that the primary difference between ETL and conventional methods for moving data is its ease-of-use. ETL products typically contain a user-friendly GUI that helps users conveniently transform data, quickly map tables and columns between the source and target DBs using a visual data mapper, which is much faster than having to write and maintain conventional software programs to parse files and modify data types. This is not just within the developer domain; with ETL’s growing popularity in automating most of the developer tasks, its products are built exclusively for BI purposes, directly delivering results to the business executives. ETL products have become convenient tools that can be relied on to get maximum performance in business.
APPLICATION OF ETL PRODUCTS
ETL tools are typically used by a broad range of professionals — from students in computer science (looking to quickly import large data sets) to database architects in charge of an enterprise account management system.
Companies may also use ETL to move information to one application to another permanently. For instance, the new application might use another database vendor and most likely a very different database schema. ETL can be used to transform the data into a format suitable for the new application to use. An example would be an Expense and Cost Recovery System (ECRS), the ones used by accountants, consultants, and legal firms. The data usually ends up in the time and billing system, although some verticals may want to utilize the raw data for employee productivity reports (in Human Resources) or equipment usage reports (in Infra Management) in the organization.
Typically a decade ago, if a company works with a lot of data or has a need to move data around frequently, they most probably had a dedicated department comprising of DBAs. In general, the Database Administrators interface data from one department (or an outside source) and load it into another database for another department. This is done so that each department will have an advantage being able to set their own technical specifications for their projects. Today, an ETL tool can transform the data needed from other departments into the format that will work best for them. Consider a financial institution which might have information on a customer in several departments. Each of its department might have customer information listed in a different way. The membership department might list the customer by name, whereas the accounting department might list the customer by a number (Customer ID). ETL can bundle all of these data elements and consolidate them into a uniform presentation for storing in a central database or a data warehouse.
ETL can also be used to acquire a temporary subset of data for BI reports and views, or a permanent data set may be acquired for other purposes such as the population of a data mart or a data warehouse, conversion from one database type to another, and the migration of data from one database or a platform to another. (A data mart is the access layer of the DW environment that has a subset of the data which focuses on a specific business line or team and is used to get data out to the users.)
Using tools based on ETL framework, business has a tremendous possibility of connectivity and scalability. ETL products have started to migrate into Enterprise Application Integration systems that now cover much more than just the extraction, transformation, and loading of data. Many ETL vendors now have data profiling, data quality, and metadata capabilities. While ETL Tools have traditionally been for developers and I.T. staff, the new trend is to provide these capabilities to business users so that they can themselves create connections and data integrations when needed, rather having a separate department working for them.
WORKING AND DESIGN OF ETL PRODUCTS
Technically speaking, in ETL, data extraction is where data is fetched from homogeneous and heterogeneous data sources. Next, data transformation is where the data is transformed to be stored in a proper format or to structure for the purposes of querying and analysis. An important function of transformation is the cleaning of data, which aims to pass only “proper” data to the target. Finally, data loading is where the data is loaded into the central database or data warehouse. Since the data extraction takes time, it is common to execute the three phases in parallel. While the data is being extracted, transformation process executes by processing the data already received and prepares it for loading. And the loading too, begins asynchronously without waiting for the completion of the previous phases.
An intrinsic part of the extraction involves data validation to confirm whether the data pulled from the sources have the expected values. If the data fails in the validation rules, it is rejected entirely or in part i.e, failed validation may result in a full rejection of the data or partial rejection. The rejected data is reported back to the source system for further analysis to identify and to rectify the incorrect records. In some cases, the extraction process itself may have to perform data validation rule in order to accept the data and flow on to the next phase. The data is handed over to the next step depending on the rule design and exception handling designed in the process. In many cases, data will be made available at regular intervals and time spent to consolidate and process the data can make blending and analysis much easier.
A typical ETL based data warehouse uses layered architecture such as staging, data integration, and access layers to define and process its key functions: The staging layer (or staging database) stores the raw data extracted from each of the disparate source data systems. The integration layer integrates the disparate data sets by transforming the data from the staging layer and stores this transformed data in an operational data store (ODS) database. The integrated data are then moved to another database, where the data is arranged into hierarchical groups, called dimensions, and into facts and aggregate facts. The combination of facts and dimensions is referred to as a star schema. Finally, the access layer helps users retrieve data.
The design layout suggests reusability, scalability, and robustness of an ETL system across the lifetime of its usage to process data within service level agreements in an organization.
ETL PRODUCTS IN BI ENHANCEMENT
ETL is an intrinsic part of today’s business intelligence. It is the IT process from which data from disparate sources can be put in one place to programmatically analyze and discover actionable business insights. Hence, the need for a stable ETL framework that is robust, scalable and which focuses on utility is indispensable for enhancing BI and Data warehousing services.
Essentially, a product based on the ETL framework works in the background for a BI reporting tool. It acts as the backbone in data integration and consolidation for BI to project visual insights on the screen. If BI is about enabling corporate decision makers to navigate complex business environments, the focus of BI solutions should be more about the users/customers and less about the technology. In order to achieve this, there has to be another level of abstraction that can shoulder the burden of working with technology. This is the gap ETL products aim to fill.
The ETL industry has spent decades perfecting the process of data collection, storage, cleaning, and distribution. Taking BI to the next level means not only transforming data into information but also getting it into the right hands, at the right time and in the correct format to be consumed for timely decision-making.
The CIO of an enterprise will need actionable business insights on data. The primary concern of any business decision maker is answering such questions as,
Which product or service offerings will best drive our revenue goals this quarter?
What kind of product mix, cost and pricing will preserve our gross margin requirements while we grow?
Which customers are the most profitable? And how do we raise the profitability of those customers who fall short?
Should we enter the market with this new product line or focus more on funding R&D?
Do we put market share or profit margin at the top of the priority list for the next fiscal?
How much more can be allotted for additional marketing in the budget this quarter?
Such strategic decisions can impact many divisions and functional areas within the business. ETL products support the BI in providing this consolidated and cross-functional data. The analysis of data necessary for strategic decision-making often leads to additional questions. This ultimately means additional requests for tailored data from the ETL processed data warehouse of an organization.
Data integration and consolidation have become a vital part of business execution every day. If getting a more consistent and faster results addresses some of the DI pain points in an organization, implementation of ETL and its products is invariably the best solution.