Accelerate your Snowflake data ingestions

R S
R S
May 20 · 10 min read

Do you prefer to focus on analytics than loading overheads? Learn how you can leverage our solution to quickly ingest your data into Snowflake.

by Ricky Sharma, Soumya Ghosh

Lets start with a typical data ingestion life cycle:

Customers want to quickly analyze their data, use Machine learning and Predictive analysis to gain insightful information that can fuel their revenue and business growth. But this cannot be done without loading this data first. This introduces several dependencies and delays their time to market.

It requires them to retain large Information Technology resource pods for ETL development and operations. These pods are often dedicated to a project or program, which further increases the costs and resource management overheads. This workflow introduces redundancy in development and support efforts, increases cost, and creates inconsistency in standards, as depicted below.

Challenges with above life cycle:

• High data ingestion cost and longer wait time before analytics is performed.

• Varying standard for Data Ingestions and Audits.

• Quality Assurance and Business Analysis of data is not sustained.

• Impact of change bears heavy cost and executes slowly.

• Spends go recursively, over and over, for each new data load.

• More pyslos applications and teams are formed, increasing cost and maintenance efforts.

How to overcome this problem?

As shown in the below approach, spending one-time on creating a sustainable data ingestion framework brings re-usability across your enterprise and provides a consistent standard, eliminating the need to maintain several project/program pods.

But this approach must be supported with an intelligent framework. A framework that can reduce or even eliminate your ETL coding efforts, consolidate management, shorten the development cycles, and support complicated data load requirements. All of this must be parameterized way of managing the code, so that you don’t maintain several code files, and pods and resources to support them.

We can help!

Using the above approach, we have designed a Data Load Accelerator using Talend that provides a configuration managed data ingestion solution. You can load Structured and Semi-Structured datasets from Cloud Storage supported by Snowflake, achieve complete audit and traceability, and avoid writing individual ETL pipelines per use case basis.

Key features of Data Load Accelerator:

1. Modern Data Architecture: Leverage a dynamic profile driven architecture bringing best of all — Talend, Snowflake and Azure/AWS capabilities.

2. One code for all your needs: With configuration-based ingestion model, all your data load requirements will be managed with one code base.

3. Lighting fast ingestions: Leverage mature design principles, such as Extract Load Transform (ELT over ETL) and execution push down to Snowflake, to achieve fast ingestions and multi-threaded processing.

4. Enterprise grade standard: Adopt to a consistent data loading, meta data management, CI/CD and DevOps standard for your enterprise.

5. No new language to learn: Have your profilers design and manage their ingestions using simple ANSI SQL language.

6. Operations optimized: Leverage automated audit reports, detailed error logs, and load statistic reports for a seamlessly traceable workflow. Even manage failure resolutions easily, with simple restart commands, as accelerator consistently manages failure and restart operations.

7. Design workflows easily: Completely control your data load orchestration activities, even multi-thread them and manage dependencies between the threads.

Components of Data Load Accelerator:

1. Pre-requisite (Optional):

Data Load Accelerator works with a Cloud Storage layer (e.g.: AWS S3 or Azure Blob) for ingesting data into Snowflake. A separate effort may be needed to bring your data into this layer.

2. Bring together diverse data:

Data Load accelerator provides two executable components. These components can run with a dependency or even be de-coupled.

a. “Extract and Load” component, ‘EL’ of ELT, copies your data into Snowflake, and

b. “Transform” component, ‘T’ of ELT, manages data preparation and transformations for your complex business requirements.

These components provide:

· Complete parameterization: Maintain one code base by leveraging parameters, that are properties driving the code execution. Some parameters are:

- Environment Name: Provide values such as ‘Dev’, ‘QA’, ‘PROD’.

- Source Name: Provide logical name to a group of ingestions, such as ‘Sports’, ‘Finance’, ‘Billing’, ‘Social Media’ etc.

· Profile your ETLs than creating code: As a profiler, you will receive complete control of all the parameters you need in order to perform COPY and Transform operations. These parameters are input tables to the Talend orchestrator / decider, so you really don’t have to code any Talend pipelines.

· Workload customization: Keep complete control on how you want to orchestrate your workloads, set dependencies and provision Snowflake Compute (Warehouse) to each task.

· Seamless maintenance and support: Use automated reports that collects Talend and Snowflake logs to provide information on data load statistics, errors, and audit findings. Achieve complete source to target traceability on your data, by using pre-enabled ETL parameters. Don’t worry when a job fails due to data errors, you will fix the data and then simply rerun the job. Accelerator is designed to auto-detect past failure instances and customizing the workflow in real time.

Overview:

A. EXTRACT AND LOAD: This component ingests raw datasets from Cloud Storage into Snowflake, by using the profile tables that provide the information to dynamically generate the COPY commands.

You can profile historical and incremental (delta) data loads, group datasets to process together, choose target storage areas (Stage: Transient layer and History: Persistent layer), perform wildcard scans, multi-thread copies, and even audit the loaded data.

Key profiling parameters are:

· What: Information about File name, location, wildcard criteria, file type (data or manifest), file format (CSV, JSON, AVRO, ORC, PARQUET, XML) etc.

· Where: Snowflake target table details — including Table name, location, layer (History / Raw), any transformations needed, any data lineage needed etc.

· How: Order of executing the COPY commands (sequential, parallel or both), and Snowflake Warehouse to use per task etc.

B. TRANSFORM: This component orchestrates and manages the data transformation task executions, by using the profile tables that provide the information to dynamically create and run the data transformation tasks.

You can profile several data transformation and manipulation tasks, process historical or incremental (delta) loads, leverage different schema types — such as Star schema and Snowflake schema, leverage different data modelling approaches — such as Business Event Analysis and Modelling (BEAM) and Data Vault Modelling, maintain Slowly Changing Dimension (SCD) structures, and even multi-thread transformation tasks.

Key profiling parameters are:

· What: Information about source table, SQL code to execute, SQL type (insert, update, alter, delete, CTAS etc.), history/incremental run etc.

· Where: Snowflake target table information.

· How: Order of executing the transformation tasks (sequential, parallel or both), Snowflake Warehouse to use per task etc.

3. Offload and Modernize EDW:

Data Load Accelerator does not impose limitations on a data modelling approach or schema type. It is based on push down methodology, so consider it as a wrapper that orchestrates and productionalizes your data ingestion needs. It will support any SQL command that can possibly run in Snowflake.

4. Running your ingestions:

A. Use Talend Server and TAC:

You can maintain a Talend Server and Talend Administration Center (TAC), a web-based application centralizing the management and administration of your studio, to execute and monitor your data ingestion pipelines generated by this accelerator.

TAC can schedule and run jobs in batch / event based workflows, record logs, support code versioning, and even integrate with external version controller such as Git.

B. Run Serverless:

It is time consuming and tedious to maintain servers. This influences serverless models, which provides you an engine to run your code virtually and pay by the use. All major cloud platforms including Amazon Web Services (AWS), Microsoft Azure and Google Cloud Platform (GCP) provide serverless containers.

Data Load Accelerator is designed to run in serverless model. You can create a Java package of accelerator from Talend Studio, ship it as a container image, and execute the image in a cloud container instance like Azure Container instances.

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

Sample data ingestion workflows you can create:

Presenting some sample data ingestion pipelines that you can configure using this accelerator.

A. Starting with a Copy Workflow:

Below example is generating Data Copy pipelines, to ingest datasets from Cloud Storage into Snowflake, using the “Execute and Load” component. Here, we have grouped datasets into Sources, where each Source has multiple datasets to copy and dependencies to manage:

Notice in this example: Source1: is a group of 13 datasets configured to load together following a multi-threaded pattern, where:

i. Level1 multi-threads copy operation for 4 datasets, then

ii. Level2 executes 5 data copies in parallel, then

iii. Level3 copies dataset 10, and

iv. Level4 will copy datasets 11, 12 and 13 in parallel.

Similarly, Source2 and others have defined different workflows and dependencies. All these pipelines will be dynamically created by the accelerator, based on profile parameters inputted by the users.

B. Let’s look at a Transformation workflow now:

Below example is generating Data Transformation pipelines, to perform various data manipulation actions in Snowflake, using the “Transform” component of accelerator. Here, we have grouped transformations into Sources, where each Source has multiple transformations to execute and dependencies to manage:

Notice in this example: Source1: is a group of 19 SQL transformations configured to execute together following a multi-threaded pattern, where:

i. Level1 multi-threads 10 SQL transformations in parallel, then

ii. Level2 executes 5 SQL transformations in parallel, then

iii. Level3 executes 3 SQLs transformations in parallel, and

iv. Level4 will execute the last SQL transformation.

Similarly, Source2 and others have defined different workflows and dependencies. All these pipelines will be dynamically created by the accelerator, based on profile parameters inputted by the users.

Note: Current version of accelerator expects inputting an entire SQL transformation script within the profile. It will provide run time global parameters to your SQL code, that you can use to perform incremental duration identification or achieve ETL lineage. Examples of such parameters are @ETLLOGID, @FILEID, @FILEDATE, @STARTTIME, and @ENDTIME.

C. Here’s how both Copy and transform workflows can be combined:

Below example depicts end to end pipelines created by accelerator for various projects and programs within an organization. The hierarchy goes by Program — Project — Source — Task, where a Task is to copy or transform.

Notice in this example: Source1 is a group of 13 datasets which will be copied from Cloud Storage to Snowflake, then 19 SQL transformations will execute on this data. The dependency requirements within 13 data copy operations and 19 SQL transformations is already described in the previous sections.

Similarly, Source(2) and others can maintain their own workflow and dependencies, which is completely configuration driven and created in real time by Talend.

The real beauty is that accelerator is parameterized to support such models, and it will provide you one single code base to manage all your needs. So, you can use a single profile metadata for all your programs/projects, combine some programs/projects to share a profile metadata, or even separate them where each project/program is maintaining its own profile metadata.

D. Time for a real-world scenario:

Below is a “Sports Business area” use case, where 4 incremental files must be ingested into Snowflake, de-normalized, and transformed into a Star Schema structure — refreshing dimensions, fact and aggregated tables for data consumers.

Key requirements include:

· For all the 4 datasets, incremental data copies must run in parallel and load them into a Stage Layer in Snowflake (creating a Data Lake for this Raw Data), then

· All dimension tables must be refreshed in parallel with this new incremental data, then

· Incremental data must integrate with the Base Fact table, and finally

· Aggregate tables must be refreshed in parallel.

Requirements as above, and even more, can be easily profiled in Data Load Accelerator, providing you a user-friendly data ingestion and integration solution.

— — — — — — — — — — — — — — — — — — — — — — — — — — — — — — —

Closing notes:

I hope this post provided you an understanding of our solution and how it helps you fast track your data delivery, achieve consistency, and mature your development and operations.

Please reach us if you have a question or want to discuss this further.

Resources:

About Authors:

Ricky Sharma, Soumya Ghosh represents Data and Analytics Practice in Slalom NYC Office. Having 12+ years’ experience working with wide variety of data warehousing technology solutions across multiple industry verticals like Media and Entertainment, Financial Services, Retail, Life Science etc., we are passionate to implement best in class solutions to solve your challenging data problems by blending innovative ideas with the right attitude and technology. We also co-lead the Slalom — Snowflake alliance for NYC market.