If Extraction from SAP were easy…

Published in

Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

10 min readNov 9, 2022

In this article, I cover the extraction piece from Netweaver/ABAP-based systems (ECC, S/4, BW, BW/4), sharing some tips to free your SAP data and put it into Snowflake in a repeatable, resilient way, with full and delta loads. I only talk about solutions I know have been deployed at real customers, and I don’t do iDoc or RFC_read.

Disclaimers apply, I’m an employee of Snowflake, but do not necessarily reflect the views of such. I am not endorsing any of these products, I am only providing one opinion. Check with suppliers before buying, and enjoy!

Why Pull/Push SAP Data Out to Snowflake?

The goal of moving data out of SAP into Snowflake is that end-users can:

truly self-serve (one copy of the data, but isolated compute so everyone can play at the same time–no database crashes, no service windows, no and no end-of-month restrictions),
mix their data with all that fantastic enterprise, sensor, social, demographic, geographical, supply chain, and statistical data, and
avoid embarrassing questions from their boss like, “Where’s my report?!”

Customers get this at a reasonable price point on Snowflake’s Data Cloud. Snowflake scales with your company, and you only pay for what you use.

There are two main ways I’ve seen to consistently and qualitatively get data out of an SAP ERP system, either through the ABAP application layer, or through database change data capture (CDC) logs.

Netweaver/ABAP Application Layer

The benefit of accessing data through the SAP application layer is that it is supported by SAP through the use of an SAP user and indirect read access from SAP ECC and S/4. The Operational Data Provisioning (ODP) API abstracts BW extractors, CDS Views, and SLT trigger-based table replication, and manages the delta and resiliency through its queue.

Tip: Don’t get confused by the name Netweaver disappearing in S/4HANA. It, for most intents and purposes, gets translated to the name ABAP.

Extractors

This is SAP delivered business or customer built content. Here’s a good article on it. You can reuse the ones you are/were using for BW, or build your own. The SAP ODP API V2 exposes these to requesting services (e.g-SAP BW, Azure Data Factory, Google Fusion, AWS Glue, and several traditional ETL tools). Tools such as SNP Glue can reuse pre-built extractors or generate their own. The BW extractors that you were using in your ECC will, for the most part, also work for S/4HANA.

ABAP CDS Views

CDS views are explained in this article. There are 36,000 of them in an S/4HANA system, not all for analytical reporting. You can use CDS views to get data out of S/4 systems through the SAP ODP API V2. You can also expose individual CDS views through web-services (but that seems like a lot of work and maintenance). Depending on the CDS view, you can get initial and delta loads.

Trigger-based Change Data Capture

Trigger-based change data capture mechanisms are managed at the application layer, which creates/destroys shadow logging tables, which add about 5% of the original table size to your database, while active.

SAP Landscape Transformation (SLT) provides trigger-based CDC. It can be accessed through ODP/ODQ or through Data Services, or through its API. Upside in using SAP tools is there are no SAP license issues (because you just paid for the SLT and Data Services license). Down-side is it puts a load on the application server. SLT can be split out onto another separate machine to reduce the processing load, as can SNP Glue.

Tip: If your database underlying your ERP is almost full, consider archiving to Snowflake. SNP Outboard is a good solution for this. You still have access to the data through the SAP GUI, and it can also be exposed as tables to access through Snowflake for analytic loads.
Tip: If you buy SLT, get support from SAP to configure this correctly, so enough underlying processors are available to run SLT.

SAP offers a cloud-hosted solution, SAP Data Intelligence (DI) that you can spin up (meaning anywhere from 30 minutes to a couple of hours) from the Business Technology Platform, that can orchestrate SLT and SAP DS for you.

Tip: Data Intelligence leverages SLT for trigger-based replication to older systems (ECC 6.x), see SAP note 2890171. So if you have ECC 6.x you don’t really need Data Intelligence, just SLT and Data Services (if you go the SAP route).

Database layer

Direct

The benefit of accessing the database layer is lower-latency. You hit the database tables. The extraction tool manages the full and delta loading. The downside is that you need the enterprise license from the database provider.

Tip: SAP admins will freak out, not a pretty sight, about non-monitored database connections to production transactional systems, so put on your body armour (or, honestly, don’t do it).
Tip: some clever customers buy the smallest size of enterprise HANA possible, offload data from the ERP to it with SLT, and then push it through to Snowflake with an ETL, such as Azure Data Factory.

Log-based Change Data Capture

Any time a database does something it logs the action in a file. Some extraction tools have you install a monitoring tool on the database server or next to it to pick up any changes that are made to these files and push these to downstream systems. Well, the unencrypted stuff anyway.

Upside is that this can be an extremely fast and scalable way for capturing data and the load is zero on the application server and minimal on the database server as these are just read operations.

Another possible upside is that the tool providers will argue you don’t need an enterprise license for the database because you are not connecting to the database, but just the log files.

Tip: check how the extraction tool is getting the initial load of data. Is it using that not-permitted database connection? Or do you rather have to install something on the ERP to trigger the initial load? If the later, you should be ok because it’s the Netweaver/ABAP stack telling the database what to do.
Warning: SAP note 2971304, warns the reader that using log-based CDC replication against HANA can generate fictional data. How this is generated is through data fairies, a nasty bunch, cousins to Tinker Bell. They smoke and drink, excessively.
Tip: To ward off fictional data (and fairies) use a tool that has data validation checks.

Data Latency and Volume

There is a cost difference between batch-loading once a month and event-based loading. Some tools run 24 hours a day, seven days a week and trickle into Snowflake. Others run once a day. How often you do your merge operations also has an impact.

Tip: Choose a tool that can handle the latency you need and, if there is a large volume, can leverage Snowflake’s COPY INTO command. Avoid tools that want to write to Snowflake row by row using ODBC/JDBC connections. They should use bulk loading, or, currently in Snowflake private preview, streaming ingest.

Append-only Persistent Staging Area (PSA)

Some tools want to manage the merge and delete operations instead of Snowflake. This is ok for small and medium deployments, but for hundreds of millions of rows a day this can lead to expensive merging operations, especially those who do hard deletes (you know who you are…).

Tip: For large data flows keep the extraction tool as simple as possible and make sure it can land the data into an append-only persistent staging area. Think carefully about much time-travel you want to give to the PSA and how often you want to cull it, if at all.

Deployment and Elasticity

I see three deployment patterns:

ABAP

Installation of an ABAP program in SAP to manage the extraction and load directly into Snowflake (through the JCo library). The forerunner here is SNP Glue (SAP certified, even against data fairies). Customers include CPGs, Manufacturers, Pharma, Energy.

Fivetran HVR and Qlik Replicate are doing a combination of ABAP to trigger the initial load and CDC. As customer examples, Pitney-Bowes and Coke One North America use Fivetran HVR, and ABB uses Qlik Replicate. Most customers have over-dimensioned SAP application servers, so installing an ABAP program is acceptable.

Tip: Initial loads must be planned during low-utilization times, such as over the weekend or multiple weekends. Initial loads can take 10s of hours depending on how much data history you have, but so what?

ETL or Dumb Adapter

Two, installation of a virtual or real machine on the same network as the ERP (usually on-premise) that hosts an ETL application (traditional ETL approach) (Informatica, Talend, etc.) or adapter (Fivetran HVR, Qlik Replicate).

CDC Monitors

Installation of software on the database server machine to capture logs (Fivetran HVR, Qlik Replicate).
Adding additional machines to your network takes time and requires maintenance, but can offer lower loads on the SAP systems.

Tip: One size doesn’t fit all. If you’re pulling in a couple of million records a day in a batch with a latency of, well, a day, this is a very different pattern from a billion records a day with a five minute latency that includes complex transformations. Choose accordingly.

Monitoring

Does the tool provide monitoring and alerts in case of error? Does it leverage and is built in to SAP monitoring? Is it an application on its own? SNP Glue leverages SAP monitoring. Others have their own monitoring dashboards.

Operations

This, in my opinion, is critical. Is the data being pushed out of SAP? Logically this would be the SAP team that manages the creation, maintenance and scheduling of the extractions, or at least the access to such. SNP Glue here.

Or is it an outside process that is capturing changes, from the outside looking in? HVR, Qlik Replicate.

Tip: understand and assign responsibility for this task. Consider the implications of a Data Mesh strategy in this.

Support and SAP Certification

You should get support from the tool editor, but there are also tools on the market that are supported directly by SAP and are certified to work with SAP. Not a requirement, but it does provide some comfort. Read the fine print from providers — did they certify on a different aspect of SAP, and are blurring the lines here?

Migration to S/4HANA

All customers have to migrate from their SAP ECC to S/4HANA, or get off Das Boot. When are you doing it? If in less than three to five years, I’d check that your tool will work for both.

Tip: Consider that if you are hitting at the database table level, or log-based CDC, the data model between ECC and S/4 are different.
Tip: Consider if you move to cloud-host S/4HANA Private Edition, your extraction tool may no longer have access to the database layer or log tables.

Ecosystem

Are you also extracting data from other systems? Do you just want to have one tool that can do it all, at the risk that it’s not specialised for SAP? Do you have a dedicated SAP extraction tool for your SAP data, and another tool for everything else?

Tip: If you have multiple sources and just want one extraction tool, consider something like Fivetran HVR.
Tip: If you are just concerned about pushing data from SAP and want something certified, consider a bespoke tool like SNP Glue. They are also getting SAP Rise certification soon.
Tip: If you are not too concerned about about real-time replication, consider some of the cloud provider tools like Azure Data Factory, AWS Glue, or Google Fusion.
Tip: If you already have Kafka set up, Init’s ODP to Kafka connector might be interesting for you. It will become particularly interesting when Snowflake streaming ingest goes into general availability.

Business Content

I always know when the SAP representative is on the call. In the first second of the call he or she asks, in one breath, “Can you leverage the business content in a Hana or BW system and bring it to Snowflake?”

Tip: Take a deep breath when talking to SAP folks, they want what’s best for their clients, and often see that Snowflake can augment that experience.

The short answer is you can export your customised HANA and CDS views to XML and build an application to analyse the complexity and then convert the XML into SQL views. BW cubes can be pushed down into views as well and brought over.

Don’t want to build the application? Global Systems Integrators such as Accenture, LTI, Deloitte and Infosys, have built out analysis and conversion tools to do just that, and have the consulting to support the rest. Check out Santosh’s blog on this.

Tip: if you’re doing a brown-field transformation, get our Professional Services involved to provide a readiness assessment. They can also build performant and reliable data models and flows fit for SAP high-churn nature.

Otherwise you can build a solid data model from scratch, see John’s article, or, a bit more flippantly, Serge’s article.

Tip: For business content, don’t reinvent the wheel. Leverage what the community has done for other ANSI SQL solutions — just remember to change the target to Snowflake, understanding how Snowflake is different.

Data Ops

The tool DBT can help a lot with building data flows (aka process chains). As can Coalesce.io. I also have customers using Wherescape, or just plain SQL. Consider using Common Table Expressions (CTEs) vs subqueries to improve your code’s performance.

Summary

Getting data from an SAP system can be simplified by using a proven data extraction tool. Which one to choose depends on the current tooling you have in place, your volume and desired latency, and your team and or partners ability to translate this data once it’s landed in Snowflake. Also licensing right?

Tip: Watch out for data fairies, but don’t let them overwhelm you either. As my fairy grand-mother used to say, there’s always hash checking. Or let the tool do it for you.

Good luck and until next time!