3 Best Practices for On-Prem to Snowflake

Modernizing, Migrating, and Getting Data to the Cloud and Snowflake

--

by Kelly Kohlleffel

Our consulting services and SI firm, Hashmap, has delivered significant business outcomes for organizations across many industries with Snowflake’s multi-cloud (AWS, Azure, and GCP) SaaS data warehousing solution. We’ve helped companies modernize and migrate away from costly, rigid, capital-intensive data warehousing appliances such as Teradata, Netezza, and Exadata.

Similarly, companies that attempted to use Big Data platforms (Cloudera, Hortonworks, MapR) for data warehousing and BI interactivity have been aggressively moving those workloads to Snowflake to take advantage of consumption-based pricing, native SQL (no esoteric tools and skills required), zero infrastructure and management overhead, and unlimited (and instant) compute scalability while still paying by the second.

What both technology approaches (cost-intensive DW appliances and traditional Big Data platforms) have in common is that the vast majority of those implementations were deployed on-premise.

In addition, many of the applications and underlying databases that drive companies’ operational and financial systems of record (and that feed the data warehouses and analytics platforms) are also on-premise (SAP, Oracle EBS, Oracle RDBMS, SQL Server, DB2, et al).

As a result, we are constantly asked questions such as:

  • What do you recommend for getting data to the cloud and into Snowflake?
  • We are really impressed with Snowflake, but can you help us with data movement, data integration, data engineering, or data pipelining from our source systems in our data center?
  • How should I approach a bulk migration to Snowflake that is 20TB+?
  • What is the best way to acquire incremental loads and ensure a reliable change data capture process?

With that as the backdrop, I want to share three best practices discussed in a recent webinar “How to Consolidate and Migrate a Data Warehouse and Data Lake to the Cloud” with Chesapeake Energy and Snowflake.

#1 ELT over ETL

For years we’ve been used to an Extract, Transform, Load paradigm in our on-prem environments. In general, our ability to transform was limited by compute OR that compute was really expensive OR any impact to that compute had the potential to adversely affect key users and stakeholders that depended on a certain level of performance for their analytics, reports, and applications.

This ETL experience, being limited by compute, would be like being limited today in your morning commute. For me, I’ll get in my Chevy truck every day unless I can convince my wife or a friend to borrow their vehicle. I’m usually not successful, and even when I am it may take a lot of time-consuming convincing and justification; I’m pretty well locked in with my truck.

However, it’s really easy to adjust to unexpected personal transportation demand when I have some great “pay-by-the-drink” options like Uber, Lyft, rental car, train, or a bus. It’s low-cost, instantly available, I use it to get where I’m going and get out, get off, or turn it in, I don’t have to maintain anything at all, and it doesn’t take any long term special skill development at all.

Similar to having unlimited commute options, utilizing a modern ELT approach when you go from on-prem to the cloud gives you unlimited compute. With ELT you can push the processing (T — transformations) to Snowflake and leverage the low cost, instant-on capability, unlimited scalability, and elasticity for transformations and processing.

Could you lift-and-shift your ETL environment? Yes. Is it the best approach in the vast majority of cases? No.

If you go the ELT route, expect to avoid having to solve a lot of data movement-based performance issues (you won’t be pulling data out of Snowflake to do transformations in another engine). You gain both cost efficiency (real dollars) and compute efficiency (not moving data back and forth from system to system).

If you are looking to use an ELT-based approach, some solutions to consider that are closely integrated with Snowflake and use Snowflake compute for transformations include Fivetran, Matillion, DBT, and Talend.

#2 Get to Know Your Data Warehouse (Snowflake, of course)

Snowflake has some uniquenesses that set it apart from anything else in the market today (and definitely from on-prem solutions). For us, three aspects really stand out:

  1. It just works! This will be self-evident as you take it for a test drive — $400 free credits for 30 days are linked here — enough said.
  2. Snowflake was developed in the cloud, for the cloud. It is a true cloud-native architecture and importantly a SaaS-based solution (uniquely though, you do get to pick your cloud of choice for your Snowflake service — AWS, Azure, or GCP).
  3. Snowflake was developed specifically for data warehousing and analytics workloads. It’s not a transactional database, it’s not an operational data store, it’s not a historian, it’s not a time-series database, and it’s not a graph database. Keep in mind that Snowflake does make it really simple to store, manage, and process both structured AND semi-structured data, but in the context of a cloud data warehouse and analytics service.

Dive down into Snowflake’s cloud-native architecture and apply that to the way you think about running your data warehouse and analytics workloads in the cloud with Snowflake instead of how you have traditionally done it on-premise.

Caching is King

Snowflake has 3 levels of cache: result cache, local disk cache, and remote disk cache. The Result Cache holds the results of every query executed in the past 24 hours. These are available across Snowflake’s independent virtual warehouses. Query results returned to one user are available to anyone else who executes the same query (provided the underlying data has not changed). From both a performance and cost perspective, this is really powerful.

On the Fly Capacity and Elasticity (Up and Down)

In the advanced analytics space, we talk about getting answers to questions we didn’t know to ask.

Below are some questions that we get asked and that you can definitely ask of your Snowflake service that are impossible with traditional platforms:

  • Do I want to instantly resize a Snowflake warehouse to improve query performance with large complex queries or maybe auto-enable a multi-cluster warehouse for handling more concurrent users? You can do both/either.
  • Should I auto suspend to take advantage of Snowflake’s per-second billing and not burn through credits when a warehouse is not in use? Which apps would fit this profile so that I avoid a lot of suspend/resumes? Easy to select in Snowflake.
  • How can I avoid warehouse compute contention and max capacity issues for my users and workloads? Hint: Snowflake virtual warehouses are completely independent operating from the same storage.
  • Will I still have to create an elaborate caching structure for my BI tools in order to avoid burdening Snowflake? Nope.

Yes, You Can Monitor Your Snowflake Usage and Costs

Once you understand how to think about ELT and data integration and also get to know Snowflake, the next questions tend to revolve around cost monitoring like the ones below.

  • This is my first time moving data and analytics workloads to the cloud and Snowflake. How will I monitor costs to ensure that I don’t get an unwelcome surprise?
  • To what level of detail can I monitor — users, services, type of usage, etc?
  • How many credits am I using on a daily, weekly, monthly basis and what are those costing me in real dollars?
  • What is my burn rate, how it is trending, and what are the prime contributors?
  • Who is consuming the most compute time in the service?
  • What’s my usage over the last 30 days?

There are many options depending on your requirements. These include daily or weekly summaries in a BI dashboard that could be Slacked out to a broad team all the way to near real-time options with specific KPIs and cost metrics. It really depends on what you need and the level of detail you need for visibility; the data is available in Snowflake’s Account Usage and is ready to be accessed and reported on. Randy Pitcher with Hashmap provides a quick start to Improving Cloud Cost Monitoring with Snowflake and Tableau. Here is another view that Scott Smith at Tableau provides to Monitor and understand your Snowflake account usage in Tableau.

Caching, Elasticity, and Cost Monitoring are just a few examples — get to know Snowflake.

#3 Avoid Being Overwhelmed By Option Fatigue

You don’t want to be like this little wooden figure that is getting destroyed by his Jenga block options, but it does seem like this is how many feel today when it comes to the options associated with data integration for moving from on-premise to cloud.

Eyes kind of rollback in the head, a cold sweat breaks out, and it’s intimidating to feel like you have to be absolutely perfect on your data integration tool selection.

I have to find ONE tool that does it all.

The reality is that whether you are a startup, medium-sized organization, or large global enterprise, with the proliferation of data you are going to need to address a wide range of use case dimensions:

  • Data sources
  • Data targets
  • Data types
  • Data locations
  • Data volumes
  • Transformation requirements
  • Automation needs
  • Monitoring options
  • Incremental change needs
  • Varying bulk load sizes
  • Network considerations
  • Security parameters
  • UI and Ease of Use
  • OEM network
  • Organizational issues
  • Risk tolerance
  • Range of skill sets
  • Other priorities

All of these and more will contribute to your dilemma of what do you use, when do you use it, why do you use it, how do you use it, and when do you need to take a different approach or pick up another tool.

And, oh, by the way, is there a one-size-fits-all tool that will take care of all of use cases, patterns, and scenarios? Even if you’re 100% comfortable that the non-technical factors are covered, additional factors should be considered such as:

  • Company viability
  • Ease of doing business
  • Number of deployments
  • Size of deployments
  • Pricing model
  • Contracting timeline and process

Those are examples and granted, mileage may vary for your organization; what’s important to you or most critical will be very unique to your situation. But how in the world do I find a single tool that meets my individual technical and business criteria?

There is no ONE tool that does it all.

I recommend you avoid focusing on the “perfect” integration solution (the be-all-end-all tool) and focus more on the key patterns you have. More importantly, focus on the business outcomes you are looking to drive with your data and analytics program as a whole because frankly, you aren’t going to find the perfect solution. You will waste a lot of time looking for the ONE solution when you’d be better off having 3–4 options in your data integration “toolbox” you can pull out based on the requirement, timing, and available skill sets.

Even if there was a company that told you they have the perfect solution that fits all use cases because it was the perfect fit for similar customers, each organization we work with is very different in how they approach technology stacks, their overall company policies towards new tools, security restrictions, business processes, and of course, global reach and localization.

Our firm, Hashmap, is not a reseller; we solely provide consulting services. Since we don’t resell ANYTHING, we get asked to do just about EVERYTHING. As a result, we have worked with a wide range of data integration tools (since we are not locked into a single vendor).

In order to help customers avoid “option fatigue” and get to a decision point, we provide a collaborative data integration workshop. In this workshop we help demystify the range of data integration solutions by grouping them into 10 categories with solution and tool examples (this is not meant to be an exhaustive list) for each category.

In the on-premise world, especially with data warehousing workloads, it may seem easy to gravitate towards a single IT-centric solution, but a modern approach really demands thinking beyond a single tool. You should look to anticipate and stay ahead of the data and analytics demands of your stakeholders.

I find that long, drawn-out, IT data and analytics projects aren’t welcomed and everyone wants to move quickly; prove it out (or not) and move on to the next high-value area.

So to avoid the data integration tool, option fatigue, Jenga block scenario pictured above:

  1. Don’t get locked into a “unicorn” tool mindset.
  2. Stay outcome-focused and keep it simple until you can’t.
  3. Determine what the correct balance of technical and business fit is for your organization.
  4. Be realistic about your delivery capabilities and get help if you need it.
  5. Think in terms of a 4S model: Simple, Speedy, Sustainable, and Self Serve.

Go Modern

Don’t think about the cloud and Snowflake as a lift-and-shift of your current ETL environment, but instead, use the opportunity to modernize your approach. Starting down the path will require some key decisions, but don’t let that stop you from getting going quickly and making an impact. With a sound strategy, team, approach, and a few tool options, you can make the most out of going from on-prem to the cloud with Snowflake. You can also deliver high-value outcomes and avoid the pitfalls that others have dealt with along the way.

Need Snowflake Cloud Data Warehousing and Migration Assistance?

If you’d like additional assistance in this area, Hashmap offers a range of enablement workshops and consulting service packages as part of our consulting service offerings, and would be glad to work through your specifics in this area.

To listen in on a casual conversation about all things data engineering and the cloud, check out Hashmap’s podcast Hashmap on Tap as well on Spotify, Apple, Google, and other popular streaming apps.

Other Tools and Content You Might Like

Feel free to share on other channels and be sure and keep up with all new content from Hashmap here.

Kelly Kohlleffel is responsible for sales, marketing, and alliances at Hashmap, a vendor-neutral solution provider assisting customers in cloud data platform strategy and assessment services, cloud migration and modernization consulting services, and cloud and data automation and optimization services. He is also a co-host of the Hashmap on Tap podcast.

He enjoys helping customers focus on data and cloud solutions that are simple, speedy, sustainable, and self serve, and that just work, like Snowflake. Be sure and connect with Kelly on LinkedIn and follow him on Twitter.

--

--