SqlDBM
Published in

SqlDBM

Moving from On-premises to Cloud — The Data Migration Guide

Strategies and considerations to ensure a smooth transition to cloud data analytics.

Photo by SpaceX on Unsplash

Accelerated by the Covid crisis, the cloud adoption phenomenon has gone from “cloud-presence” to “cloud-first.” Cloud data warehousing offers many compelling advantages like cost optimization, minimizing vendor lock-in, and enabling effortless scaling. According to Gartner, even the heavily regulated and conservative sectors like Finance are taking the leap faster than expected.

But industry trends alone should not drive the decision of “if”, and the strategy of “how” to establish a cloud data presence. To do cloud correctly, “it will be essential for CIOs to develop a formal strategy that helps to put individual cloud decisions in the context of the enterprise’s strategic goals,” says Gregor Petri, Vice President Analyst at Gartner.

This article will cover the considerations, strategies, and common pitfalls associated with cloud adoption to help inform such a choice. Since a comparison of the leading cloud database platforms was done in “Exploring the Modern BI Stack,” this article will not focus on their respective differences (I’ll just assume that you did your homework chose Snowflake.)

One warehouse to rule them all

Considerations

By high school, most of us learn that doing something just because “everyone else is doing it” is not a wise move. Before an organization commits to a path of cloud adoption, it must have prudent reasons to do so. Without a clear target to steer the journey, it is easy to get derailed or design something ambiguous that doesn't satisfy existing stakeholders.

To put it another way, “if it ain’t broke, don’t fix it.”

Having said that, we will now review five of the top reasons that drive cloud migrations to see which, if any, resonate with your organization.

Simple financials

On-premises vendors have earned a reputation for opaque pricing and predatory legal bindings (SAP, Oracle, I’m looking at you.) Even when upfront software and hardware costs are known, it could be hard to estimate maintenance and project scaling costs in the medium/long term.

In contrast, cloud providers lay out storage and compute costs upfront, require “near-zero” maintenance, and only bill for actual usage. This makes it easy to plan both entry costs as well as growth over time. Our friends at Hashmap have developed an interactive Snowflake pricing tool that takes this into account.

After all, cloud computing vendors are not simply renting out space in data centers; they offer SaaS solutions for data management. This allows them to operate as service providers instead of product salespeople. The difference between the two lies in the incentives.

Someone selling a product has two incentives: to do whatever it takes to lock a large lump-sum payment upfront (hardware/software), then figure out how to extort smaller revenues (licenses/support) over time.

In contrast, the incentives of a service provider align with the customer’s wants. One, to offer a compelling service to convince first-time subscribers, and two, retain them through continuous improvements.

Latest features and updates

Snowflake is not a mere database hosted in the cloud, it’s a scalable warehousing solution engineered from the ground up for cloud architecture. This allows for huge flexibility in extending its functionality and developing new features.

For instance, compare Snowflake’s monthly release schedule, with Oracle’s database versions which update every few years with significant features and offer only quarterly updates for fixes and security enhancements.

Felipe Hoffa via Reddit

Snowflake, specifically, invests in quality-of-life features — like making it easy to share data and ensuring a smooth user experience. It reacts quickly to customer needs by accommodating the most in-demand features like native support for unstructured and semi-structured data and external function calls. Thanks to cloud architecture, Snowflake can incorporate functionality seamlessly and at no inconvenience to its customers.

Near-zero maintenance

Manual updates, patches, and time-consuming system copies are all things of the past with cloud. Snowflake even goes a step further, eliminating indexes and even providing automatic query optimization.

This means your developers can focus on development and actually adding value to the company instead of doing periodic system maintenance.

This is convenient for the development team and eliminates the need for DBA and support roles. In most organizations, these are generally senior roles with very specialized technical profiles. Meaning, they don’t come cheap!

When comparing the costs of a cloud warehouse provider, the focus tends to be on storage and memory usage, while employee hours saved by managed services are often overlooked. Thankfully, Snowflake offers a free calculator that takes this into account.

The cloud supremacy

Many organizations are compelled to move to the cloud when they realize that most of their data already resides there. Most companies are already using one or more cloud-based SaaS solutions. Examples include Salesforce for CRM data, Google Analytics for web metrics, Bloomberg market data, and employee data managed by Workday.

Reporting solutions are also following the same precedent and offering cloud-only solutions, like Lookr, or a mixed approach (for now) in the case of Tableau and Power BI.

Whether your organization is downloading cloud data to on-premises, uploading local data to the cloud for reporting, or, heaven forbid, both! — an entirely cloud-based landscape would save a lot of time and bandwidth.

But perhaps the biggest and most convincing reason has to be scalable (and separate) compute and storage resources.

Having virtual on-demand warehouses in the cloud means having on-demand concurrency. Users no longer have to compete with system processes for memory or processing threads. In Snowflake, where storage is separate from compute and shared across all virtual warehouses, an organization can have a small multi-cluster warehouse for loading, an extra-large WH for memory-intensive transformations, and a medium warehouse for the analysts running only during business hours — and they can all point to the same centralized set of tables!

Each of these warehouses can then be scaled up or out as the data volume or the user-base expands. With cloud, an organization can balance its budget and data needs while avoiding unnecessary competition for critical resources.

But what about security?

Well, what about security?

It is common to assume that a server sitting under one’s roof is inherently more secure than one hosted virtually in the cloud. But contrary to popular belief, most data breaches occur on-site. That is because control does not equate to security — but rather, to increased responsibility.

While on-premises systems offer many security features, the onus falls on the customer to configure and maintain them correctly. This means keeping up with the latest security patches in a timely fashion and performing periodic audits.

Nope, Snowflake got me covered

In contrast, Snowflake handles most of these concerns as part of their service offering. To start, all data (not just tables, files too) is encrypted via AES-256 and re-keyed periodically. All pricing plans include SOC I & II certification and can be extended to HIPAA, PCI DSS, and HITRUST CS.

Then, user access is controlled through a range of security protocols like SSO, OAuth, MFA, Key-Pair (with rotation), and user/group management via SCIM. There is access control at the network level tool via IP whitelisting and Private Link.

That’s an impressively long list of things not to have to worry about.

The Approach

There are two fundamental ways to implement a cloud migration. You can either replicate the existing on-premises design (“lift-and-shift”) or start from scratch, leveraging the full potential of cloud architecture (“greenfield.”) Or you can combine the two into a hybrid mix of your choosing (“green-shift.”)

Neither one is necessarily “better” than the other since all come with important tradeoffs that need to be considered.

Let’s consider them.

(The ol’) lift-and-shift

Sure, the words “lift and shift” (LaS) evoke images of some back-alley swindle, but this should not discourage you from adopting this proven method. The LaS approach is the “safe” option when performing a migration, as it implies copying a proven on-premises solution and “shifting” it to the cloud.

What should discourage you from going the LaS route are two inherent drawbacks.

First, by migrating an existing design, you inherit all the technical debt built up over the life of the existing solution. This includes, but is not limited to:

  • obsolete code
  • inconceivable code that no one understands but somehow works anyway
  • clunky fixes that you’d swore you’d set right one day
  • known issues that haven’t yet gotten the benefit of a clunky fix

It’s like dragging your skeletons from one closet to another.

Second, going the “safe” route forgoes the innovative features afforded by cloud platforms — which, as we’ve seen, is a constantly growing list. On-premises models are often designed around physical limitations like maximum table size or hardware memory limits, which no longer apply in the cloud.

Then there is the extensive list of helper functions that do everything from JSON parsing to column pivots. Relying on this utility bank saves customers from “reinventing the wheel” when such functionality is called for, thereby reducing complexity.

It’s a long list of cons against one excellent pro: the low risk of a time-tested, proven model.

Again, if it ain’t broke…

Greenfield

Greenfield, the very word evokes a budding meadow in springtime. It’s a clean-slate approach that offers the opportunity to rethink an existing design with a forward-outlook, unconstrained by past decisions.

The greenfield approach is a tempting and exhilarating option for anyone who understands the constraints of the legacy design (and has sworn for years that it could be done better.)

A fresh start means you’re free to leverage all the cloud-native features that your new database has to offer. It also affords an opportunity to rethink current business needs and prepare the model for whatever is coming down the pike.

Of course, greenfield isn’t all sunshine and roses. There are two inherent risks to consider when going with this approach.

One, you will be starting with a completely new design. With any new implementation, there will be unforeseen challenges and surprises. It will need proof of concept, testing, and thorough validation.

Two, you will be building it on unfamiliar technology. Whether you decide to dive in and train your existing team or seek experienced help from a consultant, initially, you will be unable to make truly informed design decisions.

“Known unknowns” will cast doubt on your design choices, and “unknown unknowns” will invariably guarantee missed opportunities or rework.

Green-shift

Torn between the lift-and-shift and greenfield? Why not try green-shift, a hybrid approach that allows you to advance parallel workstreams in independent environments while sharing information.

There are infinite ways to packetize a hybrid solution. For example, you could LaS your main model and allow a smaller team to PoC a new design. Alternatively, you can get acquainted with the new cloud platform through auxiliary models and master data while LaS-ing and keeping the central model as-is.

Whether you LaS everything and replace bits and pieces at your own pace or decide to lift only certain parts of your model, the hybrid approach is a versatile way to go about it.

Yes, to some degree, the hybrid approach inherits all the advantages and disadvantages of both options. But, at the same time, it allows you to decide where the risks are acceptable and where to go with the safer option.

No modeling without modeling

No matter which approach you choose, modeling the target architecture is critical before jumping to the nitty-gritty of development. Having an enterprise data diagram is an important first step when starting your platform migration journey because it delineates the project's end goal.

Even in a LaS approach, where the objects and relationships are already familiar to the team, a target-facing landscape diagram helps separate the phases of the delivery cycle and keep the team synchronized.

In previous articles, I have outlined the many ways that SqlDBM can support a data-driven organization but nowhere is that more obvious than in a platform migration.

Greenfield modeling

Since we are starting with a blank slate with greenfield, we need to model and validate our database landscape before creating the actual tables and transformations.

SqlDBM facilitates blank-slate design through its logical modeling capacity. Logical modeling allows users to start with high-level dimensional entities and “whiteboard” their possible configurations without supplying exact technical details like names or datatypes.

logical modeling

But unlike a whiteboard, this design can later be expanded with the necessary details as they are defined. With the entire team able to access the diagram in real-time as it evolves, the development effort stays synchronized, and no one is left behind if they missed a meeting or lost an email.

detailed relational modeling

But SqlDBM doesn’t stop there. Once a model design has been finalized, and all the details have been defined, the diagrams can be forward-engineered into neat, deployment-ready, database-specific DDL.

forward engineering DDL for database deployment

Knowing that all this can be done from one centralized tool ensures that the entire team collaborates and keeps all the project artifacts in one place instead of working independently in local environments and communicating through static (stale) documents.

Lift-and-shift modeling

Even though the architecture diagram won’t change much in a LaS migration, an updated diagram representing the current state of development is still extremely helpful. I have already written about how externalizing visualization can supercharge cognition, but when it comes to database migration, SqlDBM has another superpower in its arsenal: database conversion.

Perhaps one of SqlDBM’s most overlooked features (after all, how often does one switch database platforms) is the ability to convert projects between supported database types. This feature takes all the database objects from an existing project and converts the underlying DDL to support the target system syntax.

It’s hard to overstate how valuable such a feature can be in terms of time (and sanity) saved while performing a database migration. Even starting from scratch, what could otherwise be a weeks-long task, if done manually, becomes a two-step process with SqlDBM.

  1. Reverse engineer the existing database — point SqlDBM at your existing database or paste in the corresponding DDL. All the objects are automatically imported and organized in a master diagram.
  2. Convert the existing project — as simple as a “save as” feature. Pick the cloud provider you are migrating to (currently Snowflake, Redshift, and Azure are supported) and click “Ok.”

Key considerations

Some helpful tips to consider to ensure that migration does not go off the rails.

Be agile

It has long been established that even on large-scale projects, agile outperforms waterfall-style management. This is doubly true of cloud migrations.

The strength of agile project management lies in planning for short work sprints that deliver incremental improvements. The ability to select the most important workstreams and deliverables on a recurring basis makes agile the ideal choice for cloud migration due to changes that are guaranteed to arise.

Not only can you incorporate new features into your design, but this flexibility saves you from having to plan, prioritize, and, typically speaking, over-promise a list of features that may be unnecessary by the time the project is finished.

Add cloud scalability into the equation and watch how truly agile, agile can be.

Embrace the MVP

While cost reduction and a bevy of cool new technical features might be exciting to the BI department, don’t forget the ultimate beneficiary of applied analytics: the data consumer.

End users tend to be a pragmatic bunch, with little appetite for reinventing familiar procedures or learning new technologies. But you can still rally them to the noble mission of cloud migration by outlining an MVP.

No, not “most valuable player.” Minimum viable product!

be the Most Valuable Player of your cloud migration by delivering Minimum Viable Products

An MVP is a product with enough features to attract early-adopter customers and validate a product idea early in the product development cycle. Note, this does not mean a scaled-down facsimile of legacy reporting but rather something that end users can get excited about.

Recall the pain points of the existing solution. If it’s speed, build responsive dashboards with adequate concurrency. If it’s a lack of data, leverage cloud data sources or near-real-time data streams. Is it a new feature that the business has been asking for like predictive analytics or self-service?

This is the question you need to ask before embarking on the migration, and the answer needs to be a part of your design.

Remember, you don’t have to replicate all the existing functionality right out of the gate. However, you do need to deliver something novel and exciting that users can’t get with the on-premises system.

Experiment

The Pareto Principle suggests that investing just 20% into design and PoC upfront will save 80% in future remodeling. As you’ll be working with new technology, which allows for scalable and independent development, don’t miss the opportunity to test new designs and try out product features to see how they perform under load and with real data volumes.

New functions, new features, even new methodologies, and tools can be PoC’d at low or no cost early in the development cycle. Given that you’ll be using agile sprints to plan upcoming deliverables, you can incorporate these ideas into the main design as soon as they have been proven.

Have you heard good things about a funky transformation tool called dbt? The time to try it out is sooner rather than later so that it can be included in your development framework, avoiding future rework, should you find it useful. Ditto ELT tools, DevOps, and reporting solutions. All the good ones offer free trials, and there is no sunk cost in moving away from one in favor of another at this stage in the process.

Be dynamic

For projects where change is a constant variable, synchronization becomes a critical factor. Keeping requirement and design documentation current and keeping the entire team aware of the latest priorities is crucial.

Traditionally, such information is relegated to static documents or buried in emails. Unfortunately, these become outdated almost as soon as they are written. Versions start to circulate, and updated objectives are announced in reduced forums. Didn’t you get the memo?

As described in the (extended) modern BI stack, this no longer has to be the case. To avoid teams working in parallel from differing specs and in isolation, leverage the sharing and collaboration features of the modern support stack.

The best way to maintain a coordinated effort is to use tools that stay current by nature. Coordinating migration, planning, modeling, tracking, and development through a single tool like SqlDBM can pay dividends in keeping the team synchronized.

On top of visual diagrams, Sqldbm also offers a data dictionary feature that bridges the gap between technical and functional knowledge. All the objects and their corresponding descriptions created through the tool are automatically added to the data dictionary. Conversely, all the functional descriptions added to the data dictionary are visible in the diagrams.

Remember, the best time to document a project is during its creation, while the functional knowledge is fresh. Whether this means using SqlDBM’s data dictionary, including inline comments in the DDL, including descriptive markdown in the code repository, or using the auto-generated documentation from dbt deployments.

By keeping documentation adjacent to development, you ensure that everyone on the team is aligned and receiving updates as soon as they are made available. The suggestions in this section are just some examples of how to achieve this goal, but there are infinite ways to go about it and no hard-and-fast rules.

Never close the data tap

As MVP deliverables start to roll out in the cloud platform and new data marts are unveiled, you end up with data sets that exist in two systems at once. This presents its own set of challenges.

For reasons as-yet-unknown to science, having two sets of data infects the users with an irresistible, pathological need to run comparisons. Even though the refresh schedules are different, and business logic may have changed as part of the redesign, users will make it their duty to bring these differences to the attention of the BI team.

Moreover, data that lives in two systems at once also suffers from the frustration of double maintenance whenever a data incident occurs, or a change is called for.

Then there is the cost involved with running and maintaining two systems in parallel.

With this in mind, the urge to cut the cord on the legacy system (either selectively or completely) ASAP is perfectly understandable. But have all the reporting requirements been covered?

A large organization will typically have an equally large BI footprint. Besides the principal data marts and federated dashboards, there are also auxiliary and satellite models that support specialized business needs.

Over time, these auxiliary data sets can spawn entire business teams whose job depends on the insights gleaned from such data. What might have been a long-forgotten, one-off effort for the BI team might be the operation cornerstone of an entire department.

This is why it’s so important not to lose sight of the day-to-day needs of existing data consumers and to never, under any circumstances, cut them off from the data that their work depends on.

Side-effects of data withdrawal may include dizziness, nausea, and stomach pains. Please consult your BI team to see if platform migration is right for you.

Before removing access to a report, table, or system, a point-by-point review needs to take place to ensure that an alternative exists in the new cloud environment.

With a LaS solution, it’s easy to stay on top of when models and reports are ready for decommissioning from the legacy system. However, when introducing changes via greenfield or green-shift, the scenario isn’t always so clear-cut.

With these solutions, you can not expect a one-to-one match with legacy data sources. The new data marts may extend in scope, fusing with satellite models, or conversely, data marts may be broken up into smaller, specialized models. Not every column or KPI needs to exist exactly as it did in the legacy system, but the users must have access to their equivalent or be able to calculate it themselves with access to the source data.

Whatever the case, the BI team must provide a complete transition plan and timeline, specifying when legacy sources are to be decommissioned and where to find their corollaries in the new environment.

Build to scale

Scalable architecture is the bold-text promise on the proverbial tin of cloud data platforms. Virtual warehouses can be instantiated within seconds, pre-configured for on-demand scale-up or scale-out. The architecture will scale, no doubt about it. How about your model?

By keeping storage and compute separate, Snowflake essentially de-couples the data from the transformational models built on top of it. This gives the BI team the freedom to develop over a relatively small (but representative) data set, enjoy performance and cost savings, then instantly swap in the full one for integration testing or deployment.

Depending on the migration strategy, a model might be slated for an MVP rollout containing the last few years of transactional data, with plans for gradual historic backfill. This is where your design will either shine or crumble.

Some design patterns prioritize accuracy over performance while being relatively forgiving under a small enough data set. Such designs typically call for excessive in-memory processing and begin to perform exponentially slower as the size of that data grows.

When this happens, you’ll need to either re-assess your budget and increase the warehouse size or go back to the drawing board for the model itself. A good understanding of how your data warehouse works under the hood is essential for avoiding this kind of scenario, as is keeping a close eye on query plans and memory consumption throughout development and testing.

Conclusion

In 2021, it’s becoming increasingly clear that warehousing as a service is here to stay and will soon become the industry standard. On-premises hosting can not compete when it comes to cost, scalability, functionality, and ease of maintenance — in most cases. Each organization will need to make its own assessment as to whether the time and viability for cloud migration are right for them.

Although a well-executed move to cloud has the potential to invigorate the entire company through access to timely and accurate data and self-service autonomy (see article on the end-user experience), organizations should keep in mind that cloud is no panacea in and of itself.

However, with careful planning, the right tools, and the right team, cloud migration is nothing to be afraid of. The tips and strategies provided in this article will help you determine if and how to approach the cloud migration journey.

See you up there.

--

--

--

All about SqlDBM — Cloud based Data Modeling Tool for Snowflake ❄️, AWS Redshift, MS SQL Server, PostGreSQL & MySQL

Recommended from Medium

A simple stack for your Firebase Rules

Getting The Terminal Size In Python

Automate your Infrastructure Access Workflows

.NET Core Counters internals: how to integrate counters in your monitoring pipeline

longjmp & Valgrind: Back to the Future

What is Test Driven Development (TDD)

Changing the Firefly Semantics Slice EStore Default Configuration

4 pieces of advice to pass software engineering interviews

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Serge Gershkovich

Serge Gershkovich

Food for thought, meals essential. Shrine your mind, build your temple

More from Medium

Moving to Real-Time Analytics with Venkat Venkataramani, Co-Founder and CEO at Rockset

How Data Lineage Improves Data Compliance

How to work with JSON in BigQuery

Automated Testing with dbt