Talking Data, the Cloud, Matillion, and Snowflake

Perspectives on Getting Work Done in the Cloud

Hashmap, an NTT DATA Company

Published in

Hashmap, an NTT DATA Company

18 min readApr 30, 2020

by Prakash Bharathi and Kelly Kohlleffel

I recently had a chance to spend some time with Prakash Bharathi, Enterprise Architect at Hashmap, and get his perspective on a number of data and cloud topics including his recent experience using both Matillion and Snowflake and the value they bring when paired together.

Kohlleffel: Prakash, I really appreciate you taking the time today to talk about your experience enabling data-driven solutions using Matillion and Snowflake. But before we jump in though, can you describe your background, your current role, and what your primary project focus is?

Bharathi: Sure, Kelly, my experience has been a mixed bag. I started in a small company very focused on data integration with CA technology applications. After a few years, I moved into CA as they were starting to offer SaaS solutions. I had a very good experience with that, and then I switched to services delivery. I worked on a lot of data integration and business insights and analytical projects. Also, I’ve had some great opportunities to take an idea from inception all the way to MVP.

I’ve also been a product owner delivering Infrastructure-as-Code and have set up infrastructure, automated infrastructure provisioning, and automated NoSQL databases like MongoDB as well as traditional RDBMS environments.

Kohlleffel: Quick off the topic question: What’s your favorite NoSQL database; Mongo, Cassandra, or HBase?

Bharathi: Out of those three, Mongo.

Kohlleffel: What is your current project focus?

Bharathi: In my current project, I am a solution architect for data modernization. The focus of the project right now is two-fold; one is to help the client migrate away from Netezza to Snowflake, and the second part if that if there is a new project coming in, how to take them into a Snowflake journey directly.

Kohlleffel: You know, I was thinking back three years ago when one of our architects was asked to evaluate Snowflake, and I had not even heard of it. At the time he said, “You’ve got to take a look at this”. It couldn’t have been more than a week or two, and our internal Slack channel for the client started blowing up in a good way with the way it worked and helped drive value. Obviously, there are a lot of differences between Snowflake and some of the other data platforms we work with. I’m just curious, when did you first hear about Snowflake and what were your thoughts from that first time you got hands-on with it?

Bharathi: I got introduced to Snowflake a few years ago. A colleague of mine who was working in CA had moved into Snowflake, so I got to know Snowflake and I saw the growth as well. The first thought that came to my mind was that this is not a brand new concept; it’s always been there, and every cloud provider has all of this. While every cloud provider pretty much has everything, somehow the Snowflake team got together and they came up with a service that’s more like an application.

So it was a beautiful concept, just like Dropbox. It still uses AWS features, but it’s an entity all by itself, right? Redshift could’ve done it a long time ago, but they did not. I was really impressed by how it was built completely from the ground up. At the time I was hearing about this, there was no database that was similar to Snowflake — I loved it. Unlimited scalability and the concept of SnowPipe that was introduced I think in 2018, and then Time Travel; the name by itself is so futuristic — that’s what caught my attention.

Kohlleffel: If we look at our clients who are deciding on Snowflake as a cloud data platform and if they aren’t forced to use Azure or AWS native services, it seems like it’s a pretty easy decision. I’ve got this true SAS solution, unlimited elastic scalability, support for a wide range of data types, little management overhead, and again, a compelling price to value ratio.

The next question is how am I going to get all of my data sources loaded into Snowflake? Our clients have cloud sources, a lot of on-prem sources, and typically transformations are required on those data sources. It doesn’t seem like there’s a one size fits all to the answer to that question; it really varies depending on the organization. So as an enterprise architect working in this space, what do you consider when you’re thinking about approaches to loading and transforming data within Snowflake?

Bharathi: That’s a good question, loading data by itself, right? That’s the first thing, the mindset that you should have is ELT — extract, load, and then think about transformation. We are all used to this ETL way of doing things where you open up different tools and then run your transformations through the tool. Even if you take a Hadoop, you have a limited number of name nodes with you and you can only use those to zone the data; you always had a limit, whatever flexibility you had, that always comes with the limit.

It’s like a drug, right? It comes with side effects. It always had those. So one advantage has two disadvantages and that’s what we were all used to with ETL.

So with the concept of ELT where you don’t have any limit on compute, you can just load all the data into Snowflake and use SQL statements to do all the transformations which was completely unheard of not too long ago. Everyone was offloading the data, transforming it in memory, and then loading it back into the target database. Today with Snowflake, the one mindset we should all go with is that all transformations can be done in SQL.

Kohlleffel: What you’re describing to me sounds like a very modern approach to data pipelines and leveraging the power of the cloud, the elasticity of the cloud, that instant on and off compute. In the early days, that first big step was, let me get my data in the cloud, and life will be good. Migrating data to the cloud is one thing, but I like how you’re describing this more modern approach using an ELT type of way of doing things where I’m really taking advantage of the power of the cloud. Are there places for both approaches today?

Bharathi: We drive towards modernization if possible because that’s where you get the value-add. Modernization allows a paradigm shift from the legacy databases to modern databases like Snowflake, Redshift, or BigQuery. It still requires management buy-in. It could be done with one single modernization initiative, thereby taking the entire company with it.

You should also consider the modernization of the data pipeline. Solutions to modernize the database and the data pipeline came to the market simultaneously, for instance, Snowflake and Matillion. Both companies have been growing in parallel and complimenting each other thereby making data modernization easier and a shorter route.

Kohlleffel: If you look at Snowflake’s partner page, they’ve got a ton of options when it comes to the data integration space: ELT, ETL, data movement, data collection, data ingestion, data loading, and data transformation; that ecosystem continues to expand. It seems like every day I run into another company that is jumping in.

You mentioned Matillion just now, for anyone not familiar with Matillion as a solution in the data integration space and ELT space with Snowflake, what is Matillion, and what areas does it address for Snowflake customers specifically?

Bharathi: Matillion is a data modernization tool that supplements the more modern databases like Snowflake, Redshift, and BigQuery. Matillion is a very useful tool with a simple drag-and-drop UI. You configure it, and it automatically creates SnowSQL for you, enables using Snowflake compute to run the SQL, and does it automatically through an API call. You can get a nice visual on how the data transforms, see everything in real-time and then say, yes, this is what you want to do.

Previously in transformation tools, the modeling was completely done outside the tool, and then the model comes into the tool itself. Most likely you’d just type in the SQL or just drag and drop the SQL in, but with Matillion you can do the modeling within Matillion. This is powerful because you can bring the data engineer and the data architect together to come up with a single entity or one single orchestration that solves all the data pipeline requirements.

I really like Matillion; it does a lot of transformation operations, loading of data, offloading of data, and altering Snowflake dynamically. When you extract data and realize the number of rows is big, Matillion will automatically or dynamically change the Snowflake warehouse size, and then start loading the data thereby making it faster for that operation and then cycle it back when complete. These are very powerful things, and in addition, Matillion has standard if-then-else statements, that’s as simple as a drag and drop. It’s been very easy to use even for someone new to the tool. I did a couple of “show-and-tells” with the broader team, and they hadn’t done any formal training of Matillion but were able to get up and running with it almost immediately — that’s powerful. Matillion also has quite a bit of documentation online as well as you are learning the solution.

Kohlleffel: That sounds great. Matillion can address both on-premise data sources and cloud-based data sources. Are there some advantages you’ve seen for using Matillion in both of those areas so far?

Bharathi: Data extraction should be close to the source. There are many tools out there that can be installed on-premise that can extract the data, compress the data, and then send it over the wire to S3 and then have Snowflake to pull that in.

Matillion extracts the data over the wire, compresses it, puts it into S3, and then imports to Snowflake. Which is okay, but you know, it may not be the optimal way of doing it. Would Matillion be that one tool for on-premise? I don’t know. It depends on the situation. Having said all that, we did put close to 70 million rows into a small data warehouse in Snowflake using Matillion in less than 6 hours. That was amazing, and it really helped the customer make the decision to go with Matillion; they were expecting 12–13 hours. My thought was, hey, if it took 6 hours, you could have done this in 2–3 hours we had compressed it.

Kohlleffel: Were you extracting from an on-prem SQL server or Oracle RDBMS as a source?

Bharathi: Oracle yes, but also bringing data in from Netezza.

Kohlleffel: I think that everyone’s driving towards having a couple of tools in the bag. It’s tough to narrow it down, but the more that these tools expand to deal with a widening range of data sources, the better.

Do you personally have any top two or three considerations when you’re dealing with either on-prem or cloud-based data sources?

Bharathi: Sure, I would look at three things. Data movement and application being the first; where to store the data and how to avoid data movement and replication? How is the data synchronized between source and target? Imagine a landing zone, a refinement zone, where transformations are done, and a presentation zone. Hopping over from one database to another within Snowflake, it’s lightning-fast. So that’s the first thing that I would look at, and how the data is synchronized between the systems?

The second thing is what’s the best environment for data preparation or integration? Where’s the data processed? In this specific scenario, we are using Matillion as the tool for processing, but where it should be processed has to be Snowflake, right? It’s going to be processed closer to the source and it’s going to be faster.

We tried the same scenario with some other tools and it was significantly slower. Matillion proved to be light years ahead.

The third aspect to consider is data governance and security and how you protect the data? How do you protect it across environments and databases? How is the access control handled while ensuring that the metadata is handled correctly?

Kohlleffel: I look back on my Oracle days of Exadata appliances and Hadoop Big Data Appliances and the amount of time that we spent trying to lock downsizing. How many servers am I going to need? What’s my memory configuration going to be? What’s my storage? How am I going to balance between memory and storage? What does my application profile look like so that I can balance that out properly? What about networking, interconnectivity, etc.? You had to be so precise to get this exact number for your budget request because the last thing you wanted to do was to have to go back to your boss and say, “I blew through my capacity in the first three months and I need more money” or “Hey boss, you know, those 100 users that were just added in last week to our new appliance? Sorry, they have to be cut off.”

So my question is this, when you’re implementing, say a Matillion and Snowflake combo solution, how precise do you have to be in the overall sizing of that solution?

Bharathi: Very good question. Snowflake by itself, the sizing can be dynamic, right? It can be changed within the service or through Matillion. It can be changed anytime, anywhere and it becomes a small database versus a big database, versus something that has 10 days worth of history versus something that has 30 days worth of history. Everything can be configured and changed dynamically depending on the governance.

When it comes to Matillion it’s a little different, right? It’s not as flexible. The reason is that it has to be installed in an EC2 environment. The EC2 environment that we install it in and the parallelism that we get out of Matillion is two times the number of CPUs that we have on the server.

So if installing it on a small server, it’s just going to be two times that server config, and Matillion also defines how many concurrent users can log in based on licensing. So there are some constraints in Matillion but the advantage is we can spin up a new Matillion environment very fast and get the orchestration jobs imported pretty fast. Coming from a Snowflake perspective, this might not be as good as Snowflake, but at the same time, coming from a normal orchestration environment or comparing it with other tools out there, Matillion is still ahead.

Kohlleffel: So it takes a little more design and thinking around how you want to architect the environment that’s going to be running Matillion, which makes sense. It sounds like it’s really quick to get up and running with Matillion.

What’s that experience like? A few minutes, a few hours, a few days? What if I’m a new Matillion customer that wants to jump in and get started. What’s that experience going to be like?

Bharathi: I recently led an evaluation of 3 different tools including Matillion and we gave each tool a week or two max run time to see what it could do. Matillion was third in line and I was able to get Matillion up and running in 1 day — that’s it — just a day or even hours. Once that was done, I started importing some cloud data sources such as Salesforce and a couple of internal data sources. I was able to get all of this done with perfect orchestration with fault tolerance in less than three days. I was able to show the team the value that they could get from Matillion in less than a week, and we were able to prove Matillion as a tool of choice. That’s how fast we went to making a decision.

Kohlleffel: I know Matillion’s got 70 or 80 prebuilt data source connectors. Did you have a chance to test drive Matillion’s SalesForce connector and if so, how was that experience?

Bharathi: That was the way we went. I used Matillion’s Snowflake and Snowflake incremental connectors and yes, it was pretty simple and straightforward. You just set connectivity, set the profile, and start integrating. Once you set the data up, once you say, this is the object that I want to bring in, you set your filters, and it starts populating the data. It was pretty fast and wasn’t that tough.

The next data source we tried was Bing Ads and the configuration again was pretty straightforward.

Kohlleffel: Why Bing versus Google ads? I’m just curious.

Bharathi: Good question. They do use both. That’s the first question I asked them as well, “Why Bing, how many people even go to Bing?”

Matillion does come with a number of out of the box connectors but they do one more thing. They have a community of jobs as well, so I can share a job across the community. So if there’s someone else using a particular job and they used a connector or a combination of two connectors that actually worked for them to get something done.

Let’s take an example of Google and Bing — someone could show how to get both datasets, compare them, and the results. If it’s published and made public, then I could download that job and start using it directly. I’d have to do the configuration, but once I did the configuration, I can extract the same type of data based on that community job. So yes, there is a community in Matillion as well that you can use to accelerate your deployment.

Kohlleffel: You talked about Salesforce and Bing and some of the other data sources, can you share a bit about a few of the top use cases that are leveraging these data sources that you’re working with? What have you seen recently and what are some of the desired outcomes that you were looking to achieve?

Bharathi: Sure. I can speak to it at a very high level. We are bringing in data from IoT devices and when you are doing that you need to consider the value of data over time. If the value of the data is days old it’s historical, right? If the data was hours old then probably it’s reactive. If the data is minutes old then it’s going to be less predictable than seconds old data, which becomes more predictable. We want to be more predictive and reactive than historical on the IoT data. They don’t want to see what happened two months ago, they want to see what happened last week.

Take this Covid-19 problem; we are right now importing data from John Hopkins as well as collecting call center data as a combined dataset. Then we are sending the lat-long and showing how a particular area is being affected by Covid-19. How are businesses being affected and how is that affecting how we should respond in a particular region? The ability to bring all the data together to make informed decisions is a really beautiful use case. Our capabilities with the data are moving from historical to reactive and then actionable, then finally predictive as what we want to get to.

Kohlleffel: Are you flowing all that IoT data into Snowflake right now?

Bharathi: Yes, but right now everyone is more used to doing things a more traditional way so we have a parallel track going on for the IoT data into Snowflake

Kohlleffel: What file format are you using for persisting the IoT data into Snowflake?

Bharathi: It’s mostly semi-structured JSON right now.

Kohlleffel: You talked about this community approach that Matillion has; have you developed any of your own personal shortcuts or quick tips that you’d want to share for somebody that might be just starting out with Matillion?

Bharathi: Of course. I created an orchestration job that’s self-healing and self-configured and you just send a manifest to it. The manifest, instead of being a file in this specific scenario, was a combination of two tables. What should be imported? You just mention it in that particular table, and Matillion has a tool that imports the data directly into the database without them touching anything.

So it’ll automatically go to the source; it knows where the sources are based on the configuration plus everything is in Python scripts. It’s automatically creating the zones, the database, and getting the permissions as well. Everything’s automated right now, so if you think about the customer’s journey, they had to figure out what database, what tables to bring in, what are the schema names, and what are the prefixes — that’s all that they need to know. They don’t even have to worry about the columns, and metadata is all automatically going to be imported. They find out what the tables are, what the sources are, and if they have a service account they go in and type it into the database or into the table, it automatically pulls it in and does a merge operation, meaning an insert and update.

So we have automation of data ingestion and provisioning. The next thing is how do we make it presentable? I created a couple of API calls where they can call the API and it’ll automatically retrieve the data and send it back to them. It’s more serverless integration using AWS Lambda; that’s where it’s going right now.

Kohlleffel: That sounds like an awesome tip that a lot of people could potentially use to improve their overall approach. If you look at Matillion specifically for what’s comes out of the box today, do you have two or three favorite features that you find yourself going back to again and again?

Bharathi: Yeah, sure. Matillion has an S3 load generator, which is easy for any data engineer to quickly generate the entire data orchestration very quickly. I also like the SalesForce integration, that’s a pretty powerful tool as well.

Matillion also has integration to AWS SNS (Simple Notification Service) and that’s what I’m using right now. If a job fails, it automatically sends a message and that will notify the job scheduler, such as Control-M, to stop the pipeline in order to fix the issue.

The last and not the least feature is the capacity to add Python or Jython code — most of what I mentioned before was done in Python and Jython. So yeah, I like that a lot as well.

Kohlleffel: You’ve got a number of tools in the bag with Matillion, Snowflake, and AWS, but tools aside, in your experience what do you see holding organizations back even with these types of great tools — any common threads?

Bharathi: When it comes to cloud migrations there are usually two approaches. One is lift and shift, the other one is converting and migrating. I was assisting a banking client and all three cloud providers came in to talk — Google was the only one who was pretty stern about converting and migrating instead of a lift and shift. You don’t gain a lot by doing lift and shift, but a convert and migrate will bring much more benefit? So in my opinion, convert the current logic for use in the cloud and yes, stay loosely coupled to migrate.

While I do prefer that approach, it all depends on priorities and costs. What does that roadmap look like if I had to convert everything into a cloud-based approach Yes, I can maintain it, but how long would it take for me to test and validate it? Also, what is the latency associated with the new approach? There are so many other things to look at that you can start to feel that a lift and shift might be better, but that is the wrong choice in my opinion because it’s not future-proof; it may solve today’s problem, but it’s not going to solve a problem five years from now.

Kohlleffel: Every company is different; whether it be limitations, restrictions, or unique rails, but if I’m trying to get in more of a proactive data-driven position, am I looking at a few months or a multi-year type plan? Are there any guidelines that can help set my expectations?

Bharathi: What’s needed is senior management involvement and buying into the operation along with securing internal leadership for analytical projects. I can bring in a set of tools, but if there is no forum to showcase the value, it will be tough. In general, I feel that 12 to 18 months is a reasonable timeframe to deliver some high-value use cases and move into a more compelling data-driven posture.

Kohlleffel: Sounds like a reasonable timeframe based on what I’ve seen as well. I have thoroughly enjoyed getting a deeper understanding of your perspectives and talking data, the cloud, and use cases, along with Matillion and Snowflake. I certainly appreciate all the time today, and I hope that everybody in the data and cloud community enjoyed the conversation and learned something from your experience that they can apply to their various data and cloud initiatives.

Bharathi: Same here. Thank you!

Need Help with Your Cloud Initiatives?

If you are considering the cloud for migrating or modernizing data and analytics products and applications or if you would like help and guidance and a few best practices in delivering higher value outcomes in your existing cloud program, then please contact us.

Hashmap offers a range of enablement workshops and assessment services, cloud modernization and migration services, and consulting service packages as part of our Cloud (and Snowflake) service offerings.

You can listen to this discussion between Kelly and Prakash on Hashmap’s Data Rebels on Tap podcast as well on Spotify, Apple, Google, and other popular apps here:

Data Rebels on Tap | Hashmap Podcast

Data Rebels on Tap is a weekly podcast focusing on data and cloud technologies from a developer's perspective. Your…

www.hashmapinc.com

Talking Data, the Cloud, Matillion, and Snowflake

Perspectives on Getting Work Done in the Cloud

Need Help with Your Cloud Initiatives?

Data Rebels on Tap | Hashmap Podcast

Data Rebels on Tap is a weekly podcast focusing on data and cloud technologies from a developer's perspective. Your…

Other Tools and Content For You

Snowflake Utilities & Accelerators | Do more with Snowflake

Try out all the Snowflake utilities that Hashmap has available and do more with Snowflake: Snowflake Inspector…

Housekeeping in Snowflake with SQL and DBT

Be Sure and Take Advantage of Account Usage History in Snowflake

Snowflake Cheatsheet | 3 Approaches to Creating Users

One of the most basic tasks available in Snowflake is the ability to create and manage users. This is a core pillar of…

Data Sync to Snowflake Using Confluent Kafka Connect: Part 1

Moving On-Prem Oracle Databases to Snowflake in Azure with Kafka Connect

Written by Hashmap, an NTT DATA Company