Spark + Now = (S[Now]Park)

9 min readFeb 25, 2022

I’m Duncan and I’m a Senior Sales Engineer at Snowflake. Opinions expressed here are solely my own and do not represent the views or opinions of my employer.

The best snowboarders are only as good as their equipment will allow them to be. If they have all the talent, logic, thinking and timing in the world but no ability to execute, then it is not going to end well!

What’s this got to do with Spark, Dunc? I hear you ask…..well there are parallels between the way we operate and the way businesses operate. For example, if an organisation’s infrastructure is creaking why would they consider moving this to the cloud? In the same way if a snowboarder’s board is cracked, why would they try and descend a black run?

So it’s important for us to separate the talent from the ability in the same way we need to separate the business logic being executed from the infrastructure it runs on. No longer should we just think about moving them together as one. Sometimes we need to throw the board (the ability) away (unless it’s got sentimental value!) but that’s probably highly unlikely with hardware clusters!!

It’s not news that organisations are moving their workloads to the cloud, away from on-premises environments, in substantial numbers. But for me it’s less about “what” they’re doing but more about the “how” and the “why” they’re moving is where it gets really interesting. Let’s pick a particular area of focus to drill down on to have a look. Data Science and their complex processing environments and let’s start with the basics…

What’s Data Science? Well, there is still no consensus on the exact definition of data science and it is considered by some (not me!) to be a buzzword. Big data is a related marketing term. But, generally, Data scientists are responsible for breaking down data into usable information and creating software and algorithms that help companies and organisations determine optimal operations.

What is a complex processing environment? Let’s use Hadoop as an example. Hadoop is a software ecosystem used for data storage and computation. Initially released in 2011, it’s evolution as a distributed storage and processing system has grown into a vast ecosystem which includes but is not limited to:

Apache Pig: scripting language similar to SQL that is used in conjunction with Hadoop and used instead of writing code in Java; it’s used for data analysis tasks.
Apache Spark: is a clustered computational framework that provides distributed data processing for more complex tasks such as machine learning.
Apache Hive: provides for the use of SQL for data querying, summarisation, analysis, and exploration.
Apache Flume: data collection software that can handle massive amounts of streaming data; Flume is used for data ingestion.
Yarn: used for resource management and job scheduling.

What’s so bad about this? The reasons for this have recently been articulated in multiple write-ups arguing that Hadoop is simply not a very good scale-out solution for all but the most limited application and is now seen as an inhibitor to organisations being operationally effective.

With these types of platforms their performance scaling will disrupt operations and cause downtime. Furthermore, solutions built for data science, such as Apache Spark based systems, are not designed to be used as an enterprise-wide platform to support broad analytics needs. Using Spark to support multiple functional teams requires custom and specialised skill sets to manage the resource contention and data issues. Put simply:

It’s failing to deliver the future it promised. Hadoop may be a victim of the hype assigned to it by the media. A hangover perhaps of it being largely associated with big data, and because big data caught the attention of many enterprises that were working with data that wasn’t nearly as big as they expected it would be, Hadoop was at the heart of many implementations that failed to deliver any measurable value.
It’s too rigid and stubborn. Hadoop allows you to run only a single general computation and offers you precious little flexibility to accomplish even that. Too much of it is fixed, which is a problem for organizations needing a scale-out solution that conforms to their needs, rather than the other way around.
It‘s “Complicated”. An enterprise-class Hadoop implementation will likely require significant expertise, and many organisations have found themselves ill equipped internally to address that need, and unwilling to invest the necessary sums to outsource it. When polled by Gartner(1), even back in 2015, 54% of respondents had no plans to adopt Hadoop, and 57% of those said the skills gap was the primary reason. System administrators can be lulled into thinking they’ll be able to manage a Hadoop production cluster after running a small problem on a small test cluster. This is seldom the case. There are common and time-consuming problems Hadoop administrators face, from the complex and largely manual task of scaling to error and failure modes. Then there’s the need to worry about network design. On this point, most corporate networks are not designed to accommodate Hadoop’s need for high-volume data transfers between nodes.
It’s too much of a heavy lift. Most organizations that aren’t Amazon, Facebook or Google, and let’s face it that leaves virtually all other organisations need a way to scale their computing capability beyond their largest server, but it can be tough to justify the cost of a Hadoop roll-out just for that.

The truth is, until now, if your data set or problem was bigger than your largest server, you had three options: scale up by buying a bigger server, scale out by distributing your problem across clusters or limit the size of your problem to match the size of your server. The first option costs money; the second costs time and money; the third costs time. None of these options is desirable when are important and delaying them threatens productivity.

Now at this point one could argue that I’m focussing on the technology (the ability and snowboard in our analogy) and you’d be right. It’s important to understand what’s going to motivate organisations to move away from what is rapidly being acknowledged as legacy architecture. The technology itself is only half of the discussion. The other half is the business logic and workloads being processed inside of these platforms (the boarder and the talent in our analogy).

So the question is, given these challenges, “why” do organisations move these on-premise solutions with so many issues and constraints to the cloud or even build them in the cloud from scratch?

Is it because it’s the least risk option that allows them to deliver to schedule and budget?
Is it skills related?
Is it too much choice based confusion or the opposite — is it lack of awareness in the market of capabilities?
Or…is it the way the business logic has been implemented means the execution is so tightly coupled to the platform and infrastructure that there’s no other viable option?

My experience is that it’s a mix of all of the above which leads to the fallback, default approach of “I’ve already got a hammer, therefore everything’s a nail so off we go into the cloud - crash, bang, wallop!”.

Organisations have invested years of effort and domain expertise building and maintaining the “business logic” being executed. The thought of re-platforming the code can be daunting and often drives the technology infrastructure decision. For example, I often hear organisations say “all our code is in Scala (or Python) it’s too much of a risk (time, effort, skills, cost) to rewrite so let’s lift and shift it to the cloud”.

This is an understandable approach to a certain extent as it’s a bit like trying to change the tyre on a car whilst its moving.

The business has to keep operating but for me this is not transforming the business but simply modernising it. One is simply moving the same headlines as above to somebody else’s datacenter.

The key phrase here is “until now.” What if there was another way, what if there was an answer that would allow organisations to remove the headline challenges of the technology as mentioned above and allow them the continue executing the existing business logic? What if we could separate the coding from the tech and allow organisations more freedom and flexibility to operate??

Well, there is an answer to this! Snowflake or to be more precise in this discussion, Snowpark!

Providing a single, integrated software-as-a-service (SaaS) environment requiring near-zero maintenance, Snowflake allows organisations to easily dedicate, customise, and control their data resources for analytics, data warehousing, data lake, and other workloads without compromise.

With Snowpark, developers can use their preferred language (Scala, Python, or Java) to accelerate feature engineering efforts by using familiar programming concepts such as DataFrames and then execute these workloads directly within Snowflake. For example, data scientists can choose their notebook of choice such as Jupyter or in Dataiku, H2O.ai, and Zepl (now part of DataRobot’s platform) and execute the processing in Snowflake to benefit from its scalability and performance. “Keep your friends close, keep your data closer”!

And here’s the big ticket takeaway — Snowpark enables organisations to repoint their code base away from costly, rigid and complicated infrastructure to Snowflake, allowing cost take out from the business.

Since its launch earlier this year, Snowflake’s Snowpark developer framework has helped data scientists, data engineers, and application developers collaborate more easily and streamline their data architecture by bringing everyone onto the same platform. Snowpark lets developers collaborate on data in the coding languages and construct familiar to them, while taking advantage of Snowflake’s security, governance, and performance benefits.

As part of the recently announced Snowpark for Python offering, Snowflake brings enterprise-grade open source innovation to the Snowflake Data Cloud while helping ensure a seamless experience for data scientists and developers to do their work. Snowflake Data Cloud users who already benefit from near-instant and governed access to data will now be able to speed up their Python-based workflows by taking advantage of the seamless dependency management and comprehensive set of curated open source packages the Anaconda partnership provides. The integrated Anaconda package manager is immensely valuable as without the right tool set, resolving dependencies between different packages can land developers in “dependency hell,” which can be a huge time sink.

Further, Snowpark for Python enables data teams to operate with improved trust and security. Users can collaborate against the same data using their preferred languages, without needing to copy or move the data. Not only can this eliminate ungoverned copies of data, but all code is run in a highly secure sandbox directly inside Snowflake for further protection.

When enabled, Snowflake compute resources automatically suspend and resume to operate only when needed, which maximises utility and eliminates unnecessary expenses when resources are idle. Other platforms, whether a Hadoop or a traditional MPP data warehouse, originated as an on-premises environment and lack the agility, flexibility, and scalability of Snowflake.

My colleagues Carlos Carrero and Mats Stellwall have written some excellent articles which go into great detail on the practical application of Snowpark on the topics of Streamlining Architectures and Feature Engineering

While these systems originally built as on-premise technologies may now be available in the cloud wrapper a.k.a Wolf in Sheep’s Clothing and may improve the speed of acquiring platform infrastructure, many challenges, such as disruptive and physical scaling limits, remain which is why Snowflake and Snowpark are such a compelling proposition.

And at least we know when we’re at the top of the Black Run we can be without fear that our board will disintegrate beneath us!

See you on the Snowpark slopes folks!

Duncan

(1) Gartner Report — https://www.gartner.com/en/newsroom/press-releases/2015-05-13-gartner-survey-highlights-challenges-to-hadoop-adoption

Spark + Now = (S[Now]Park)

Written by Duncan Beeby