I read somewhere that when dealing with computation problems involving billions of things, it’s a good idea to divide the work amongst multiple processes.
You can also use multiple threads but I’ll stick to multiprocessing since this more closely reflects how Databricks works.
Databricks is a multiprocessing platform. There are problems that on first blush do not appear to be appropriate for Databricks yet can be a great fit if you think about it differently.
Consider the problem of estimating pi using a Monte-Carlo simulation. You can think of the estimation method this way:
Running a Databricks notebook as a job is an easy way to operationalize all the great notebooks you have created. I think the two biggest benefits are:
In this article we will do the following:
I was happy to see that Microsoft held its Build conference despite the COVID pandemic gripping the world. While it’s no substitute for the in-person version, it was still pretty damn good. Machine Learning was an evident focus, which is good because that is on my list of things to learn this year. The session titled Azure Machine Learning in Action by Sarah Guthals and Francesca Lazzeri showcased the automated machine learning feature of Azure Machine Learning (AML). It’s hard for me to be blown away, but this session did just that. …
My previous posts explained how to get a local Spark instance up and running in Docker, load data, and query data. I also mentioned that this is an excellent way to get introduced to Spark because it is easy to set up and run.
When you’re ready to graduate to an enterprise-grade Spark instance, you will want to load and analyze data on a Spark cluster because it can handle massive datasets with ease. However, configuring a Spark cluster is far from trivial.
Enter Azure Databricks.
After reading this article, you will have accomplished the following:
The code for this article is available on GitHub.
In Part 1 of this series, we got a PySpark Docker container up and running. In this article, we’ll do a hello world type of data analysis. A popular example to start out with seems to be getting word counts from a tome of text. From my high school days, I can remember Shakespeare containing a lot of words. So let’s see what the most frequently used words are in his works. This exercise is quite easy thanks to PySpark’s SQL functions library. If you’re familiar with SQL then you’ll feel…
One of the benefits of working with a forward-thinking company like Marel is the opportunity to be challenged in different domains. In the next few months, I will get to deep dive daily into the data engineering domain — uncharted territory for me.
Azure Databricks is a crucial SaaS platform used at Marel for processing lots of data. I would summarize Azure Databricks as an easy way to spin-up an Apache Spark cluster. That’s all I am going to say about Spark and Databricks because these two technologies have been wonderfully explained by those much more knowledgeable than I. Here…
Browse to nrwl.io and you are greeted with a bold statement — develop like Google, Facebook, and Microsoft. Sounds good to me!
Click the NX logo at the top right of the toolbar to find more clues. Now click the Angular logo on the page and scroll down. Ah, this looks like tooling that helps Angular developers. But it is much more than that.
Marel uses NX with great success. We find the primary benefit of NX is that architecture is a first-class citizen and not an afterthought. Have you noticed each project team has “their” way of developing Angular…
I am a software engineer at Marel, an Icelandic company that makes machines for meat, fish, and poultry processing.