Stop, collaborate and listen — a look ahead to Spark + AI Summit Europe
Nayur Khan — Head of Platform Engineering at QuantumBlack
Once referred to as the Taylor Swift of Big Data software, Apache Spark is an integral part of our tech toolkit here at QuantumBlack. Capable of processing data at a monumental scale, the multipurpose unified analytics engine is used across a number of our projects, from injury prediction in sports to clinical trials in pharmaceuticals, and even for assessing the impact of air quality on a city’s population.
As QuantumBlack’s Head of Platform Engineering, I’ve been running Spark analytic workloads for a number of years and so I’m all too familiar with the software. Aside from the technology, one of Spark’s standout attributes is how it has harnessed its popularity to build a global community of contributors who regularly collaborate and share ideas. This community comes together at regular Spark Summit conferences, which attract thousands of attendees.
Earlier this week the Spark Summit Europe 2018 kicked off in London. Developers, data scientists and tech executives will descend onto the city’s Royal Docks arena from Tuesday 2nd to Thursday 4th October for a unique one-stop-shop in how to practically apply the very best data tools and machine learning solutions — with a focus on the event’s namesake.
Spark Summit sessions and training always provide fantastic data engineering and data science content, alongside fresh ideas around best practices for productionising AI. This is a particularly exciting week for me, as I’ll be taking to the stage today to share the lessons QuantumBlack has acquired through years of running analytic workloads in production in the Cloud. Ahead of my appearance later today, I wanted to give a flavour of what I’ll be covering in my session.
Cloud object storages are a crucial part of working with data, and I’ll seek to address the common quirks and misconceptions that businesses encounter when first exploring cloud-powered analytics.
Today’s businesses are increasingly moving a significant proportion of IT workloads to cloud environments. This fundamental shift from organisations building IT to consuming IT means that workloads will be transitioned to a hybrid cloud infrastructure at a significant pace, with a heavy reliance on off-premise environments.
There’s a reason why so many modern businesses are choosing to run analytic workloads in the cloud. The cloud provides opportunities to rapidly bring solutions and products to market at a scale and cost effectiveness that outstrips most traditional in-house infrastructure options.
Any organisation looking to harness data for analytics purposes will require a data lake — a centralised, large-scale repository to store all structured and unstructured data, which can then be analysed and put to use. Cloud object storage is usually the technology of choice for housing the beginning of these data lakes, as it provides massively scalable, cost-effective storage to store any type of data in its native format.
Design to operate
I’ll also be borrowing insights from the world of DevOps. Efficient DevOps practices are key to helping organisations approach analytical workloads in a consistent manner and enable different teams to pick up and run with projects they may not have designed, without having to wade through a mountain of documentation or code.
It’s fast becoming the norm for organisations to have many teams focus on building varying analytical use cases in the same analytic environment. It’s often advantageous to structure a workload in this fashion — a mixture of personnel means that a number of experts can input their specialist skill set at different phases of the workload.
That being said, spreading an analytical assignment comes with its own challenges, particularly when the team initially involved in building the code is not the same team responsible for operating it. If the two teams employ different configurations to run analytic workloads, and this creates an operational mismatch when the workload is handed over. To put it simply — while the overall goal for the analytical workload may be consistent across a business, the ways of applying workload to achieve these goals may differ. Organisations should take care to ensure their design and operational configurations are consistent.
Robust pipelines are a must-have
Running analysis in the same system that generates data is normally unfeasible, and this means that businesses have come to rely on data pipelines. Data pipelines that transfer data from different systems into a cloud environment for analysis to occur. Sustaining a reliable data pipeline can be an enormous challenge for any organisation, from data newcomers to AI aficionados. My Spark Summit session will touch on things to consider when building and maintaining your pipeline.
Of course, even the term can cause confusion depending on who you’re talking to. A data engineer will have a different idea of what a pipeline is than a data scientist would. To me, defining a robust data pipeline boils down to how well an end-to-end solution operates, from initial data gathering through to analysis and application, and finally to what the user sees.
If this sounds like an oversimplification, it is. Managing pipelines which utilise data engineering and machine learning analytic workloads requires balancing a vast number of spinning plates, often 24 hours a day, seven days a week, throughout the year. So what are those considerations and how should they be made? Come along to my Spark Summit session to find out.
Spark Summit’s always provide a fantastic opportunity to explore data best practice, and I’m delighted to be attending as a visitor, let alone a speaker. One of my personal highlights so far has been Wednesday morning’s keynote by Matei Zaharia, who launched the Spark project in Berkeley back in 2009. Matei delivered a fascinating roundup of his work with new open source offering MLflow, which aims to simplify the machine learning lifecycle.
I’ll be taking to the stage in Room 6 of the Spark Summit Europe at 14:00 this Thursday (4th October), and I hope to see many of you there.
For those unable to attend, do stay tuned to QuantumBlack’s Medium channel, as we’ll be touching on many of the topics above in the months to come.