Can big data’s killer app really be reactive?
This post was my first draft of a set of ideas around reactive machine learning. These ideas have since evolved a great deal and are now best represented in the book Reactive Machine Learning Systems. For the latest information on reactive machine learning, such as talks and other writings, check out reactivemachinelearning.com.
Reactive systems are one of the most exciting developments in software development within the past several years. For a brief introduction to the topic, I’d point you to the Reactive Manifesto. It’s a succinct and revelatory distillation of many of the key strategies for fulfilling modern application requirements using sophisticated distributed systems design principles.
These ideas are building into a fairly substantial movement, and you can find books on how to apply the principles of reactive systems to issues in application development, domain modeling, web application development, and design patterns. There’s even a very cool course on reactive programming you can take on Coursera.
This is all exciting and engaging material, but I see one big oversight: machine learning. Though I work at a web startup, I don’t personally build webapps. I use modern big data tools to build massive machine learning pipelines. So my obvious question after digging into this topic was:
How do the principles of reactive systems relate to machine learning applications?
Finding no answer on Google, I decided to give it a crack at answering it myself. What follows is a pretty rough initial analysis that I’ll probably update over time as engage more with the question. But in the spirit of starting before you’re ready, here’s my attempt to answer my own buring question.
Machine Learning Applications
First, when I say machine learning application (ML app), what do I mean? Actually, I mean several things.
Model Learning Pipeline
The model learning pipeline is typically what people think of when they think of an ML app. It is typically, but not necessarily, a batch job that processes a data set consisting of training and possibly validation and/or test sets of instances. The end result is some sort of model.
This is one of the less discussed parts of a machine learning application: whatever actually uses the model to make some sort of decision. As inputs, it consumes features (more on them in a second) and the previously learned model. The output of this sub-application is at least some sort of signal about the action to take. In many real world systems, this signal is often entangled with the portion of the application that actually takes the action (e.g. showing you which books you might want to read on Amazon).
This is the portion of the application that takes raw data from some source and turns it into derived semantically meaningful data known as features. The astute reader will notice that we need feature extractors in both of the above subsystems: once when we learn the model and once when we evaluate the model. Ideally, those two usages will just be two calls to the same feature extractor implementations, but nothing requires that. Implementing feature extractors can be tricky even though it looks simple. For a deeper dive into the difficulties of getting this right, check out my post on machine learning failures. One last point to call out on feature extractors is that they have a similar issue to the model server: they can be entangled with the process of actually gathering the raw data from the outside world.
Names are hard
The nomenclature above probably doesn’t line up with the terms that you heard in school or use in your ML app. I’ve found a fair bit of heterogeneity in usage across academia and industry. So, for the moment, let’s take my terms as a given.
Whose app is it anyway?
In my analysis, I’m going to use a hypothetical application designed to solve one of the most pressing problems in the modern world: dog poop. For us New Yorkers, the prospect of walking out each morning and stepping right into a steaming pile is very real. So, we’re going to imagine a hip startup in SoHo with a shoe-saving app named Dookie. The folks at Dookie are dog lovers who want to build a global social network around spotting and reporting rogue dog poop.
The original version of the app was based around mapping and notification of dog poop sightings. Using Dookie’s mobile app, New Yorkers could mark locations where they found dog poop. This fed a realtime view of all of the recent sightings of dog poop around the city. Eventually, people could plan their routes to and from places based around the presence or absence of dog poop on the street. The app was a huge hit with the sort of affluent, image conscious users that advertisers love to target.
This allowed the team at Dookie to raise a big round of investment to help fund an even greater version of their crap-tracking app.
In the evolved version of the Dookie platform, built with the help of the new data science team, the app actually attempts to predict potential hotspots before they’re reported. The goal was to give users an even more powerful view of the universe of possible locations routes unobstructed by poop.
Predicting where poop might be turned out to be a problem that was not just tricky from the analytical dimension. This new machine learning stack required all sorts of new application code.
In the first version of the ML app, called Poop Predictor, the team used primarily tools from the PyData stack: scikit-learn, pandas, etc. In the second version of the ML app, called Scoop Sense, the team used Spark and Scala.
The specifics of those two tech stacks aren’t really the topic of this post. The goal is just to characterize the differences between an intrinsically single-threaded, single node implementation (Poop Predictor) versus a multithreaded, distributed implementation (Scoop Sense).
I’m aware that there are ways of building big awesome applications using the PyData stack that aren’t purely limited by the capabilities of CPython; I’ve done so. I’m actually a huge fan of Python, especially for machine learning. But there are big architectural implications of choosing one or the other of the two best tech stacks for machine learning, and I want to call those out as a way of highlighting principles, strategies, patterns, etc. that will allow us to build a reactive system.
Finally, in the material below, I’m going to try to characterize behavior and designs that apply to a wide range of applications. To begin with, I’ll be mostly focusing on the model learning pipeline and the feature extractors. I’ll try to put aside any concerns about the model server, treating it as a separate, potentially fully reactive application.
Principles of Reactive Systems
Going through the four principles of reactive systems, I’ll try to evaluate how our example systems will measure up to the ideal of a reactive system.
As the diagram implies, the principles interact and enable each other, so techniques used in one section will clearly play a role in supporting the principle of another section.
The first principle of reactive systems is that they should be responsive. Somewhat surprisingly, this is often the point that drives people like Jay Kreps to question the utility of the reactive systems paradigm. If you’re all about big data infrastructure, responsiveness sounds a little bit too close to that responsive web design thing that front end guys sometimes worry about.
As I understand it, responsiveness is just whether or not the application returns consistent timely responses to user input. What does this mean for an ML pipeline?
Let’s start with user input. What is the user input to our ML pipeline? A hand-wavy treatment in an ML textbook would probably say something like “the features,” but we know better, right? Features don’t fall from the sky; they’re extracted from raw data, and raw data must be collected. Let’s define the user input to be the call to initiate the feature extraction and model learning process which consumes the raw input data, leaving aside concerns of raw data acquisition for the moment.
Then what constitutes a response? In the case of a machine learning pipeline, the obvious candidate is a freshly learned model.
So, do we have reliable upper bounds on our response times giving us a consistent quality of service? In many system designs, I would argue that the answer is no.
For example, when is the pipeline going to finish? Depending on the application, the answer might be anywhere between a few minutes to several hours (or more, in principle). But whatever the number given, that number will almost always be just an estimate based on past historicals. Rarely is there a concept of an SLA for an ML pipeline. Usually, they’re just setup to run on whatever data they’re pointed at, and the resulting latency between hitting run and getting a new model just is what it is. If things are starting to take too long, people use larger hardware or try to improve the implementation in some way. Maybe they’ll even sample down the data, if that’s an option. But none of these options will actually give a user of the pipeline system a consistent level of service through a reliable upper bound on responsiveness.
There are options for reducing the uncertainty of the response time, though. Although it can have a meaningful impact on the quality of a learned model, we can set the number of iterations.
Explicitly setting the number of iterations that the algorithm will use is not quite bounding the response time; it’s just bounding the number of passes over the training data. But it’s definitely progress towards creating a predictable distribution of response times.
If we wanted to have a true upper bound on our model learning pipeline’s response time, wrapping our model learning pipeline in a Scala Future works nicely:
This approach takes advantage of the concept of a default model. A default model can be a key component of allowing your ML app to gracefully degrade to some reasonable state in the face of failure. In this particular example, we are attempting to learn a real Spark linear regression model, but if model learning takes longer than the timeout of one hour, we’ll return the default model instead. Examples of a default model don’t even need to include “true” models. They could simply be the default class (e.g. “no poop here”) or the last model that was learned for this configuration. A lot of this depends on the specifics of your application, the stability of your model learning process, your business requirements, etc. But most teams have the option of gracefully degrading to some sort of default model.
The model learning timeout functionality brings us much closer to the concept of a responsive ML pipeline. Though I enjoy how neat and clean this in the Scala example, futures are available in most languages that support multithreading (often via libraries). Moreover, achieving futures-like semantics in any application capable of asynchronous processing should be achievable. It doesn’t have to be implemented at the language level and could easily live within your application’s code.
The next principle of reactive systems is that they are resilient. In this context, resilience is defined as maintaining responsiveness in the face of failure. Obviously this is a desirable property in a modern application, but how is it to be achieved?
The authors of the Reactive Manifesto describe three specific strategies for ensuring resilience:
- Containment and Isolation
Let’s take each in turn and consider how we could incorporate them into our two implementations of our ML app.
The Reactive Manifesto defines replication as follows:
Executing a component simultaneously in different places…
For Poop Predictor, the Python implementation of our ML app, this is hard. Without native multithreading capabilities we must put the responsibility for the simultaneous execution of multiple invocations of our ML pipeline upon some other external system. Yes, Python has some support for multiprocessing, but we’re going to ignore that for now. It’s a late addition to the language, totally bolted on to get around the global interpreter lock. I think that it’s more accurate to think of Python multiprocessing as a facility provided by the OS and not by Python the language. Beyond that, Poop Predictor is a single node application. So even if we executed our ML pipeline in a multithreaded manner, those executions would not be in different places, dramatically limiting the utility of this sort of replication.
The story is pretty different on the Scoop Sense side. Thanks to our use of Spark and MLlib, our Scala implementation is operating on a cluster. Spark is internally relying on Akka. It’s system architecture ensures that a given component of our pipeline (a task) will be executed on multiple nodes in a cluster, when need be. By default, there may not be that much replication in our job, but effective Spark programming will almost definitely involve broadcast variables. This will mean that portions of our dataset will be replicated around our cluster.
We can go even further than that by taking advantage of Spark’s speculative execution capabilities, we can achieve literal simultaneous execution of portions of our job, exactly fulfilling the replication definition given above.
Now, whenever one task in our job is running slower than desired, perhaps headed towards a failure, we will begin speculative execution of that task in another location, giving our pipeline some nice responsiveness in the face of low level task failures.
This all sounds pretty useful, and broadcast variables and speculative execution can definitely play a role in enabling other principles and strategies for reactive machine learning systems.
Containment and Isolation
The strategy of containment and isolation means keeping components of the application sufficiently separated from each other that a failure in one component will not affect other components. Kuhn and Allen use the metaphor of bulkheading from ship design. Applications should be designed to share nothing. Errors should only be signals that can be responded to, not part of shared state that can lead to cascading failure.
Poop Predictor turns out to not be in good shape here. Since it dates back to the earlier days of Dookie, when they had less scale, it simply loops through all configured models needed and learns each one in turn. This means that a catastrophic failure in the Chicago poop prediction model can prevent the New York poop prediction model from being learned. Poop Predictor is not a poorly implemented app; it has error handling functionality. But it has no ability to detect a timeout and start up a new feature extraction or model learning process in parallel to compensate. So any pipeline failure is likely to have cascading effects downstream.
Scoop Sense looks better here. Spark internally compartmentalizes tasks, preventing the failure of one task from causing the failure of an entire job (in our case the feature extraction and model learning pipelines). Since we’re submitting the New York and Chicago model learning jobs separately, there’s no real way for a rogue job to spill out onto the poop deck of the S. S. Dookie or into another compartment.
But what about if an entire job fails? Does our entire model learning capability fail as well? It depends on the relationship between job invocation and job execution.
Let’s assume that initially, the Dookie developers execute their Scoop Sense Spark jobs on the cloud, specifically AWS EMR. This means that the job invocation application is just making API calls and receiving back statuses. There’s very little risk of a mere status code propagating any sort of failure to other independently executing jobs. Failure is nicely contained because the application components are sufficiently isolated.
As Dookie continues to scale, they feel the need to lower their monthly AWS bills, so they turn to hosting a persistent cluster for all of their data processing jobs. Whether they choose Mesos or YARN to manage their jobs, both cluster managers employ sophisticated strategies around containment and isolation to ensure that failures won’t propagate across jobs.
What about job invocation itself, though? In all of these scenarios, a job must be started by something. Some driver program must be around to start jobs and receive their status signals to know whether or not the job has failed. A good strategy for ensuring more resilience might be to use a distributed, fault-tolerant scheduler such as Chronos. If the Dookie team uses Chronos to manage its Scoop Sense Spark jobs (or any other types of jobs it might have), they will get the ability to setup retry logic for their jobs and dependency management, all sitting on top of the resilient Mesos infrastructure. Using such a system will allow them to architect their system in such a way that the job invocation application doesn’t end up being a single point of failure that cascades down to the rest of the ML app.
Our final strategy for ensuring resilience in our application is delegation. From the Reactive Manifesto again:
Delegating a task asynchronously to another component means that the execution of the task will take place in the context of that other component.
Obviously, we’re in trouble again for Poop Predictor. Asynchronous anything is out the question unless we leave Python land (e.g. using a message queue of some sort). And all of our jobs occur in the exact same context. Poop Predictor runs each job in the same Python process. There is no possibility of some independent component of the application to sit on top of the various jobs observing what has failed in a way that would allow it to truly have separated its context from that of the job execution context. Certainly, there can be hierarchy and error handling but only within the same execution context. So any “delegation” we might have in Poop Predictor is going to be a feature of our application code and not of its architecture, dramatically weakening any guarantees it might offer about execution supervision.
Since Scoop Sense is built on top of Spark, we get some delegation right out of the box. Delegation is a core concept of the design of Akka, which is a key component of Spark, and Erlang, from which much of the design of Akka derives. Beyond the Akka level concepts of delegation, Spark also uses a hierarchical architecture allowing failures to be handled at both the cluster manager and driver program levels. Similar to the last section, we have the option of adding yet more delegation to our design by having the job invocation mechanism live outside of the driver program itself via a fault-tolerant scheduler like Chronos.
Did we succeed in ensuring resilience in at least one of our implementations? Obviously we ran into a lot of real limitations on the Poop Predictor side, mostly due to Python being single-threaded and Poop Predictor only being deployed on a single node. We had options for improving our ML app, but they all seemed to involve pretty fundamental changes to Poop Predictor’s architecture.
Scoop Sense seemed to be setup for a lot more success out of the gate, mostly due to the pervasive use of reactive design principles in the implementation of Spark and other supporting tools like Mesos and Chronos. But what is the definition of success here?
From the Reactive Manifesto:
The client of a component is not burdened with handling its failures.
Based on this, I’d say if we see exceptions that bring down our top level job invocation application, we’ve failed. Scoop Sense has a lot of explicit design choices that make use of the strategies of replication, containment/isolation, and delegation, putting us in a good place in terms of resilience.
Building on the techniques used already, I’m going to suggest one more refinement to our design. Let’s submit our jobs in a manner where our driver programs are supervised by the cluster manager.
Relying primarily on the strategies of delegation and isolation, this invocation strategy ensures that any failure in the driver program will be detected by the supervising cluster manager and cause the driver program to be restarted. This feature combined with all of the others above should leave our Scoop Sense implementation pretty responsive despite any failures that might occur.
The next principle of reactive systems that we’ll consider is elasticity. For a system to be elastic it must maintain responsiveness under varying levels of load. This requirement implies several characteristics of our system. First, it must be capable of splitting its input stream across multiple processing units.
In the case of Poop Predictor, it’s not clear how we would do this. Our Python implementation is running the feature extraction and model learning pipelines for each configured model in serial. Even assuming an external message receiving layer that accepted requests for new models, Poop Predictor can only learn one model at a time. So, while it might be able to eventually catch up to a flood of new inbound requests for models (due to the launch of Dookie in the crap filled streets of Paris), Poop Predictor will not be maintaining a constant level of responsiveness in the face of this traffic spike.
The second characteristic of an elastic system is that it is able vary the number of processing units in response to varying load. In the optimistic version of this scenario, we just want our ML app to spin up a bunch of new processing resources when all of those Parisians start to download the app and want to know where they can step. In the pessimistic scenario, this is about dialing down our total spend on AWS when North Korea shuts off our app for predicting that Kim Jong-un’s Central Luxury Mansion might have some dog poop on its front steps. Given that our Poop Predictor app is only running on a single server, it’s not going to have this useful characteristic.
Both are important components of making an application fit it into the total value system of Dookie the company. Dookie is a startup. The team wants the app to become incredibly popular and still delight users. And if hockey stick growth doesn’t happen this month, then they want to keep their infrastructure costs as small as possible to elongate their runway as much as possible. Can the Scoop Sense implementation do any better here?
The distributed design of our Spark ML app puts us in a better place here. At the task level, within a given instance of the pipeline, we’re definitely splitting up our inputs across multiple processing units. If we presume either an independent execution service like EMR or a cluster manager, then our various calls to start learning new poop prediction models will be matched with yet more resources.
The Spark framework underlying our Scoop Sense app also gives us the ability to automatically dial down our resources when the supreme leader pulls the plug on our promising Asian expansion. If Scoop Sense is executed with YARN as its cluster manager on the latest version of Spark. This functionality is called dynamic resource allocation, which sounds like a more verbose way of saying “elasticity.”
This isn’t a magic trick, though. Spark doesn’t buy and configure our self-hosted servers. If we’re running a Hadoop cluster configured to use YARN, then we’re still paying one way or another and doing much of the hard work. More servers aren’t going to magically appear unless we’re running in the cloud and we make the correct API calls to instantiate them. Elastic scalability in response to variable traffic is an incredibly difficult problem and no single API call can solve it.
Conceptually, what we might want is, if Dookie is hosting a Hadoop cluster, to have the ability to send excess work to the cloud when the on site cluster is over utilized. I don’t have a code example for how to pull this particular trick off, because, frankly, it’s hard to do. Sophisticated cluster managers like YARN and Mesos have the ability to monitor cluster utilization and propagate signals about that state back up to some other supervisory application. But let’s say that this additional level of supervision hasn’t been implemented in Scoop Sense and leave it for a future user story.
It’s worth taking a moment to understand why elasticity was so hard to achieve in the Poop Predictor implementation but so much easier in the Scoop Sense implementation. Certainly part of the reason is that we assumed a clustered deployment of Scoop Sense. Spark and its associated cluster managers are doing much of the heavy lifting for our app. But it’s worth digging a bit deeper into the how and why of Spark, given how valuable many of its design decisions have proven on our quest for reactive machine learning applications.
As in other reactive systems, our machine learning pipeline is built on a bedrock of functional programming principles. Spark is implemented in Scala and makes extensive use of the functional programming facilities that the language provides.
For example, in Spark, we’re using higher order functions to pass a function to our dataset, such as when we load our data and parse it:
This isn’t just a metaphor. A given portion of our data is located on a given node in our Spark cluster. The Spark framework will send the instructions to execute the line splitting and LabeledPoint creation to each of those nodes; the program will go to the data, not the reverse. This is one of the fundamental insights behind the map reduce programming model popularized by Hadoop. As we’ve seen above, splitting up our data is going to have a huge impact on our ability to scale up or down as needed.
Higher order functions aren’t the only technique from functional programming that enable our Scoop Sense app to have such a reactive design. Another crucial strategy in use is immutability. Without diving too deep into the topic, thinking in immutable terms allows us to think in a more mathematical sense, viewing our ML pipeline as a series of transformations that yield new datasets, without mutating the original datasets. This mental model and execution strategy make it much easier to think in terms that make sense on a developer’s laptop but scale up to massive clusters.
I could easily go off on a rant about why functional programming is crucial for modern data engineering and what it means for your career, but I won’t. If you see me at a meetup or at a bar, that’s probably what I’m doing anyway, and the show is much better in person. Instead, I’ll press on to our final principle of reactive systems to see how our ML apps measure up.
In the previous sections we’ve fleshed out a lot of the rationale behind why our ML app needs to be broken up into smaller pieces that are clearly separated so that they can be supervised and responded to in the event of varying load or failure. The final principle, being message driven, is a key element of how you could actually implement a system with all of these properties. In a reactive system, asynchronous message passing is used as the communication mechanism across loosely coupled components.
Message passing plays a key role in allowing a system to be elastic, resilient, and ultimately response. When our ML pipeline increases its runtime over an expectation that information can be conveyed back to our cluster manager to get more resources for things like speculative execution. When the new Paris model explodes on some data that includes an accent aigu, that won’t be a problem for our Chicago model.
We won’t have to step in that flaming pile, scrape that mess off of our shoe, and threaten Billy Madison’s life.
In a message-driven system, failure is just another message.
It can be responded to or not, as your app requires. If we’ve properly designed for resilience, maybe we’re already handling model learning failure with a default model. Or maybe the business rule is to just not show new poop predictions in the case of failure. Paris’ reported levels of poop may be sufficient to maintain your expected user experience, so predicted dookies may simply not be required. Alternatively, the failure can be conveyed back to the supervisory layer of our application for further action as a message not as a runtime exception that must be handled immediately. Then we have a choice on what we do next when failures are just messages.
It sounds like we might be in good shape for our final principle, but let’s consider a somewhat trickier example. Let’s see if we can flesh out a bit more about the role that being message-driven plays in being reactive.
As a user engages with the Dookie mobile app, he or she is sending back data to Dookie’s data collection systems. This data is things like user-entered poop sightings or (more frequently) passive data being submitted about locations the user has passed without reporting unattended poop. Ultimately, this data will be used by the feature extraction part of the pipeline to learn new models. So, how is that data ingested into our ML app?
In the case of Poop Predictor, that data is sent in irregular batches from the mobile app back to a data collection API on a backend server. That data collection application is actually now sort of part of our ML app, and it’s not at all message-driven. The data is actually sent in the form of extracted features for direct writes to the relational feature values database. Worse, it directly writes those feature instances to the database before confirming to the mobile app that the request has been processed.
This means that this communication is not asynchronous. Also, we’re probably also in a bad spot in terms of resilience and elasticity. We’re not handling errors, or we’re forcing the handling of those errors back to our mobile app! There’s certainly no functionality to manage the load on our writes to the database or our utilization of the data collection application itself.
Thankfully, the team at Dookie built on everything that they learned with Poop Predictor when they built out Scoop Sense. In the latest version of the Scoop Sense ML platform, they’ve actually built on top of PredictionIO, a sophisticated ML platform built on top of Spark (and several other cool technologies).
In the PredictionIO-based version of the Scoop Sense platform, data is sent back to the event server via API calls from client libraries. This brings us a lot closer to the message passing ideal. The event server is primarily responsible for just receiving these messages. It doesn’t need to engage in some sort of blocking validation that the event messages fit some expected business logic. They’re just messages.
This completely decouples the data collection phase from the model learning feature extraction pipeline logic. If a new version of the mobile client app starts sending back nonsense events, the event server can still receive them all without having to reject them at receipt because they don’t fit into the relational data model of what a feature should be. As long as they are valid event messages, those messages can be persisted, and nothing forces the feature extraction pipeline to use them, if they don’t pass the data validation logic.
Perhaps most importantly, this completely disconnects the issues of vertical and horizontal scaling for our data collection process from our feature extraction pipeline. As long as we’re just passing messages via some persistence integration layer (e.g. Kafka), we can have dramatically different footprints for data collection and the feature extraction and model learning pipelines. This then enables elastic scaling of those resources, so the Dookie team can spin up a bunch of resources nightly to learn new models while keeping the footprint of the data collection application constant.
Done, Not Done
Now that I’ve tried to answer my own question, I have to say that I find it even more interesting.
How do the principles of reactive systems relate to machine learning applications?
I’ve tried to give this question as thorough of a treatment as will fit into a blog post, but I still have many questions that I’d like to dig into:
- Could we find a better definition for responsiveness?
- What does responsiveness really mean in the context of the collect data-extract features-learn model-repeat loop?
- Could we derive the reactive design principles purely from the principles of functional programming? Or is it possible to build a reactive ML system without ever using higher order functions, immutability, etc.?
- What are the best ways of bringing these techniques to languages without native multithreading capabilities?
- Can you use Spark and still wind up with a non-reactive system?
- Are there design anti-patterns that lead people towards building machine learning systems that they think are reactive but really aren’t?
- Is predicting locations of dog poop really the highest and best use of sophisticated machine learning techniques?
- Where should elasticity logic live when your app is deployed to the cloud?
- How does the implementation of the model server affect the reactiveness of the overall application?
- Who watches the watchmen? Do we need to supervise our distributed, fault-tolerant scheduler?
- How does our reactive machine learning application change when our model learning algorithm is an online learning or reinforcement learning algorithm?
- How does the concept of a lambda architecture relate to a reactive architecture in the context of a machine learning system?
There’s definitely more to consider here, and I promise to get back to these questions. I’d love to hear your perspective on this topic, so check out the contact info below and let me know what you think. In the meantime, watch where you step!
If you want to hear more about reactive machine learning, sign up for my mailing list at reactivemachinelearning.com.
I’m a data engineer and sometimes cartoonist, based in New York. I hack on machine learning systems at x.ai. I blog mostly here on Medium. You can find more of my comics and collaborative art project in the Comics in Beta collection.
Finally, I promise that I always pick up my dog’s poop.
References and Credits
Much of the reactive systems material is obviously based on The Reactive Manifesto. I also heavily referenced Reactive Design Patterns by Kuhn and Allen and highly recommend buying and reading the book! I also found the presentations of Dean Wampler and Henrik Engström very helpful. I highly recommend the Principles of Reactive Systems course on Coursera.
To learn and understand Spark, I recommend starting with Learning Spark. For a more machine learning-focused perspective on the platform, I highly enjoyed Advanced Analytics with Spark. Of course, you can (and should!) dive right into Spark with just the docs.
Finally, for an interesting perspective on a modern machine learning architecture built using Spark, PredictionIO is definitely worth understanding.
Final Python Note
I really do love Python, and I realize that this post could be read as indictment of Python. It’s not. When they put me in the ground, I hope that my tombstone will be enscribed with significant whitespace and not a single semicolon.
But I build and love powerful distributed systems. The best way that I know how to do that from Python is Spark. Seriously, it’s a very real option. The impedance mismatch issues are more significant than if you use Scala or Java to talk to Spark, but the cognitive overhead may be worth it for the capabilities it brings to your app. That’s for you to decide. I would just make the argument that attempting to build a reactive machine learning application is an ambitious goal, well worth shooting for.