Taming Hadoop one xml tag at a time

The vision

We, at Linagora, think that software will be increasingly using AI for everything.

Today, what we see is very spectacular usages of AI like autonomous cars, very impressive image analysis, conversational agent, speech recognition, etc. It’s indeed really awesome cases to promote these technologies and as they are still costly to develop, it explains why only very disruptives use cases emerge.

But we really believe that with the comoditisation of Machine Learning and Deep Learning, their usage will spread into more and more softwares for a very good reason: where classic development tools are able to automate process that some humans designed, AI can automate what did not exists or was not known when software was written.

In short: where traditional software design has to choose between customisability and genericity, AI can adapt to every single situation without the tradeoffs.

That’s why we decided to build a Datascience team at Linagora: we want to have the best tools to design software that you will love.

The Team

But building a Datascience team is not an easy thing. Having some prediction models on a dataset is one thing but writing software that just works is another thing entirely, and people good at both are not easy to find (did I say we are hiring?).

So we started with a mix: a datascientist willing to learn software development and a developer willing to learn machine learning. And we choose the most trendy framework to make sure we would not be alone with our problems, so we started with Spark in Scala.

The first predictions

We are learning a lot, we rewrite things everytime we learn a better way, we simplify as much as possible when we discover new parts of the framework.

And we do all that with Agile methods because we want to integrate well with the whole OpenPaaS project.

We are not fast, for sure, but we are getting better every single day and we ship things.

We soon had a first prediction API to plug in the project. This API wrapped an embedded Spark (for latency purpose) and stored model in memory at first. Nothing production ready but enough to eat our own dogfood and start plugin an instance with our internal system.

But then, the next step was unexpectedly complex: we needed to run the training process on a cluster to make sure it would be fast enough, and we found no easy way to do that.

The Ops part

We first thought that running some docker commands and/or using a well crafted hadoop distribution would take a day or two and that we would go back to development tasks.

But every solutions we tried failed miserably and we soon discovered that there’s so much complexity in the hadoop ecosystem that once something goes wrong, you’ll never find the solution if you were not able to configure everything by yourself in the first place.

And here we go: we organized a 3-days session with the right people to handle that. Let’s call it the “Hadoop Barcamp”.

Hadoop Barcamp

Staffing

First, we needed an Ops, a real one. Not a dev pretending to automate the entire world with immutable infrastructure and infrastructe as code and all the trendy concept you could think of (self-blaming ends here) : we asked Yann to come with us.

Second, we need an Hadoop expert to go right to the goal : I found a great freelancer, Fabien.

Finally, Luc and I were here to ask question and try to understand how everything fits together.

Objectives

Not much actually : we wanted to build our models (1) using a cluster of servers whatever the solution is.

Architecture

We have 3 servers connected by a 1 GBit/s private network.

The simplest solution we found to run a Spark instance on these pieces of hardware is to configure Hadoop Yarn and Hadoop HDFS and to make every components run without security on the private network.

Then, we made sure the firewall blocked everything from the internet and we put a reverse proxy to access a few HTTP interfaces remotely.

Yarn and HDFS can be set up in a way that ensure no single point of failure but one need to configure a ZooKeeper to handle election. Everything is then much harder.

To make things simpler, Yarn and HDFS were installed with fixed roles for each server : the master is always to be found on the exact same address but data are safe because they are replicated anyway. In case of failture, it simply means you have to bring the failing system back for the service to be available.

Once you get these services talking to each other, all you have to do is to teach your spark-submit where Yarn is and everything will fall in place.

Results

After 3 days of configurations and discussions, we managed to launch our Spark job manually on our cluster. It means that we are able to learn 3 times faster than before. We are also able to store the models on the HDFS, that is basically a distributed, replicated filesystem that allows Spark to work well and that will make model storage safe.

Next steps

Every-time you make some steps forward, you discover the new paths you have to explore to go forward and this barcamp was no exception to that.

Now that everything is in place, and that hopefully we now know how to manage the cluster we set up, we have to work on easing operations: put a centralized log system (ELK to the rescue), keep cluster metrics to monitor how well it performs (grafana is already there), create some alerting, etc.

The second thing to work on is to teach our prediction API to delegate model building to the spark cluster so that it becomes a kind of gateway to our Spark cluster.

Conclusion

As usual we learned a lot of thing: 
* Hadoop ecosystem is really complex and modular, and every limitation or design issue has been solved by introducing a new component, feeding the always increasing complexity
* finding Hadoop configuration information is a mess and Fabien, our Hadoop expert was a really great resource for that
* The first company that will replace Hadoop by something easier to operate will probably becomes very rich (:
* Datascience needs a very broad set of skills and building the right team is very important (if you think you can help us, we are hiring!)
* Asking Fabien to teach us Hadoop was a very good decision to make this Hadoop Barcamp sucessful

Now that we sort out the infrastructure side of things, we are now looking forward to deliver smart features in OpenPaaS for our customers and we’ll write about that soon.

(1) : a model in machine learning is the output of a learning on a dataset that allows to predict or classify new things.