Taming Hadoop one xml tag at a time

Matthieu
Matthieu
May 16, 2017 · 5 min read
Image for post
Image for post

The vision

Today, what we see is very spectacular usages of AI like autonomous cars, very impressive image analysis, conversational agent, speech recognition, etc. It’s indeed really awesome cases to promote these technologies and as they are still costly to develop, it explains why only very disruptives use cases emerge.

But we really believe that with the comoditisation of Machine Learning and Deep Learning, their usage will spread into more and more softwares for a very good reason: where classic development tools are able to automate process that some humans designed, AI can automate what did not exists or was not known when software was written.

In short: where traditional software design has to choose between customisability and genericity, AI can adapt to every single situation without the tradeoffs.

That’s why we decided to build a Datascience team at Linagora: we want to have the best tools to design software that you will love.

The Team

So we started with a mix: a datascientist willing to learn software development and a developer willing to learn machine learning. And we choose the most trendy framework to make sure we would not be alone with our problems, so we started with Spark in Scala.

The first predictions

And we do all that with Agile methods because we want to integrate well with the whole OpenPaaS project.

We are not fast, for sure, but we are getting better every single day and we ship things.

We soon had a first prediction API to plug in the project. This API wrapped an embedded Spark (for latency purpose) and stored model in memory at first. Nothing production ready but enough to eat our own dogfood and start plugin an instance with our internal system.

But then, the next step was unexpectedly complex: we needed to run the training process on a cluster to make sure it would be fast enough, and we found no easy way to do that.

The Ops part

But every solutions we tried failed miserably and we soon discovered that there’s so much complexity in the hadoop ecosystem that once something goes wrong, you’ll never find the solution if you were not able to configure everything by yourself in the first place.

And here we go: we organized a 3-days session with the right people to handle that. Let’s call it the “Hadoop Barcamp”.

Hadoop Barcamp

Staffing

Second, we need an Hadoop expert to go right to the goal : I found a great freelancer, Fabien.

Finally, Luc and I were here to ask question and try to understand how everything fits together.

Objectives

Architecture

The simplest solution we found to run a Spark instance on these pieces of hardware is to configure Hadoop Yarn and Hadoop HDFS and to make every components run without security on the private network.

Then, we made sure the firewall blocked everything from the internet and we put a reverse proxy to access a few HTTP interfaces remotely.

Yarn and HDFS can be set up in a way that ensure no single point of failure but one need to configure a ZooKeeper to handle election. Everything is then much harder.

To make things simpler, Yarn and HDFS were installed with fixed roles for each server : the master is always to be found on the exact same address but data are safe because they are replicated anyway. In case of failture, it simply means you have to bring the failing system back for the service to be available.

Once you get these services talking to each other, all you have to do is to teach your spark-submit where Yarn is and everything will fall in place.

Results

Next steps

Now that everything is in place, and that hopefully we now know how to manage the cluster we set up, we have to work on easing operations: put a centralized log system (ELK to the rescue), keep cluster metrics to monitor how well it performs (grafana is already there), create some alerting, etc.

The second thing to work on is to teach our prediction API to delegate model building to the spark cluster so that it becomes a kind of gateway to our Spark cluster.

Conclusion

Now that we sort out the infrastructure side of things, we are now looking forward to deliver smart features in OpenPaaS for our customers and we’ll write about that soon.

(1) : a model in machine learning is the output of a learning on a dataset that allows to predict or classify new things.

Linagora Engineering

We are Open Source Engineers, Hacking Awesome Stuff

Welcome to a place where words matter. On Medium, smart voices and original ideas take center stage - with no ads in sight. Watch
Follow all the topics you care about, and we’ll deliver the best stories for you to your homepage and inbox. Explore
Get unlimited access to the best stories on Medium — and support writers while you’re at it. Just $5/month. Upgrade

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store