If you haven’t already taken the plunge into machine learning, Loud ML 1.5 is a great place to start. Here’s why.
Devops and reliability engineering is a broad role. It’s not just whether the services are running or not; it’s a data driven role, where decisions are made based on observations. We’ve reached an exciting era in machine learning, and the team behind Loud ML are working on some interesting features for Loud ML 1.5 that will help devops get the most automation from AI-driven systems in 2019.
Unlike software development, which takes strictly-defined, validated input and produces expected output, machine learning inherently learns from its own results, constantly training from input, to gradually improve over time.
Here, we’ll talk about new features and our 2019 roadmap. We’ve broken it down to the following categories:
- Performance improvements
- Incremental retraining and checkpoints
- Better strategies to filter the noise and spot true abnormal data
- Better strategies to manage business rules
- Dynamic data sources and data sinks
- JWT authentication
- Better forecasts
Devops and infrastructure analytics providers have thousands, maybe millions of metrics to monitor — more than humans can deal with. AI can help, taking care of these metrics in a scalable way. It no longer requires a massive data centre to crunch the numbers: you can gain valuable insights with a handful of servers crunching terabytes thanks to tools like Loud ML. Release 1.5 will be our first distributed release, running across nodes, CPUs and GPUs. This horizontal scaling means you can distribute the training load, and/or the inference load in a cluster. This also means higher availability.
In addition, we all want our machine learning jobs to be cost effective and resource friendly. Loud ML 1.5 uses the latest hardware technology to leverage speed and efficiency. As a result, Loud ML will run up to eight times faster on Intel and NVIDIA architectures. MKL-DNN (Intel Math Kernel Library for Deep Neural Networks) is our new standard for Intel architecture packages. We’re considering implementing ARM support if there’s enough demand. If you’d like this, please let us know!
Lastly, we’ve also upgraded to TensorFlow 1.12 to leverage new functionality and to stay up to date.
Incremental retraining and checkpoints
In “business as usual” situations, minor changes in data will be learned automatically, so that the model gets better over time, resulting in false positive noise reduction.
There might be times when your model learns something you don’t want it to. For example, your data centre loses power for a weekend and your model starts retraining on unexpected irregular data. What you’d really like to do is wind back to a point in time when the data was clean (when the data centre was still running), and ignore the unexpected irregular data altogether — without having to retrain your model from scratch. Many of you asked for this functionality, so we built it using checkpoints.
Better strategies to filter the noise and spot true abnormal data
Imagine if all your mobile phone alerts were the same — a single ‘ding’ for phone calls, social media, texts from your partner and all your other app alerts. You’d be looking at your phone so often that, eventually, you’d stop looking altogether, missing those all-important texts and risking the end of a relationship. Devops face this problem when alerts are created too regularly. What if you could filter out that noise and be alerted to just the important information?
Loud ML 1.5 will cut out the noise, providing you with loud but limited, relevant information. It will be able to spot true abnormal data that would otherwise be missed by the average human being.
Better strategies to manage business rules
Let’s say your data center business has just signed a lucrative deal with a new supplier. Now, your old servers have new drives that spin up and down much faster, saving energy and resources. Result! But your data metrics just changed overnight, showing a big drop in server usage and appearing to be an anomaly. We know it’s not, and a truly smart ML system must understand when norms change in business rules. Loud ML 1.5 will provide the toolset needed to cope with such situations.
Dynamic data sources and data sinks
Devops are busy enough without having to spend time moving data around. Loud ML 1.5 will therefore enable you choose where, how and if output data from inference jobs should be saved, and easily integrates with third-party software, such as alerting tools.
For example, your time series data monitors the load balancing on a series of servers. You might want certain alerts to be routed to certain third-party tools, as discussed in our previous post about sending intelligent machine learning-driven alerts to Slack and PagerDuty.
For applications hosted in the cloud, security is a must. When implementing a machine learning tool such as Loud ML 1.5, you’ll probably want to control things like resource allocation per user and authentication levels to ensure your ML jobs are running with the right privileges.
To achieve this, Loud ML 1.5 will include JWT authentication and read/write access to the /datasources API. This means you can define JWT tokens to authenticate all queries sent to the /datasources API and dynamically add or delete data sources, providing end-to-end security in a multi-tenant environment.
Templates are here
New to Loud ML, or ready to set up your first machine learning job? Time to AI is now faster than ever with templates. Loud ML 1.5 will come with a library of predefined templates for common metrics and use-cases, so you can select what you need and tweak it for your environment, without having to configure the entire thing. We want you to get started in minutes for typical use-cases, such as detecting anomalies in your syslog data, your nginx response times, or your disk filling ratio. We’ll include predefined templates for each one of those. Point, click, find your template, hit save, and it’s done.
Using our data center example, you might need to forecast some capacity planning for your CFO. If you’re running, say, 500 applications on 50 servers, when will you need the next five servers? Accurate forecasting means you can give your CFO a date for purchasing more hardware, leading to better planning for capital expenditure investment, and data-based decisions over opinions.
Loud ML already forecasts future events, but forecasting is a tricky thing: the future is uncertain and forecast accuracy tends to drop over time, regardless of future data points. Loud ML 1.5 will be able to simulate millions of possible forecast results and return the most “likely”.
For example, to forecast when you need the next five servers, your machine learning job will provide the most likely future value. In addition, you’ve bound the forecast to a 65% confidence rate. The machine learning job will provide you with two other values:
• a higher value: 65% of the time, the actual value will be lower than this higher bound; and,
• a lower value: 65% of the time, the actual value will be higher than this lower bound.
Your feedback is driving us in the right direction
There’s all this and more in store for 2019, and we’d love to hear from you — whether you’re a first-time reader or a seasoned Loud MLer, please take a moment to answer five short questions.
We’d like to thank all of those who have supported us this year — OVH, Rockstart AI accelerator program and our partners InfluxData and Tingsense. Most of all, we’d like to thank the tens of thousands of you who have already downloaded Loud ML and tried it out for yourselves.
Finally, we wish you all a merry Christmas, and we promise to bring you lots more exciting news in the New Year!