There is no bigger data than telecommunications and Multiple System Operator (MSO) data. There is no data which affects the operations of these industries more than network telemetry. And there is no data that is less standard.
Telecommunications companies are dealing with one of the most complex data problems you could imagine:
- They are often an amalgam of companies, each of which had their own data operations
- They use different manufacturers of network machines, each with their own data format
- They use many models and firmware versions for each manufacturer
- They often have different geographies, each with their own operations
Getting a whole of network view is complex, because it is one of the largest data roll-up activities that could be imagined.
This article will show how using neural networks can reduce this time significantly—we have seen reductions in Time to Data of up to 99.998% for business as usual data pipelines.
One of Datalogue’s largest customers, a top US telco, has employed these techniques to improve time to data and increase the number of errors which are automatically responded to—leading to savings on the order of tens of millions of dollars per year in reduced operational costs.
A new industrial data workflow
Let’s say, for example, that a network provider wants to access handset telemetry from across their various data sources.
a company would ingest a data store, mapping the data to a schema that makes sense to them.
The company would then transform the data as necessary to feed it into the output format required.
Then, a second source appears. Another mapping process. Another pipeline.
A third, source. Same again.
(If you are getting bored, imagine how the engineers feel.)
Then the structure of the first source changes without warning. The company would have to catch that, delete the mis-processed data, remap and repipe, and start again.
A painful cycle, and one that doesn’t benefit from economies of scale. A user might get marginally faster at building a data pipeline, but it is still about one unit of labor for each new source.
… and each change in a destination too. Each iteration in the output requires more of the above as well.
That’s a lot of upfront work. A lot of maintenance work. And a lot of thankless work.
And critically, this data grooming is not supporting good business process. The people who know, own and produce the data, the people who live with the data every day, those people aren’t the people who are comprehending the data here. Rather, by leaving it to the data engineering, analysis or operations teams, the domain-specific knowledge of those groups is largely left by the wayside. It is lost knowledge.
That means that errors are only caught once the data product loop is complete. Once the data org massages the data according to their understanding, analyzes the data according to the their understanding and provides insights based on their understanding. Only once that insight is delivered, and is counterintuitive enough to alarm the experts would errors be caught.
A new approach …
would be worthwhile only if it addressed the flaws in the existing processes:
- it should scale with the number of sources being ingested
- it should scale with the complexity of the problem
- it should be resilient to change in the source structure and content
- it should leverage the domain-specific expertise of the data producers
This is where a neural network based solution really shines.
A neural network based workflow would:
- use data ontologies to capture the domain knowledge of the producers
- train a model to understand those classes, in the full distribution of how they may appear
- create data pipelines that leverage the neural network based classification to be agnostic to source structure and be resilient to change in the schema.
I’ll outline the above by walking through this example.
A real world use case: making disparate handset data usable
We were given three files to work from:
- a ten thousand row training data set of US handset telemetry data (“the training data”) (this will be discussed further below)
- a 340 million row full US handset telemetry dataset (“the US telemetry data”)
- a 700+ million row European handset telemetry dataset (“the EU telemetry data”)
The training data was well known and understood by the data producers, and structured according to their want. Crucially, this data represented their work in solving a subset of the data problem.
Designing the ontology a training a neural network is like uploading expertise
The US and EU telemetry datasets were unseen—we were to standardize all three files to the same format.
To do so, we created a notebook (see appendix for code snippets) that utilized the Datalogue SDK to:
- create a data ontology,
- attach training data to that ontology,
- train and deploy a model, and
- create a resilient pipeline that standardized data on the back of classifications made by the deployed model.
Creating an Ontology
An ontology is a taxonomy or hierarchy of data classes that fall within a subject matter area. You can think of an ontology as a “master” schema for all the data in a particular domain.
Ontologies are powerful tools for capturing domain knowledge from subject matter experts. That knowledge can then be utilized to unify data sources, build data pipelines, and train machine learning models.
In this example, the producers of the data helped us create an ontology that explained the data that we were seeing: which data is sensitive and which can be freely shared, which was handset data and which was network data, which telemetry field belonged to what.
This allows the data operators to understand and contextualize the data. Both business purpose (here, obfuscation) and context (here, field descriptions and nesting according to the subject matter) are embedded directly into the data ontology.
Attaching training data
The next step is adding training data to the ontology. This data is used to train a neural network to understand and identify each class in the ontology.
This further embeds the domain experts’ knowledge of the data to the process (as their knowledge of the training data set allowed them to perfectly map that data to the ontology).
Training and deploying a model
Once we have the training data attached to each leaf node in the ontology, the user can automatically train a neural network model to classify these classes of data.
This default options trains a model that takes each string in as a series of character embeddings (a matrix representing the string), and uses a very deep convolutional neural network to learn the character distributions of these classes of data.
This model also heeds the context of the datastore—where the data themselves are ambiguous, other elements are considered, such as neighboring data points and column headers.
This “on rails” option will be sufficient for most classification problems, and allows a non-technical user to quickly create performant models.
Where there is a bit more time and effort available, and where more experimentation and better results may be required, a ML engineer can use “science mode” to experiment with hyperparameter tuning, and generally have more control over the training process.
Model Performance Metrics
Once the model has been trained, the user is able to see the performance of the model on the validation and test sets, with stats like:
- model wide statistics (F1 score, precision, recall, etc.)
- confusion matrices
- class specific statistics
- training statistics (loss curve, etc.)
As you can see, this model is able to, with little work, disambiguate the telemetry classes effectively.
Now that we have a model, we can use it to create pipelines that work for both the European and US data stores, and are resilient to changes in the incoming schemata of these sources.
This pipeline has some novel concepts:
- the pipeline starts with a classification transformation—the neural network identifying the classes of data (either on a datapoint or column level) with reference to the ontology
- the later transformations (such as the structure transformation) rely on the aforementioned classification—if the schema changes, or if a new source file is used, the pipeline need not be changed. The pipelines don’t rely on the structure or column header for transformation, but rather the classes as determined by the neural network
The marginal cost for adding a new source, or remediating a changed source schema is now only the cost of verifying the results of the model—no new manual mapping or pipelining required.
The resulting data
From completely differently named and structured sources, we now have a standardized output:
Now you have a clean dataset to use for analytics. One datastore to be used for:
- advanced analytics
Neural network based data pipelines were measured as being 99.998% faster than traditional methods
That faster time to data means:
- less time spent on reporting and modelling
- faster time to getting a single view of your network
- faster time to issue detection
- faster time to issue resolution
And for the aforementioned telco, tens of millions of dollars in savings per year.
The above was a simple example used to highlight the model creation, deployment and pipelining.
There were pipelines from three sources used.
In deployment, for the above telco, more than 100k pipelines are created each month, and that number is growing exponentially.
100,000 pipelines created per month