Using Machine Learning Powered Static Analysis to Identify Logging Privacy and Security Issues

Philips
Philips Technology Blog
11 min readMar 7, 2024

Authors: Rafael Medeiros de Farias Vaz and Fernando José Vieira, Software Excellence Team, Philips

Abstract

This study highlights the results obtained from a proof of concept involving the usage of machine learning (ML) models to classify potential privacy and security violations in source code log entries.

Privacy and security risks are present in every software development in the world. As logging can materialize these risks, mitigation processes like static analysis of the product code and pattern matching in log file samples are common, presenting low effectiveness that overloads engineers with the time-consuming task of auditing code and log samples.

As an alternative, we trained machine learning models with rich datasets from Java and C# codebases. Following a consistent methodology, the results show ML models can outperform static analysis tools, being potentially applicable to all programming languages.

Introduction

Software behavior, due to its nature, can be hard to observe. Things can be harder when the development team does not know and cannot simulate what led to a wrong side effect or a system failure. To mitigate the observability issue, most development teams adopt logging, whose efficiency is directly related to the experience of engineers, code standards, and log patterns. Bad log implementation can exacerbate security, performance, operational, and privacy risks.

In 2018, Twitter (now X) had to request its users to change their access credentials after it was found that 330 million unredacted passwords were accidentally logged internally. This happened only two days after the same thing happened to GitHub. While there was no indication of an actual breach on either account, the episodes highlighted the dangers and concerns regarding logging sensitive information in software systems.

In addition to security risks, data privacy has been an important concern in the healthcare industry since its beginning. HIPAA data breach statistics shows breaches are trending up since the numbers started to be summarized. The publication of data privacy laws (like the EU’s GDPR and Brazil’s LGPD) and the emergence of big data, with companies storing detailed and sensitive data of their customers, turned this concern into a liability to the overall IT industry.

To mitigate the risks associated with logging in the software development process, teams included code standards, checks in code review, and traditional static analysis. As code review outcomes vary and traditional static analysis is not able to identify all potential log issues, the auditing of source code and log file samples was adopted as a safeguard and proof of non-functional requirement coverage.

However, auditing log files is complicated, since separating samples and analyzing them is a time- consuming and error- prone activity; it requires computing power and human effort. Traditionally, samples of logs collected from test and production environments are analyzed throughout exploratory checks. Smart engineers build their own tools to facilitate this analysis. It is very common to find scripts that compare log entries with complete strings or regular expressions to find matches. Besides, this auditing occurs too late in the development process, it requires too many resources and results in low accuracy.

The process of auditing source code is better, but still full of pitfalls. A typical process involves an algorithm that matches all potential log calls in the code and segregates the suspected cases for further analysis by a qualified engineer. Building the algorithm is a thankless task since a coarse filter can end up with thousands of lines to be analyzed, while a fine filter can bypass issues.

We noticed that traditional pattern matching used to analyze source code (mainly based on regex) results in accuracies lower than 40%. The alternative approach, presented in this article, trained machine learning models with curated datasets from Java and C# source code and reached accuracies higher than 80%.

Finally, it is important to mention that analyzing source code sometimes requires an understanding of complex context and semantics. To log the A variable can result in null value or a list of thousands of patient data registers. This issue will not be mitigated by the proposed approach.

In the following sections, we will briefly visit traditional strategies before detailing our machine learning approach.

Problem definition

Logging is essential to detect issues and misbehavior but can itself become a source of unexpected issues. To write entire entities and configuration data is sometimes necessary when debugging code locally and even in the test phase. The real problem happens when those entries stay in the released code (due to a failure in the development process), being a potential risk in case a bad actor can read the log files or a violation of customer privacy (because logs are not subject to the same privacy scrutiny/ process databases are).

To mitigate risks, a dedicated process audits the source code or samples of log files, guaranteeing no sensitive information is included. But what is sensitive information?

According to privacy laws: personal sensitive information refers to any data related to an identifiable individual including name, id, geo location, IP address, device id, etc.

According to security specialists: IP addresses, usernames, passwords, host names, and any other information that can be used to exploit a system.

The expected outcome of the source code auditing process is a backlog of issues to be fixed by the technical debt process, ideally eliminating all entries that leverage risks.

Development practices to mitigate logging risks

The software development lifecycle should include data privacy and security concerns. Here are some practices that can help prevent logging issues:

● Code Standards: to define a log pattern and what information can be logged to guide developers while including log entries in the code.

● Static Analysis: to check the code without executing it. The automation runs tools and scripts that detect and highlight potential issues.

● Code Review: a thorough code review process can make so that sensitive data in log entries get redacted, tokenized, encrypted, or kept out of the logs entirely.

● Log Audit/Review: regular (either exploratory or automated) review of logs to detect and fix information bleaching.

In the ideal world, the first three practices could be enough to mitigate logging risks. However, code standards and code review rely on human focus and expertise, and static analysis, either using advanced techniques, are prone of false positives and limitations.

Besides requiring computing and human resources, the real issue with log audit is that it happens too late in the process, and checking log samples doesn’t guarantee the software is free of issues. Log samples checking fails, for example, when software has many different functionalities or user profiles, being not improbable log file samples will only contain the behavior of a specific user or a subgroup of test cases executed during that period of time.

Limitations of traditional static analysis approaches

Generally, static analysis is performed by tools and scripts. They can run on the developer’s machine as an IDE plugin, as a build step or a git hook. They also can run as one step in continuous integration processes, ideally before the code review happens.

In terms of static analysis implementation, the most basic approach is to use pattern matching through regular expressions and rules. Another common approach is the creation and analysis of abstract syntax trees.

The difficulty with traditional static analysis is that the resulting solution is prone to false positives. For example, when capturing all log entries with a variable “name”, both of the following logs would be flagged. Even though the first is a true issue, there is no indication that the second log represents a privacy risk.

Finding issues with machine learning

At Philips, we treat privacy and security as critical requirements of every commercialized product . To guarantee those requirements are met, the development process includes auditing the code and raising logging issues to be addressed by the developers. As auditing the entire code manually is time- consuming, static analysis tools were developed; in the subsequent step, engineers interpret if the findings are true or false positives.

This approach would be good enough if static analysis accuracy was higher than 80%, but we found that even the most optimized tool presents accuracy lower than 50%, forcing engineers to walk through big code bases.

To solve the problem, we considered training machine learning models that would substitute the traditional approach without modifying the process: a tool analyzes source code and outputs a list of potential privacy or security issues. It became clear the fact that different programming languages (PL) demand dedicated effort due to their specific grammar. Therefore, we decided to prove the concept for two of the most popular PLs in Philips: Java and C#. Another complexity considered when collecting data was the existence of different logging libraries and the possibility of wrapping calls with custom code.

The resulting dataset for each PL included 100 repositories, guaranteeing we were covering a good variety of code styles and implementations. The log entries were separated, classified, and preprocessed (more details below). It was also necessary to balance the Java PL dataset.

In the next step, five candidate models were trained with our datasets. Tuning was done to get the best outcome in each model, allowing us to make a clear and objective comparison to conclude what model is the best fit for the problem. The described process can be synthesized in the picture below.

Finally, the trained model was deployed and successfully used in analysis processes of code bases not included in the datasets. As expected, the results were superior when compared to the traditional static analysis tools .

In the following sections, we will detail the methodology and findings. At the end, we will draw conclusions and give suggestions to further explore the machine learning approach proposed in this article.

Methodology

Data Collection and Preprocessing

The datasets comprise code from 200 git repositories, spanning various business domains to guarantee variability. Two datasets were created: one from 100 git repositories with code written in Java and another from 100 repos with C# code. Observations consist of individual code blocks containing a log entry.

Text Preprocessing

Here are the steps we apply to the data:

Tokenization: To account for code-specific naming conventions, snake case and camel case were split during tokenization.

Filtering: Pure string logs, which lacked value (as they are unlikely to contain a privacy or security issue), were filtered out.

Translation: Not all logs in the dataset were written in English. Non-English logs were translated using the LibreTranslate API.

Cleaning: Special characters and common English stop words, which contribute little to meaning, were removed to improve data clarity and model performance.

Lemmatization: Word inflections were normalized through lemmatization, reducing data dimensionality and facilitating meaningful comparisons.

An example of processed observation:

Balancing

Due to class imbalance in the Java dataset, synonym-based text generation was applied to positive training set observations, creating 2 alternates with 2 to 3 words substituted with their synonyms. This resulted in approximately a 3% improvement in the models’ accuracy.

The C# dataset did not have a considerable balance issue, so no action was necessary.

Featuring extraction

To have the datasets ready to be used as inputs in candidate models, a traditional TF-IDF feature extraction was executed.

Model Selection and Tuning

Algorithms

Five machine learning algorithms were considered:

● Naive Bayes: A probabilistic model known for its efficiency in handling text data.

● Gaussian Process: A flexible non-parametric model well-suited for handling uncertainty and noise in data, potentially beneficial for dealing with the variability of code and log content.

● Stochastic Gradient Descent (SGD): An optimization algorithm used here to train a logistic regression model, a common choice for binary classification tasks.

● XGBoost: A powerful tree-based ensemble model capable of capturing complex patterns and interactions within textual features, often yielding high performance in text classification.

● Multilayer Perceptron: A neural network architecture capable of learning non-linear relationships and semantic representations of texts. In the current study, models with 2 hidden layers were trained.

Hyperparameter Tuning

To uncover the most effective model configurations for privacy issue detection, Grid Search was employed to systematically explore diverse combinations of hyperparameters, optimizing each model’s performance.

Technology

As a guideline, here is a list of programming languages/libraries used:

● Python.

● Pandas — for data manipulation.

● nlpaug — for data augmentation (synonymAug function).

● spacy — for NLP (stop word removal and lemmatization).

● libretranslate — to translate text into English.

● sklearn (scikit-learn) — for model selection (Grid Search), feature extraction (TF-IDF), most ML algorithms (Naive Bayes, SGD, Gaussian Process, Multi-layer Perceptron) and metrics (confusion matrix, f1score, AUC).

● xgboost — for the xgboost classifier algorithm.

We used standard laptops to train the models. The training time for models varied from 1 to 4 hours.

Model Evaluation

To comprehensively assess the performance of different models, the following metrics were considered:

• Accuracy: Overall proportion of correct predictions.

• Sensitivity: Ability to accurately identify true privacy issues (true positives).

• Specificity/Recall: Ability to correctly identify code blocks without privacy issues (true negatives), ensuring minimal false alarms and unnecessary interventions.

• F1 Score: Balanced measure of precision and recall, considering both true positives and negatives.

• Area Under the ROC Curve (AUC): Represents the models’ ability to distinguish between the classes.

While a regex approach provided a baseline for the initial metrics, its inherent limitations as a rule-based method precluded its use for AUC (Area Under the ROC Curve) evaluation, which requires a probabilistic approach.

Findings

The Stochastic Gradient Descent (SGD) model was found to be the most consistent performer, achieving higher accuracy and F1 scores.

As expected, all models significantly outperformed the regex approach with an average reduction of 56% in type I errors and 27.4% in type II errors.

Conclusions & further steps

This study showed that machine learning models can effectively be applied to the problem of identifying privacy and security vulnerabilities related to log entries in the source code. Some benefits of this approach are:

● Superior accuracy: machine learning models consistently outperformed traditional rule-based approaches such as regular expressions, achieving substantially higher accuracy in identifying log issues.

● Early detection: by analyzing code directly, machine learning enables proactive identification of vulnerabilities before they manifest in runtime logs, reducing the risk of costly breaches and reputational damage.

● Multi programming-language: while trained on Java and C# datasets, the proposed approach can be adapted to a wide range of programming languages, offering a versatile solution for diverse development stacks.

● Consistent performance: once the model is trained, it runs as fast as any static analysis algorithm (running time of tested repositories was about 40 seconds per code base in a standard GitHub runner, depending on number of lines of code).

The trained models were good enough to be used effectively. As further steps, it would be possible to:

● Increase the variability: to include more repositories in the datasets.

● Deep learning exploration: check if deep learning models with more than 2 hidden layers can outperform the models trained in the current study.

● Expand language coverage: incorporating additional programming languages will foster a more comprehensive and inclusive security solution, catering to a broader range of development needs.

● Integrate tool in the development environment: while easily integrated into CI workflows, embedding the solution directly into development tools and environments would further shift left vulnerability detection, empowering developers with real-time guidance and mitigating risks at the earliest stages of the development process.

Curious about working in tech at Philips? Find out more here

--

--

Philips Technology Blog
Philips Technology Blog

Published in Philips Technology Blog

Learn more about how Philips designs, builds, and operates our systems and engineering organizations

Philips
Philips

Written by Philips

All about Philips’ innovation, health technology and our people. Learn more about our tech and engineering teams.

No responses yet