SWAN_ASSIST: Semi-Automated Detection of Code-Specific, Security-Relevant Methods

Published in

ASE Conference

8 min readOct 29, 2019

Tailoring static analyses for detecting different security vulnerabilities is time consuming task that requires deep understanding of both the static analysis and the application security domains. If we consider taint-style vulnerabilities, such as SQL injection, we need to feed our taint analysis with codebase-specific methods like sources, sinks, and sanitizers. We refer to these as security-relevant methods (SRM). In this post, I discuss our tool SWAN_ASSIST, that implements an active machine-learning approach for semi-automatic detection of SRM cathegorizied in different security vulnerabilities. SWAN_ASSIST uses SWAN, a fully automatic suppervised machine-learning approach for detection of SRM.

This post is based on our Demonstration paper “SWAN_ASSIST: Semi-Automated Detection of Code-Specific, Security-Relevant Methods” with Lisa Nguyen Quang Do, Oshando Johnson, and Eric Bodden at Automated Software Engineering (ASE) 2019, as well as our technical paper “Codebase-Adaptive Detection of Security-Relevant Methods” with Lisa Nguyen Quang Do and Eric Bodden at International Symposium on Software Testing and Analysis (ISSTA) 2019.

Why are SRM code- and security-specific?

We can answer this question using a simple code example, as follows.

Java code snippet containing potential SQL injection and Open redirect

The code contains a potential SQL injection (CWE-89) from line 3 to line 7, and an open redirect (line 9) (CWE-601). The SQL injection vulnerability is mitigated using the validator line 4, which sanitizes the user-controlled input. To find the two potential vulnerabilities, a taint analysis requires specific SRM. Here, the source getParameter() creates the data to be tracked, the sinks executeQuery() and sendRedirect() raise the alarm, and the validator encodeForSQL(), marks the data as safe, but only for the SQL injection.

Detecting SRM with SWAN

SWAN runs an automated classification twice: in the first iteration, it classifies all methods of the analyzed program and libraries into general SRM classes: sources (So), sinks (Si), sanitizers (Sa), authentication methods, or none. In the second iteration, it discards the methods marked with none, and classifies the remaining SRM into the individual CWEs: CWE-78, CWE-79, CWE-89, CWE-306, CWE601, CWE-862, and CWE-863, per SRM class. Each class is covered by one binary Support Vector Machine (SVM) classfier. For instance, SWAN has one SVM to decide whether a method is a source or not source. The only exception is the case for authentication methods, where we have 4 classes: auth-safe-state, auth-unsafe-state, auth-no-change, and none. The first one refers to authentication methods that elevate the privileges of the user, e.g., login methods. The second contains methods that lower those privileges (e.g., logout methods). The third category marks methods that do not change the state of the program (e.g., isAuthenticated()).

Figure 1 shows an overview of SWAN (below dotted line).

To help the machine-learning algorithm classify the methods into the different classes, SWAN uses a set of binary features that evaluate certain properties of the methods. For example, the feature instance methodClassContainsOAuth is more likely to indicate an authentication method than any other type of SRM. As a first phase to the learning, SWAN constructs a feature matrix by computing a true/false result for each feature instance on each method of the training set. This matrix is then used to learn which combination of features best characterizes the classes, and uses this knowledge to classify the methods of the testing set, after creating the feature matrix for that set. SWAN uses 25 feature types, instantiated as 206 concrete features. We call feature types generic features such as methodClassContains and feature instance their concrete instances (e.g., methodClassContainsOAuth).

The training set in SWAN contains 235 Java methods collected from 10 popular Java frameworks. We put particular care in ensuring that the methods were chosen so that each of the 206 feature instances among all classifiers in SWAN had at least a positive and a negative example.

We evaluated the precision of SWAN on 12 open-source Java libraries: two frameworks from the mobile domain (Android and Apache Cordova), eight web frameworks (Apache Lucene, Apache Stratos, Apache Struts, Dropwizard, Eclipse Jetty, GWT, Spark, and Spring), one framework from the home automation domain (Eclipse SmartHome), and one utility framework (Apache Commons). We applied SWAN to the 12 libraries, and randomly selected 50 methods for each pair of library/class, whose classification we then manually verified. The overall precision is 0.76. SWAN is more precise for detecting SRM types (0.826) than for CWEs (0.677). SWAN’s overall precision is consistent over different types of Java applications, but can be improved with SWAN_ASSIST.

How does SWAN compares to other approaches?

We know of the following three approaches to have open-sourced their SRM: Susi [1], Sas et al.’s approach [2], and JoanAudit [3]. We compare the lists of sources and sinks from Susi [1] and its extension by Sas et al. [2] to the lists of sources and sinks generated by SWAN on the Android framework (version 4.2). The number of sources and sinks detected by the three approaches is shown in Figure 2.

Figure 2: Number of soures (left) and sinks (right)

SWAN reports a total of 25,085 sources and 13,798 sinks, Susi 18,044 sources and 8,278 sinks, and the tool by Sas et al., 3,035 sources and 7,311 sinks. SWAN reports more SRM than the other two approaches, which after a manual investigation we attribute to two reasons. First, SWAN’s features target a broader range of vulnerabilities compared to Susi’s and Sas et al’s data privacy focus. Second, Susi reports methods from abstract classes and interfaces, SWAN reports their concrete implementations, which allows for a better precision. Sas et al. are stricter and only report warnings belonging to certain classes: database, gui, file, web, xml, and io. Unlike to SWAN and Susi, Sas et al. reports more sinks than sources. This is due to the larger number of sink features than source features contained in their approach. Both SWAN and Susi contain enough features and training instances to overcome this. To compare the precision of the three approaches, we randomly selected 50 sources and 50 sinks in the lists produced by the three tools, and manually classified them. The selected methods of each tool were labeled by different researcher, two of the authors and one external researcher. SWAN shows a precision of 0.99 for sources and 0.92 for sinks, whereas Susi yields respective precisions of 0.96 and 0.88, and Sas et al.’s tool has 0.88 and 0.88 respectively.

What is SWAN_ASSIST?

As shown in Figure 1, SWAN_ASSIST involves the user in the learning process. It uses SWAN and lets the user to trigger SWAN in a loop. The user is supposed to provide examples coming from the codebase (i.e. testing set), label them, and add them to the training set.

SWAN_ASSIST is implemented as a plug-in component for IntelliJ IDEA IDE. The following video gives a 5-minute overview of the main features of the tool. A 10-minute overview is also available.

Which methods should the user label and what is the required manual effort?

The tool can be used by developers at coding or debugging time in combination with other taint analysis tool, or by security teams to configure analysis tools before they are deployed. User may apply different strategies for selecting methods from the codebase. We evaluated two of them.

A simple way is to randomly choose methods. Another way is to use the SWAN-Suggester algorithm that we developed. This algorithm, at each run, proposes two most impacful methods to the user (shown also in Figure 1). For this, we use the relevance of the features that are not covered by the added methods. The relevance is calculated by a one-at-a-time analysis. The following is a pseudo code of our algorithm.

For the evaluation, we manually classified and labeled the 1,663 methods of GXA [4] with one or more of the following classes: sources, sinks, CWE-89, none. 286 methods were identified as sources, 183 methods as sinks, and 29 as relevant to CWE-89, and consider this our ground truth. To evaluate how SWAN-Suggester algorithm helps improve the results of SWAN, we compare the resulting SRM lists when feeding SWAN_ASSIST randomly selected method pairs, and when using SWAN-Suggester to select those pairs. We first run SWAN with its initial training set and GXA as a testing set. Then, add a new method pair to the training set and continue until we run out of methods. For each of the 819 iterations, we report the classification’s precision (Figure 3). The precision shown for the random suggester is averaged over 10 runs.

Figure 3 Precision of the sources (top), sinks (middle), and CWE-89 (bottom) over 819 iterations of SWAN by adding methods with SuggestSWAN and with random selection

We see that for sources and sinks, the evolution of the precision for the random suggester is linear. This shows that the suggester does not help the classification: the precision increases naturally as a new pair is added to the training set. On the other hand, SWAN shows a quick increase in precision at the beginning, showing that the suggester is efficient in selecting the methods with the most impact first. This maximizes the impact of the classification and minimizes the developer work to tune SWAN to their code base. In the case of sources, the precision reaches 0.8 at iteration 31 (from 0.75 at iteration 1), making 60 methods labeled (4% of the total number of methods in the application). Afterwards, the growth slows down, reaching a precision of 0.9 after 91 iterations. In the random case, the growth is much slower, reaching a precision of 0.8 at iteration 166 and 0.9 at iteration 414. For sinks, the precision using reaches 0.9 at iteration 20 (from 0.61 at iteration 1) with SWAN-Suggester, requiring the developer to only label 1% of the total number of methods in the application. This precision is only reached at iteration 358 with the random suggester.

Although less visible, we see a similar trend for the case of CWE-89 where the precision in the early iterations is better when using SWAN-Suggester. The growth is less pronounced, only reaching a precision of 0.8 at the 130th iteration (from 0.67 at iteration 1), requiring 16% of the methods to be labeled. We attribute this to the lower number of SWAN instances targeting this class compared to the Srm classes, and to the low number of CWE-89-related methods in the test set, making the classifier less efficient in targeting it.

Implementation

The implementation is avalable at github: https://github.com/secure-software-engineering/swan

References:

[1] S. Arzt, S. Rasthofer, and E. Bodden. 2013. SuSi: A Tool for the Fully Automated Classification and Categorization of Android Sources and Sinks (NDSS 2013).

[2] D. Sas, M. Bessi, and F. A. Fontana. 2018. Automatic Detection of Sources and Sinks in Arbitrary Java Libraries (SCAM 2018).

[3] J. Thomé, L. K. Shar, D. Bianculli, and L. C. Briand. 2017. JoanAudit: A Tool for Auditing Common Injection Vulnerabilities (ESEC/FSE 2017).

[4] European Bioinformatics Institute. Gene Expression Atlas. https://github. com/gxa/gxa