Autonomous Source Code Classification using Machine Learning and Natural Language Processing

Jaskirat Singh
7 min readAug 15, 2021

--

Project Overview

In this project I have proposed an Automated way for classification of programming language from the source code snippets using Natural Language Processing (NLP) and Machine Learning (ML) algorithms. The dataset was extracted and further processed for applying ML and NL. The dataset was downloaded from Stack Overflow 2017 data dump. Using Beautiful Soup library, I extracted 12000 code snippets for each of 21 popular programming languages from Stack Overflow posts. The languages selected for the study were Bash, C, C#, C++, CSS, Haskell, HTML, Java, JavaScript, Lua, Objective-C, Perl, PHP, Python, R, Ruby, Scala, SQL, Swift, VB.Net, Markdown. In total, 237804 code snippets were selected for the study. Code snippets extracted from the Stack Overflow questions were converted into feature vectors using the Term Frequency-Inverse Document Frequency (TF-IDF) vectorizer from the Scikit-learn library. Stochastic Gradient Descent (SGD) and Random Forest Classifier’s (RFC) are employed which are trained using Stack Overflow posts. SGD Classifier is shown to achieve an accuracy of 74% whereas RFC shows an accuracy score of 76%. The performance metrics used in this project are precision, recall, accuracy, F1 score and confusion matrix.

1. Data Extraction and Pre-Processing

In the project, questions with more than one programming language tag were removed to avoid potential problems during training. Questions chosen contained at least one code snippet, and the code snippet had at least 2 lines of code. For each programming language, 12,000 random questions were extracted; however, two programming languages had less than 12,000 questions: Markdown (1,358) and Lua (8,427). Fig. 1 shows an example of a Stack Overflow post.

Fig. 1 Using Stack Overflow code snippet as input data and programming tag as the label

A Stack Overflow question consists of a title, body and code snippet. The tags <code> and <code> were utilized to extract the code snippet from Stack Overflow question. In some cases, a question contained multiple code snippets; these were combined into one. Machine learning models cannot be trained on raw code snippets because their performance is affected by noise present in the data. Therefore, the stop words were removed from code snippets during pre-processing. Code snippets were treated as a natural language construct, which means that punctuation was removed. The features of a code snippet are the keywords, identifier, name of the library etc. It should be noted that stemming and lemmatization were not applied on code snippets. The extracted set of code snippets provide a good coverage of different versions of programming languages. For example, code snippets were extracted for the python tags: Python-3.x, Python-2.7, Python-3.5, Python-2.x, Python- 3.6, Python-3.3 and Python-2.6, for the Java tags: Java-8 and Java-7, and for the C++ tags: C++11, C++03, C++98 and C++14. The snippets extracted had a significant variation in number of lines of code, as shown in Fig. 2.

Fig. 2. The length of code snippets in the Stack Overflow dataset.

2. Extracting code snippet features from each Programming Language using NLP

Machine learning algorithms cannot learn from raw text; so, several steps of processing the dataset are needed before training the algorithm. First, the code snippets need to be converted into numerical feature vectors. I used [TF-IDF]: Term Frequency-Inverse Document Frequency which is a text processing stage in the Bag of Words family. TF-IDF calculates how important a word is in a document in comparison to the entire corpus. The Minimum Document Frequency (Min-DF) was set to 10, which means that only words present in at least ten documents were selected (a document is a single code snippet). This step eliminates infrequent words from the dataset which helps machine learning models learn from the most important vocabulary. The Maximum Document Frequency (Max-DF) was set to default because the stop words were already removed in the data pre-processing step. All characters were converted to lowercase before tokenizing. 10,000 features were considered as the top max_features ordered by term frequency across the corpus. Feature was made up of word by setting analyser as “word”.

3. Applying Machine Learning Algorithms

The ML algorithms Random Forest Classifier (RFC) and Stochastic Gradient Descent (SGD) were employed. These algorithms provided higher accuracy compared to the other algorithms I explored, ExtraTree and MultiNomialNB. The machine learning models were tuned using GridSearchCV, which is a tool for parameter search in the Scikit-learn library. It is important to tune the models by varying the hyper-parameters to fit the model robustly. The best hyper parameters used for RFC and SGD were as follows:

param_grid_RFC={‘n_estimators’:[100], ‘max_depth’: [None], ‘min_samples_leaf’: [1], ‘bootstrap’: [False], ‘min_samples_split’: [15], ‘max_features’: [‘sqrt’]}

param_grid_SGD= {‘loss’: [log], ‘max_iter’:[100], ‘tol’= [0.001]}

All model parameters were fixed after GridSearchCV tuning on the cross-validation sets (stratified ten-fold cross-validation). For this purpose, the dataset was split into training, validation and test partitions. Test data consisted of 20% of total data. The training dataset was split into k =10 folds. During training, 9 folds were used for training and the left-out fold was used for validation. The trained model was evaluated on cross validation fold and discarded. This process is then repeated till we train and evaluate on all folds of data. This technique is called k-fold validation. We used k = 10 and all reported cross valuation accuracy are average values

4.1 Results for SGD Classifier

Table 1. SGD model performance results using metrics like Precision, Recall and F1 score for each of 21 programming languages
Fig. 3. Confusion matrix for the Stochastic Gradient Descent classifier trained on code snippet features. The diagonal represents the percentage of snippets of a programming language correctly predicted.
Table 2. Feature Importance for SGD Classifier: Top 10 most important keywords with weightage values for each of 21 Programming Languages

The analysis of the feature space of the programming languages indicates that these languages have unique code snippet features (keywords/identifiers).

4.2 Results for Random Forest Classifier

Table 3. RFC model performance results using metrics like Precision, Recall and F1 score for each of 21 programming languages
Fig. 4. Confusion matrix for the Random Forest classifier trained on code snippet features. The diagonal represents the percentage of snippets of a programming language correctly predicted.

The implementation of the project can be found in the below github link.

5. Usage Example

The models were tested using simple command line interface. To run and verify the model prediction, the first step is to load the dataset and select the feature set. The next step is to train the machine learning algorithm on the selected features. Subsequently, users will be asked to enter their code snippet through command line. Finally, the predicted programming language for the snippet is output. Fig. 5 demonstrates the autonomous source code classification works.

Fig. 5. Autonomous Source code classification using Command Line Interface

6. Future Work and Conclusion

The study of programming language prediction from code snippets is still new, and much remains to be done. Most of the existing tools focus on file extensions rather than the code itself. In recent years, there has been tremendous progress made in the field of deep learning, especially for time series or sequence-based models such as Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) networks. RNN and LSTM models can be trained using source code one character at a time as input, but they can have a high computational cost. In the future, our model will be evaluated using public GitHub repositories. This would help us understand how general the model is.

After analysing the performance of the model, I observed that predicting the programming language from code snippets or from a few lines of source code is far more challenging than determining from open source GitHub repos considering complexity of modern programming languages. I believe that our tool could be applied in other scenarios such as code search engines and snippet management tool. Random Forest Classifier outperformed SGD classifier trained on a Stack Overflow dataset. RFC achieves an accuracy of 76% and the average scores for precision, recall and the F1 score with the proposed model were 0.76, 0.75 and 0.75, respectively. To summarize our results, the task of identifying the programming language of a code snippet seems to be fundamentally different in nature compared to that of a source code file.

--

--

Jaskirat Singh

Talks about Computer Vision, Deep Learning, Machine Learning, Artificial Intelligence and Natural Language Processing