Analysis of Financial Data of ‘ENRON’

11 min readJun 28, 2020

Abstract :
Enron Scandal, series of events that resulted in the bankruptcy, which had been one of the largest auditing and accounting companies in the world. This collapse of Enron, which held more than $60 billion in assets, involved one of the biggest bankruptcy fillings in the history of the United States, and it generated much debate as well as legislation desgined to improve accounting standards and practises, with long-lasting repercussions in the finance world. In the resulting Federal Investigation, a significant amount of typically confidential information entered into the public record which includes detailed financial data for top executives. Here, the main scope is to analyse the mentioned financial data of Enron and train a machine learning model to predict or classify the person of interest, i.e. the person that has high probability of performing the scam in the company. This trained model can further be used as a person of interest (POI) identifier by the other companies like Enron to prevent such types of scams.

Introduction :

Enron Corporation was an Energy Company based in Houston, Texas, United States of America. It was founded by Kenneth Lay in 1985 after Houston Natural Gas got merged with InterNorth (a Nebraska pipeline company). Afterwards, Enron was rebranded into an energy supplier and trader. In addition, Enron also started operating and dealing in electricity, water and broadband services. It was also referred as “The most innovative large company in America” according to Fortune’s Most Admired Companies survey for six consecutive years. In 1990, a new division was created by Lay called Enron Finance Corporation and Jeffrey Skilling was appointed as its head. It soon started to dominate the markets under Skilling’s leadership. In addition, Skilling kept on modifying and changing the corporate culture of Enron to transform and build its name in trading business. He also attained the position of CEO after Kenneth Lay retired.

They switched their accounting technique to Market to Market (M2M) method.They used to over value the assets including assets with zero value that were valued very high. Enron showed increment every year with the help of fake profits and sales. The company used unethical practices to misrepresent earnings hiding accounting limitations in order to fool the regulators. It resulted into its complex business model which was very confusing to the analysts. Non-standard accounting techniques and deal inflation became a common practice there. Special Purpose Vehicles (SPVs) or Special Purpose Entities (SPEs) were used by the company to hide its billions of debt and toxic assets from investors and creditors.

Enron took big bank loans, created offshore companies and reported revenue from undelivered goods. Its shares were worth $90.75 at its peak in 2000 compared to $0.26 when the company got filed for bankruptcy in December 2001.

The company used to donate large amounts of money to political parties to make changes to the law pressuring the government. Enron’s bankruptcy was an intentionally committed and well planned fraud involving several banks and an auditor company. Kenneth Lay and Jeffrey Skilling were found guilty of fraud and conspiracy. This is reported to be the largest bankruptcy in American history and almost 4000 people lost their jobs.

Data Analysis :

The data belongs to the Enron Corporation. The data was in the form of a Dictionary which was then converted to a DataFrame with names of the employee or the person who had a link with the company as rows and various payments and stock values as columns. The columns (used) are - Salary, Bonus, Long Term Incentive, Deferred Income, Deferral Payments, Loan Advances, Other, Expenses, Director Fees, Total Payments, Exercised Stock Options, Restricted Stock, Restricted Stock Deferred and Total Stock Value.

All the graphs are developed in python - Anaconda Spyder. The python libraries used throughout the analysis are - pandas, numpy, pickle, seaborn, matplotlib and sklearn.

The first approach was to load the data, get the expected values and remove the outliers present in the dataset. In statistics, an outlier is a data point that differs significantly from other observations. An Outlier may be due to variability in the measurement or it may indicate experimental error, so they are sometimes excluded from the data set to avoid any serious problems. After loading the dataset in dictionary format, it was then converted to the pandas.DataFrame structure (Figure 1).

To analyse the string data, the complete description of the dataset was taken (Figure 2). Next, the string type objects were converted to the numeric type ( if possible ) whether integer or decimal. This also includes the conversion of string ‘NaN’ to numpy.NaN. After this conversion of data types from strings to float, the new data types of the columns are as shown in figure 3.

In figure 3, there are many null data in the dataset. So, to select the most appropriate features to explore, the rows containing more than 70 percent of all the features ( 70 percent of 21 = 15 approx. ) are deleted. Here, the most appropriate row is the one having the least number of null features in it. Then, the count of null values in each feature got reduced as shown in figure 4.

From figure 3 and figure 4, it is observed that the features named deferral_payments, restricted_stock_deferred and director_fees are removed. Also, all the features were then divided into two lists - financial_features, which contains the features salary, bonus, exercised_stock_options, restricted_stock, shared_receipt_with_poi, total_payments, expenses, total_stock_value, deferred_income and long_term_incentive; and email_features, which contains the features to_messages, from_messages, from_poi_to_this_person, from_this_person_to_poi and other. All the null values in the dataset ie. numpy.NaN were converted to 0.

The next objective is to identify any outlier present in the dataset and to remove it. This objective is achieved by defining a method for constructing box plots using seaborn python library. The box plots are used to visualize how the data is distributed accordingly to each feature and to identify outliers through visual inspection.

From the figure 5, it is clear that there is at least one strong outlier (which was identified as the TOTAL instance). After removing the outlier identified i.e. ‘TOTAL’, it was observed that there were still remaining outliers but they would not be removed as they were related to analysis being done.

Since, highly correlated variables are useless for machine learning classification, so it’s better to use uncorrelated variables as features and to build out new features from the data. The reason behind using only uncorrelated variables is that they are orthogonal to each other so they would provide different aspects of information from the dataset to the model.

To check the correlation among features in the dataset, a method named corr from pandas library is used. This is further used to construct a heatmap (shown in figure 6) from a method named heatmap in seaborn library. Heatmap with correlations is only used to make the visualization easier.

From figure 6, it is observed that all the financial features are highly correlated whether they show positive or negative correlation. So, a new feature can be obtained from them.

PCA (Principal Component Analysis) will be applied on correlated features in order to generate a new feature from them. PCA will be applied on the columns which are in financial_features list and this will generate a new column namely, ‘financial’, which will then be added to the dataset. In PCA, n_components parameter takes in the count of the new features to be generated and object of PCA use fit_transform method taking in the parameter the dataset of the financial features to apply PCA to the data.

Distribution plots are constructed of the two different classes of POI - True and False - for the features - salary, bonus and the new financial , see figure 7. Figure 7a shows the pairplot between salary and bonus while figure 7b shows the pairplot between salary and financial. The non-diagonal plots show scatter plots which give the relation between salary and bonus features and salary and financial features respectively.

But the diagonal plots show the density function of univariate features - salary, bonus and financial individually. In this, x-axis is converted into bins or range of values and y-axis represents count or frequency of a particular value in that range in the dataset. This forms a histogram which is later converted to density plots using the average of the values.

The next step is to apply univariate feature selection to the data set. For this, the new ‘financial’ feature needs to be added to the all_features and financial_features list. In order to use univariate feature selection, SelectPercentile class is used from sklearn.feature_selection library. Default selection function is used i.e. 5% most significant features will be extracted from all the features.

An instance of SelectPercentile class using parameter percentile = 5 is created and then applied to the selected dataset of financial_features columns list. All the elements of pvalues_ is then passed to numpy.log10 to get the logarithmic value of all the elements with base 10. Then all the elements are divided by the maximum value in the list to convert the wide range of list to be in the range [0,1]. Bar graph is then plotted on the list for all the features for better visualization as shown in figure 9.

Figure 10 shows the logarithmic value in the range [0,1]. So, only those features are selected as significant features which have the value greater than 0.45, and the rest all are ignored.

After this, the features having value greater than 0.45 were found to be - ‘salary’, ‘bonus’, ‘exercised_stock_options’, ‘total_stock_value’, ‘deferred_income’, ‘long_term_incentive’ and ‘financial’.

The important step that needs to be done before some machine learning algorithms is ‘Feature Scaling’. This is achieved by using MinMaxScaler class from sklearn.preprocessing library. In this, the wide range of the values in various features get set in the range of the input parameters ‘min’ and ‘max’ but here default values are used i.e. [0,1].

In this dataset, features vary at great extent in magnitudes, units and range. So, normalisation should be performed when the scale of a feature is irrelevant or misleading and should not normalise when the scale is meaningful. The algorithms which use Euclidean distance measure are sensitive to magnitudes. So, feature scaling helps to weigh all the features equally. If a feature in a dataset is big in scale compared to others then in algorithms where Euclidean distance is measured this big scaled feature becomes dominating and needs to be normalized.

Now, the aim is to select the best machine learning classification algorithm out of some other classification algorithms to apply on the features extracted from the dataset. Evaluation of the classification results of the algorithms used - Naive Bayes; Adaboost with decision trees; and Support Vector Machines (SVM) - was done using precision and recall performance measurements. Since the dataset is unbalanced i.e. the number of samples for each class are distinct, so accuracy cannot be used.

Precision and recall both indicate accuracy of the model. Precision means the percentage of your results which are relevant. On the other hand, recall refers to the percentage of total relevant results correctly classified by your algorithm.

Image source — https://towardsdatascience.com/precision-vs-recall-386cf9f89488 — Figure 11

F1 score, in statistical analysis of binary classification, is a measure of a test’s accuracy. It is the harmonic mean of the precision and recall, where an F1 score reaches its best value at 1 (perfect precision and recall).

Now, the boolean type of ‘POI’ feature is converted to integer type i.e. for True, it changes to 1 and for False, it changes to 0.

The dataset is divided into training data and testing data i.e. training data is the data on the basis of which model is going to be trained while testing data is the data on the basis of which model is going to be evaluated. This split needs to be done in order to avoid overfitting. Since the dataset is unbalanced, so it is important to split the data in a stratified way, i.e., each subset must have the same proportion of each class in POI (True or False). The split was done with 10:3 ratio i.e. for every 10 examples in training data there would be 3 examples in testing data.

Figure 12, shows the splitting of the data into train and test data in a stratified manner. The train and test data was further divided into train features and train labels; test features and test labels respectively. Classification needs to be done on the ‘POI’ column so the label is taken as ‘POI’ and all other columns as features.

GridSearchCV class is also imported from sklearn.model_selection library to select the parameters in various classification algorithms.

Machine learning classification algorithms used -

Naive Bayes - GaussianNB class is imported from sklearn.naive_bayes module. Model is trained using features and labels in the train dataset. Since, GaussianNB does not take any parameter so there is no need of GridSearchCV in this classification.

Evaluation metrics of this algorithm on test data is shown below (figure 13).

Adaboost - DecisionTreeClassifier class and AdaBoostClassifier class are imported from sklearn.tree and sklearn.ensemble modules respectively. Dictionary of possible parameters in AdaBoost classifier and Decision Tree classifier is made to get the best parameters for best accuracy using GridSearchCV. But parameters of the decision tree are only used as a parameter of Adaboost and further parameters of adaboost are used in GridSearchCV. Then, an object of GridSearchCV with the best possible parameters is taken to train the model using features and labels of train data.

Evaluation metrics of this algorithm on test data is shown below (figure 15).

SVM (Support Vector Machine) - SVC class is imported from sklearn.svm module. Dictionary of possible parameters in the Support Vector classifier is made to get the best parameters for best accuracy using GridSearchCV. Then, an object of GridSearchCV with the best possible parameters is taken to train the model using features and labels of train data.

Evaluation metrics of this algorithm on test data is shown below (figure 17).

Conclusion :

Among the evaluation metrics of all the classifiers (Figure 13, Figure 15, Figure 17) the best metrics (precision and recall) for POI in testing data is the evaluation metrics of Naive Bayes classifier. F1-score for POI in testing data in naive bayes classifier is 0.29 while that in adaboost classifier is 0.14 and in svm classifier is 0. So, the maximum value is observed in the naive bayes classifier.

Analysis of Financial Data of ‘ENRON’

Introduction :

Data Analysis :

Conclusion :

Written by TUSHAR SETHI