# Detecting Possible Earning Manipulation for a Real-World Company

`Don't forget to follow me on Medium and get notification whenever I publish a new article!`

# Intro

The final episode of earning manipulation detection is to utilize everything I just wrote and showcased in the previous articles to really check if a singular company has possibly manipulated its earning. I picked Amgen Inc as my target company. Amgen is a bio-pharmaceutical firm which develops and manufactures medicines. It was founded in 1980 and through multiple companies acquisition, it is now ranking 130 on 2017 Fortune 500 list. There are two reasons why I picked this company as my research target.

1. The existence period of this company is appropriate, neither too long nor too short, which grant me a suitable size of data set for multiple analyses.

2. I am not very familiar with bio-pharmaceutical industry, and with this chance, I am able to explore more of it.

# Skills and Tools

*Tools: Python-Spyder, R-R-Studio*

*Skills: Data cleaning and munging with Pandas, function creation, visualization using matplotlib and ggplot2, linear Regression with R*

*Feel free to click** my GitHub **to see all the code.*

# Analyses structure

The analyses structure will be very similar to the previous articles and I will walk through and apply those methods and tools to the target company, Amgen Inc. Furthermore, I will check if there is any possibility or hint that the earning of Amgen Inc was being manipulated.

1. Data Cleaning

2. The overview of revenue, net income, total asset

3. Benford’s Law

4. Accruals Model

5. Operating Cash Flow Model

6. M-Score

# 1. Data Cleaning

As I mentioned in the previous article, Compustate is a gigantic data set, which contains 1000 variables and millions of rows. The data I need here is very straightforward: the data of Amgen and the data related to Amgen. Therefore I load my data into Python and further doing some cleaning and slicing to create and write out two csv files, which you might find in my Github.

The brief process is below:

(1) Only pick several needed variables including gvkey (company identifier), datadate (reporting period), fyear (fiscal year), revt, rect, ppegt, epspi, ni, at, oancf, sic, and rdq.

(2) For the first data frame, I only select row of Amgen Inc; and for the second data frame, I only select the companies are in the same SIC-defined industry as Amgen’s, which is 2836, biological products.

(3) Year is selected to be after 1980 and before 2018, for Amgen was founded in 1980.

(4) Select the row where its receivables are large than 0.

(5) Drop NA based on several columns and remove duplicates.

(6) Finalize and write out two tables. One is Amgen data solely, the other is the industry data, where Amgen is in.

mport pandas as pd

import numpy as npcomp = pd.read_csv("compustat_1950_2018_annual_merged.csv")comp1 = comp[["gvkey","datadate","fyear","revt","rect","ppegt","epspi","ni","at","oancf","sic","rdq"]]

comp2 = comp1[comp1['gvkey'] == 1602] // comp2 = comp1[comp1['sic'] == 2836]

comp3 = comp2[comp2['fyear'] > 1980]

comp3 = comp3[comp3['fyear'] < 2018]

comp4 = comp3[comp3['rect'] > 0]

comp5 = comp4.dropna(subset = ['at','revt','ni','epspi','rect','oancf'])

comp6 = comp5.drop_duplicates(comp5.columns.difference(['rdq']))

com_7 = comp6.fillna(0)com_7.to_csv('amgen_compustate.csv', float_format = '%.6f', index = 0) //com_7.to_csv('amgen_compustate_2386.csv', float_format = '%.6f', index = 0)

# 2. The overview of revenue, net income, total asset of Amgen

Before digging deeper into earning manipulation practice, I’d like to get some basic overview of the recent financial situation of Amgen and how Amgen performs across years. Therefore I select three of the most important metrics to plot and analyze them across year.

(1) Year 2017 Financial Performance

# Python code

df = pd.read_csv('amgen_compustate.csv')y2017 = df[df['fyear']==2017]

y17_basic = pd.DataFrame(y2017[['revt','ni','at']])

print(y17_basic)########## Result ############

revt ni at

22849.0 1979.0 79954.0

(2) Yearly trend

df2 = df[['fyear','revt','ni','at']]

df2 = df2.set_index('fyear')plt.figure(figsize=(10,8))

plt.plot(df2['revt'], linestyle = 'solid')

plt.plot(df2['ni'], linestyle = 'dashed')

plt.plot(df2['at'], linestyle = 'dashdot')

plt.legend(df2.columns.values.tolist())

plt.title('Trend for Revenue, Net Income & Total Asset for Amgen Inc 1988-2017')

plt.show()

It is observed that the total asset of Amgen has a drastic rise after year 2000, which is worth investigation. And it might be the result of multiple and frequent acquisition after that period of time. Also, the net income after 2015 has dropped. Its profitability recently can be further checked.

# 3. Benford’s Law

Still remember the Benford’s Law? It is the distribution theory of the leading digits in the real life. Benford’s Law indicates the appearance of each digit is followed by certain probability.

In this section, I will still focus on three important financial metrics related to earning (revenue, operating cash flow and total asset)and to see if Benford’s Law can reveal some hints for earning manipulation.

(1) First Digit

# R-code

library('benford.analysis')

library(dplyr)

library(ggplot2)## Use 1st digit from revenue columns as code example

revenue_bf <- data.frame(benford(abs(df$revt), number.of.digits = 1, sign = 'positive', discrete = TRUE, round = 3)$bfd)[,c("digits","data.dist", "benford.dist")]colnames(revenue_bf) <- c('Digit', 'Sample_Revenue', 'Benford_Distribution')ggplot(data = revenue_bf, aes(x = as.factor(Digit), y =Benford_Distribution)) +

geom_bar(stat='identity') +

geom_line(aes(Digit, Sample_Revenue, col = 'Sample_Revenue'), linetype = 1) +

geom_point(aes(Digit, Sample_Revenue), size = 4, col = 'red')+

ggtitle('Theoretical Distribution v.s. Amgen Revenue Distribution - 1st Digits ')

From the graph and Chi-squared table above, the distribution of first digit for operating cash has more significant difference from Benford’s Law, which indicates possible manipulation on the first digit.

Visually speaking, there is a high peak for digit 5 in operating cash flow. We might further refer to its performance goal to see if the manager tried to push the number to five.

Also for revenue, though it doesn’t have significant difference from original digit distribution, I observe a higher appearance probability for digit 1 and zero appearance for digit 9. It might give a hint that the company tend to turn some number with 9 as first digit and round it to 1 (i.e. turning 99 to 1000).

(2) Second digit: After looking into first digit distribution, I decide to go further to check the distribution of second digit.

From the Chi-squared test, it shows the significance difference between theoretical digit distribution and actual number distribution. And from graphs, I observe a huge distinction for number 9, which refer to possible manipulation on second number other than just rounding the number.

(3) Some limitation for Benford’s Law here is that the sample data for each financial metrics are not so big, which is around 30 rows. The small sample of data might not be a very appropriate representative for the overall picture compared what I have done in the previous article, where I used every company in the data set.

# 4. Yearly Accruals Model

For the yearly accruals model, I used the second data file with Amgen and its peer companies in the same SIC-defined industry.

The model is to calculate every year accruals regression within the same industry and for each company, there will be a residual generated by these regression model. If in the specific year, the residual is huge, which means its accrual have varied differently from industry collective standard. It further indicates the possible manipulation of independent variable such as cash revenue growth or gross PPE.

# R-code

df1 <- select(df, gvkey, fyear, accurals,scale_cashrev_growth, scale_ppe)

df1 = na.omit(df1)year = list()

residual_a = list()for (i in 1:29){

data1 <- df1[df1['fyear'] == i + 1988, ]

fit <- lm(accurals ~ scale_cashrev_growth + scale_ppe, data = data1)

year[i] = i + 1988

residual_a[i] = fit$residuals[1]

}residuals_list = do.call(rbind,

lapply(1:length(year),

function(i)

data.frame(A=unlist(year[i]),

B=unlist(residual_a[i]))))residuals_list$B = abs(residuals_list$B)plot(residuals_list, type="l", xlab = 'Year', ylab = 'Residuals',

main = 'Yearly Unsigned Discretionary Accruals for Amgen')

In the code chunk, I select the munging column accruals, cash revenue growth as well as ppe and run regression every year in a for loop and capture the 1st residuals, which belongs to Amgen. Furthermore I take the absolute value of Amgen residual and plot it across year.

The peaks for residuals are where Amgen accruals are off industry benchmark (year 2002, 2014, and 2017). It might further give some possible hint of earning manipulation.

# 5. Operating Cash Flow Model

The logic behind operating cash flow model is very similar to accruals model. It is the testing of predictive power of several different financial metrics. The higher of R-score of the model, the more robust the independent metrics refers to the target variable and the less possibility of earning manipulation on operating cash flows or accruals.

I decided to use five years as a bin and to see for every five year how the change of predictive power of the model is.

# R-code

df1$accurals = df1$ni - df1$oancfdf01 <- df1[,c('gvkey','fyear','oancf')]

df02 <- mutate(df01, fyear = fyear - 1) %>%

rename(., next_oancf = oancf)

df1 <- left_join(df1, df02, by = c('gvkey',"fyear"))# Basic model

oc_reg <- lm(df1$next_oancf ~ df1$accurals + df1$oancf)# run the model every five year

y1992 <- df1[df1['fyear'] <= '1992',]

oc_reg_92 <- lm(y1992$next_oancf ~ y1992$accurals + y1992$oancf)

summary(oc_reg_92)

The result is sorted below:

From the chart, the green-colored R-squared in 1993–1997 and 2013–2016 are relatively high, indicating a better predictive power and less probability of manipulation.

The two red-colored R-squared from 2003 to 2012 showing the weakness of model prediction and higher possibility of manipulation on the independent metrics (operating cash flow and accruals).

# 6. M-Score

For M-Score, I will use the M-Score Generator I created in my previous article to plot the graph of M-score.

`# Python`

m_score_trend_graph('Amgen')

From M-score graph, data point above red line is the red-flag of earning manipulation; and data point between red and green line is yellow flag indicating slight manipulation.

Overall speaking, based on M-score, Amgen might not be a potential earning manipulator for past 10 years.

# Brief Conclusion

In the article, I briefly walked through different methods to detect possible earning manipulation for a real-world company. Different methods provide different kinds of perspectives. Next time when you want to look into the company profile, maybe it can grant you a fresh eye to the a company with a more data-science-based manner.

If you like the article, feel free to give me 5+ claps

If you want to read more articles like this, give me 10+ claps

If you want to read articles with different topics, give me 15+ claps and leave the comment hereThank for the reading