Neural Networks, the Universal Approximation Theorem and Option Valuation

Abhisek
Analytics Vidhya
Published in
10 min readOct 19, 2020

by Abhisek Das, CQF

3 unit Multi Layer Perceptron using Step function to approximate a continuous function

‘Data Science’, ‘Machine Learning’, these are the buzzwords that seem to dominate the narrative space of technology and computer science in the second decade of the 21st century. Everyone, from national governments to staid corporations seem to be eager to jump onto the bandwagon. ‘Data is the new gold’ seems to be clarion call of this brave new world and Machine Learning the new frontier. Neural Networks and Deep Learning seem to have a particular cachet with the audience, as it seems to evoke dreams of ‘real artificial intelligence’ and its endless possibilities.

However, as is inevitable in the real world, there are skeptics aplenty. ‘Black Box’, ‘unreasonable expectations’, ‘failure to deliver’ are the rebuttals of these apparent Luddites, who point to the failure of ‘statistics wrapped in glitz and glamour’ to deliver on the high expectations of punters in past.

RosenBlatt’s Perceptron really riled up the media at the time

So the perhaps-billions-of-dollar question is who is right? The enthusiasts or the skeptic? In my humble opinion, neither.

Data Science or Machine Learning is basically the application of algorithms based on mathematics and statistics to real world problems. Now I know there is a lot more to Machine Learning, especially its implementation, but let’s stick to the bare bones of it. Now as we all know, mathematics and statistics are developed by the gifted few, who sit in their so-called ivory towers, and think up amazingly clever solutions to problems which plague the mundane world. But there are some issues with this approach. Mathematics is said to be the ‘Handwriting of God’, and is used to understand and solve problems posed by nature. Usually, we call nature fickle, but when it comes to her laws, she is anything but. The laws of gravity, thermodynamics and electromagnetism are inviolate and do not change for real world applications. An aircraft designed using these laws will not stop functioning in a particular place because ‘the underlying assumptions no longer held true, or the initial data used to develop the model was inadequate to capture the actual system mechanic’. But, when it comes to situations involving human-beings, well, nothing is sacred or inviolable.

This poses a rather serious problem to Machine Learning. We are basically using tools developed to understand a world which operates on cardinal, fixed rules, to analyse situations which are anything but. The problem is exacerbated in sectors like Finance, wherein the conditions can change at any time, and the past is truly no guarantee of the future (although most financial models are predicated on this assumption, see Martingales and Markov processes).

Also, when it comes to Finance, not all ML algorithms are created equal. We often talk of the ML tool-box and there is a good reason for that. Think of libraries like the ubiquitous SKLEARN or TENSORFLOW your tool-box into which you reach in to fix a particular appliance. Now, different tools have different utilities, and to solve your problem, or ‘fix your appliance’, you must know which one tool, or which specific combination, would give you the optimum results. And for that, we need to know how each algorithm actually functions.

Of course, there is a ready rebuttal on this point that we try out all the algorithms and stick with those which work the best for the particular situation. But the problem with this approach is: what if once you find your ‘optimal algorithm or combination’ and build and implement your system, your problem space ie your data changes? Remember the cardinal rule of data science or data analysis, ‘Garbage in, Garbage out or GIGO’. The entire modus -operandi of ML is based on data, so you need to understand a) how your algorithm works and its limitations b) how your data may change and its probable behavior in the future in order to build an efficient system.

How Neural Networks work (really simplified version):

Neural Networks are Universal Function Approximaters, a single layer of neurons can approximate any function to an arbitrary precision level, although that can take an impractical number of neurons(possible infinite) to achieve it, so it’s better to have multiple layers and non-linear activation functions(tanh, sigmoid, ReLu etc).

A function and how networks try to approximate it

In the above figure, we have a function, which is unknown to the network, and the points represent the input data we feed to the network. By using back propagation(basically taking derivatives of the Loss function w.r,t the weights of the inputs), the network tries to minimize error ie E.

The Universal Approximation Theorem states:

  • a feedforward network with: 1) a linear output layer, 2) at least one hidden layer containing a finite number of neurons and 3) some activation function can approximate any continuous functions on a compact subset of to arbitrary accuracy.

The theorem was first proved for sigmoid activation function (Cybenko, 1989). Later it was shown that the universal approximation property is not specific to the choice of activation (Hornik, 1991) but the multilayer feedforward architecture.

However let’s take a step back and think about the implications of this theorem. What is basically says is if you have adequate (this is very important)input data which is sourced from a continuous function, and you have a classification or regression problem, you are golden. Just build a, preferably multi-layer network, code up a frame work to feed the input data to the network, and voila, success!

When it comes to problems like voice recognition (it’s a function, think how your MP4 audio files work), or face recognition(again function) or games based on rules(GO), Neural Networks are the champions. But, then we hear statements like, for user recommendation systems or time series analysis or real life data, ensemble techniques beat networks, and we wonder if neural networks are really the magic bullet we thought them to be.

And the answer is no. If you want predictions wherein the input data and their outputs are linked through a function, go for neural networks. Otherwise look elsewhere, like Random Forests, Ensemble techniques, even lazy instance based techniques like K-Nearest. If your inputs are uncertain data or non-parametric, then neural networks are perhaps not the best.

In finance, neural networks are really good for predicting values for Options, the outputs for models like SABR for Volatility etc. Now we have to note here is that SABR stands for Stochastic Alpha, Beta, Rho, and so the outputs(volatility in this case)will be stochastic ie random. So, if neural networks can give very good predictions for ‘random’ processes, why we cannot use it for say, stock market price prediction?

The answer to this lies in the subtle difference between ‘stochastic’ and ‘uncertain’. If we have a fair or unbiased coin, and we do a toss, what is the probability that we will obtain a head? Its 0.5, and we have the probability of obtaining a tail. This is the premise on which ‘Stochastic processes’ are built on eg the Weiner process, represented by dW, wherein

But what if the coin is unfair? Then we cannot model the process using a Wiener process, nor its outcome.

Now the usual model for stock price change is given as:

Wherein the first part is drift and the second part is diffusion. The drift, or the gradual upward or downward movement of stock price with time, is what investors are really interested in. If you have 100 dollars and you are wondering if your chosen stocks will double it in 5 years, drift is what you are interested in.

But in the Holy Grail of option valuation model, the Black Scholes Model, using clever sleights of hand(delta hedging), the drift term is done away with, and we are left with the second part, the diffusion which contains the volatility or noise(which can be known, or again modeled), the stock price(which is not there in the final BSM equation) and the dW, which is the square root of the gradient of time(and so in a time dependent process not strictly random).

But when we come to real world stock market prediction, the coin is actually biased. Markets can move up, down, sideways based on a mere rumor or a whiff of a scam or new regulation, unpredictable news can make it skitter and gallop or suddenly flounder. It’s not stochastic, it’s uncertain and non-parametric. So, trying to reduce the stock market to a parametric function is really difficult and maybe even futile.

Coded example of all the above:

Now what I have below is an attempt to predict the value of digital option using a Neural Network and its variant, the LSTM (Long Short Term Memory) Network.

A digital option, in layman terms, is an option that either pays all or none based on if a certain condition( the underlying value at maturity minus its value at contract inception) is greater than 0 at option maturity or not. So its value is function based and should be easily approximated using a neural network.

First I value it using Black Scholes Model and Monte Carlo Simulation, and then I feed the inputs of these two functions to the networks, together with labelled data, and try to get predictions.

Initial step, load all the libraries:

In this section, we get the features and target data from an excel file. The target data is generated using a seperate code which uses both Black Scholes Model to price Digital Options and also 50000 iterations Monte Carlo Simulations. But the file to be used only has the BSM valuations. Please note in this valuation implied volatility, risk free rate and time to maturity is kept constant. Only the underlying price and the option strikes are the variables.

ANN code:

start_time_ANN=datetime.now()data=pd.read_excel(r'C:\Users\abhis\Desktop\Digital_Option_Val_conventional_100.xlsx', sheet_name="Sheet1", index_col=0, usecols=[0,1,2,3,4,5,6])print(data.head())X_Features=pd.DataFrame(data)Y_Response=X_Features[['BSM_VAL']].copy()X_Features=X_Features.drop(columns='BSM_VAL')print(X_Features.head())print(Y_Response.head())Y_Response=Y_Response['BSM_VAL'].ravel()

We have the following output (checking the Features and Response array to confirm that the Features array does not have any output, ie the labels. Good practice for beginners)

Output:

Initial Data, Features and Response(Labelled Output)

Coding the ANN(Artificial Neural Network) model and getting the output:

model=Sequential([Dense(units=80, input_shape=(5,), activation='relu'),Dense(units=40, activation='relu'),Dense(units=20, activation='relu'),Dense(units=1, activation=None)])model.summary()model.compile(optimizer=Adam(learning_rate=0.0001),loss='mean_squared_error',metrics=['accuracy'])model.fit(x=X_Features, y=Y_Response, validation_split=0.1,batch_size=1,epochs=30,shuffle=True,verbose=2)predictions=model.predict(x=X_Features,batch_size=1,verbose=0)dataframe_predictions=pd.DataFrame(predictions)end_time_ANN=datetime.now()print('ANN\n',dataframe_predictions.head())

Output:

model.summary()
ANN output(first five)

LSTM (Long Short Term Memory):

Now we create the LSTM model, with tanh activation function, as is the practice.

start_time_LSTM=datetime.now()X_Features_LSTM=np.array(X_Features).reshape(802,1,5)#data is arranged as [batch_size, time steps, dimensionality]model_LSTM=Sequential([LSTM(units=80, input_shape=(None,5), activation='tanh',return_sequences=True),LSTM(units=40, activation='tanh',return_sequences=True),LSTM(units=20, activation='tanh',return_sequences=True),TimeDistributed(Dense(1))])model_LSTM.summary()model_LSTM.compile(optimizer=Adam(learning_rate=0.0001),loss='mean_squared_error',metrics=['accuracy'])model_LSTM.fit(x=X_Features_LSTM, y=Y_Response, validation_split=0.1,batch_size=1,epochs=30,shuffle=True,verbose=2)predictions_LSTM=model_LSTM.predict(x=X_Features_LSTM,batch_size=1,verbose=0)predictions_LSTM=np.array(predictions_LSTM).reshape(802,1)dataframe_predictions_LSTM=pd.DataFrame(predictions_LSTM)end_time_LSTM=datetime.now()print('LSTM\n',dataframe_predictions_LSTM.head())

Output:

model.summary(), note the significant increase in model parameters over an ANN
LSTM output for strikes 100 and 105 (first five)

Now we use LSTM network to predict option values for Strike 110, but we train the network only on data (feature set) for strikes 100 and 105. This will allow us to find out if the network is over-fitting and has high bias.

start_time_LSTM_new=datetime.now()data_new=pd.read_excel(r'C:\Users\abhis\Desktop\Digital_Option_Val_conventional_110.xlsx', sheet_name="Sheet1", index_col=0, usecols=[0,1,2,3,4,5,6])X_Features_LSTM_New=pd.DataFrame(data_new)Y_Response_LSTM_New=X_Features_LSTM_New[['BSM_VAL']].copy()X_Features_LSTM_New=X_Features_LSTM_New.drop(columns='BSM_VAL')print(X_Features_LSTM_New.head())print(Y_Response_LSTM_New.head())

Output:

New data
model_LSTM_split=Sequential([LSTM(units=80, input_shape=(None,5), activation='tanh',return_sequences=True),LSTM(units=40, activation='tanh',return_sequences=True),LSTM(units=20, activation='tanh',return_sequences=True),TimeDistributed(Dense(1))])model_LSTM_split.summary()model_LSTM_split.compile(optimizer=Adam(learning_rate=0.0001),loss='mean_squared_error',metrics=['accuracy'])model_LSTM_split.fit(x=X_Features_LSTM, y=Y_Response, validation_split=0.1,batch_size=1,epochs=30,shuffle=True,verbose=2)

Please note, when fitting to model, we use the old Feature(X_Features_LSTM) ie the data for strikes 100 and 105.

predictions_LSTM_split=model_LSTM_split.predict(x=X_Features_LSTM_New,batch_size=1,verbose=0)predictions_LSTM_split=np.array(predictions_LSTM_split).reshape(401,1)dataframe_predictions_LSTM_split=pd.DataFrame(predictions_LSTM_split)end_time_LSTM_new=datetime.now()print(dataframe_predictions_LSTM_split.head())

When using the predict function, we use the new Feature(X_Features_LSTM_New) ie for strikes 110.

Output:

model.summary()
LSTM output for strike 110 (first five)

Code’s predictions’ Accuracy Analysis:

Run time analysis:

We first compare the run times between the Monte-Carlo simulation process and the neural network method:

print('This is the ANN run time:{}'.format(end_time_ANN-start_time_ANN))print('This is the LSTM run time:{}'.format(end_time_LSTM-start_time_LSTM))print('This is the New LSTM run time:{}'.format(end_time_LSTM_new-start_time_LSTM_new))

Output:

Run times for ANN, LSTM and MCS

Predictions Accuracy Analysis:

Since we are going for regression based prediction, ie a single output for the input vectors, we use the metrics Explained Variance, Mean Square Error and

Measure of Fit

. For Measure of Fit and Explained Variance, the closer to 1 the better, while Mean Square Error should be as small as possible for best quality prediction.

Code:

ann_explained_variance=explained_variance_score(Y_Response,predictions)ann_mean_squared_error=mean_squared_error(Y_Response,predictions)ann_r2=r2_score(Y_Response,predictions, multioutput='raw_values')lstm_explained_variance=explained_variance_score(Y_Response,predictions_LSTM)lstm_mean_squared_error=mean_squared_error(Y_Response,predictions_LSTM)lstm_r2=r2_score(Y_Response,predictions_LSTM, multioutput='raw_values')lstm_new_explained_variance=explained_variance_score(Y_Response_LSTM_New,predictions_LSTM_split)lstm_new_mean_squared_error=mean_squared_error(Y_Response_LSTM_New,predictions_LSTM_split)lstm_new_r2=r2_score(Y_Response_LSTM_New,predictions_LSTM_split, multioutput='raw_values')print('ANN Explained Variance\n',ann_explained_variance)print('ANN Mean Squared Error\n',ann_mean_squared_error)print('ANN R2\n',ann_r2)print('LSTM Explained Variance\n',lstm_explained_variance)print('LSTM Mean Squared Error\n',lstm_mean_squared_error)print('LSTM R2\n',lstm_r2)print('LSTM New Data Explained Variance\n',lstm_new_explained_variance)print('LSTM New Data Mean Squared Error\n',lstm_new_mean_squared_error)print('LSTM New Data R2\n',lstm_new_r2)

Output:

ANN metrics for strikes 100 and 105
LSTM metrics for strikes 100 and 105
LSTM metrics for strike 100

So we can see from the Explained Variance, Mean Square Error and Measure of fit, that both vanilla neural networks and LSTM give very good results, with LSTM outperforming vanilla network due to its memory persistence feature. Even when we are trying to get out of sample predictions, like in the last case, the results are pretty impressive. Also the shorter run-time and the feature to save pre-trained models and run them on demand (Keras: model.save) means that once we have sample data inputs and their outputs, we can use pre-trained models to give near exact predictions for any input data for the same approximated function.

So the neural network performed very well in approximating the digital option valuation function.

Sources:

W.A McGhee SABR Paper (https://papers.ssrn.com/sol3/papers.cfm?abstract_id=3288882)

--

--