Generating Realistic Synthetic Financial Time Series
A non-parametric approach using weighted, variable-width kernels to estimate conditional PDFs
Synthetic financial time series are artificially generated sequences of data that mimic the statistical properties of real market data. These synthetic datasets play a pivotal role for traders and strategy designers, offering a controlled environment to test and refine trading strategies, ensuring model adaptability, and providing insights into performance under diverse market scenarios and economic conditions that may be difficult to observe in reality.
But creating ‘realistic’ synthetic financial time series is no easy feat, as historical data often contains specific, complex features that are difficult to replicate. For example, consider Figure 1, which shows the prices, p(t), returns, r(t) = (p(t) − p(t−1)) / p(t−1), and the distribution of returns for Apple (AAPL) over the three-year period from January 1, 2021, to January 1, 2024.
The distribution of returns (right) isn’t your typical bell curve — it has fatter tails and a sharper peak than a normal (or ‘Gaussian’) distribution (dashed curve). This is called ‘excess kurtosis’. Looking at the return series (center), we notice that large returns (positive or negative) tend to be followed by more large returns, while small returns tend to follow small returns. This pattern is known as ‘volatility clustering’. In the price series (left), there’s an overall upward trend over the three years, but we also see shorter ups and downs. These shorter trends often correct themselves, with prices returning to more ‘normal’ levels based on fundamentals like price/earnings ratios — a behavior known as ‘mean reversion’. Fat tails, volatility clustering, and mean reversion are examples of ‘stylized features’. To create convincing synthetic financial series, we MUST capture these features.
While parametric models such as Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs) have surged in popularity over recent years, their results in this domain have been hit and miss. These models often struggle to capture long-term dependencies inherent in features like volatility clustering and mean reversion. They also come with many practical challenges including training instability, high resource demands, and limited interpretability.
In this story, I take a completely different approach and describe a non-parametric model based on explicitly estimating the probability distribution of returns. I will begin by exploring how to model a distribution of returns without considering the temporal sequence in which they occur. We’ll see that capturing features like excess kurtosis and skew is fairly straightforward. Then, I’ll demonstrate how to extend this to estimate ‘conditional’ return distributions — distributions that take into account previous returns. This paves the way for more advanced modelling that can generate realistic synthetic series, accurately capturing the stylized features of financial time series and providing a much-needed alternative to GANs and VAEs.
Modeling financial returns using KDE
Kernel density estimation (KDE) is a useful method for modelling non-Normal distributions. It works by placing a kernel (often Gaussian, but it can be different) at each data point and normalizing so that when all these kernels are summed together, the area under the curve equals one, making it a valid probability distribution function (PDF).
The key factor in KDE is the kernel width, or ‘bandwidth’. If this is too narrow, the density estimate becomes overly localized, resulting in a squiggly and noisy density function. Conversely, if the bandwidth is too broad the estimate will over-smooth the data and fail to capture important features of the distribution. Although rules of thumb such as the Silverman bandwidth¹ often provide a good trade-off, a constant bandwidth is unlikely to adequately capture the fat-tailed nature of financial return distributions while simultaneously modelling the higher-density regions with the required accuracy.
Variable-width kernels
An intuitive way to tackle this problem is to use variable-width kernels where the bandwidth increases as we move away from the center (mean) of the distribution. We use the following to determine the bandwidth bₜ for data point rₜ :
where 𝜇̂ is the mean of the historical returns, A is the width of the ‘base’ kernel at the center of the distribution, and k controls how quickly the bandwidth increases with distance from the mean. We use an exponentially increasing bandwidth, but a linearly increasing one could work too.
Let’s apply this to some historical financial series. Figure 2 shows the empirical return distributions and PDF estimates for the daily close prices of Apple (AAPL), Google (GOOG), Facebook (META) and Microsoft (MSFT) over the same three-year period using variable-width kernels. In each case, the estimated PDF captures the sharp peaks and fat tails seen in the actual data. It also captures the asymmetry (or ‘skew’) in the historical distributions, which is perhaps most evident for the case of Google. Note that while KDE can model skew, if the skew is extreme, it might be better to handle the left and right sides separately using different k values.
But what about a ‘series’ of returns?
Drawing a single value from the estimated PDF is straightforward, but our aim is to generate a sequence, or ‘series,’ of returns. A naive approach would be to simply sample returns successively from the PDF, but this overlooks a critical fact:
Returns sampled from the same PDF will be independent and identically distributed (i.i.d.). However, returns from financial series that exhibit features such as volatility clustering and mean reversion are not independent — past returns influence future ones.
To capture these patterns, we need to estimate ‘conditional distributions’ — distributions of returns that take into account previous returns. Only then can our synthetic series accurately reflect the temporal dynamics of the historical data.
Modeling conditional returns using weighted KDE
The kernels we used above were centered at the historical returns rₜ, and we found that they could produce a sufficiently detailed estimate of the PDF of the historical return distribution. By weighting these same kernels it is possible to generate a multitude of PDFs with diverse characteristics (symmetrical versus skewed, single-mode versus multi-modal, etc.).
We would like to produce PDFs that represent the return distribution conditional on previous return values. The challenge lies in finding a set of weights that capture the relationship between past returns. In general, we seek a function of the form
where wₜ is the weight for the kernel centered at historical return rₜ. Let’s consider the simplest case: estimating the return distribution conditional solely on the previous day’s return.
Example: Conditioning on the previous day’s return
Let’s assume the previous day’s return has value c. The essential idea is that for each historical return rₜ, the closer that its 1-day lagged return rₜ ₋ ₁ is to c, the stronger should be the influence of the kernel centered at rₜ on the estimate of the conditional PDF. There are various ways of capturing this relationship. We will use the following:
where 𝜎̂ is the standard deviation of the historical returns.
Figure 3 displays the three-year historical returns of Apple (X-axis) plotted against the 1-day lagged values (Y-axis). It also shows three specific values for the previous day’s return that we will condition on: −0.06, −0.03 and 0.07. Figure 4 shows the corresponding conditional PDFs.
The shape of the estimated PDF varies based on the conditioning value. When this value is close to 0, the conditional distribution tends to be unimodal. In contrast, when the conditioned value has a relatively large magnitude (e.g., −0.06 or 0.07), the distribution becomes more complex in shape.
We can generate a series of synthetic returns as follows, where we use primes (as in r’₀) to distinguish synthetic returns from historical returns:
The sequence of returns r’₀,…,r’ₙ forms the synthetic return series, which can then be converted into a price series.
‘Realistic’ synthetic financial time series
Estimating conditional distributions based solely on the previous day’s return is unlikely to capture the richness of features we observe in real financial data. To do so, we need to consider distributions conditional not just on the previous day’s return, but on a sequence of previous returns.
At Skanalytix², we have built on these concepts to produce a synthetic financial time series generator that can effectively model conditional distributions to produce realistic synthetic financial series. The generator estimates conditional PDFs of the form:
where Rₜ and its lags are either scalars representing the return of a single stock, or vectors representing the returns of multiple stocks, and Xₜ is either a scalar or vector of ‘auxiliary’ features which can be a mix of numerical and categorical and might include fundamentals, economic variables, sentiment, etc.
We have applied the time series generator to generating synthetic datasets from the three-year Apple (AAPL) data. Figure 5 displays the return series, distribution of returns, autocorrelation function (ACF) for returns, ACF for absolute returns, and price series for historical data (first row), together with three synthetic series (second to fourth rows). (The synthetic series were produced using lagged returns only, without any auxiliary features.) The synthetic return series exhibit fat tails and long-term volatility clustering similar to the historical data, and the prices fall within a comparable range to the historical prices, suggesting that the model has captured the mean reversion properties.
Figure 6 shows PCA³ and t-SNE⁴ plots of the original and synthetic returns. These are handy dimensionality reduction techniques that allow us to visualize high-dimensional data in two dimensions. (The returns data were expanded to include lagged values as additional dimensions). Both methods show that points from the historical and synthetic datasets are well interspersed, indicating that the synthetic data captures a broad range of variability similar to the historical data. Moreover, there are no single-colored clusters, indicating that there are no patterns present in one group but absent in the other. From Figures 5 and 6 we can be confident that the synthetic data has mimicked the overall structure of the historical data.
Correlated Stocks
In addition to generating synthetic series for individual stocks, we can also generate realistic series for multiple correlated stocks. Figure 7 shows the historical and synthetic price series for four technology stocks — Apple (AAPL), Google (GOOG), Facebook (META), and Microsoft (MSFT) — over the three-year period. The synthetic series exhibit the strong positive correlation observed in the historical values of these stocks, while simultaneously modelling the stylized features of each individual series. Negative correlations can be similarly modeled.
Conclusion
In this story, I have described a non-parametric method for generating synthetic financial time series based on explicitly estimating the distribution of returns. This method not only generates realistic synthetic time series that accurately capture stylized features like volatility clustering and mean reversion but also completely avoids the practical challenges associated with parametric approaches such as GANs and VAEs. Importantly, the model is anything but a black box — the distribution from which a return is drawn is a sum of weighted kernels, making it directly visualizable.
We have been primarily interested in generating synthetic financial series — for each point generated, a conditional distribution was estimated, and a random value was then drawn from that distribution. However, this same distribution can be used for a variety of purposes. If the aim is to perform prediction (or ‘forecasting’), then it is simply a matter of determining the conditional mean (or median or mode) of the estimated distribution. Estimating Value at Risk (VaR) is also straightforward, as is computing the probability that a return falls within a defined interval.
The synthetic financial time series generator we have developed at Skanalytix² is based on our Unified Numeric/Categorical Representation and Inference (UNCRi) framework, which combines a graph-based data representation scheme with a flexible inference procedure that can be used to estimate and sample from the conditional distribution of any categorical or numeric variable. The generator is a valuable tool for organizations requiring realistic synthetic data to use for testing and refining trading strategies, model validation, and other tasks that involve decision-making based on financial time series. A free online version of the synthetic financial time series generator can be accessed from the Skanalytix² website. To stay updated with our latest news and insights, follow Skanalytix on LinkedIn (https://www.linkedin.com/company/97182209).
Notes
- The free online version of the Skanalytix synthetic financial time series generator is available from: https://skanalytix.com/synthetic-financial-time-series-generator
- Python code used to produce Figures 1 to 4 is available from https://github.com/a-skabar/SynthFinTimeSeries
References
[1] Silverman, B.W. (1982). Applied Statistics, Royal Statistical Society, Vol. 33.
[3] Jolliffe, I.T. (1986). “Principal Component Analysis”, Springer, New York.
[4] van der Maaten, L. and Hinton, G. (2008). “Visualizing Data using t-SNE” Journal of Machine Learning Research, 9(11).