Handling and Statistically Modeling the Unknowns

How LeaseLock Data Science solves for unknown or unobserved data statistically in AI/ML risk models

Jin Lee
LeaseLock Product & Engineering
6 min readJan 5, 2022

--

LeaseLock is the first and only AI-powered lease insurance platform that eliminates deposits from the rental housing industry and offers customized protection across entire portfolios. The LeaseLock Data Science team employs statistical and machine learning models to develop AI actuarial algorithms that enable data-driven decisions. The team has developed a variety of algorithms for risk underwriting, risk monitoring, revenue forecasting, financial analysis, and more. This post will focus on the Risk Platform and how we handle the issue of missing data.

The Challenge: Unknown or Unobserved Data

Missing data is a common challenge in data science and presents various problems. Since our statistical model assumes our data represents the entire dataset, missing data can result in unwanted bias in our models. It also becomes tricky given an actual missing case could happen in the future. Let’s create hypothetical claim data with missing observations:

claim = np.random.normal(1500, size=1000, scale=400)

claim_series = pd.Series(claim)
claim_series = claim_series.apply(lambda x: np.nan if x >=1000 and x<1025 else x)
claim_series = claim_series.apply(lambda x: np.nan if x >=1300 and x<1325 else x)
claim_series = claim_series.apply(lambda x: np.nan if x >=1600 and x<1700 else x)
claim_series = claim_series.apply(lambda x: np.nan if x >=2000 and x<2100 else x)

df = pd.DataFrame(data = claim_series, columns=['claim amount'])

As shown in the chart below, we may not have seen any leases with claims between $1,600 and $1,700. However, it is most likely that we just have not observed the data yet, and leases that fall into the category are more likely to show up in the future. To treat unknown/missing data issues, the data science team develops statistical models to predict the probability of an event happening that simply has not been observed before.

First Solution: Imputation

The first approach we experimented with was imputation, which is widely used and easy to implement. When it comes to categorical values, we can replace the missing value with the mode, the most common value, or encode it with a different value. And when the missing values are a continuous value, numeric number, we can replace the missing value with mean, median, or some other measures. Implementing this is a relatively light lift computationally. The codes for simple imputation with mean or median values will look something like the following:

df_mean = df.fillna(df['claim amount'].mean())

With mean imputation logic, those missing values shown as NaN are now filled with mean values. However, if you look at the chart below, this approach does not solve our problem where we need to estimate the probability of missing values or bins happening in the future. There are still some bins in the histogram below that are non-existent.

Applying machine learning algorithms for imputation could be a solution, too. But this approach is not ideal when the dataset is too small or sporadic, such as new properties with limited financial ledger data, as it will create unwanted bias. Although there are cases where we utilize these methods, it was not realistic for this specific goal, and we did not want to bring unwanted bias to our model. Above all, this approach could not solve the unobserved data issues.

Second Solution: Bucketing

Another solution we considered was bucketing. Imagine we adjust the bin size of the histogram and use the new histogram as a probability model to fill the missing event. Although it becomes a little more complicated than the previous imputation, it is still simple and would not introduce as much bias as the imputation method does. Using this bucketing/binning approach, we would guess there is an 11% chance of a claim between $1,500 and $1,650 and a 7% chance of a claim between $1,650 and $1,800. Compared to a 0% chance of that claim event happening, this is a more realistic estimate.

However, the size of the bin has to be wide enough to handle the unobserved data, since if the size of the bin is too small, we will still have buckets with zero occurrences, which then leads to predicting zero possibilities for certain data points that are simply unobserved yet. Such a lack of granularity will lead to inaccurate distribution in the dataset. In short, this solution works well when there is an abundance of data around the missing values.

fig = px.histogram(df, x="claim amount", histnorm='probability density')

fig.update_traces(xbins=dict(
start=0.0,
end=2700.0,
size=150 # increase bin size (100 -> 150)
))
fig.update_traces(xbins=dict(
start=0.0,
end=2700.0,
size=25 # decrease bin size (100 -> 25)
))

LeaseLock Solution: Kernel Density Estimation

After considering the pros and cons of several methods, we decided to use kernel density estimation, the KDE method, to handle missing data. The method places a kernel function, such as a Gaussian function, on each data point and generates a probability density function by summing up all kernel functions and normalization. Empty buckets are filled by kernel functions, which results in a smooth curve based on observed data.

As shown in the image above, the KDE function fits a smooth curve based on the observed data and can be used to predict the probability of the unobserved data occurring in the future. However, depending on what parameters are chosen, it can be underfitting or overfitting the data. As it is true for all ML or statistical models, it is important to choose the right and optimal hyper-parameters, one of which for KDE function is bandwidth. Various Python packages, such as Scipy, Statsmodels, and KDEpy packages offer automatic bandwidth selection algorithms and we could confirm its performance by experiment.

bandwidth = np.round(np.array([0.05, 0.3, 0.5, 0.7, 1.5]), 2)

for bw in bandwidth:

kde = stats.gaussian_kde(x_no_nan, bw_method=bw)
fig.add_trace(go.Scatter(x=xx, y=kde(xx), name=f'KDE with BW {bw}', mode='lines'))

As shown here, the KDE approach efficiently handles missing values and the estimation of the probability of those unobserved data occurring in the future. If we were to guess the chance of a claim occurring between $1,600 and $1,700, we can calculate the area under the density function between $1,600 and $1,700 by integration. This is a more realistic and educated guess. Also, the beauty of the KDE approach is that we can calculate the probability of any claim event regardless of range.

kde.integrate_box_1d(1600, 1700)
# 0.06275631781374585
kde.integrate_box_1d(1600, 1610)
# 0.00705282594667713

Conclusion

So, we have shared our secret sauce to handle the unobserved yet potential future events. There is no perfect solution to generalize and the performance of different methods varies by each dataset. We could forecast the chance of events happening using the KDE approach, even when we have not seen certain parts of the data, thus overcoming the incomplete nature of the financial ledger and transaction dataset. For the actual risk forecasting and risk assessments, we combine this approach with other Bayesian methods, which can be a topic for some other day in the future. In the meantime, please check out our other blog posts and our website if you are curious about our business or interested in joining our team!

--

--