The significance of Interpolation for Data Scientists

Uri Itai
4 min readDec 18, 2023

--

Following my engagement as a data science consultant with a pricing-focused company, I encountered a scenario where they utilized interpolation to forecast trade values based on sampled points from historical data. The approach proved effective with a limited number of sampled points, but as the sampling increased, the outcomes became more diverse, contrary to expectations. Recognizing the counterintuitive nature of this situation, the company sought my assistance. The scheme had been devised by a skilled data analyst who was convinced there was an undetected programming bug. To investigate, I requested an overview of the data and the algorithm. The analyst shared crucial insights: the data displayed monotonic behavior, indicating that more money led to more contracts; there were diminishing returns, signifying saturation; and the sampled points were nearly equidistant. Furthermore, polynomial interpolation was employed in the process.

After this, I asked him to plot the sampled points. This startled me. The plot resembled an arctan function. Wow. In my graduate school, I studied this as Runge’s phenomenon.

Runge’s phenomenon denotes the manifestation of oscillations or undesirable behaviors in the interpolation of functions using high-degree polynomials, especially when employing equidistant interpolation points. With an increase in the polynomial degree, these oscillations intensify near the edges of the interpolation interval. This phenomenon arises due to the error being bounded by the n-th derivative. Let f(n) represent the n-th derivative of f. In this case, the series max(f(n)) is unbounded, causing the error to diverge to infinity. This occurrence is attributed to the presence of poles in the complex plane of the arctan function.

The sampled points from arctan and the interpolant
Carl Runge

To mitigate Runge’s phenomenon, several methods can be employed. The first approach involves sampling at specific points. Utilizing Chebyshev nodes, which correspond to the roots of Chebyshev polynomials, serves as a promising starting point. Chebyshev polynomials form an orthogonal set linked to the weight function:

The Chebyshev wight function

However, implementing Chebyshev nodes was not feasible given the existing settings. Consequently, the next step involved employing spline interpolation.

Spline interpolation is a mathematical technique employed for the approximation and representation of a smooth curve or function between a given set of data points. Unlike simpler interpolation methods such as linear interpolation, spline interpolation utilizes piecewise-defined polynomial functions, often of lower degree, to construct a more adaptable and smoother curve. The term “spline” is derived from the flexible strips or splines once utilized by draftsmen to achieve smooth curves in manual drawings. Common types of splines include cubic splines, which employ cubic polynomials between each pair of data points, ensuring continuity of the first and second derivatives. In this context, the fourth derivative bounds the error, leading to convergence in the process. This resolution successfully addressed the problem, resulting in overall satisfaction.

Finally, it’s important to note that the data utilized in this analysis is synthetic. Additionally, the identity of the company is intentionally withheld to safeguard business confidentiality.

The sampled arctan the polynomial interpolation and the spline cubic interpolation

To conclude, it is valuable to highlight the distinctions between approximation, interpolation, and extrapolation. These are distinct concepts within the realm of mathematical modeling. Interpolation is the process of estimating values within the range of known data points. It strives to construct a function or curve that precisely passes through the given data points, providing insights into the behavior of the function within that specific interval. On the other hand, approximation involves creating a simplified or generalized representation of a function. This representation may not precisely match the given data points but captures the essential characteristics of the phenomenon. Approximation is often utilized when dealing with complex functions or extensive datasets. In contrast, extrapolation is the prediction of values beyond the scope of the known data range. It entails extending the function or curve to estimate values outside the existing dataset. While interpolation and extrapolation focus on different regions of the data, approximation serves as a middle ground, providing a more generalized understanding of the overall behavior of the function.

The code

x = np.linspace(0,10, num=n)

y = [np.arctan(xx-5) +2 for xx in x]

# get the divided difference coef

a_s = divided_diff(x, y)[0, :]

# evaluate on new data points

x_new = np.linspace(0,10, num=10*n)

f_int = sc.interpolate.BarycentricInterpolator(x,y)

y_new = newton_poly(a_s, x, x_new)

y_new = f_int(x_new)

y_arc = [np.arctan(xx-5) +2 for xx in x_new]

f_cubic = sc.interpolate.CubicSpline(x,y)

y_cubic = f_cubic(x_new)Thet.plot(x_new, y_new, color = ‘blue’,label=’Interpolant’)

plt.plot(x_new, y_arc, color = ‘black’,label=’arctan’)

plt.plot(x_new, y_cubic, color = ‘grey’,label=’cubic’)

plt.xlabel(‘Prices’)

plt.ylabel(‘Winning bids’)

plt.title(f’interpolant vs acrtan with degree {n}’)

plt.legend()

plt.show()

Finally, it’s important to note that the data utilized in this analysis is synthetic. Additionally, the identity of the company is intentionally withheld to safeguard business confidentiality.

--

--

Uri Itai

Mathematician in exile, researching algorithms and machine learning, applying data science, and expanding my ideas.