Python, Functional Programming, and Mathematical Computation for Data Science

You are probably well aware of Python’s flying demand in almost every discipline of computer science. Python dominates in programming for embedded systems, network engineering, back-end web development, and of course, Data Science. Its success is owed to its wide range of applications as well as its user-friendliness: a Python program is almost always at least a few times shorter than its equivalent Java program and more than several times shorter than its analogous C++ program. With its versatility, maturity, and emphasis on clarity, Python rests comfortably in the tech stacks of titan companies Facebook, Google, Instagram, Spotify, and Netflix.

In this article, I will be providing a survey of one of Python’s key industry capabilities: computation, specifically in the domain of Data Science. Mathematical calculation and computational statistics are the crust of the data scientist pie; they govern the efficiency and structure of any model you will build and test in your Data Science career. Python retains a comprehensive collection of third-party libraries as well as vanilla tools to help with this. Let us take a look at some of Python’s computational components. We will begin with higher-order functions.

Higher-Order Functions

Higher-order functions are part of Python’s functional programming paradigm and are used when users want to minimize moving parts in their code. They handle other functions by either taking them as input arguments or returning them. What makes this possible is the state of any Python function as a first-class object. First-class objects can be stored in data structures, can be returned from another function, assigned to a variable, and passed as a parameter to another function.

A familiar example of a higher-order function in Python is filter. Consider a problem in which you are given a collection of integers, and you want out of the list only the multiples of six. You might create a number_list that contains the integers, and a mult_six list to contain the multiples of six. Imperatively, you might use a for loop to iterate over each element of number_list and check if it is a multiple of six, and if that is true, append number_list[i] to mult_six.

This is fine, but we can use filter to take a more declarative approach to this problem. To illustrate the higher-order functionality, we will define a multiple_of_six() function:

def multiple_of_six(num):
if num % 6 == 0:
return True
else:
return False

The imperative approach with an arbitrary number_list:

number_list = [24, 74, 14, 75, 12, 72]
mult_six = []
for n in number_list:
if multiple_of_six(n) == True:
mult_six.append(n)# Returns [24, 12, 72]mult_six

Now, let’s try this using filter. filter() will take in a function and a sequence, and it will output the items in the sequence for which that function returns True.

f = filter(multiple_of_six, number_list)
for i in f:
# Prints 24, 12, 72
print(i)

High-order functions in Python enable us to make multiple computations in just one beautiful line of code. Other noteworthy higher-order functions are map and reduce. The next type of function we’ll examine works similarly to a mathematical function and reduces bugs in computation.

Pure Functions

Pure functions don’t operate on global variables or any objects outside the function’s scope. When we use an impure function, we follow the typical scheme of defining a global variable and a function that manipulates it (i.e. takes it as an argument) and applying the function to it through a function call. When we rerun the function with the same input, we aren’t guaranteed to get the same output. The opposite is true for pure functions as it doesn’t change the state of any variable. A pure function is also described as deterministic.

Consider a function that adds two integers. It takes one argument, b, to add to a pre-defined global variable a.

a = 3 
def addition(b):
return a + b
# Returns 7
addition(4)

You’re probably expecting the value of addition(4) to be different if we change the value of a.

a = 7# Returns 11
addition(4)

addition is impure because its output does depend on the value of a global variable. There is a determinant of the result that is outside the scope of addition. Let’s “purify” this function by making it depend only on its inputs, eliminating all side effects:

def addition_pure(a, b):
Return a + b
# Returns 8, the same output each time we run the function with the same inputs
addition(6, 2)
addition(6, 2)
addition(6, 2)

Let’s think about a couple of other examples: a function that reads files and a function that depends on a randomly generated number. It should be obvious that neither of these functions is pure: both of these inputs are mutable. File names can change and random number generators never produce immutable results. A benefit of pure functions is that they are isolated and can’t affect any other parts of their program. This will reduce the risk of bugs, instability, and trials with testing, and improve clarity.

Anonymous Functions

You already know that we define a function using the def keyword and immediately follow it with a name. However, this might be futile if we only want to use it in one very specific instance. This is where anonymous/lambda functions come in handy. Anonymous functions don’t have names and are instantiated using the lambda keyword. These functions derive from a simple model of computation known as lambda calculus which is based on function abstraction.

Lambda functions have raised controversy due to their confusing syntax and imposition of a functional way of thinking, but they are a convenient shorthand for applying a function to a collection of arguments. Note that lambda functions can have any number of arguments but only one expression, which is automatically returned without the need for a return statement. Let’s attend to some examples.

A lambda function declaration has the following composition: the keyword (lambda), bound variables (the arguments), and a body (the expression). Here’s a very simple lambda function:

lambda x: x**3

We can see that this is similar to writing:

def cube(x):
return x**3

We can call a lambda function on an arbitrary bound value like so:

# Returns 64
(lambda x: x**3)(4)

It is also valid to assign a lambda function to a variable name and call from that variable name instead. Multiple arguments can be fed into a lambda function:

# Returns 16
(lambda x, y: x ** y)(4, 2)

Finally, a useful feature of lambda functions is that they are harmonious with higher-order functions:

# Returns 8
(lambda x, addition: x * addition(x))(2, lambda x: x + x)

This doesn’t look attractive at all, so let’s break it down: we write a lambda function with two arguments as before, with the second argument being a function. In the call statement, we define the body of the addition function. This is the exact same approach as in the previous example, where we defined the second argument of our lambda function, y, to be 2. This illustrates why Python functions are a first-class entity, as well as lambda’s parallelism to a regular defined function. The cleverness of the anonymous function obviously comes at the cost of readability and the natural syntax Python is known for. Nonetheless, it’s proven to be valuable for tasks including sorting, concatenating, writing classes, among others.

I just explained the general styles and structures that occur within functional programming. FP’s approach of avoiding mutable data and shared-state variables and pure calculation lends itself well to mathematical and statistical computation. Although Python isn’t a purely functional programming language like Haskell or Clojure, Python FP has repeatedly established itself in Data Science, machine learning, and computational statistics. If you have experience in deep learning, you can say that the higher-order-chain nature of FP over a data structure works the same way as a neural network when it’s given a simple matrix of data.

Linear Algebra Using NumPy

We will now shift gears and briefly discuss some of Python’s capabilities in computational linear algebra. Linear algebra is arguably one of the most crucial mathematical fields a data scientist must study. A lot of scientific and statistical computing problems can be represented with linear algebra, and it’s hard to identify a common Data Science problem that doesn’t employ matrix operations, vectorization, decomposition, etc.

I will not be diving deeply into the mathematical theory behind linear algebra, but rather common algorithms implemented by NumPy. NumPy is Python’s main library for performing mathematical computations, specifically on vectors and matrices. It is also the basis for plenty of other Python Data Science and machine learning libraries. NumPy’s versatility in computation is what has cajoled data scientists and machine learning engineers into using Python. Let’s introduce and demonstrate various important linear algebra operations in NumPy, starting with creating a simple multidimensional array.

Create a Multidimensional Array

https://hadrienj.github.io/posts/Deep-Learning-Book-Series-2.1-Scalars-Vectors-Matrices-and-Tensors/

You are likely familiar with scalars, vectors, and matrices. We know that in any programming language, a matrix is written as a 2D array. You’ve also likely come across arrays of more than two dimensions, or n-dimensional arrays.

Remark: A ‘tensor’ is a type of function, whereas an n-dimensional array is the corresponding data structure. A tensor is a type of n-dimensional array with specific linear transformation properties under matrix operations.

Let us use NumPy to generate a simple 3-dimensional array. Our approach will be to use np.zeros(), with a tuple (i, j, k) as an argument, where the length of the tuple equals the dimension of the array, and the values i, j, k correspond to the number of elements in each dimension. The ordered triple we input is called the shape of our n-dimensional array. np.zeros() will simply create an array filled with 0. We can use numpy.array() to create an n-dimensional array with specified non-zero values.

For example:

>>> import numpy as np
>>> array = np.zeros(1, 2, 3)
>>> array
array([[[0., 0., 0.],
[0., 0., 0.]]])
>>> np.ndim(array) # Get number of dimensions of array
3

Some other matrix operations the NumPy library can quickly perform include, but aren’t limited to:

  • Element-wise addition, subtraction
  • Element-wise multiplication, division
  • Cross product, dot product
  • Transposition
  • Inner and outer product
  • Generation of the identity matrix
  • Matrix multiplication
  • Matrix inversion
  • Determinant, trace

Check the numpy.linalg documentation for more advanced matrix operations and details.

Example: Norms with NumPy

In linear algebra, a norm is a “function that assigns a strictly positive length or size to each vector in a vector space — except for the zero vector, which is assigned to a length of zero. Consider the two-dimensional Euclidean space ℝ². The ℝ² space supplies the “Euclidean norm,” which assigns to a vector the length of its arrow drawn in the Cartesian coordinate system. This is what we know as the magnitude of a vector!

Formally, the norm on a vector space V is a positive-valued scalar function: V → [0, ) that satisfies the following, with u,vV (u,v are vectors):

  1. f(u+v) ≤ f(u)+f(v) (see triangle inequality)
  2. f(av)=|a|f(v) (absolutely homogeneous, a∈ℝ)
  3. f(v) = 0 v = 0 is the zero vector
https://hadrienj.github.io/posts/Deep-Learning-Book-Series-2.5-Norms/

For example:

>>> u = np.array([5, 1])
>>> v = np.array([6, 2])
>>> np.linalg.norm(u + v)
11.40175425099138
>>> # Geometrically, this is the shortest line between the two vectors, assuming they are in tip-to-tail form.

--

--

Camille Dunning
Data Science Student Society @ UC San Diego

Sophomore at UCSD, Class of 2022. Data science kid and musician, so I’m going for a young StatsQuest kind of character. Director of medium.com/ds3ucsd