Better Python Programming for Data Scientists, Part 2: More Python Fundamentals

Published in

The Data Nerd

10 min readJun 28, 2022

An animal emoji printer with conditional logic

In Part 1 of this series, we started our dive into Python fundamentals by looking at the different Python implementations available and the built-in data types. With that foundation, we can now look at some of the fundamentals of coding in Python.

The topics covered in this post are actually fundamentals of coding in general, but I’ve found that a lot of junior data scientists don’t use them, or don’t use them enough, and we’ll focus specifically on their implementations in Python.

Functions

Any time you repeat code while programming, there is potentially the opportunity to convert that code into a function. Functions increase the modularity and reusability of code (and any tool that decreases the amount of work you have to do is a positive). They can also make it easier to read and follow the logic.

As a data scientist, you use various functions from various libraries every day, but unless you create them regularly, you may not have needed to think about how functions are put together in a while. A function has four basic components: a name, a parameter list, a body (this is the code that gets reused), and a return data type which can be None if the function doesn’t return anything. Optionally, you can provide default values for the arguments, and provide a docstring. You can also include type hints as discussed in Part 1. These components come together as follows:

def function_name(parameter1, parameter2=DefaultValue):
    '''docstring'''
    some code
    return DataType

If the function doesn’t return anything, you can omit the return statement, you can explicitly return None, or you can have the word return followed by nothing (called an empty return statement) if you find that easier to read.

Docstrings

An important aspect of functions that I want to highlight is docstrings. Though technically optional, providing docstrings for your functions is a very good habit to be in. When sharing code with others or revisiting your own code after some time away from it, docstrings can simplify the process of understanding what the code does as a higher level than individual in-line comments.

At a minimum, docstrings provide explanations of what each function is, but they can also provide descriptions of the parameters and what is returned, detailed explanations of how the function works, and example code (which some IDEs can run as doctests). If you ever plan to create documentation for a module or library, there are tools you can use such as Sphinx to automatically generate it from the docstrings. As an example, compare the documentation for the scikit-learn implementation of Ridge Regression to the docstring in the source code.

This Stack Overflow answer provides an overview of the different docstring formats that are commonly used in Python. I’m partial to the Numpy format, which is compatible with Sphinx.

Iteration

Iteration is the process of repeating an action on successive items in a sequence. Here, sequence is used in the general sense, not referring specifically to sequence data types.

In Python, an iterator is an object that contains a countable number of values or elements that is created when you apply iteration to an iterable object — a container that can return its elements one by one — such as a list, string (which is a list of characters), dictionary, or data frame. You can create one by calling the iter() function on the object that contains the target sequence. The function that performs the traversal of the iterator is next(). Python’s built-in iterable data types and many others have this implemented for you so that you never need to write for example iter_object = iter(list_object) to create the iterator of a list, though if you try to call next() on these data types it will raise a type error. You may never need to directly use iter() and next(), but they are worth knowing about to understand how iteration is implemented under the hood in Python.

More commonly, you will simply use a for or while loop.

Python’s for loops are an implementation of the more general idea of a foreach loop in which the entire sequence of items is traversed. They do not require having an end condition defined though you can have logic within the loop that causes it to end.
Python’s while loops traverse the sequence so long as a condition remains true. It can optionally be used with an else statement that runs when the condition is no longer true.

Iteration is such a fundamental and useful process, Python has a few tricks you can use to avoid having to write out common uses:

Zipping: the zip() function aggregates iterables to pair up their elements. The resulting zip object contains tuples of corresponding items from each iterable, with the length determined by the shortest iterable. For example if you zip the lists [1, 2, 3, 4] and [‘a’, ‘b’, ‘c’, ‘d’] you get [(1, ‘a’), (2, ‘b’), (3, ‘c’), (4, ‘d’)]. If the first list were instead [1, 2, 3, 4, 5, 6], you would get the same result since the last two elements don’t have corresponding items in the second list.
Enumerate: the enumerate()function is a special type of zipping that aggregates a counter with the object being iterated over, so that each tuple in the sequence consists of the count value and the next element from the iterable object. If you have ever used the Pandas iterrows function, this is the same idea as the index, row pairs.
List comprehension: this is a shortcut for creating a new list from an existing one by applying a function or condition to each element in a single line of code. The syntax is new_list = [func(x) for x in old_list if condition], with the function call and condition both optional depending on your needs.

At this point, you may be wondering: When do you need a function versus iteration for repeated operations?

A loop can of course call a function within it, and a function can contain a loop, but the usefulness of a function is when you are writing (or copying and pasting) code multiple times. If the code is only written once within a single for loop, a function probably isn’t necessary for most use cases though you can choose to use one for stylistic reasons.

Under-utilization of for loops and custom functions are the most frequent opportunities for improving data science code that I see. They significantly enhance readability, and reduce logic errors and reproducibility problems by eliminating the mistakes that are easy to make when reusing code such as forgetting to change a variable name.

Generators and Yield

A generator function is a special type of function that behaves like an iterator but does not store its contents in memory. An example is the readline() function used to iterate over the lines of a file. Generators can be particularly useful to reduce memory requirements when working with large amounts of data.

To create a generator function, the syntax is the same as regular functions but they use the yield keyword instead of return. The difference between the two is that any code placed after a return statement is not executed, while a yield pauses to output a value then resumes with the existing states of all variables retained.

You can also create generator expressions, which similarly create generator objects but can be written as a single line like a list comprehension. They have the form:

generator_expression= (func(x) for x in iterable_object)

The iterable object can be a file, as with a generator function.

You may never need to create your own generators, but because of their applications in working with files and big data, knowing about them will help you understand the tools you are working with.

Lambdas

An example lambda function

Lambdas are simple anonymous functions, though they may call other functions with more complex functionality. They are created with the syntax:

lambda arguments: expression

Since they don’t require a full function definition, they can be stored as a variable or used without being stored.

Lambdas can be used by being passed as an argument to another function such as a list comprehension or Pandas apply() for applying it to the rows or columns of a data frame. A few functions that you may not have encountered before but can be useful when working with lambdas are:

Map: returns an iterator that applies the lambda expression to every element of an iterable object.
Accumulate: returns an iterator that applies an aggregation lambda function (such as sum or max) to an iterable object.
Filter: returns an iterator that filters the elements of an iterable object to only those that meet a condition.

Alternatively, a lambda can have arguments passed to it directly in the form:

(lambda function)(argument value)

For example (lambda x: x**2)(4)returns 16. This syntax is useful for things like parallelization which we will be covering in a later post.

For a more in-depth look at lambdas in Python and additional examples, check out this Geeks for Geeks article.

Conditional Logic

Our animal emoji printer implemented with pattern matching for Python 3.10

The basic conditional logic in Python are the if/elif/else statements. These are standard Boolean conditions with and/or/not operators for compound expressions. If statements can be used without requiring an elif or else. That is, nothing happens when the if statement is false. Python also supports the ternary operator:

expression if condition else expression

This is for cases where there is only one condition to evaluate, though you can have a nested ternary in the else expression.

For switch statements, an option has been added in Python 3.10 called pattern matching, but for older versions of Python you need to either use a block of if/elif/else statements or a dictionary that has lambdas or functions as the values. The newly available syntax follows the standard switch statement pattern:

match subject:
    case value/pattern:
        code
    case value/pattern:
        code
    case _:
        default code

The underscore character condition in this example is a wildcard that always returns true if reached and is optional. It is intended for things like raising exceptions.

There are three additional statements that can be useful in control logic:

Pass: This is equivalent to “do nothing”. If you have some conditions that don’t require their own special operations performed under an if/elif branch but you want to exclude them from the default behavior of the else statement, you can use pass. This keyword is described in more detail in the next section.
Continue: This is used within a loop and indicates that the rest of the logic in the loop should be skipped for this particular iteration, but the loop should not end. For example, if iterating through a list looking for a particular element, you might have something happen when that element is found, otherwise continue.
Break: This indicates that the loop should end. If the loop occurs in a function, you can instead use return.

The Pass Keyword

When working on a complex project, you may want to plan out the steps of what you need to do with placeholder code before you are ready to implement it. This helps you stay organized and identify what’s left to be done. I often use placeholders as a dynamic To Do List when writing functions or creating case logic. However, you also often want to test portions of the project code as you go to make sure they work correctly and having unimplemented sections of code would raise a syntax error. Similarly, you may encounter cases where some portion of your code needs to be temporarily commented out but doing so would cause an error.

To avoid this, you can use the keyword pass. Pass does nothing but is a complete statement so that you won’t have errors from empty functions, for loops, and branches in control logic. You can use it on its own, or to replace commented out logic.

The pass keyword is also useful in production code for certain use cases including in object-oriented programming, which we will cover in a later post in this series, and in APIs where executable code can be passed as an argument, known as a callback. (If you’ve been paying attention, that should sound familiar from our discussion of lambdas. Lambdas can be passed as callbacks, as can named functions and classes.)

Modules

A Python module is a file that contains function and variable definitions that can be imported into other Python files (scripts, modules, notebooks). The module name for importing is the file name without the .py extension. You can make a module independently executable by adding at the end:

if __name__ == “__main__”:
    some code that uses the functions and variables

A single module can be used as a catch-all for any functions and variables you want to reuse, but it is easy for them to become unwieldy as your code base grows. If you are working on a large, complex project you may want to organize your code into separate modules and directories.

Putting it All Together

There is a lot of basic functionality available in Python that you can use to make your programming more efficient and scalable. Examples of how I use these capabilities in my everyday work include:

Dynamically generating SQL queries.
Generating tabular features from text and other unstructured or semi-structured data.
Using iteration anywhere that an operation needs to be repeated for multiple objects or values.
Working with config files.
Taking steps that I commonly perform in EDA and wrapping them in functions.
Moving functions to a module so that I don’t have to copy and paste code over and over between notebooks.
Creating pipelines to run experiments rather than using notebooks.

This is by no means an exhaustive list. As you go through the articles in this series, you’ll spot many other opportunities to leverage the fundamentals of Python to improve your programming.

Stay tuned for the rest of the Better Python Programming for Data Scientists series!