Better Python Programming for Data Scientists, Part 1: Python Fundamentals

Emily Strong
The Data Nerd
Published in
9 min readJun 20, 2022

Python currently dominates as the most popular programming language for data scientists. Its ease of use and popular machine learning libraries like scikit-learn and PyTorch, not to mention the simplicity of using Jupyter Notebook as compared to an IDE, make it an obvious choice to allow you to quickly get your projects up and running.

However, many data scientists never move beyond the basics with Python programming even though there is a wealth of functionality and tools available that can make data science projects better in a variety of ways, from speeding up run times and improving reproducibility to creating your own tools to enhance your projects.

This series will explore some of these ways of using Python and improving your skills.

Part 1: Python Fundamentals
Part 2: More Python Fundamentals
Part 3: Data Structures
Part 4: Introduction to Object-Oriented Programming
Part 5: More Object-Oriented Programming
Part 6: Parallelization
Part 7: Strategies and Tools for Improving Your Skills
Part 8: Books for Improving Your Skills

Photo by Chris Ried on Unsplash

Python Fundamentals: Interpreters and Implementations

Python is a high-level, interpreted language. If you aren’t familiar with interpreters versus compilers, what this effectively means is that Python is evaluated by an interpreter program at run-time to parse and execute the commands. There are multiple interpreters available for Python implemented in different languages. The most popular one is CPython, which is the reference implementation and is written in C. If you have never before had to think about what Python implementation you are using, it’s CPython.

Other implementations include:

  • PyPy: also written in C, faster than CPython
  • Jython: written in Java
  • Pyston: uses just-in-time compilation

Because CPython is the original and most commonly used implementation, most libraries are written for CPython and then have to be ported over to the others. For example, until a few years ago Numpy was not available for PyPy and the scikit-learn support is still considered experimental.

With that out of the way, we can dive into the Python language.

Data Types

We should begin by looking at the built-in data types available in Python. Every programming language has basic data types that all other data types and data structures build off. For some languages like Java, they are called primitives and have fixed memory sizes. In Python, the basic data types are objects rather than primitives which allows them to have varying sizes. Python’s built-in types include a null type, numeric types, Booleans, strings, sequences, binary sequences, and unordered collections such as sets and maps (or “dicts”).

Python has extensive documentation on the built-in types, and it is worth spending some time reading through it to familiarize yourself with the full range of existing functionality, but for getting started it is worth focusing on the basics of working with them.

None

Python includes a null data type of None. When a value is missing or an empty variable has been created as a placeholder, the data type is None. It is worth noting that NaN (“not a number”) is not the same as None, though some libraries such as Pandas treat them as equivalent. NaN corresponds to an undefined number from floating point calculations such as dividing by 0. Although it is undefined, it is still a value, whereas None is the absence of any value.

Numeric Types: Int, Float, Complex

Python comes with three built-in numeric types: integer (int), float, and complex. Integers are positive or negative whole numbers and because they are objects rather than primitives they can have any size, while floats are positive or negative decimals. Reflecting Python’s use in scientific computing, complex numbers are also available and are composed of two floats.

Boolean (bool)

A Boolean has two values: True and False. They result from evaluating an expression for its truth value, such as 1 == 1 (True) and 1 == 2 (False). Technically, Booleans in Python are a subtype of integers with zero corresponding to False. Because of this, 1 and 0 are often used interchangeably with True and False, and other non-zero integer values both positive and negative evaluate to True.

The other built-in types also have a state or value that corresponds to False. For example, in floats it is 0.0, for strings it is the empty string ‘’, while for sequences and collections it is when they are empty. Take for example:

x = []
if x:
print(x)
# nothing is printed

String (str)

Strings are the text data type. They can be empty (''), or any length of Unicode characters, and are immutable. They can use single or double quotes and use triple quotes for strings that span multiple lines. Because strings in Python are sequences of characters, and they can be iterated over like lists.

There are a variety of included functions for working with strings, though the ones I use most often as a data scientist are:

  • Format: replaces a brace-delimited field with a value, e.g. 'ab{}d'}.format('c'); there are a variety of ways to use this as specified in the documentation.
  • Lower, upper: converts all letters to the same case.
  • Replace: replaces all (or the first n if the count argument is used) instances of a substring with a new substring.
  • Join: concatenates a list of strings to a delimited string, for example ['a', 'b', 'c'].join(', ') -> ‘a, b, c’.
  • Split: converts a delimited string to a list, for example 'a, b, c'.split(', ') -> ['a', 'b', 'c'].
  • Zfill: pads a string with 0s, for example '7'.zfill(3) -> '007'.
  • Eval: evaluates and executes valid Python code in a string, for example eval('1') returns the integer 1, while eval('print(x)') prints the stored value of the variable x.

Because strings are immutable, these string manipulation functions return new objects and you must point a variable to the new object to store it.

Sequence Types: List, Tuple, Range

Python has three basic sequence types:

  • Lists are mutable ordered collections of objects, meaning the elements can be changed, as well as elements added or removed, and every element has an indexed position. The objects do not need to have the same data type. You can have a list of integers, of variables, of lists, and so on. The elements can be accessed by index or unpacked in order (e.g. a, b = [1, 2] results in a=1 and b=2), and lists support the mutable sequence operations.
  • Tuples are immutable ordered collections. They have a fixed length and contain a fixed set of elements. The elements can be accessed by index or unpacked in order. When a function returns multiple values, the output is a tuple.
  • Ranges are immutable ordered sequences of integers with a fixed increment. The optional lower bound is included in the range, while the required upper bound is excluded. For example range(6) returns the numbers 0 through 5, while range(10, 20, 2) returns the even numbers between 10 and 19.

Binary Sequence Types: Bytes, Bytearrays, Memoryview

Bytes are formatted similar to strings (with the format b'text') and support many of the same functions, but are immutable sequences of single bytes based on ASCII text. Bytearrays are the mutable equivalent of bytes and support the mutable sequence operations such as slicing and element replacement.

Another built-in binary sequence type is memoryview. To be frank, I have never needed to use this data type in my data science work, but if you are curious, you can learn more here.

Sets and Frozensets

A set in Python is an unordered collection of distinct hashable objects. What does that mean?

  • Every element in a set is unique. If you try to add a duplicate element, it will not be added (but it also won’t throw an error). Because of this, sets are often used to check for uniqueness, identify the unique elements, and compare sets for intersections and differences.
  • Every element in a set must be hashable. Hashing allows objects to be compared to each other, and the hash of an object is constant. Most of the built-in immutable data types are hashable, though for containers like tuples the individual elements in the container must all be hashable. Objects of user-defined classes are hashable because they have a unique memory address (the object “id”) which can be hashed, but only the exact same object will be identified as having the same hash.
  • Because they are unordered, elements in sets do not have indices and cannot be accessed by position.

As with sequences and binary sequences, there is a mutable data type (set) and an immutable one (frozenset).

Dictionaries (dicts)

Python dictionaries (or dicts) are the data type for maps, that is collections of key-value pairs. They are unordered and mutable, and the keys must be hashable. Keys do not need to have the same data type.

Values in a dictionary can be accessed by key just as elements in a list can be accessed by index. Dictionaries also have view objects, accessed through the keys(), values(), and items() functions. These lists dynamically update as the dictionary updates. For example:

a = {1: 1, 2: 2, 3: 3, 4: 4}
b = a.keys()
del a[1]
print(b)
# prints dict_keys([2, 3, 4])

Type Hints

Objects in Python have dynamic typing. That is, you do not need to explicitly declare the type when initializing a variable, or in defining the signature of a function (what it accepts as input and what it returns as output). Furthermore, a function argument does not need to accept only one date type or class, nor does a function need to return only one type or class.

This approach gives you flexibility in your code, and avoids redundancies from implementing the same functionality for multiple data types or classes that don’t have a shared parent class for which the function could be reasonably implemented (Don’t know what that means? Stay tuned for the object-oriented posts in this series!).

It does however mean that you can accidentally use a function for an incompatible data type, resulting in errors or unexpected behavior. This is where type hints comes in! They allow you to document what the expected data types are. The syntax is as follows:

# Function with return
# if nothing is returned, the return type is None
def function(argument: type = default_value) -> type:
some code
return value_of_type
# Examples:
def example(a: int) -> float:
return a / 1.0
def example2(df: pd.DataFrame) -> np.array:
return df.values

For more complex hinting such as multiple data types and optional arguments, you can use the typing module.

Type hints are completely optional in Python, but they are a good practice to have for code that will be used by multiple people or that you will need to reuse or reference at a later point.

Collections

The last thing I want to highlight in this first dip into Python is the collections module. The data types included in this module add some very useful functionality to the built-in collection and sequence data types. The ones I use most often are:

  • Counter: creates a dictionary of hashable objects and their frequency counts, and sorts the keys by count.
  • defaultdict: creates a dictionary that has a factory function that assigns a default value for keys. You can use the type name of one of the built-in data types as the factory function. This allows you to access and update the value for a key without needing to check if the key already exists and create it if it doesn’t.
  • namedtuple: allows you to add names to the fields of a tuple as well as a class name for the tuple type. For example, a Person namedtuple might have name and birth date fields. The fields can be set by name when instantiating the object similar to function arguments, and accessed by name.
  • OrderedDict: remembers the order in which items are added to the dictionary.

As an example of how these can improve your code, we can look at defaultdict:

from collections import defaultdictsome_list = [1, 2, 2, 3, 4, 5, 6, 9, 10, 10, 11]# With defaultdict:
d0 = defaultdict(int)
for i in some_list:
d0[i] += 1
print(d0[7])
# prints 0
# Without defaultdict:
d1 = dict()
for i in some_list:
if i in d1.keys():
d1[i] += 1
else:
d1[i] = 1
print(d1[7])
# raises KeyError

There are other classes available in the collections module and I would encourage you to take a look at them and consider how they might help you make your code more efficient.

Wrapping Up

We’ve covered a lot of material in this post, but this is just the start! There is so much more to explore in becoming a better Python programmer.

Stay tuned for the rest of the Better Python Programming for Data Scientists series!

Python Fundamentals | More Python Fundamentals | Data Structures | Object-Oriented Programming 1 | Object-Oriented Programming 2 | Parallelization | Strategies and Tools | Books

--

--

Emily Strong
The Data Nerd

Emily Strong is a senior data scientist and data science writer, and the creator of the MABWiser open-source bandit library. https://linktr.ee/thedatanerd