Optimizing Memory Usage in Python

With Generators and Arrays

XQ
The Research Nest
6 min readOct 2, 2023

--

Today, we’re delving into the art of optimizing memory in Python, all with the help of our beloved ninja, Naruto. So, grab your shuriken and ninja headband, and let’s get started. :P

Image created by me using DALLE.

Imagine Naruto trying to keep track of all his shadow clones. Instead of summoning them efficiently, he creates a massive list of individual clones, which is, well, not the best approach.

# The bad approach
shadow_clones = ['Naruto_clone' for i in range(1000000)]

This isn’t just tedious but also consumes an unnecessary amount of memory! Every item in our list takes up space. A list with a million individual items? That’s quite heavy on the memory!

Measuring Memory Usage: Before we dive into optimization, let’s first see how much memory our bad approach consumes.

import sys

memory_used_bytes = sys.getsizeof(shadow_clones)
memory_used_kb = memory_used_bytes / (1024)
print(f'Memory used by clones: {memory_used_kb:.2f} KB')

~ 8 MB

When you create a list in Python, several things happen:

  1. Overallocation: Lists often reserve more space than needed to add items quickly. Over time, especially with large lists, this extra space can use a lot of memory.
  2. Object Overhead: Every item in a list is an object. Objects have a base memory cost. So, a list of a million items has a million times this base cost.
  3. Pointers: Python lists store addresses pointing to the actual items, not the items themselves. This means more memory is used.

Why Might This Be Bad?

  1. Slows Down Programs: Too much memory usage can make a program run slower.
  2. Wastes Resources: Extra memory usage can be costly, especially in environments where you pay for the resources you use.
  3. Less Room for Growth: Programs that are heavy on memory might struggle with larger tasks or more users.

In essence, using memory efficiently in Python is about balance. It’s good to know how memory works to make informed decisions in our coding projects.

Now, for the exciting part. Let’s explore some techniques to optimize this.

Using Generators

Instead of creating a list, we can use a generator. This is basically creating the shadow clones one by one as required instead of all at once.

def shadow_clone_generator(n):
for i in range(n):
yield 'Naruto_clone'

clones = shadow_clone_generator(1000000)

Let’s check the memory usage.

memory_used_bytes = sys.getsizeof(clones)
memory_used_kb = memory_used_bytes / (1024)
print(f'Memory used by clones: {memory_used_kb:.2f} KB')

~ 0.1 KB

Surprised?

Why is this the case?

Generators are iterable, like lists or tuples. However, unlike lists, they don’t store all their values in memory. Instead, they generate each value on the fly and yield it individually.

This is incredibly memory-efficient for large datasets because, at any given time, only one value (or a very small set of values) is in memory.

Understanding yield
yield is a keyword in Python that’s used like return, but with a twist. Here’s what it does:

  • Pauses the Function: When a generator function is called, it runs until it encounters yield. At that point, it pauses and “returns” the yielded value.
  • Maintains State: After yielding a value, the function retains its state (including local variables and where it left off). This differs from regular functions, which start fresh every time they’re called.
  • Resumes on Next Call: The next time you ask the generator for a value (typically using next()), it resumes right where it left off, running until it hits the yield keyword again, at which point it provides the next value.

In essence, generators allow you to “generate” a series of values over time rather than computing them upfront and storing them all in memory. The yield keyword is the tool that makes this possible, giving you both the memory benefits and the flexibility to compute values on the go.

Here’s a simple example:

def simple_generator():
yield 1
yield 2
yield 3

gen = simple_generator()
print(next(gen)) # prints 1
print(next(gen)) # prints 2
print(next(gen)) # prints 3

When to Use Generators:

  1. Large Datasets: Process without consuming all memory.
  2. Streaming Data: Handle real-time data streams effectively.
  3. Expensive Computations: Compute values lazily, only when needed.
  4. Infinite Sequences: Produce values indefinitely.
  5. Pipeline Processing: Reduce memory in data processing chains.

When Generators Might Not Be Ideal:

  1. Random Access: Generators are linear, not for indexed access.
  2. Multiple Passes: Generators exhaust after one use; not for re-traversing.
  3. Short Data Sequences: Overkill for small, in-memory datasets.
  4. Complex State Management: This can lead to confusing code.
  5. Performance-Critical: Might introduce minor speed overheads.

Choose generators for memory efficiency and lazy evaluation, but consider alternatives based on specific task needs.

Using Array

Python has a separate array module, different from the list.

from array import array

shadow_clones_array = array('u', ['N'] * 1000000)

Let’s see how it does with memory.

memory_used_bytes = sys.getsizeof(shadow_clones_array)
memory_used_mb = memory_used_bytes / (1024)
print(f'Memory used by clones: {memory_used_mb:.2f} KB')

< 4 MB.

Not bad. But note that we are just storing “N” and not the full string, as arrays support only simple characters and numbers out of the box. Arrays are based on C language arrays and differ from lists as follows:

  1. Homogeneity: Arrays require all items to be of the same type, while lists can hold items of any type.
  2. Storage Efficiency: More memory-efficient since there’s no need to store type information or the extra overhead that generic Python objects have.
  3. Less Flexibility: Arrays don’t support all the general-purpose methods that lists do. This is a trade-off for their compact storage.

When to Use Arrays:

  1. Fixed Data Types: Best when you know your data consists entirely of one type, like float or int.
  2. Memory Concerns: If memory usage is a concern, especially with large data sequences, and the type is consistent.
  3. Interfacing with C: If you’re working with libraries that interface with C and require contiguous memory, arrays might be preferred.

When Not to Use Arrays:

  1. Mixed Data Types: If your dataset has multiple types, like strings and numbers mixed together.
  2. Full Functionality: When you need a rich set of list methods to manipulate data.
  3. Short Sequences: For small datasets, the memory savings might be negligible, and the flexibility of lists would be more beneficial.

Conclusions

  • Generators: Perfect when you’re dealing with large amounts of data and don’t need to access all the items simultaneously. Like Naruto’s shadow clones, you summon them as needed!
  • Array: Efficient for large lists where the data type is consistent. However, the functionality is limited compared to lists.
  • Lists: When dealing with small amounts of data and when you want to have flexibility with in-built methods.

Remember, optimization is all about choosing the right tool for the task at hand. Think about the needs of your program and the type of data you’re working with. Sometimes, like Naruto with his shadow clones, a little strategy goes a long way!

And this is not the end. There are many other approaches relevant in specific contexts. For example, you can use Numpy when dealing with numbers.

Explore and test such possibilities for your specific use case.

Created using DALLE.

--

--

XQ
The Research Nest

Exploring tech, life, and careers through content.