Stories by Marvin Lanhenke on Medium

Breaking Through the GIL Barrier With Asyncio and Multithreading

Marvin Lanhenke — Tue, 21 Mar 2023 13:16:46 GMT

Python Concurrency

How to run blocking libraries concurrently with Asyncio, Multithreading, and Python

Python is extremely popular. It’s widely used. And it offers a plethora of third-party libraries and frameworks to choose from.

When it comes to developing asynchronous applications, Asyncio and other non-blocking libraries like Aiohttp are natural fits and easy choices to make. Unfortunately (or luckily — that depends), we don’t always start from scratch.

Most of the time, we manage existing code. Existing code that uses blocking I/O libraries.

And with blocking libraries there is no concurrency. There are no sweet performance gains. Asynchrony? Forget about it!

Fortunately, this is not quite the case.

In the following sections, we’ll dive into the details of how we can leverage multithreading, thread pools, and Asyncio to run blocking I/O operations concurrently. Allowing us to still harness some of the desired performance gains.

So, if you are looking to unlock the full potential of your Python code, keep reading to find out how to use Asyncio with multithreading.

Release the GIL (and not the Kraken)

The Global Interpreter Lock (GIL) is a mechanism in Python that ensures only one thread executes Python bytecode at a time.

Unlock the Power of Python: A Beginner’s Guide to Concurrency

This means that even if we have multiple threads in our program, only one thread can execute Python code at a given time. So after all, multithreading may not seem like an effective approach for improving performance in Python.

However, when it comes to I/O-bound tasks, such as reading from or writing to a file, network, or database, the GIL is temporarily released, allowing other threads to run Python code. This means that even though Python only allows one thread to run Python code at a time, it is possible to use multiple threads to achieve better performance when performing I/O-bound tasks.

When working with existing code that uses blocking I/O libraries, using Asyncio with multithreading can be an effective way to improve performance.

Multithreading with blocking I/O [Image by Author]

Python’s threading module allows us to manage threads, so we can take advantage of the temporary release of the GIL to run multiple threads concurrently. This way, multiple operations never block each other and only block the thread they are running in.

A Pool Full of Threads

While Python’s threading module is a great place to start, we don’t really want to create and keep track of every single thread individually. Luckily for us, we can utilize something called a ThreadPoolExecutor.

But before doing any of that, let’s set the stage with an easy example. Let’s make a few basic web requests by making use of the library requests (well, that came as a surprise…).

import time
import requests

def get_status_code(url: str) -> int:
    response = requests.get(url)
    return response.status_code

def main() -> None:
    start = time.time()

    urls = ['https://www.example.com' for _ in range(100)]
    for url in urls:
        get_status_code(url)

    end = time.time()
    print(f"Total time: {end-start:.4f} seconds")

main()

In this example we make 10 sequential requests, fetching the status code from the specified URL. Running this simple script takes roughly 48 seconds to finish.

Now, let’s try to improve it by using multiple threads.

import time
import requests

from concurrent.futures import ThreadPoolExecutor

def get_status_code(url: str) -> int:
    response = requests.get(url)
    return response.status_code

def main() -> None:
    start = time.time()

    with ThreadPoolExecutor() as pool:
        urls = ['https://www.example.com' for _ in range(100)]
        pool.map(get_status_code, urls)

    end = time.time()
    print(f"Total time: {end-start:.4f} seconds")

main()

The ThreadPoolExecutor provides a convenient way to execute tasks concurrently without having to manage threads directly. It automatically creates and manages a pool of threads to which we can submit work.

This allows us to run the blocking requests in separate threads concurrently. Running the example from above takes about 4.5 seconds. That’s 10x faster than before.

Note: Determining how many worker threads to create is quite complicated. The formula for the default number of threads is min(32, os.cpu_count() +4).

Pool Skimming with Asyncio

Making use of the ThreadPoolExecutor is just fine. It works. However, we can still improve our code by incorporating the use of Asyncio.

In Python 3.9, the asyncio.to_thread coroutine was introduced. This allows us to further simplify putting work on the default thread pool executor. It takes in a function and a set of arguments to run in a thread.

import time
import asyncio
import requests

def get_status_code(url: str) -> int:
    response = requests.get(url)
    return response.status_code

async def main() -> None:
    start = time.time()

    urls = ['https://www.example.com' for _ in range(100)]
    # Create a list of coroutines, putting work on the thread pool executor
    coroutines = [asyncio.to_thread(get_status_code, url) for url in urls]

    await asyncio.gather(*coroutines)

    end = time.time()
    print(f"Total time: {end-start:.4f} seconds")


asyncio.run(main())

And that’s all there is to it.

By using asyncio.to_thread, we were able to reduce the number of lines of code and simplify our program. Despite this, it still takes around 4.5 seconds to execute, which seems reasonable. Essentially, we have just employed a more straightforward and compact syntax.

Up to this point, we have compared two distinct methods for making requests: One sequentially and another concurrently using multiple threads. However, there exists a third approach that involves making requests concurrently within a single thread using Aiohttp.

import time
import aiohttp
import asyncio

from aiohttp import ClientSession

async def fetch_status(session: ClientSession, url: str) -> int:
    async with session.get(url) as response:
        return response.status

async def main() -> None:
    start = time.time()

    async with aiohttp.ClientSession() as session:
        urls = ['https://www.example.com' for _ in range(100)]
        requests = [fetch_status(session, url) for url in urls]

        await asyncio.gather(*requests)

    end = time.time()
    print(f"Total time: {end-start:.4f} seconds")


loop = asyncio.get_event_loop()
loop.run_until_complete(main())

Executing the single-threaded version of the program takes merely 0.6 seconds, which is 7.5 times faster than the multi-threaded approach.

The reason behind this difference in speed is that creating threads involves additional overhead. When threads are created, the operating system has to allocate more resources than it does for coroutines. Moreover, threads have to undergo context-switching at the OS level, which can incur an additional cost.

By using Aiohttp and a single-threaded approach, we can avoid all these extra costs and achieve even greater performance gains. In comparison to the synchronous blocking method, which took 48 seconds to execute, we have boosted our performance by a factor of 80.

Conclusion

Python offers various techniques to attain concurrency and enhance the performance of your applications. Using Asyncio and multithreading, we can create more efficient programs that can handle multiple tasks concurrently.

By utilizing thread pools and async/await syntax, we can achieve concurrency, even with blocking libraries, without having to manage threads directly. Additionally, we learned that Aiohttp offers a more streamlined and effective way to execute requests concurrently within a single thread, resulting in even better performance gains compared to traditional multithreading techniques.

Python Concurrency

If you enjoyed the read, make sure to hit ‘follow’ for more on Python concurrency and advanced techniques to take your programming skills to the next level.

Consider becoming a Medium member and continue learning with no limits. I’ll receive a portion of your membership fee if you use the following link, at no extra cost to you.

Join Medium with my referral link - Marvin Lanhenke

References / Further Material:

Fowler, Matthew. (2022). Python Concurrency with Asyncio. Manning Publications.

More content at PlainEnglish.io.

Sign up for our free weekly newsletter. Follow us on Twitter, LinkedIn, YouTube, and Discord.

Interested in scaling your software startup? Check out Circuit.

Breaking Through the GIL Barrier With Asyncio and Multithreading was originally published in Python in Plain English on Medium, where people are continuing the conversation by highlighting and responding to this story.

Concurrent Web Requests with Aiohttp: Get More Done in Less Time

Marvin Lanhenke — Tue, 14 Mar 2023 15:19:32 GMT

Python Concurrency

Discovering Aiohttp for Faster, Concurrent Web Requests

Photo by Aron Visuals on Unsplash

Are you tired?

Tired of waiting for your requests to complete one by one.

Being stuck. Waiting. Only to be met and overwhelmed with the feeling of frustration and disappointment when the request finally times out? Have you tried using async/await everywhere, only to find out that most libraries are blocking anyway?

Fear not, as the answer to your problems lies in Aiohttp.

In the following sections, we will explore the beautifully concurrent world of Aiohttp. A popular asynchronous HTTP client/server library for Python.

We will discover how to make non-blocking web requests that run concurrently and improve the application's performance. By the end of this blog post, you’ll not only be armed with the knowledge of how to use Aiohttp but also how exceptions can be handled or asynchronous context managers work.

So don’t go anywhere, take a seat, fire up your IDE, and let’s get started.

No more Blocking: Introducing Aiohttp

It’s all about concurrency. Allowing multiple tasks to be executed simultaneously. That’s why asynchronous programming and libraries like Python’s Asyncio exist in the first place.

The Beginner’s Guide to Asyncio in Python: A Deeper Dive into Coroutines and Tasks

However, one of the most common mistakes we tend to make (Yep, I did it too) is to apply the async/await syntax to every line of code we can get our hands on and hope for the best.

Well, most of the time the best is — nothing. Nothing happens at all. No concurrency. No sweet performance gains. But why?

Unfortunately, most libraries are blocking, meaning that they will block the main thread and event loop, rendering async/await basically ineffective. This is where non-blocking libraries like Aiohttp come into play. By using non-blocking sockets and utilizing asynchronous context managers, Aiohttp allows for efficient acquisition and closure of HTTP sessions, leading to improved performance and a more pythonic way of working [1].

Before diving deep into the inner workings of Aiohttp, let’s take a small detour and talk about asynchronous context managers first.

Managing Asynchrony: The Pythonic Way

It’s very common to deal with resources in a way that requires them to be opened and then to be closed. Think of a file for example.

We open it. We read it. We close it. Nothing fancy so far.

However, we need to be careful not to leak any resources. If for any reason an exception is raised our resource might never be properly closed. To avoid any leaking resources we have several options to choose from.

First, we can wrap our code in a try/finally block, making sure the resource will be closed no matter what. Second, we can apply a more pythonic way of dealing with resources. Context managers [2].

# Use of a synchronous context manager
with open("example.txt", "r") as f:
    contents = f.read()
    print(contents)

In Python, context managers are used to ensure that resources are properly closed even if an exception is raised during runtime [3]. However, traditional context managers only work with synchronous code.

With the introduction of asynchronous context managers [4], we can now manage resources asynchronously by using the async with syntax. Now, we can acquire and close resources like HTTP sessions more cleanly and in a more Pythonic way. This is why asynchronous context managers lay at the core of Aiohttp.

Let’s take a look at a super basic example to illustrate and understand the way asynchronous context managers work.

import asyncio

# Implement the context manager protocol
class AsyncContextManager:
    async def __aenter__(self):
        print("Entering async context...")
        return self
    
    async def __aexit__(self, exc_type, exc_value, traceback):
        print("Exiting async context...")
        return False

# Define the main coroutine
async def main():
    async with AsyncContextManager():
        print("Inside async context...")

asyncio.run(main())

In this example, we define an AsyncContextManager class that implements the async context manager protocol by defining the __aenter__ and __aexit__ methods.

When the async with block is executed, the __aenter__ method is called, which in this case simply prints a message to indicate that the async context has been entered. The code inside the async with block is then executed, which in this example just prints another message.

When the async with block is exited, the __aexit__ method is called, which also prints a message to indicate that the async context has been exited.

If an exception occurs inside the async with block, the __aexit__ method is called with the details of the exception, allowing the context manager to handle the exception if necessary.

Making Concurrent Web Requests with Aiohttp

Now that we know about non-blocking libraries and resource handling with asynchronous context managers, it’s finally time to make some requests.

Non-blocking requests. Concurrently. Of course.

But, before we do any of that. Let’s do it the old-fashioned way first — synchronously.

import time
import requests


def fetch_status(url: str) -> int:
    response = requests.get(url)
    return response.status_code


def main() -> None:
    start = time.time()

    urls = ['http://python.org' for _ in range(10)]
    results = [fetch_status(url) for url in urls]
    print(results)

    end = time.time()
    print(f"Total time: {end-start:.4f} seconds")


main()

In this example, we make use of the requests library, which is blocking by default. We simply execute 10 requests sequentially and fetch the status code.

Running this code takes about 1.4 seconds.

Now, let’s do the same thing again. However this time we make use of Aiohttp.

import time
import aiohttp
import asyncio

from aiohttp import ClientSession


async def fetch_status(session: ClientSession, url: str) -> int:
    # Use ClientSession to make a GET Request
    async with session.get(url) as response:
        return response.status


async def main() -> None:
    start = time.time()

    # Acquire new ClientSession
    async with aiohttp.ClientSession() as session:
        urls = ['http://python.org' for _ in range(10)]
        requests = [fetch_status(session, url) for url in urls]

        results = await asyncio.gather(*requests)
        print(results)

    end = time.time()
    print(f"Total time: {end-start:.4f} seconds")


loop = asyncio.get_event_loop()
loop.run_until_complete(main())

Running the code above takes only about 0.2 seconds.

7x faster than before. This is the power of concurrency.

So how does this work?

In order to make web requests Aiohttp relies on the concept of sessions, where one session can have multiple connections open. This is known as “Connection pooling” a technique for managing a pool of reusable network connections to a server [5]. This allows us to avoid the overhead of creating a new connection for each request.

Once we obtained a session, we can make our GET requests. We utilize our helper coroutine fetch_status to create multiple requests and schedule them on the event loop by using asyncio.gather.

Things will fail. They simply do.

A lot of things can go wrong when making a network request. Unreliable connections. Bad requests. Data errors. All of these issues can cause our request to run indefinitely. Thus, we need a way to time out.

Luckily for us, we can make use of Aiohttp’s ClientTimeout data structure.

import aiohttp
import asyncio

from aiohttp import ClientSession


async def fetch_status(session: ClientSession, url: str) -> int:
    # Apply a timeout at request level
    request_timeout = aiohttp.ClientTimeout(total=0.2)
    async with session.get(url, timeout=request_timeout) as response:
        return response.status


async def main() -> None:
    # Apply a timeout at session level
    session_timeout = aiohttp.ClientTimeout(total=1.0, connect=0.2)
    async with aiohttp.ClientSession(timeout=session_timeout) as session:
        result = await fetch_status(session, 'http://python.org')
        print(result)


loop = asyncio.get_event_loop()
loop.run_until_complete(main())

In the example above, we simply specify two timeouts. One at the session and the other at the request level. If our request, for example, takes too long an asyncio.TimeoutError will be raised.

But what if a single request fails? What about exception handling?

Unfortunately, exception handling when running multiple requests with asyncio.gather is a bit clunky. However, we can make use of the parameter return_exceptions=True which will include all exceptions raised in the result list. This allows us to handle the exceptions accordingly.

import aiohttp
import asyncio

from aiohttp import ClientSession


async def fetch_status(session: ClientSession, url: str) -> int:
    async with session.get(url) as response:
        return response.status


async def main() -> None:
    async with aiohttp.ClientSession() as session:
        urls = ['http://python.org', 'invalid://address.org']
        requests = [fetch_status(session, url) for url in urls]
        
        # Include raised exceptions in result list
        results = await asyncio.gather(*requests, return_exceptions=True)
        # Outputs: [200, AssertionError()]        
        print(results)
        

loop = asyncio.get_event_loop()
loop.run_until_complete(main())

Just slightly more control. Please.

Using asyncio.gather is convenient. But it has its drawbacks.

Exception handling is somewhat clunky and additionally, we have to wait. We have to wait until all requests are completed before we can proceed to work with any of the results. So if there is just one bad request, that takes forever — we’ll most likely end up waiting forever.

Fortunately, there is another way.

We can make use of asyncio.wait which takes a list of awaitables and returns two sets. A set of tasks that are finished, and a set of tasks that are pending.

import aiohttp
import asyncio

from aiohttp import ClientSession


async def fetch_status(session: ClientSession, url: str, delay: int) -> int:
    await asyncio.sleep(delay)
    async with session.get(url) as response:
        return response.status


async def main() -> None:
    async with aiohttp.ClientSession() as session:
        fetchers = [
            asyncio.create_task(fetch_status(session, 'http://python.org', 1)),
            asyncio.create_task(fetch_status(session, 'http://python.org', 1)),
        ]

        # Wait for all tasks to be completed
        done, pending = await asyncio.wait(fetchers)

        for done_task in done:
            result = await done_task
            print(result)


loop = asyncio.get_event_loop()
loop.run_until_complete(main())

In the example above, we get the same effect as if we’d use asyncio.gather. We run our requests concurrently and wait until all tasks are completed.

However, with asyncio.wait we can specify a return_when parameter.

Let’s slightly modify the example and include long-running requests. We also want to make sure to set return_when=FIRST_COMPLETED to return the result of whatever task finishes first.

We loop over a set of pending tasks and call async.wait on that set with each iteration. Once we have a result, we update done and pending and print out any results as soon as possible.

import aiohttp
import asyncio

from aiohttp import ClientSession


async def fetch_status(session: ClientSession, url: str, delay: int) -> int:
    await asyncio.sleep(delay)
    async with session.get(url) as response:
        return response.status


async def main() -> None:
    async with aiohttp.ClientSession() as session:
        # Create a set of pending tasks with different delays
        pending = [
            asyncio.create_task(fetch_status(session, 'http://python.org', 3)),
            asyncio.create_task(fetch_status(session, 'http://python.org', 1)),
            asyncio.create_task(fetch_status(session, 'http://python.org', 2)),
        ]

        # Loop over the set as long as tasks are pending
        while pending:
            # Update both sets
            done, pending = await asyncio.wait(
                pending,
                return_when=asyncio.FIRST_COMPLETED,
            )

            print(f"Tasks done: {len(done)}")
            print(f"Tasks pending: {len(pending)}")
            
            # Print results that are already done
            for done_task in done:
                result = await done_task
                print(result)


loop = asyncio.get_event_loop()
loop.run_until_complete(main())

# Output:
# Tasks done: 1
# Tasks pending: 2
# 200
# Tasks done: 1
# Tasks pending: 1
# 200
# Tasks done: 1
# Tasks pending: 0
# 200

While this approach is definitely less convenient than the use of asyncio.gather and more verbose it allows for more fine-grained control.

As soon as one task is completed we can proceed to work with its result. Moreover, we get the ability to handle each task individually, which also includes exception handling or the cancellation of a task.

Conclusion

Aiohttp provides a solution to the issue of blocking libraries and allows for concurrent web requests and efficient acquisition and closure of HTTP sessions. This leads to improved performance and a more pythonic way of working.

However, it is important to note that while Aiohttp offers a significant improvement in performance, it may not be the best solution for every scenario. Additionally, it’s important to handle exceptions and timeouts appropriately when using Aiohttp and Asyncio in general.

This blog post only scratches the surface of what can be accomplished with non-blocking libraries, and there are many more to discover.

If you enjoyed the read, make sure to hit ‘follow’ for more on Python concurrency and advanced techniques to take your programming skills to the next level.

Consider becoming a Medium member and continue learning with no limits. I’ll receive a portion of your membership fee if you use the following link, at no extra cost to you.

Join Medium with my referral link - Marvin Lanhenke

References / Further Material:

[1] The Zen of Python
[2] Why You Should Use Context Managers in Python
[3] https://www.geeksforgeeks.org/context-manager-in-python/
[4] https://peps.python.org/pep-0492/
[5] https://www.cockroachlabs.com/blog/what-is-connection-pooling/
Fowler, Matthew. (2022). Python Concurrency with Asyncio. Manning Publications.

Level Up Coding

Thanks for being a part of our community! Before you go:

👏 Clap for the story and follow the author 👉
📰 View more content in the Level Up Coding publication
💰 Free coding interview course ⇒ View Course
🔔 Follow us: Twitter | LinkedIn | Newsletter

🚀👉 Join the Level Up talent collective and find an amazing job

Concurrent Web Requests with Aiohttp: Get More Done in Less Time was originally published in Level Up Coding on Medium, where people are continuing the conversation by highlighting and responding to this story.

The Beginner’s Guide to Asyncio in Python: A Deeper Dive into Coroutines and Tasks

Marvin Lanhenke — Wed, 08 Mar 2023 03:40:58 GMT

Python Concurrency

Harness the Power of Coroutines, Tasks, and Futures

Photo by Laurie-Anne Robert on Unsplash

Are you tired of writing slow and unresponsive code? Do you want to build applications that are fast, scalable, and responsive?

If your answer is a resounding yes, then it’s time to dive into the world of asynchronous programming with Python’s Asyncio library!

Asynchronous programming has become a critical skill for modern developers, particularly in building scalable and responsive applications. Python’s Asyncio library offers a powerful and efficient way to write asynchronous code using coroutines, tasks, and futures.

However, it can be daunting for beginners to dive into this new paradigm of programming.

That’s why in the following sections, we’ll take a deeper dive into coroutines and tasks, and explore the power of concurrency through Asyncio.

So, if you’re ready to unlock the full potential of Asyncio in Python, look no further and fire up your IDE.

Functions Unleashed: Coroutines

They’re at the heart of Asyncio. The foundation.

Coroutines are functions that have the ability to pause their execution [1] at specific points in time and resume later, allowing other code to run in the meantime. This “superpower” enables concurrency in our program, making it possible to write highly efficient and responsive code.

When a coroutine encounters a potentially long-running operation, it can pause its execution and allow other coroutines to run while it waits for the operation to complete without blocking the main thread of execution. Once the operation is finished, the coroutine can wake up and continue where it left off.

To create a coroutine and utilize its “superpowers”, we simply mark a function with the async keyword. Now, we have a coroutine instead of a simple function.

import asyncio

# Create a coroutine with the `async` keyword
async def print_message() -> None:
  print('Hello, Asyncio')

# Calling the coroutine returns an coroutine object
print(type(print_message()))

# Output:
#

Note that coroutines aren’t executed when we call them, but rather they create a coroutine object that can be run later. To actually run a coroutine, we need to use an event loop.

Unlock the Power of Python: A Beginner’s Guide to Concurrency

In order to run the coroutine and get the actual result, we have to make use of an event loop. Luckily, the Asyncio library provides a convenience function for us: asyncio.run() resembles the main entry point to an Asyncio application.

It creates a new event loop, runs the coroutine, and returns the result.

import asyncio

# Create a coroutine with the `async` keyword
async def print_message() -> None:
  print('Hello, Asyncio')

# Create and run the event-loop
asyncio.run(print_message())

# Output:
# Hello, Asyncio

But what if we want to run multiple coroutines?

To run a coroutine we have to await it by using the await keyword. This causes the coroutine to be run and pauses its execution until it returns a value.

import asyncio

# Create a coroutine with the `async` keyword
async def print_message(message: str, delay: int) -> None:
    # Simulate a long running task by introducing a delay
    await asyncio.sleep(delay)
    print(message)

# Create a main coroutine that can be called by asyncio run
# This enables us to run multiple coroutines inside the `main` coroutine
async def main():
    await print_message('Hello, World', 1)
    await print_message('Hello, Asyncio', 1)

# Create and run the event-loop
asyncio.run(main())

If we run the code above, we get Hello, World and Hello, Asyncio as an output. However, the program returns every result one-by-one each after a second. It runs in sequence. Not concurrently.

When we use the await keyword with a coroutine, it causes the coroutine to be run while also pausing the execution of the parent coroutine until it returns a value. This is known as “blocking behavior.”

To achieve true concurrency, we need something else.

Unlocking true Concurrency with Tasks

Tasks are an essential component of Asyncio in Python.

They’re wrappers around coroutines that schedule them to run on the event loop as soon as possible, in a non-blocking manner [2]. This allows us to roughly execute multiple coroutines at the same time, which is crucial for achieving concurrency.

Running task concurrently [Image by Author]

Once we create a task, we can run and execute other code. This is possible because tasks run in the background and don’t block the main thread of execution.

We can create a task using the asyncio.create_task() function, which returns a task object immediately. We can then use the await keyword to get the result of the task.

import asyncio


async def print_message(message: str, delay: int) -> None:
    await asyncio.sleep(delay)
    print(message)


async def main():
    # Create task objects
    task1 = asyncio.create_task(print_message('Hello, World', 1))
    task2 = asyncio.create_task(print_message('Hello, Asyncio', 1))
    
    # Await tasks to get the results
    await task1
    await task2

asyncio.run(main())

If we run this code now, we retrieve both messages concurrently. The program takes roughly a second to complete instead of two as in the previous example.

Note: It’s important to note that if we don’t await a task, it will be scheduled, but almost immediately stopped and ‘cleaned up’. Leaving us with no result to retrieve.

Now, we can run multiple tasks concurrently.

While this is awesome, it is also quite cumbersome to create each task manually and await it one by one. This will become verbose and messy rather quickly.

Luckily for us, we can make use of another convenience function from the Asyncio library asyncio.gather().

This function takes in a sequence of awaitables (more on that in the next section) and runs them concurrently. It is important to note, that gather will wrap a coroutine automatically with a task, so we don’t have to manually create each task by hand.

import asyncio

async def print_message(message: str, delay: int) -> None:
    await asyncio.sleep(delay)
    print(message)

async def main():
    coroutines = [
        print_message('Hello, Asyncio', 1),
        print_message('Hello, World', 1),
    ]

    # Use gather to wrap coroutines in tasks
    # and await to run them concurrently
    await asyncio.gather(*coroutines)

asyncio.run(main())

Note: Since Python 3.11 we can also use Task Groups. Tasks can be added to a group by calling create_task() and by using an asynchronous context manager that awaits all tasks when exited.

But I don’t want to wait for Forever

Asyncio tasks are nice.

However, they can run indefinitely which will cause us to wait forever. Thus, we need some kind of way to cancel a task.

Once again the Asyncio library has got us covered. We can cancel a task using the cancel() method, which raises a CancelledError if the task is awaited.

import asyncio
from asyncio import CancelledError

# Simulate a long running task
async def delay(seconds: int) -> None:
    await asyncio.sleep(seconds)

async def main() -> None:
    long_task = asyncio.create_task(delay(10))

    seconds_waited = 0
    # Check each second if task is done
    while not long_task.done():
        print('Task not done. Waiting for another second.')
        await asyncio.sleep(1)

        seconds_waited += 1
        # If task takes to long cancel it
        if seconds_waited == 5:
            print('Task takes to long. Cancelling.')
            long_task.cancel()

    try:
        await long_task
    except CancelledError:
        print('Task was cancelled')

asyncio.run(main())

We can also make use of the asyncio.wait_for() function to set a timeout for a task and raise a TimeoutError if the task takes too long to complete.

import asyncio
from asyncio.exceptions import TimeoutError

# Simulate a long running task
async def delay(seconds: int) -> None:
    await asyncio.sleep(seconds)

async def main() -> None:
    long_task = asyncio.create_task(delay(5))

    try:
        # Wait for task to complete, raise TimeoutError after 1 second
        await asyncio.wait_for(long_task, timeout=1.0)
    except TimeoutError:
        print('Task took too long. TimeoutError')
        print(f"Task has been cancelled: {long_task.cancelled()}")

asyncio.run(main())

Note: Additionally, you can use the shield() function to prevent a task from being canceled, which can be useful in some situations where you need to guarantee that a task completes, even if other tasks are canceled.

Back to the Future

So far we learned about coroutines and tasks.

This is awesome because those objects allow us to write programs that run concurrently and more efficiently.

But where is all of this coming from?

In Asyncio, a future is an object that represents the result of an asynchronous operation [3]. It’s a placeholder for a value that hasn’t been computed yet but will be at some point in time. You can think of “Promises” if you are familiar with Javascript. Awaiting a future will cause the execution of a program to be paused until the result is set.

A task can be thought of as a combination of a coroutine and a future. When a task is created we also create an empty future. Once the coroutine runs and finishes, the result is set. The future is now no longer empty and we can retrieve the result.

So what do all of those objects have in common? What is it that connects futures, tasks, and coroutines?

The common thread between those three classes is the Awaitable abstract base class. And this seems reasonable since coroutines, tasks, and futures can all be used with an await expression. There is an even stronger relationship between tasks and futures. Tasks directly inherit from and extend futures, which can be seen in the inheritance diagram below.

Inheritance diagram [Image by Author]

Common Pitfalls

Once we know the basics, it’s tempting to use Asyncio everywhere by simply tagging each function with async/await syntax. Unfortunately, it’s not that simple and there are at least two common pitfalls we should be aware of.

One of the main pitfalls is trying to run CPU-bound code in tasks and coroutines without using multiprocessing. The GIL (Global Interpreter Lock) in Python will block the concurrent execution of code. So if we try to run CPU-bound tasks in Asyncio without using multiprocessing, we may find that our program doesn’t run as fast as we’d like.

Another common pitfall is using blocking I/O libraries or APIs without multithreading. When we use a blocking I/O library or API, it can block the main thread of execution, which in turn can block the event loop itself. This means that everything will run in sequence, and our program won’t be able to take advantage of the asynchronous nature of Asyncio.

It‘s also important to be aware of the limitations of Asyncio and to use it in situations where it makes sense. Asyncio is best suited for I/O-bound tasks that involve waiting for network or disk I/O, rather than CPU-bound tasks that involve intensive computation.

Conclusion

Asyncio is a powerful Python library that enables us to write asynchronous, non-blocking code for I/O-bound tasks. With its coroutines, tasks, and futures, Asyncio provides a flexible and efficient way to manage concurrent tasks in your programs.

When working with Asyncio however, there are a few common pitfalls to be aware of, such as running CPU-bound code in tasks without using multiprocessing or using blocking I/O libraries without multithreading. By avoiding these pitfalls and using Asyncio in the right situations, we can write more efficient, non-blocking, and responsive code.

While this blog post has covered just a few of the basics of Asyncio, there is much more to explore. With advanced features such as semaphores, queues, and streams, we can create even more powerful and sophisticated asynchronous programs.

If you enjoyed the read, make sure to hit ‘follow’ for more on Python concurrency and advanced techniques to take your programming skills to the next level.

Become a Medium member and continue learning with no limits. I’ll receive a portion of your membership fee if you use the following link, at no extra cost to you.

Join Medium with my referral link - Marvin Lanhenke

References / Further Material:

[1] https://snarky.ca/how-the-heck-does-async-await-work-in-python-3-5/
[2] https://docs.python.org/3/library/asyncio-task.html#creating-tasks
[3] https://docs.python.org/3/library/asyncio-future.html#future-object
Fowler, Matthew. (2022). Python Concurrency with Asyncio. Manning Publications.

Level Up Coding

Thanks for being a part of our community! Before you go:

👏 Clap for the story and follow the author 👉
📰 View more content in the Level Up Coding publication
💰 Free coding interview course ⇒ View Course
🔔 Follow us: Twitter | LinkedIn | Newsletter

🚀👉 Join the Level Up talent collective and find an amazing job

The Beginner’s Guide to Asyncio in Python: A Deeper Dive into Coroutines and Tasks was originally published in Level Up Coding on Medium, where people are continuing the conversation by highlighting and responding to this story.

Unlock the Power of Python: A Beginner’s Guide to Concurrency

Marvin Lanhenke — Fri, 03 Mar 2023 18:01:18 GMT

Python Concurrency

Learn the Basic Concepts of Concurrency and Upgrade Your Python Skills today!

Photo by CHUTTERSNAP on Unsplash

Python is popular. It’s versatile. And it’s widely used for everything from web development to scientific computing [1].

However, many of today's applications rely heavily on the use of input/output (I/O) operations, which can have a severe and compounding impact on performance. Luckily for us, Python offers a range of tools and techniques to tackle this issue. Most importantly concurrency.

But with many options to choose from, it can become quite a challenge, especially for beginners, to figure out where to start.

In the following sections, we will break down the key and basic concepts for Python concurrency. Covering everything from multithreading and multiprocessing to the Global Interpreter Lock and single-threaded concurrency.

Whether you are looking to optimize your application’s performance or simply want to improve your programming skills, this guide has got you covered.

Concurrency, Parallelism, and Multitasking

Let’s begin our journey by covering some basic terminology first.

Concurrency is the concept of allowing more than one task to be handled at the same time, out of or in partial order [2]. This can be incredibly useful for improving the performance of applications.

In Python, concurrency can be achieved in several ways, including threading, multiprocessing, and asynchronous programming using libraries such as asyncio. One important aspect of concurrency is that it is achieved through switching between tasks, which means that only one task is actively being executed at a time, even if multiple tasks are in progress. Thus, concurrency does not imply running multiple tasks in parallel.

Example of concurrency [Image by Author]

Parallelism, on the other hand, is the concept of actively doing more than one task at the same time. Multiple calculations or processes are carried out simultaneously [3]. This is different from concurrency and multiple CPU cores are required to be able to execute various tasks simultaneously.

In Python, parallelism can be achieved using the multiprocessing module, which allows multiple processes to be created and run in parallel. Unlike concurrency, parallelism can significantly improve the performance of CPU-bound tasks.

Example of parallelism [Image by Author]

Note: I/O-bound operations in Python are tasks that spend most of the time waiting for input/output operations to complete, such as network requests, file I/O, or user input. CPU-bound operations are tasks that require heavy computations and do not rely on I/O operations for their executions.

Multitasking allows for numerous tasks to be executed concurrently in an interleaved manner by sharing resources between them [4]. It can be achieved in two ways: Preemptive and cooperative multitasking.

Preemptive multitasking is managed by the operating system, which decides when to switch between tasks. It does so by using an algorithm such as time-slicing. In this approach, each task is allocated a time slice or quantum, during which it can run on the CPU. When the time slice expires, the operating system interrupts and switches to another task. The switching between tasks is resource-intensive and may lead to poor performance if not managed properly.

Cooperative multitasking, on the other hand, relies on the developer to provide specific code instructions that allow tasks to yield control to other tasks voluntarily. This approach can be beneficial since it is less resource-intensive and provides more granularity and control.

In Python, cooperative multitasking is often achieved using coroutines and the async/await syntax. The asyncio module is a good example of cooperative multitasking, where coroutines are run in a single-threaded event loop, and the developer can control when to switch between them using await statements.

Processes, Threads, Multithreading, and Multiprocessing

Now, that we know and understand some of the basic concepts, we can start to dig a little deeper.

In Python, a process is an instance of a program that is being executed by the operating system. A process is equipped with its own separate memory space. Thus, each process has its own set of resources, such as CPU time and memory, and does not share those with other processes or the parent process that launched it. Additionally, a process can execute multiple threads to perform different tasks concurrently.

Example of a process [Image by Author]

Threads are a way to achieve multitasking by dividing a program into smaller, independent parts that can be executed concurrently. In simpler terms, threads can be thought of as a “light-weight-process” that can run at the same time with other threads in the same process.

In contrast to processes, threads share the same memory space as the parent process, which allows accessing the same variables and data structures, making communication between threads easier. However, this also introduces another layer of complexity and problems like “race conditions” to be at least aware of. There will at least be one thread in the parent process: The main thread.

Example of threads in a process [Image by Author]

Threads are particularly useful for I/O-bound tasks. By running these tasks in separate threads, a Python program can continue executing other tasks while waiting for I/O operations to complete. This allows for a faster, more efficient, and more responsive application.

In Python, threads can be created using the threading module, which provides a Thread class that can be used to create and manage new threads.

import os
import threading

# Get the parent process id
process_id = os.getpid()
# Get the total number of active threads
num_active_threads = threading.active_count()
# Get the current thread's name
curr_thread_name = threading.current_thread().name

print(f"Python process running with id: {process_id} \
with {num_active_threads} active threads running. \
Current thread: {curr_thread_name}")

# Output: 
# Python process running with id: 2300 with 1 active threads running. 
# Current thread: MainThread

But how can we improve the application’s performance by utilizing the concepts of processes and threads?

Multithreading is a way to achieve concurrency by running multiple threads simultaneously in the same process [5]. In other words, we can execute more than one task at the same time within a single program. As stated earlier, communication between threads is simple, since they share the same memory space. Thus, allowing access to the same variables and data structures.

The concept of multithreading is especially useful for I/O-bound tasks, such as network requests and file I/O, where waiting for external events is a bottleneck in program performance. By running these tasks in separate threads, we can continue executing other tasks concurrently, which can improve the performance and efficiency of the application.

However, it is important to note that multithreading has its limitations in Python, due to the Global Interpreter Lock (GIL), which ensures that only one thread can execute Python bytecode at a time.

import threading

def print_numbers():
    for i in range(1, 11):
        print(i)

def print_letters():
    for letter in ['a', 'b', 'c', 'd', 'e']:
        print(letter)

# Create two threads
t1 = threading.Thread(target=print_numbers)
t2 = threading.Thread(target=print_letters)

# Start the threads
t1.start()
t2.start()

# Wait for the threads to finish
t1.join()
t2.join()

# When we run this program, 
# we'll see output that interleaves the numbers and letters 
# printed by the two threads. This is because the two threads 
# are executing concurrently, or in parallel

Multiprocessing, on the other hand, is a way to achieve parallelism by running multiple processes simultaneously on different CPU cores. In other words, it allows us to execute more than one task at the same time by distributing them across multiple processes. Each process has its own memory space and system resources.

Example of multiprocessing [Image by Author]

Multiprocessing and parallelism are particularly useful for CPU-bound tasks, such as heavy computation or data processing. By running these tasks in separate processes, a Python program can execute multiple tasks simultaneously, utilizing all available CPU cores. This can result in a significant speedup in program execution and improve program efficiency.

In Python, multiprocessing can be achieved using the multiprocessing module, which provides a Process class that can be used to create and manage new processes. Unlike multithreading, multiprocessing does not have the same limitations imposed by the Global Interpreter Lock (GIL), as each process has its own interpreter and can execute Python bytecode independently.

import os
import multiprocessing


def hello_from_process():
    print(f"Hello from child process {os.getpid()}")

if __name__ == '__main__':
    # Create new process
    child_process = multiprocessing.Process(target=hello_from_process)
    # Start new process
    child_process.start()

    print(f"Hello from parent process {os.getpid()}")

    # Wait for child process to finish
    child_process.join()

    # Output:
    # Hello from parent process 3128
    # Hello from child process 2248

However, it’s important to note that multiprocessing has some overhead in terms of memory usage and interprocess communication, so it’s not always the best approach for every situation.

The Global Interpreter Lock

In the previous section, we hinted at the fact that multithreading in Python has its limitations due to the Global Interpreter Lock. Now, let’s take a closer look at what the Global Interpreter Lock (GIL) actually is.

The Global Interpreter Lock (GIL) is a mechanism in Python that ensures that only one thread can execute Python bytecode at a time, even in a multithreaded program [6]. The purpose of the GIL is to prevent multiple threads from accessing shared data simultaneously and causing data inconsistencies, which can result in hard-to-debug errors.

The GIL works by locking the interpreter, which prevents other threads from acquiring the lock and executing Python bytecode. This means while one thread is executing bytecode, all other threads are blocked and have to wait. This can limit the performance of multithreaded programs, especially in situations where the program is CPU-bound and requires heavy computation.

So why does the GIL even exist in the first place?

While the GIL can be a limitation for some programs, it’s also a feature that simplifies Python’s memory management. Python uses a reference-counting model for memory management, which is not thread-safe. The GIL ensures that only one thread can modify the reference counts of Python objects at any given time, preventing memory corruption and crashes.

Probably the most important thing to note is that the GIL only applies to threads that execute Python bytecode.

I/O-bound tasks, such as network requests and file I/O, can release the GIL and allow other threads to execute Python bytecode while they wait for I/O operations to complete.

This means that multithreading is still a useful tool for I/O-bound tasks, even in a GIL-constrained environment.

Single-threaded Concurrency

Single-threaded concurrency can be achieved by running multiple tasks concurrently within a single thread, without creating additional threads or processes. This approach is especially useful for I/O-bound tasks since those release the GIL and allow other Python bytecode to be executed while waiting for I/O operations to be completed. Moreover, single-threaded concurrency can be more efficient due to the saving of the overhead cost of creating multiple threads or processes.

In Python, single-threaded concurrency can be achieved using non-blocking sockets and the OS event notification system, such as kqueue, epoll, or IOCP. When data is ready, the system sends a notification, and the coroutine can return the result. This approach allows a single thread to handle multiple I/O operations concurrently, without blocking and waiting for each operation to complete.

The asyncio module in Python is a good example of single-threaded concurrency in action. The module provides an event loop that can handle multiple coroutines concurrently, using non-blocking sockets and the event notification system to achieve concurrency.

The Event-Loop

In asyncio Python, an event loop is a central part of the asynchronous programming model, allowing coroutines to be scheduled and executed in a non-blocking manner.

The event loop is responsible for managing tasks and coroutines, and for determining which task or coroutine to run next. It’s essentially a loop that continuously waits for events to occur, and then dispatches tasks or coroutines to handle those events [7].

The following code example shows an implementation of the most basic event loop.

import queue

# Create a queue to store events
message_queue = queue.Queue()

# Fill queue with dummy events
for i in range(5):
    message_queue.put(f"event_{i}")

def process_message(message):
    print(f"Processing message: {message}")


# Run event-loop forever
while True:
    try:
        if message_queue:
            message = message_queue.get(timeout=1)
            process_message(message)
    except queue.Empty:
        # Continue the loop and wait for new events
        continue
    except KeyboardInterrupt:
        # Exiting the program with Ctrl+C
        break

# Output:
# Processing message: event_0
# Processing message: event_1
# Processing message: event_2
# Processing message: event_3
# Processing message: event_4

When an asyncio program starts, it creates an event loop, which is used to schedule and execute coroutines. Each coroutine is a task that represents a unit of work to be done, such as making a network request or reading from a file. The event loop manages these tasks and decides which one to run next based on which task is ready to run, such as one that has data available to read or one that has completed an I/O operation.

The event loop is also responsible for handling exceptions and errors that may occur during the execution of a coroutine. If an exception occurs, the event loop can catch the exception and decide whether to continue running the coroutine or stop it and move on to the next task.

Conclusion

Understanding concurrency in Python is an essential skill for any developer who wants to create efficient and responsive programs.

The concept of asynchronous programming helps to achieve concurrency by allowing programs to handle multiple I/O-bound tasks concurrently and free up system resources. Resulting in greater efficiency, better performance, and resource usage.

However, there are some drawbacks as well. Asynchronous code can be harder to read and write than synchronous code, especially for beginners. It also requires a different programming mindset. And additionally, some tasks, such as heavy computation or CPU-bound tasks, may not be well-suited for asynchronous programming.

While asynchronous programming can provide significant performance gains it is unfortunately not a ‘silver bullet’ for everything.

If you enjoyed the read, make sure to hit ‘follow’ for more on Python concurrency and advanced techniques to take your programming skills to the next level.

Consider becoming a Medium member and continue learning with no limits. I’ll receive a portion of your membership fee if you use the following link, at no extra cost to you.

Join Medium with my referral link - Marvin Lanhenke

References / Further Material:

[1] https://www.statista.com/statistics/1338409/python-use-cases/
[2] https://en.wikipedia.org/wiki/Concurrency_(computer_science)
[3] https://en.wikipedia.org/wiki/Parallel_computing
[4] https://en.wikipedia.org/wiki/Computer_multitasking
[5] https://en.wikipedia.org/wiki/Multithreading_(computer_architecture)
[6] https://wiki.python.org/moin/GlobalInterpreterLock
[7] https://en.wikipedia.org/wiki/Event_loop
Fowler, Matthew. (2022). Python Concurrency with Asyncio. Manning Publications.

Unlock the Power of Python: A Beginner’s Guide to Concurrency was originally published in Geek Culture on Medium, where people are continuing the conversation by highlighting and responding to this story.

Create a Serverless Authentication Service With AWS CDK, Cognito, and API Gateway

Marvin Lanhenke — Wed, 12 Oct 2022 15:45:40 GMT

AWS Solutions

How to Create a Serverless Authentication Service With AWS CDK, Cognito, and API Gateway

A backend service using TypeScript, JWT, and HttpOnly cookies

Photo by FLY:D on Unsplash

I used it. You did, too.

It’s ubiquitous. It’s everywhere. It’s essential.

You know what I’m talking about: authentication.

Authentication describes the act of proving an assertion, such as your identity to a computer system [1].

Or simply put — you tell the system who you are.

And since authentication is required in nearly every modern application we use, it might be a good idea to build an authentication service and solve this requirement once and for all.

In the following sections, you will create a serverless backend service using Amazon Cognito, API Gateway, and AWS Lambda.

By making use of the AWS Cloud Development Kit (CDK), you will be able to provide Infrastructure as Code (IaC) — making it very easy to spin up or shut down the backend service with just a simple command line statement.

However, before diving headfirst into the implementation details, let’s take a step back and briefly talk about the high-level design.

High-Level Overview

Prior to writing any code at all, it’s always useful to envision the complete picture first. Making sure we know exactly what it is we’re trying to achieve.

AWS serverless authentication flow [Image by Author]

So, our overall goal is to create a serverless backend system that will handle authentication for us. But what does this actually mean?

Let’s quickly step through the flow above:

The user either tries to create a new account or to sign in by providing some form of credentials (e.g., username and password)
The user receives a response. In case of a successful login, it will be an HttpOnly Cookie with a JSON Web Token inside.
Equipped with the cookie, the user tries to access a protected resource via another API Gateway. A Lambda authorizer will parse the cookie that is included in the request header and verify the JWT. If the verification is successful, the authorizer returns a policy document to the user, making it possible to access the protected resource.

And this is already it.

Now, let’s begin by firing up our favorite IDE and creating a new project.

Note: All of the used services are free-tier eligible so no additional costs should occur. However, it’s still advisable to check your AWS Account and shut down any unused services.

Implementing a Serverless Authentication Service

First of all, make sure you have AWS CDK installed and bootstrapped. The following code can help:

// Install AWS CDK
npm install -g aws-cdk

// Bootstrap AWS CDK
cdk bootstrap aws://ACCOUNT-NUMBER/REGION

Now that we’re equipped with the right tools, we can start our project by simply creating a new folder and initializing the CDK via the command line interface.

// Create a new folder
mkdir aws-serverless-auth
cd aws-serverless-auth

// Init CDK
cdk init --language typescript

Once the installation process has finished, we can finally open our code editor and get to work.

Creating an Amazon Cognito user pool

First things first.

In order to build a proper authentication service, we have to create some form of a user database first. For this purpose, we make use of Amazon Cognito which luckily provides us with all the desired features.

Inside the lib folder, create a new file called user-pool.ts.

https://medium.com/media/12041719f45990901560541692e5ab66/href

In the code above, we export a class called CognitoUserPool.

Inside the class constructor, we basically create a new user pool and attach an application client to it. While instantiating a new user pool, we also make sure to pass the required configuration parameters as well.

Note that we expose two read-only class fields for further reference, namely the userPoolId and userPoolClientId.

Building the auth API

Now, let’s proceed to the nuts and bolts — the authentication API.

Once again, inside the lib folder create a new file called auth-api.ts.

Let’s start nice and easy by constructing a RestAPI first and by making sure our class constructor receives the correct properties.

Next, we add a new resource and attach several lambda functions to it. To make our lives much easier, we will use a private helper method called addRoute().

Note: We will create all the necessary Lambda functions in the next section.

https://medium.com/media/2c57e565c3186c787e0a38fe1a1e67e0/href

Each Lambda function corresponds with a separate route and a specific user action.

By making use of the helper method, we can not only reduce code duplication but also provide each function with the mandatory environment variables as well as the correct policies (e.g., allowing access to the Cognito user pool).

Implementing the auth Lambda functions

So far, so good.

We already created a Cognito user pool and a RestAPI, allowing us to expose our authentication logic to the outside world.

However, we have yet to implement such logic.

Inside the root project folder, we create a new directory called lambda. We open the new directory and create a subfolder with the name auth inside.

Note: In order to install the type definitions for AWS Lambda enter npm install @types/aws-lambda in your terminal.

Signup

Let’s begin with the signup function and create a new file, signup.ts.

https://medium.com/media/7ceea5c3b73e0413a50c1ac85635c5ad/href

Once we made sure we received a proper event body (the user’s credentials), we simply call the signUp() method on the instance of the CognitoIdentityServiceProvider and return a response to the user.

If the signup was successful, the user will face a challenge to verify the provided email address and confirm the signup by entering a verification code.

Confirm signup

Well, you know the drill.

Create a new file called confirm-signup.ts.

https://medium.com/media/5fdfc14b4df0d2aad25bb0c6d4cae7c7/href

Nothing fancy here.

We simply receive a username and the confirmation code, which we pass to the confirmSignup() method of the CognitoIdentityServiceProvider. In the end, we return an appropriate response to the user.

Sign in

Now, that we’re able to create a new user we can start working on the sign-in function. Still inside the auth folder, create a new file with the name signin.ts.

https://medium.com/media/8830222b71f9bfabc34b41532d4026c5/href

Inside the sign-in function, we collect the username and password in order to invoke the initiateAuth() method. If the given credentials are correct, we extract the IdToken from the AuthenticationResult and set a Secure and HttpOnly cookie inside the response header with the token as payload.

Sign out

The sign-out function is super basic.

We simply create a new file signout.ts and “delete” the cookie by setting its expiration date to the past.

https://medium.com/media/a9570c15b25469aa7b4c19eb1a5e2850/href

Create a Protected API and Lamba Authorizer

We’re approaching the finishing line.

Now, that we already implemented the authentication API and all the necessary Lambda functions, we can start working on the final missing pieces: The protected RestAPI and the Lambda authorizer.

Build the protected API

Let’s get started with the supposedly easy part.

Inside the folder lib, we create a new file called protected-api.ts.

https://medium.com/media/cbf85af18f196eaeb2ea1161bb8a57e4/href

In the code above, we define a simple RestAPI, two lambda functions, and their integrations.

The protectedFn returns just a message, allowing us to simulate some protected resources. We create a new file inside the lambda folder with the name of protected.ts.

https://medium.com/media/2592ec5246f80cc9de85a939817b1355/href

By providing a RequestAuthorizer and setting our Lambda authorizer as the handler, we make sure the route is protected.

Implementing the Lambda Authorizer

The Lambda Authorizer does two things:

It parses the cookie, provided in the request header, and verifies the JWT.
It returns a policy document, either denying or allowing access to the resource.

Simple enough, right?

Let’s go ahead and create a new file authorizer.ts inside the lambda/auth folder.

https://medium.com/media/69f8d3f73e861c91c4fb5c8566ba08f8/href

Inside the authorizer, we make use of three helper functions: parseCookies(), verifyToken() and createPolicy().

Let’s cover those next.

But first, create a new file utils.ts inside the folder lambda which will house all of those three helper functions. Here’s what it looks like:

https://medium.com/media/b4d6d2c50a3bcc0ffb8d0535f8bee8ee/href

Our first helper function does what the name suggests — it parses the cookies inside the request header. Basically looping through the headers.Cookie object and creating a cookieMap with the cookie name and value.

Once we retrieve the cookie with our token, we need to verify it.

https://medium.com/media/6de840f74c11f03f1a8ec24daa3a7473/href

The next helper function, verifyToken, relies on three external libraries so make sure to npm install axios jsonwebtoken jwk-to-pem.

We retrieve the JSON web key for our Cognito user pool by requesting the provided URL. Next, we convert the key with the help of the external library jwk-to-pem. Once we have converted the key we can verify the token and return the result.

Based on this result, we create a policy document either allowing or denying access to the protected resource. For this purpose, we create our last helper function, createPolicy().

https://medium.com/media/5c170fe1b5fdcae83cac7785448f3421/href

Putting It All Together

Phew — that was a lot of work.

Now, there is only one thing left to do. We have to put it all together in our final stack.

Inside the folder, lib, open the file with the name aws-serverless-auth-stack.ts.

https://medium.com/media/8b5f391d0bf44db9fea5c4ecad1b3c50/href

Here, we simply instantiate all of the other classes we created before. Our Cognito user pool and both of the RestAPIs. Note that we pass userPoolId and userPoolClientId as properties to both APIs.

We assembled all the pieces. Good job.

Now, it’s time to deploy the stack by typing cdk deploy inside your terminal.

Testing the Flow With Postman

Once our stack has been completely deployed, we can finally test the overall communication flow by making use of Postman.

However, before we can start testing we need to obtain both the URL of our authentication and the protected API. Therefore, head over to your AWS console, navigate to API Gateway, select each API, select stages, and copy the URL.

Let’s get moving by creating a new user and signing up.

Inside Postman, we create a new POST request with the URL of the authentication API we copied earlier. Our JSON request body simply contains a username, an email, and a password.

Authentication testing signup [Screenshot by Author]

Note: Please make sure to enter a valid email address since we will receive a confirmation code in order to confirm our signup request.

Next, we have to confirm our signup request by entering a verification code we should have received by mail. Create a new POST request inside Postman with a username and code in the body.

Authentication testing confirmation [Screenshot by Author]

The received response should be stating that we have successfully confirmed the user signup.

Now, we can test the sign-in function.

We create yet another POST request, providing a username and password inside the body.

Authentication testing sign-in [Screenshot by Author]

Within the response, we should have received a cookie with a JSON web token inside. We can verify this by inspecting our cookies inside Postman.

HttpOnly cookie set by AWS Lambda [Screenshot by Author]

We’re successfully logged in. Great.

Now, let’s try to access our protected resources.

Create a GET request inside Postman and hit the protected route. Also, make sure to include a cookie token= within your request.

Accessing the protected route [Screenshot by Author]

As we can tell from the response body — it worked. We received a super secret.

And this is finally it.

We finished our serverless authentication service.

Note: After we’re done testing, we can tear down the infrastructure by typing cdk destroy inside our terminal.

Conclusion

In this article, we created a serverless authentication service by utilizing Amazon Cognito and API Gateway. We also made use of the CDK, creating our Infrastructure as Code, which allowed us to easily spin up and tear down the complete stack.

Since authentication is used in almost every application, having a backend service right at our fingertips might prove useful in the future.

However, there is still room left for some improvements. We only have the capability to signup with a username and password. This could be further enhanced by integrating federated identity providers such as Google, LinkedIn, etc.

Thanks for reading.

You can find the full code on my GitHub.

References

Create a Serverless Authentication Service With AWS CDK, Cognito, and API Gateway was originally published in Better Programming on Medium, where people are continuing the conversation by highlighting and responding to this story.

How to Design Better DAGs in Apache Airflow

Marvin Lanhenke — Thu, 09 Jun 2022 15:15:09 GMT

Data Engineering

The two most important properties you need to know when designing a workflow

Photo by Campaign Creators on Unsplash

Last week, we learned how to quickly spin up a development environment for Apache Airflow.

This is awesome!

However, we have yet to learn how to design an efficient workflow. Simply having a great tool at our fingertips won’t cut the deal alone — unfortunately.

Although Apache Airflow does a pretty good job at doing most of the heavy lifting for us, we still need to ensure certain key properties for each Airflow task, in order to obtain proper and consistent results.

Luckily, a lot of best practices exist.

Today, we begin with two of the most important concepts that apply universally to all workflows.

Today, we learn about atomicity and idempotency.

All or nothing: Atomicity

Often used in the context of database systems, atomicity is one of the ACID properties and is considered an indivisible, irreducible series of operations such that either all occur or nothing at all¹. It is either performed entirely or not performed at all².

In terms of Apache Airflow, that means a task should be defined in a way that allows for success with a proper result or complete failure without affecting the state of the system.

Let’s imagine, that we have to extract data from a CSV file, apply some transformations to it, and write the result to a database.

Simple enough, right?

A bad, non-atomic approach would be the following.

We extract the data line-by-line, apply the transformation right away, and upload the result immediately to the database. All within the same task.

A non-atomic approach [Image by Author]

Now, if some lines are corrupt and the task fails halfway through, we’re left with only a fragment of the desired results. Some lines are processed and already inserted — some simply non-existent. Debugging and rerunning this task while avoiding duplication would be a nightmare.

An improved, atomic workflow might be defined like this.

A better approach with atomic tasks [Image by Author]

So a general rule of thumb to keep in mind is to split up the operations into different tasks. One operation equals a single task — think Single-responsibility principle.

Unfortunately, this simple rule cannot be applied every time.

Some operations are so tightly coupled, that it’s best to keep them in a single coherent unit of work. For example, authenticating to an API before executing the request.

Luckily for us, most Airflow operators are designed in an atomic fashion and can be used straight off the shelf. With the more flexible types of operators like the Bash or Python operator, however, we have to be more cautious and mindful when designing our workflow.

Creating atomic Airflow tasks allows for the ability to recover from failure and rerun only the failed and downstream tasks. Atomicity provides easier maintainable and transparent workflows without hidden dependencies and side effects.

Start, Stop, Rewind: Idempotency

The concept of idempotency goes hand-in-hand with the idea of atomicity and describes a property of certain operations in mathematics and computer science. So the operations can be applied multiple times without changing the result beyond the initial application³.

Think of pressing the “on-button” on a control panel as an operation. Pressing this button multiple times has the same effect as just pressing it once.

So what does this mean in the context of Apache Airflow?

Calling the same task multiple times with the same input has no additional effect. In other words, if rerunning a task without changing the input yields the same output it can be considered idempotent.

Idempotency allows for decreased recovery time from failure and reduces data loss.

Now, let’s imagine our job is to fetch data from a database for a specific day and write the results to a CSV file. Rerunning this task for the same day should overwrite the existing file and produce the same output every time it is executed.

An idempotent task producing the same output every time [Image by Author]

Suppose, we design our task in a different way that with each rerun we simply append the records to an existing file.

Now, we violate the concept of idempotency. Every rerun of the task produces a different result with duplicate records.

Non-idempotent task producing duplicate results [Image by Author]

In general, tasks that write should check for existing records, overwrite or use UPSERT operations to conform to the rules of idempotency.

For more general applications we have to, however, think carefully of all possible side effects.

Conclusion:

In this article, we covered two of the most important principles when designing DAGs in Apache Airflow: atomicity and idempotency.

Committing those concepts to memory enables us to create better workflows that are recoverable, rerunnable, fault-tolerant, consistent, maintainable, transparent, and easier to understand.

However, there are a lot more best practices to adhere to and consider when coding and creating the next workflow.

But this is a topic for another day …

Setting Up Apache Airflow with Docker-Compose in 5 Minutes

Enjoyed the article? Become a Medium member and continue learning with no limits. I’ll receive a portion of your membership fee if you use the following link, at no extra cost to you.

Join Medium with my referral link - Marvin Lanhenke

References / Further Material:

[1] https://en.wikipedia.org/wiki/Atomicity_(database_systems)
[2] https://www.webopedia.com/definitions/atomic-operation/
[3] https://en.wikipedia.org/wiki/Idempotence
https://en.wikipedia.org/wiki/ACID
https://en.wikipedia.org/wiki/Single-responsibility_principle
https://www.astronomer.io/guides/dag-best-practices/
Bas Harenslak, Julian de Ruiter. Data Pipelines with Apache Airflow. New York: Manning, 2021.

How to Design Better DAGs in Apache Airflow was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.

Setting Up Apache Airflow with Docker-Compose in 5 Minutes

Marvin Lanhenke — Tue, 31 May 2022 20:11:31 GMT

Data Engineering

Create a development environment and start building DAGs

Photo by Fabio Ballasina on Unsplash

Although being pretty late to the party (Airflow became an Apache Top-Level Project in 2019), I still had trouble finding an easy-to-understand, up-to-date, and lightweight solution to installing Airflow.

Today, we’re about to change all that.

In the following sections, we will create a lightweight, standalone, and easily deployed Apache Airflow development environment in just a few minutes.

Docker-Compose will be our close companion, allowing us to create a smooth development workflow with quick iteration cycles. Simply spin up a few docker containers and we can start to create our own workflows.

Note: The following setup will not be suitable for any production purposes and is intended to be used in a development environment only.

Why Airflow?

Apache Airflow is a batch-oriented framework that allows us to easily build scheduled data pipelines in Python. Think of “workflow as code” capable of executing any operation we can implement in Python.

Airflow is not a data processing tool itself. It’s an orchestration software. We can imagine Airflow as some kind of spider in a web. Sitting in the middle, pulling all the strings and coordinating the workload of our data pipelines.

A data pipeline typically consists of several tasks or actions that need to be executed in a specific order. Apache Airflow models such a pipeline as a DAG (directed acyclic graph). A graph with directed edges or tasks without any loops or cycles.

A simple example DAG [Image by Author]

This approach allows us to run independent tasks in parallel, saving time and money. Moreover, we can split a data pipeline into several smaller tasks. If a job fails, we can only rerun the failed and the downstream tasks, instead of executing the complete workflow all over again.

Airflow is composed of three main components:

Airflow Scheduler — the “heart” of Airflow, that parses the DAGs, checks the scheduled intervals, and passes the tasks over to the workers.
Airflow Worker — picks up the tasks and actually performs the work.
Airflow Webserver — provides the main user interface to visualize and monitor the DAGs and their results.

A high-level overview of Airflow components [Image by Author]

Step-By-Step Installation

Now that we shortly introduced Apache Airflow, it’s time to get started.

Step 0: Prerequisites

Since we will use docker-compose to get Airflow up and running, we have to install Docker first. Simply head over to the official Docker site and download the appropriate installation file for your OS.

Step 1: Create a new folder

We start nice and slow by simply creating a new folder for Airflow.

Just navigate via your preferred terminal to a directory, create a new folder, and change into it by running:

mkdir airflow
cd airflow

Step 2: Create a docker-compose file

Next, we need to get our hands on a docker-compose file that specifies the required services or docker containers.

Via the terminal, we can run the following command inside the newly created Airflow folder

curl https://raw.githubusercontent.com/marvinlanhenke/Airflow/main/01GettingStarted/docker-compose.yml -o docker-compose.yml

or simply create a new file named docker-compose.yml and copy the below content.

https://medium.com/media/73eba75161167381a12e5e3edc491bfd/href

The above docker-compose file simply specifies the required services we need to get Airflow up and running. Most importantly the scheduler, the webserver, the metadatabase (postgreSQL), and the airflow-init job initializing the database.

At the top of the file, we make use of some local variables that are commonly used in every docker container or service.

Step 3: Environment variables

We successfully created a docker-compose file with the mandatory services inside. However, to complete the installation process and configure Airflow properly, we need to provide some environment variables.

Still, inside your Airflow folder create a .env file with the following content:

https://medium.com/media/28012d93c64d8386d63db70b46f005ff/href

The above variables set the database credentials, the airflow user, and some further configurations.

Most importantly, the kind of executor Airflow we will utilize. In our case, we make use of the LocalExecutor.

Note: More information on the different kinds of executors can be found here.

Step 4: Run docker-compose

And this is already it!

Just head over to the terminal and spin up all the necessary containers by running

docker compose up -d

After a short period of time, we can check the results and the Airflow Web UI by visiting http://localhost:8080. Once we sign in with our credentials (airflow: airflow) we gain access to the user interface.

Airflow 2 Web UI [Screenshot by Author]

A Quick Test

With a working Airflow environment, we can now create a simple DAG for testing purposes.

First of all, make sure to run pip install apache-airflow to install the required Python modules.

Now, inside your Airflow folder, navigate to dags and create a new file called sample_dag.py.

https://medium.com/media/78a0718e8a9a2ed0d119c47b123f4bf3/href

We define a new DAG and some pretty simple tasks.

The EmptyOperator serves no real purpose other than to create a mockup task inside the Web UI. By utilizing the BashOperator, we create a somewhat creative output of “HelloWorld!”. This allows us to visually confirm a proper running Airflow setup.

Save the file and head over to the Web UI. We can now start the DAG by manually triggering it.

Manually triggering a DAG [Screenshot by Author]

Note: It may take a while before your DAG appears in the UI. We can speed things up by running the following command in our terminal docker exec -it --user airflow airflow-scheduler bash -c "airflow dags list"

Running the DAG shouldn’t take any longer than a couple of seconds.

Once finished, we can navigate to XComs and inspect the output.

Navigating to Airflow XComs [Screenshot by Author]

Inspecting the output [Screenshot by Author]

And this is it!

We successfully installed Airflow with docker-compose and gave it a quick test ride.

Note: We can stop the running containers by simply executing docker compose down.

How to Design Better DAGs in Apache Airflow

Conclusion

Airflow is a batch-oriented framework that allows us to create complex data pipelines in Python.

In this article, we created a simple and easy-to-use environment to quickly iterate and develop new workflows in Apache Airflow. By leveraging docker-compose we can get straight to work and code new workflows.

However, such an environment should only be used for development purposes and is not suitable for any production environment that requires a more sophisticated and distributed setup of Apache Airflow.

You can find the full code here on my GitHub.

Enjoyed the article? Become a Medium member and continue learning with no limits. I’ll receive a portion of your membership fee if you use the following link, at no extra cost to you.

Join Medium with my referral link - Marvin Lanhenke

References / Further Material:

Airflow Documentation
Bas Harenslak, Julian de Ruiter. Data Pipelines with Apache Airflow. New York: Manning, 2021.

Setting Up Apache Airflow with Docker-Compose in 5 Minutes was originally published in TDS Archive on Medium, where people are continuing the conversation by highlighting and responding to this story.

NLP-Day 30: A Bag-Of-Resources For Your NLP Learning Adventure

Marvin Lanhenke — Fri, 06 May 2022 16:58:30 GMT

#30DaysOfNLP

Gathering resources to continue learning and improving

Wrapping up the series #30DaysOfNLP [Image by Author]

In the last episode, we took a small detour and learned the basics of regular expressions, allowing us to match and modify strings in a way that suits our Natural Language Processing needs.

Looking back, we have covered a variety of topics in this series. Things like Bag-Of-Words, TF-IDF vectors, convolutional neural networks, and Transformers should all sound familiar by now.

However, there is still much left to learn in the vast field of Natural Language Processing.

In the following sections, we’re going to wrap up this series. We will create a non-comprehensive list of resources that hopefully enables us to continue our joyful learning adventure in the world of Natural Language Processing.

So for the last time, take a seat, don’t go anywhere, and make sure to follow #30DaysOfNLP: A Bag-Of-Resources For Your NLP Learning Adventure

Online Courses

NLP with Deep Learning (Winter 2017) — Stanford

Starting off with a free lecture series provided by Stanford University.

This series provides an introduction to cutting-edge research in deep learning applied to the field of Natural Language Processing. It contains 18 one-hour long lectures, covering topics like word vectors, recurrent neural networks, attention mechanisms, and transformer-based architectures.

[Link]

NLP Specialisation Course — Coursera

Provided by DeepLearning.AI this specialization course is created to get you ready to design various NLP applications. Applications like question-answering, sentiment analysis, tools that translate languages and summarize text, and even chatbots.

However, keep in mind this course is not for free, once the 7-Day-Trial has passed.

[Link]

Introduction to Natural Language Processing in Python — Datacamp

Designed to teach the NLP basics, this course covers topics like identifying and separating words, extracting topics, or how to build a fake news classifier. It also highlights the use of basic libraries such as NLTK as well as deep learning frameworks.

This course is also not free.

[Link]

Keras.io (Examples)

Although not a real online course, Keras.io provided a plethora of examples we can learn from.

I’d highly recommend going through each example and coding along. One of the best ways to build a skill is still by performing it and working on projects.

[Link]

Books

Deep Learning — Ian Goodfellow

One of the best books for machine and deep learning in general. Especially the first few chapters get us up to speed, covering and revising all the necessary prerequisites in terms of linear algebra and statistics.

It also covers all the different neural network architectures in great detail. And the best thing — it’s completely free.

So get yourself ready for a deep and intense learning experience since this book comes with a high degree of information density.

[Link]

Deep Learning with Python — F.Chollet

Written by Keras creator and Google AI researcher François Chollet, this book introduces the field of deep learning with Python and the Keras library.

It covers the general principles of deep learning as well as the most common neural networks architectures. All done through intuitive explanations and practical examples.

[Link]

Natural Language Processing in Action: Understanding, analyzing, and generating text with Python

Providing a comprehensive overview of the complete field of Natural Language Processing, this book comes with a lot of practical examples and an easy-to-understand language.

Since this book covers a wide range of topics including tokenization, bag-of-words, word vectors, and deep learning, it’s the perfect starting point in the world of NLP.

[Link]

Natural Language Processing Projects: Build Next-Generation NLP Applications Using AI Techniques

The best way to learn is to build stuff.

This book starts with a general overview of NLP and artificial intelligence, before diving straight into several Natural Language Processing end-to-end projects. Applications like sentiment analysis, topic extraction, resume parsing, building a chatbot, or even generating novel text.

[Link]

Papers

Attention Is All You Need

Nothing left to say. A must-read covering the Transformer architecture and self-attention.

[Link]

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Explaining the inner workings of BERT and the benefits of pre-training, this paper is also a must-read.

[Link]

A Neural Conversational Model

In this paper, a simple approach for conversational modeling is presented. The model makes use of a sequence-to-sequence framework and converses by predicting the next sentence given the previous sentence or sentences in a conversation

[Link]

Improving Language Understanding by Generative Pre-Training

Published by OpenAI, this paper addresses the scarcity of labeled data and how to overcome this challenge. By making use of generative pre-training and discriminative fine-tuning on each specific task, the authors demonstrate large gains and improvements.

[Link]

Conclusion

In this article, we wrapped up the series by providing a list of resources, allowing us to continue our learning adventure in the land of Natural Language Processing.

There is not much more to say but thank you.

Thank you for reading. Thank you for following this series and thank you for your support. It’s been an incredibly intensive, interesting, fun, but also exhausting #30DaysOfNLP.

Feel free to link and share the complete series.

#30DaysOfNLP

Enjoyed the article? Become a Medium member and continue learning with no limits. I’ll receive a portion of your membership fee if you use the following link, at no extra cost to you.

NLP-Day 29: How To Manipulate And Preprocess Strings With Regular Expressions

Marvin Lanhenke — Thu, 05 May 2022 18:14:20 GMT

#30DaysOfNLP

Just express yourself with regular expressions

Express yourself with regular expressions #30DaysOfNLP [Image by Author]

In the last episode, we reviewed the key architectures in the field of deep learning and highlighted the importance of a general workflow. We also stated that most of the challenges lie not in the designing or modeling aspect but in the preparation and preprocessing of the data.

Now, it’s time to take a small detour and learn about regular expressions.

In the following sections, we’re going to cover the basics of regular expressions, allowing us to preprocess and modify strings in a way that helps us to solve the NLP task at hand.

So take a seat, don’t go anywhere, and make sure to follow #30DaysOfNLP: How To Manipulate And Preprocess String With Regular Expressions

Introducing regular expressions

Regular expressions (regex) can also be viewed as a tiny, highly specialized programming language embedded inside Python that is made available through the re module.

Although embedded, regular expressions are actually compiled into a series of bytecodes and executed by a matching engine written in C.

Regular expressions allow us to specify rules for a set of possible strings we want to match e.g. English sentences, e-mail addresses, specific characters, etc. However, we can not only match certain patterns but also modify or even split strings.

Despite regex being quite powerful, it also can get complicated pretty quickly. Thus, more sophisticated preprocessing steps should not be done completely with regex alone but rather with or in combination with Python.

Matching operations

Let’s begin with probably the most common task. Matching characters.

import re

text = "Natural Language processing is so awesome, isn't it?"
pattern = re.compile(r'\?')
matches = pattern.findall(text)

>>> ['?']

However, before we can do anything at all, we have to import the re module and specify or compile a pattern. In our simple case, we define a pattern to match a question mark.

After defining the matching pattern, we can apply several built-in functions.

1. match()- determines if RE matches at the beginning of the String
2. search()- scans through a string, looks for any matching location
3. findall()- finds all matching substrings, returns a list
4. finditer() - finds all matching substrings, returns an iterator

We make use of the findall() function that matches all substrings and returns them in a list. In our example, we retrieve the question mark.

Pretty straightforward so far. But what about more sophisticated patterns? What about metacharacters?

Metacharacters

Most letters and characters simply match themselves.

With metacharacters, however, this is a completely different story. We can use metacharacters to signal that some out-of-the-ordinary thing should be matched.

Let’s consider the square brackets '[]' for example which can be used to specify a set of characters.

import re

text = "Natural Language processing is so awesome, isn't it?"

pattern = re.compile(r'[a-c]')
matches = pattern.findall(text)

>>>
['a', 'a', 'a', 'a', 'c', 'a']

Picking up our simple example from before, we specify a set of characters [a-c] and try to find all occurrences in our string.

Other metacharacters to consider are the caret ’^’ and the dollar sign '$' which can be used to either check if a string starts or ends with a certain character.

import re

text = "Natural Language processing is so awesome, isn't it?"

pattern = re.compile(r'\?$')
matches = pattern.search(text)

if matches:
  print(True)
else:
  print(False)

>>>
True

By making use of the dollar sign, we are able to verify that the string ends with a question mark.

So far so good.

However, things start to get more interesting once we account for the number of occurrences. Using metacharacters like '*' '+' '?' we can specify the number of times a character has to appear in a given string.

import re

text = "abcd"

pattern = re.compile(r'[e-z]+')
matches = pattern.search(text)

if matches:
  print(True)
else:
  print(False)

>>>
False

By using the '+' character, we try to match the characters in the range [e-z] that appear at least once. Since our string doesn’t contain any of those characters we’re unable to match the pattern.

For a complete list of metacharacters, you can refer to the table provided by w3schools.com.

Special Sequences

Using the backslash character, we can access several special sequences. For example, \w which matches any alphanumeric character.

Or imagine we want to extract all digits from a sequence.

import re

text = "I am 32 years old."

pattern = re.compile(r'\d')
matches = pattern.findall(text)

>>>
['3', '2']

In this example, we make use of the \d sequence to extract all single digits.

For a list of all special sequences, we can once again refer to w3schools.com.

String modifications

With regular expressions, we can do more than just matching operations. We can also split and modify strings as well.

import re

text = "Natural Language Processing is so awesome!"

pattern = re.compile(r'\W+')
result = pattern.split(text)

>>>
['Natural', 'Language', 'Processing', 'is', 'so', 'awesome', '']

Relatively straightforward. We simply make use of the split() function to split a string. In this example, based on all non-alphanumerical characters.

By making use of the sub() function we can even modify a string by substituting characters based on a certain pattern. Let’s assume, we want to replace all whitespace characters with a hyphen.

import re

text = "Natural Language Processing is so awesome!"

pattern = re.compile(r'\s')
result = re.sub(pattern, '-', text)

>>>
'Natural-Language-Processing-is-so-awesome!'

Useful expressions

Now that we covered most of the basics, let’s finish this article with some more useful examples.

Finding e-mail addresses

import re

text = "Here are some mail addresses \
alice-b@googlemail.com peter@yahoo.com"

pattern = re.compile(r'[-\w]+@[\w]+')
matches = pattern.findall(text)

>>>
['alice-b@googlemail', 'peter@yahoo']

Extracting phone numbers

import re

text = "Here are my phone numbers (555) 555-1234, (555) 555-5678"

pattern = re.compile(r'\([-)\d\s]+')
matches = pattern.findall(text)

>>>
['(555) 555-1234', '(555) 555-5678']

Working with numbers (including separators)

import re

text = "The numbers are 21.40453, 2,245.43, and 4,506."

pattern = re.compile(r'\b[\d.,]+\b')
matches = pattern.findall(text)

>>>
['21.40453', '2,245.43', '4,506']

Extracting dates

import re

text = "Here are some timestamps \
2013-02-20T17:24:33Z, 2016-03-23T11:19:33Z"

pattern = re.compile(r'[-\d]+\d{2}')
matches = pattern.findall(text)

>>>
['2013-02-20', '2016-03-23']

Conclusion

In this article, we took a small detour and learned the basics of regular expressions. And such basic knowledge might come in handy when we have to preprocess and modify certain strings to fit into our Natural Language Processing pipeline.

Now, it’s time to wrap up the complete series by looking back, reviewing the work we have done, and providing some useful resources in the last episode.

So take a seat, don’t go anywhere, make sure to follow, and never miss a single day of the ongoing series #30DaysOfNLP.

#30DaysOfNLP

Enjoyed the article? Become a Medium member and continue learning with no limits. I’ll receive a portion of your membership fee if you use the following link, at no extra cost to you.

Join Medium with my referral link - Marvin Lanhenke

References / Further Material:

Mlearning.ai Submission Suggestions

NLP-Day 28: How To Approach And Choose A Deep Learning Architecture

Marvin Lanhenke — Wed, 04 May 2022 17:21:59 GMT

#30DaysOfNLP

A general workflow and some key network architectures

General Workflow & key architectures #30DaysOfNLP [Image by Author]

In the last episode, we gently introduced Tensorboard. A tool that allows us to gain insights and a deeper understanding of the various models we implemented.

Considering the fact that this series is slowly approaching the finish line, it’s time to take a step back.

In the following sections, we’re going to take a look at the rearview mirror. Not only highlighting the importance of a general workflow but also revising the key network architectures we already encountered and implemented for ourselves.

So take a seat, don’t go anywhere, and make sure to follow #30DaysOfNLP: How To Approach And Choose A Deep Learning Architecture

A simple tool

Within only a few years, deep learning has achieved tremendous breakthroughs. Especially in the field of machine perception, working with unstructured data like images, videos, sound, or text.

Given enough training data neural networks are capable of extracting nearly the same amount and quality of information from the data as a human could.

However, deep learning is just a tool.

And simply having a tool at our disposal won’t suffice.

In order to solve problems, we need a general workflow. We need to understand the key network architectures, enabling us to choose the right tool for a well-defined task.

A general workflow

It’s not building a model.

The difficult part is everything that lies before designing and training a model. Understanding the problem domain or knowing how to measure success, unfortunately, isn’t something TensorFlow or Keras can help us with.

There simply isn’t a plug-and-play function, thus we need a workflow:

We need to define the problem and know what data is available. What are we trying to predict? Do we need to collect more or manually label the data?
How can we reliably measure success on our goal? Perhaps a simple metric like accuracy is sufficient or do we need to define a custom, domain-specific metric?
We need to prepare the validation process to evaluate our model. This means we should define a training, validation, and test dataset.
Vectorize the data. We need to shape and preprocess (e.g. normalization) the data into a form that makes our model happy and smile.
Create a first baseline that beats a common-sense approach. This way we ensure that the network can learn anything at all.
Refine our architecture gradually. We can tune hyperparameters and add regularization to improve the model’s performance and generalization ability.
We can deploy our final model in production and keep monitoring and refining it.

Key network architectures

The key architectures can be divided into 4 different categories. Densely connected, convolutional, recurrent networks, and Transformers.

Each architecture has individual needs in terms of input data and makes different assumptions. Data, underlying assumptions, and the architecture must match in order for the model to be able to learn.

Image data, for example, can be processed by 2-dimensional convolutional neural networks, whereas sequential data is rather processed by recurrent neural networks.

Densely connected networks

Contains stacks of dense layers that are meant to process vector data. Dense networks assume no specific structure in the input features. They’re called densely connected because each unit is connected to every other unit. Thus, creating a dense net of connections.

The dense network attempts to map relationships between any two input features and can be mostly used for categorical data or as a final layer for a classification or regression task.

Convolutional neural networks

Convnets look at spatially local patterns by applying the same transformation to different patches of the input tensor. The results are translation invariant, making convnets highly data-efficient and parallelizable.

Convolutional neural networks can be either 1, 2, or 3-dimensional. We can process sequences (e.g. words in a sentence) with a 1-dimensional network. 2-dimensional networks are best suited for image data.

The network consists of several stacks of convolution and max-pooling layers.

Recurrent neural networks

RNNs process sequences one timestep at a time while maintaining a state throughout. For sequential data, they should be preferred over 1-dimensional convnets, especially if the data contains temporal order e.g. time-series, words in a sentence.

We can rely on the Keras API to provide us with several implementations: SimpleRNN, GRU, and LSTM.

Transformers

Transformers leverage an attention mechanism to transform each input vector (e.g. a word) into a representation that is aware of the context. We can also use positional encoding to make the Transformer aware of both the global context and the order.

Transformers are more effective than RNNs or 1-dimensional convnets and they especially excel at sequence-to-sequence-related problems.

Transformers are made up of two parts: The TransformerEncoder and the TransformerDecoder. The encoder transforms an input into a representation that is aware of the context and the order, whereas the decoder takes the encoder’s output and a target sequence and tries to predict the next element in the target sequence.

Conclusion

In this article, we took a step back and quickly reviewed the key architectures in deep learning. We also established a general workflow, enabling us to approach a problem in a structured, efficient way.

In the next article, we take a slight detour before finishing the complete series and learn about the basics of regular expressions in Python.

So take a seat, don’t go anywhere, make sure to follow, and never miss a single day of the ongoing series #30DaysOfNLP.

#30DaysOfNLP

Enjoyed the article? Become a Medium member and continue learning with no limits. I’ll receive a portion of your membership fee if you use the following link, at no extra cost to you.

Join Medium with my referral link - Marvin Lanhenke

References / Further Material:

Francois Chollet: Deep Learning with Python. New York: Manning, 2021.

Mlearning.ai Submission Suggestions