Easy data processing at scale with Optimus

Argenis Leon
8 min readNov 11, 2021
Photo by SpaceX on Unsplash

This blog post is the first of a series about Optimus. Below will be adding a link to every post in the series.

Optimus is an open-source, user-friendly Python library to load, transform and explore data at any scale. It relies on DataFrames technologies like pandas, Dask, cuDF, Dask-cuDF, Spark, and Vaex, so you can process data using the same code with the engine that fits better with the infrastructure at your disposal.

But why so many engines? It turns out that every engine has pros and cons. Let see:

  • Pandas. Fast for small data in your local PC.
  • Dask. Big data in your PC or a Dask Cluster.
  • cuDF. If you have GPU and small data, this is the correct approach
  • Dask-cuDF. If you have big data and a cluster of GPUs, this is the approach.
  • Spark. Slow for small data. If you have big data and a Spark cluster, this is the approach.
  • Vaex. Can process greater than memory data efficiently in a single machine.

Now that we know what Optimus is, let's talk about what we consider its three main features:

1. A unified API.

With Optimus, you can use some of the most popular DataFrame technologies. It provides the flexibility to process a sample of the data on your laptop using pandas and then sends a job to Dask-cuDF or a Spark cluster to process it at scale. With Optimus, you can handle small and big data efficiently. All using the same python code.

2. Abstract user from technicals details.

With Optimus, you only work with columns and rows (no index or mask knowledge required); the purpose is to use familiar concepts from spreadsheets so that you can have a soft landing when you start using Optimus. You have 100+ operations to process string, date, URL, emails and apply math, statistical, trigonometrical, plots, NLP, string clustering, and ML functions. We will be covering these functions in future posts.

3. Data types should be as rich as possible.

Optimus can infer and detect a richer set of data types like email, URL, and credit card numbers, among others. The idea is that you have richer insights of your data and process as needed to put it in shape.

Knowing this, let's install Optimus.

Installing Optimus

To install Optimus, you only need to run this in the terminal,

pip install pyoptimus

You are ready to go!

To initialize Optimus, just run:

from optimus import Optimus

Moreover, to start using pandas in Optimus, just run:

op = Optimus("pandas")

You can use any of the other engines available like:

  • op = Optimus("dask") to use Dask.
  • op = Optimus("dask-cudf") to use Dask-cuDF.
  • op = Optimus("cudf") to use cuDF.
  • op = Optimus("spark") to use Spark.
  • op = Optimus("vaex") to use Vaex.

Keep in mind that to use cuDF and Dask-cuDF, you need an Nvidia GPU compatible con RAPIDS.

Now let's see how you can load a file, manipulate and save a dataframe.

Loading data

Once you have initialized Optimus, you can start loading data. For example, to open a CSV file, just run:

op.load.csv("path/to/file.csv")

You can also load JSON, XML, Excel, Parquet, Avro, ORC, and HDF5 files.

Furthermore, the most helpful function to load a file is df.load.fileWhich can detect and load all the supported files and infer the data encoding (in the case of a CSV) to forget about extra configuration when loading a new dataset.

Now let's see how we can connect and load files from remote sources.

Load files from remote sources

Optimus also provides a way to load data from a remote source. It enables you to connect to these sources even if it requires credentials. For example, to load from a remote storage, we can create a connection like:

conn = op.connect.s3(endpoint_url="nyc3.digitaloceanspaces.com", bucket="bucket_name", key="<KEY>", secret="<SECRET>")

Some of the available methods on op.connect and which filesystems represent are:

  • file: Local or network filesystem (used by default when no connection is set)
  • s3: Amazon S3
  • hdfs: Hadoop File System
  • gds: Google Cloud Storage
  • adl: Microsoft Azure (Data Lake Storage)
  • abfs: Microsoft Azure (Blob Storage)
  • http/https: HTTP(S) filesystem
  • ftp: FTP filesystem

To use the connection from the previous example, we can use the following code:

df=op.load.file("files/my-remote-file.csv", conn=conn)

Now that we connect to remote sources let's see how we can connect to databases.

Loading data from a Database

Connecting to databases can be very handy because you do not need to save the data to a file and then load it in memory.

To create a connection to a database using Optimus, we can use any available methods for database handling. For example, to connect to a PostgreSQL database using specific credentials:

db = op.connect.postgres(address="localhost", user="root", password="12345678", database="mydatabase")

Some of the available database engines Optimus can handle are:

  • mysql: MySQL
  • sqlite: SQLite
  • microsoftsql: Microsoft SQL server
  • postgres: PostgreSQL
  • oracle: Oracle
  • bigquery: BigQuery
  • redshift: Redshift
  • cassandra: Cassandra
  • presto: Presto
  • redis: Redis

Each method requires the following arguments to make a proper connection to a database:

  • host: The host where the database is stored.
  • port: The port where the database is available.
  • user: Username.
  • password
  • database: The name of the database you want to connect.

Saving Data

You surely want to save it after loading and processing your data (we will see more about processing in future posts). With Optimus, you can save data locally, to a remote storage or databases.

To save data locally, you simply need to write:

df.save.csv("file.csv")

And to save data to a remote connection, you need to create a remote connection (as we saw when loading data from remote data sources) like:

On the other hand, to save a file or the table of a database, you can use the following code:

Processing data with external services

One fantastic option to scale your processing quickly is Coiled. Coiled is a deployment-as-a-service library for scaling Python in which you can create clusters on demand. To use a Coiled cluster through Optimus, you need to include your Coiled token in the parameters. It's crucial to notice that to get this token, you must create an account at https://cloud.coiled.io and get the token from your dashboard. So, for example, setting the number of workers to 2, we execute the following code:

op = Optimus(coiled_token="<your token here>", n_workers=2)

Once we have done this, Optimus will initialize a Dask cluster and handle the connection created by Coiled. Then, Optimus will work as usual. However, it's important to point out that it's fundamental to maintain identical versions between the packages in the remote cluster and the packages in your local machine. For this, you can install a Coiled software environment as a local conda environment using its command-line tool. To use Optimus, we will use a specific software environment called optimus/default.

For example, to create the conda environment, we run:

coiled install optimus/default

Then, to start using it, we run:

conda activate coiled-optimus-default.

DataFrames

Optimus creates DataFrames out of Python dictionaries; this makes it easier to create them and work with them.

Given any DataFrame, Optimus can detect a wide range of data types within it, such as:

  • Integer
  • String
  • Email
  • URL
  • Gender
  • Boolean
  • US zip code
  • Credit card number
  • Time and date format
  • Object
  • Array
  • Phone number
  • Social security number
  • HTTP code

Many of these data types have special functions within Optimus:

  • URL (schemas, domains, extensions, and query strings)
  • Date (year, month, and day)
  • Time (hours, minutes, and seconds)
  • Email (domain and domain extensions)

To visualize DataFrames as tables, we use the function. display.

For example:

Optimus has two main functions that allow us to filter the columns in a dataset: select and drop. The first one allows us to remove all the columns that are not input, while the second one will do the opposite, choose which columns will be dropped from the dataset.

On the one hand, you can select columns by the attributes of the columns. For example, using the Dataframe above, we select a column by its name to get only the column named "even":

The dataframe shown above is quite simple, let's create a more rich one:

You can also apply regular expressions to select columns. For this example, with the "^n." regular expression, you can get all the columns that start with the letter n:

Another helpful parameter is invert. For example, if we want to exclude all the columns that start with the letter "n" we would apply the following code, this will return the dataframe "df" except for the columns whose names begin with "n":

There also exists the possibility to select columns by their numerical order. So, for example, if you want to display just the first column of the DataFrame, write the code: (yourDF).cols.select(1).

And if you want to display the first and third columns, write the code: (yourDF.cols.select([1,3])).

For example, the following code will return a table containing the first and third columns of "df":

On the other hand, you can eliminate any columns by "dropping" them from the dataset. For example, if we want to get rid of the column "words" from "df", we write:

As you can see, the table above is the same as the one generated by df but this one doesn't contain the column named "words."

Another great managing function is keep, which allows us to delete all the columns in the dataset except for the ones we specify in the argument. An example would be:

Optimus also haves the option to change the order of columns in the dataset through the function move which gives a helpful way to organize the columns. To use it, you must input the columns (or a single column) into the function, telling it where to place the selected columns relative to the reference column. Let's look at an example, the following code will return the same table generated by "df" but with a new order of its columns:

As you can see, the "words" column went from first to fifth, and the "thing" column went from third to sixth.

Now you know almost every essential function in Optimus to get started, from using a Coiled cluster or loading data from an external database to filtering and organizing DataFrames your way.

In a word, Optimus can enhance data wrangling in every way: it makes more accessible managing data with its many DataFrame functions, it enables any user to load data like a pro, it gives a new intuitive way to analyze big and small data, but most of all, it feels familiar and straightforward so that anyone can use it.

The following articles will explore more specific areas that Optimus covers, such as data profiling and quality, set datatypes, string, and numeric functions, along with mathematical and statistical functions and UDFs (User-defined functions).

See you next week!

Do you want to learn more about Optimus?

If you want a deep dive explanations about the 100+ functions in Optimus, consider reading our book available on Amazon https://www.amazon.com/Data-Processing-Optimus-Supercharge-preparation-dp-1801079560/dp/1801079560/ref=mt_other?_encoding=UTF8&me=&qid=1628683187.

--

--