It’s not Python VS SQL for Data Analysis. They are in a love relationship!

Stefan Graf
CodeX
Published in
5 min readApr 4, 2022
Photo by Kelly Sikkema on Unsplash

The ultimate goal is to use the best tool for your job. But often it makes sense to use some tools together like SQL and Python. Both are vastly different in a lot of points (which will be also discussed in the following text), but this doesn’t mean you have always to choose between both of them for your Analysis tasks. Rather the contrary, they can complement each other very well.

But first let’s gain a quick overview of Python and SQL regarding Data Analysis Tasks.

Python

Python is a general purpose programming language. In fact, it pushes this definition to the extreme. Python is literally being used in the real world for almost everything, including Data Analysis. Also, Python can run almost everywhere. On your local machine, on a Server or even on distributed systems.

This gets realized by different packages which can be imported. For Data Analysis purposes the most prominent one is for sure Pandas, as it provides you with a lot of tools to work with data in a table based format. This way is also implemented in various other packages like veax. Also the task of Data Analysis can be distributed to a multitude of systems using something like PySpark or Dask. That makes Python also suitable for Big Data Analysis.

In big contrast to SQL, Python is also capable of consuming APIs, Web Scrapping and other ways of gathering data. Hence it is very flexible on including different data sources.

SQL

SQL is the query language to interact with data in relational databases. This is the only use case of this language, but it implements it in a very compelling way. It’s such a good fit for this purpose, that it is the de facto standard on this for over 40 years, which is a really long time in computer science.

But it also comes with some limitations. SQL can be only used in Databases and a lot of data driven use cases, like machine learning and sophisticated statistics, are better implemented by programming languages like Python.

Another difference is that even SQL runs on various database systems from different vendors, which can have drastically different purposes like OLTP or OLAP, SQL looks almost everywhere the same. Sure, there are some additions to the SQL standard syntax from the different vendors, but it is everywhere very similar. This differs a lot to the fact, that Python code can look very different, depending on which package is being used. This fact makes SQL very easy to use.

Where one of them is enough

So in my title I stated that SQL and Python are in a love relationship besides all their differences. But there are topics where each of them just work fine by them self.

Like in the real world, in any good relationship you should do some stuff for yourself, which your partner doesn’t enjoy.

For SQL this would be for example:

  • Queries inside your Datawarehouse
  • Preparing data for Analysis inside your Datawarehouse

What do the points of this incomplete list have in common? Every task is solely done inside your Database. This is where SQL shines and no additional tool is required.

For Python the list looks like this:

  • Doing one-time analysis on a csv file
  • Analysing data from API without saving the data
  • Implementing Machine Learning models from csv files

Again, what do these list items have in common? There is no contact to a persisting layer (database) and therefore SQL is not required.

Where both need each other

I guess by now you know where this is leading — Python requires help when it comes to get data from a database and save it into a database. SQL needs help when it has to interact with something outside its database.

Even if you do some things on your own, the foundation of each working relationship is doing stuff together you both love.

How do they connect

Python is used as connection between both of them. There are plenty of options in getting access to the data inside in your database. The most common one might be to use the SQLAlchemy Package. This SQLAlachemy connection can be directly used by pandas to ingest all needed data from an SQL query into a dataframe.

Often databases don’t give you the possibility to connect directly to them. It is abstracted behind an API, which still is no Problem for Python to connect to the data source.

Doing your Analysis where it fits best for you inside your ETL/ELT

Gathering data from one System and load it into a target database using an ETL pipeline is probably the most common task every Data Engineer is doing. This is where Python is often used as glue between the source data system and the target system. Getting data from APIs and Webscrapping is done by Python. Transforming and analysing with very complex logic or on a Big Data scale using something like PySpark is also a great fit for Python.

The following architecture schema is built in Azure, but it looks very similar at different cloud providers. Basically, you use Pyspark in Databricks (a managed spark environment) or pandas in an Azure Function (serverless Functions) to transform your data. This transformed data can then be loaded into Synapse (Datawarehouse) or any other database.

Architecture of a modern data environment using the Azure stack

Now this is where the great relationship comes in place. You can decide where your different analysis tasks fit the best. Often with Python, you should do your sophisticated statistics. These results can be generated while doing the rest of your transformations. Afterwards you can load the data, including your statistics, into your Datawarehouse to perform other adhoc analysis.

Machine Learning/Data Science

Training/Inferencing is another task, where both tools can work very well together. The data used in your ML solutions is commonly provided by the database. You could just simply query all the data needed from the database and do all the heavy lifting inside your Python Code.

A better way would be to use well suited SQL queries to get only the data you really need. As already mentioned, SQL is a very efficient way to define your needed data. In fact, if you use good queries, where a index can be leveraged, you increase the performance of your whole solution.

Conclusion

I hope you enjoyed this text. My main goal was to provide you with some ideas how SQL and Python can work together on a beneficial way. Also, to remind you, that every tool has its pros and cons. Different from Lord of the rings, there is not the one tool to rule them all and often it is better to combine the strengths of your tools to cover their weaknesses.

--

--

Stefan Graf
CodeX
Writer for

Data Engineer Consultant @Microsoft — Data and Cloud Enthusiast