Data Engineering (Pandas in Python & SQL)

Andrew Githinji
3 min readSep 24, 2022
source

Pandas, was created in 2008 by Wes McKinney and used for data analysis in Python, which is an open-source library designed primarily for dealing with relational or labeled data easily and intuitively. It includes a number of data structures and procedures for manipulating numerical data and time series.

Pandas is built on top of the NumPy package, hence NumPy is necessary to run Pandas.

What are the capabilities of using Pandas?

Pandas will simplify many of the time-consuming, repetitive activities connected with data work.

What’s fascinating about Pandas is that it takes data such as a CSV or TSV file or a SQL database and constructs a Python object with rows and columns called a data frame, which looks very similar to a table in statistical software (think Excel)

You’ll typically use it in one of three ways:

  • Convert a Python list, dictionary, or NumPy array to a Pandas data frame.
  • Use Pandas to open a local file, commonly a CSV file, but it may also be a delimited text file like TSV, Excel, or other formats.
  • Use a URL to open a remote file or database, such as a CSV or a JSON, or read from a SQL table/database.

Other instances include when you want to analyse a dataset saved on your computer in a CSV, Pandas will extract the data from that CSV into a Data Frame such as tables and do things like:

· Averages, median, max, or min of each column.

· Correlate columns.

· The data column distribution would resemble.

The great Library of NumPy

Short for Numerical Python, NumPy is Python’s fundamental library for scientific computing. Used for working with arrays, it has functions that make it suitable for working in data structures, and implementing multidimensional arrays and matrices.

These data structures are used to perform optimal computations on arrays and matrices.

Arrays are collections of values that have one or more dimensions. NumPy makes it easy to manage large amounts of data.

It is also highly useful for matrix multiplication and data reshaping. NumPy is quick, which makes it effective to work with massive amounts of data.

How can you utilise NumPy…

NumPy is a multi-purpose array-processing library. By using the array method, you may generate an array from a conventional Python list or tuple. The structure of the resultant array is determined by the type of the sequence elements.

NumPy includes a method similar to range that returns arrays rather than lists to generate numerical sequences.

Ndarray

In NumPy, the array object is known as ndarray. Using the array() function, we can generate a NumPy ndarray object.

To generate an ndarray, enter a list, tuple, or any array-like object into the array() method, and it will be generated into an ndarray.

Arrays are very frequently used in data engineering and data science, where speed and resources are of high value.

Check out the NumPy documentation to learn more…

A brief on Data Engineering with SQL..

source

SQL is a standard language for storing, manipulating and retrieving data in databases.

Structured Query Language or SQL in short is used to implement actions on the database’s records, such as updating records, inserting records, deleting records, creating and altering database tables and views etc.

Not to be misidentified as a database system, but rather a query language. This query language became the standard of ANSI in the late 80’s, and later ISO Certified. It has become widely used in data science and analytics and it is used at the back-end of large enterprises such as Facebook, Instagram, and LinkedIn.

Database systems including PostgreSQL, SQLite, MySQL, SQL Server, etc. are different variants of database systems that implement some primary aspects of SQL are not compliant with the SQL ANSI/ISO standards.

More on SQL here

--

--