Automate Exploratory Data Analysis on Databases

How to Extract Insights From Your Data

Davis David
Geek Culture
10 min readApr 5, 2023

--

Photo by RODNAE Productions

The world in which we live generates millions of data every day. The vast majority of this data is stored in databases on servers located across the globe. This data is used to influence and shape our daily lives, from government decisions to the products consumers purchase.

Data has a great deal of promise if you can gain insights from it. It has become more important than ever to be able to find patterns in this data, draw conclusions from it and leverage it to create a competitive advantage.

By using analytical techniques to identify patterns and uncover insights, companies are able to gain actionable knowledge from their data that they can use to inform decisions, identify new opportunities, and develop strategies.

How can you begin to extract insights from your data in order to create data-driven decisions? Exploring your data to discover and comprehend the underlying patterns, information, and facts that can help you make a better decision is the first step.

In this article, you will mainly learn:

  • How to manage your data using the HarperDB database.
  • Access your data from HarperDB using Custom Function.
  • Automate EDA with data from the harperDB database using the Sweetviz python library.
Automate EDA

So! let’s get started 🚀.

Introduction to Automated Exploratory Data Analysis

Exploratory data analysis (EDA) is an important step in any data science project. It means looking at and understanding the structure, patterns, and properties of a dataset before using it to make machine learning models. EDA helps data scientists find outliers, missing values, correlations, and distributions that can affect how accurate the machine learning model is.

But doing EDA manually can take a lot of time, especially for large datasets with a lot of features. Automated exploratory data analysis can give a quick and complete overview of the dataset, highlighting the most important insights and patterns. AutoEDA can make visualizations and reports that are more consistent and standardized than those made by humans. In general, AutoEDA is a powerful tool that can help data scientists speed up their work, make their models better, and learn more about their data.

Why Analyze data from the databases?

When compared to other data sources, there are several benefits to analyzing data from databases.

  • First, databases store data in one place and in a structured way, which makes it easier to manage and search through large amounts of data. In today’s fast-paced business world, timely decision-making is critical and data scientists need to be able to quickly analyze data to spot new trends or possible problems.
  • Second, databases offer robust security and access control mechanisms that ensure the privacy and integrity of the data. Data stored in databases are protected from unauthorized access and manipulation, which is critical for sensitive and confidential data.

In this article, I will demonstrate how you can explore data from a database such as the HarperDB database which provides more flexibility to easily manage and access the data to explore.

What is HarperDB?

HarperDB is one of the fastest and most flexible SQL/NoSQL data management platforms. It offers different services, including but not limited to :-

  • Rapid application development
  • Distributed computing
  • Software as a Service (SaaS)
  • Edge computing and others

HarperB has the ability to run on various devices, from the edge to the cloud without duplicating data. What I love about HarperDB is its compatibility with different programming languages such as Python, which is most commonly used for Data analysis. Other programming languages are Javascript, Java, Go, C and NodeJS.

HarperDB offers different features that you can use for different cases and projects, these features are as follows:

  • Single endpoint API
  • Custom Functions (Lambda-like application development platform with direct access to HarperDB’s core methods).
  • Allows JSON and CSV file insertions.
  • Supports SQL queries for full CRUD operations.
  • Limited database configuration required.
Logo from harperdb.io

Steps to manage data on the HarperDB database

We need to put some sample data into the HarperDB database before we can automate the process of exploring the data. For this tutorial, I will use the loan dataset that is available here.

Step 1: Create a HarperDB Account
The first step is to create an account if you don't have one. You need to visit this link, https://harperdb.io/ and then click the navigation bar to see a link called “Start Free”.

If you already have an account, visit this link https://studio.harperdb.io/ to log in with your credentials.

Step 2: Create a HarperDB Cloud Instance
The second step is to create a cloud instance to store and fetch your data that will be automatically explored. You just need to click the "Create New HarperDB Cloud Instance" link to add a new instance to your account.

The new page will give you all the important information you need as a guideline to create your cloud instance.

Step 3: Configure the HarperDB Schema and Table
Before you can add the data you want to explore from the database, you must first set up a schema and a table. You only need to load the HarperDB cloud instance you previously built from the dashboard and name the schema (such as “loans”) to complete the process.

loan schema

Then add a table, such as “customers” that will contain the loan data. HarperDB will further request the hash attribute, which is like to an ID number.

Step 4: Import Loan data to the table
Import the loan data you have downloaded from this link to the table you have created (for example- Customers table). Click the file icon for bulk upload from the table interface.

Bulk upload icon

The new page will ask you to either import the URL of your CSV file or drag it to select the CSV file to import, choose the option that works best for you.

upload data from a CSV file.

In the screenshot below, you can see examples of the loan data that has been added to the database.

loan data

How to access the data from the HarperDB Database

The harperDB database can be integrated with the exploratory data analysis tool to explore and visualize the data. With a feature called Custom Function, HarperDB makes it easy to retrieve the data via API in a very straightforward manner.

This will enable seamless integration of the data analysis process with the database to quickly and efficiently extract insights from the data. Additionally, using HarperDB's Custom Function feature will also save time and make it easier to get the data, which will make the workflow smoother.

So, let’s learn more about Custom Function.

What is a Custom Function?

Custom functions provide a way to add your own API endpoints to the HarperDB database. Fastify is what makes this feature work. It is flexible and lets your data interact well with your database. The API request will make it possible for the data from the database to be automatically sent to the application for exploratory data analysis.

Here are the steps you need to follow:-

1. Enable Custom Functions
Enable Custom functions by clicking “functions” in your HarperDB Studio. This feature is not enabled by default

2. Create your Project
The following step is to create a project by naming it. For example, loan-api-v1. It will also generate project configuration files, including:

  • Routes folder
  • File to add helper functions
  • Static folder

Note: You will focus on the routes folder.

3. Define a Route
Create the first route in order to retrieve loan data from the customers' table in the HarperDB Datastore. Also, you should be aware that route URLs are resolved as follows:

[Instance URL]:[Custom Functions Port]/[Project Name]/[Route URL]

It will include:

  • Cloud Instance URL
  • Custom Functions Port
  • Project name you have created
  • The route you have defined

In the route file (example.js) from the function page, you can see some template code as an example. You need to replace that code with the following code:

In the code above, the GET method is used to define the route /loan-api-v1. The handler function will then send an SQL query to the database to get all of the data from the customers' table.

The SQL query will fetch information from the following columns in the customers’ table:-

  • Gender
  • Married
  • Dependents
  • Education
  • Self_Employed
  • ApplicantIncome
  • CoapplicantIncome
  • LoanAmount
  • Loan_Amount_Term
  • Credit_History
  • Property_Area
  • Loan_Status

4. Use the API URL to access the data
You can now easily access your data using the API URL. For example, my API URL is https://functions-1-mlproject.harperdbcloud.com/loan-api-v1. Open your web browser and paste your URL to view the data from your database.

As you can see, the API call makes it possible to quickly access the customer data that is stored in the HarperDB database.

Steps to perform Automated EDA with Sweetviz

Sweetviz is an open-source Python library that generates stunning visualizations that are packed with insightful information to get exploratory data analysis off the ground with just two lines of code.

The library can be utilized to create a visual comparison of the dataset as well as the variables. The output application is a completely self-contained HTML file that can be loaded in a web browser.

Install Sweetiz

This library can be installed using the following command:

pip install sweetviz

Collect data from the API

We use the API endpoint to get to the data, and Pandas library can load and integrate the data with the Sweetviz library.

The code above sends a request to the API endpoint using a python package called request and finally, the data will be extracted in JSON format.

Load data into pandas Dataframe

The extracted data from the API endpoint is then loaded into a DataFrame using the Pandas package.

Top five rows

As you can see in the screenshot above, we were able to access the data via an API endpoint and load the data using a Pandas DataFrame.

Now, with the help of Sweetviz, we can automatically explore and visualize the data. We use only two methods from Sweetviz as follows:

  • analyze() method: This method analyzes the entire dataset and then automatically provides a detailed EDA report with visualization.
  • show_hmtl() method: This method renders the report to an HTML file that can be opened in the web browser.

Report loan_analysis.html was generated and we can open it in the web browser.

Let’s open the loan_analysis.html to view the EDA report generated.

EDA using Sweetviz library

As you can see, our EDA report is complete and offers an abundance of information for each attribute with visualization. It is simple to comprehend and is prepared in 3 lines of code.

From the above screenshot, you can see that the largest loan amount was 700 requested by one customer and the smallest loan amount is 9 also requested by only one customer. This insight can be used to help loan officers to make data-driven decisions.

Complete code

Below is the complete code with only 9 lines of code that, when executed, will automatically:-

  • Access data from your database using an API call.
  • Load the data using the Pandas library.
  • Analyze the data and provide a detailed EDA report.

As more new data are added to your HarperDB database, you can now just execute the above code to generate a new beautiful EDA report to collect more insight from the data.

Conclusion

Congratulations 🎉, You have made it to the end of this article. You have learned:

  • How to manage your data using the HarperDB cloud instance.
  • Access your data from HarperDB using Custom Function.
  • Automate EDA with data from the harperDB database using the Sweetviz python library.

If you learned something new or enjoyed reading this article, please share it so that others can see it. Until then, see you in the next post!

You can also find me on Twitter at @Davis_McDavid.

One last thing: Read more articles like this in the following links

This article was first published here.

--

--

Davis David
Geek Culture

Data Scientist 📊 | Software Developer | Technical Writer 📝 | ML Course Author 👨🏽‍💻 | Giving talks. Check my new ML course: https://bit.ly/OptimizeMLModels