How to use SingleStore with Spark ML for Fraud Detection — 2/3

Akmal Chaudhri
6 min readSep 2, 2021

--

Load the Credit Card data into SingleStore

Abstract

In the first part of this Fraud Detection series, we created and configured a Databricks CE cluster. We’ll load credit card data into our Spark environment in this second article. We’ll also compare the performance of the SingleStore Spark Connector against JDBC for loading our data into SingleStore from Spark.

The notebook files used in this article series are available on GitHub in DBC, HTML and iPython formats.

Introduction

This is a multi-part article series, and it is structured as follows:

  1. Configure Databricks CE.
  2. Load the Credit Card data into SingleStore.
  3. Create and evaluate a Logistic Regression model.

This second article covers Part 2, Load the Credit Card data into SingleStore. Please ensure that you have completed the setup and requirements described in the first article if you follow this series.

In our SingleStore Managed Service account, let’s use the SQL Editor to create a new database. Call this fraud_detection, as follows:

CREATE DATABASE IF NOT EXISTS fraud_detection;

We’ll run the command and check that it completed.

We need some data for our use case. We can find actual credit card data on Kaggle. If you don’t have an account at Kaggle, create one and download the creditcard.csv file. The Kaggle website states that this file is 143.84 MB in size.

The data are anonymised credit card transactions containing genuine and fraudulent cases. The transactions occurred over two days during September 2013, and the dataset includes a total of 284,807 transactions, of which 492 are fraudulent, representing just 0.172% of the total. This dataset, therefore, presents some challenges for analysis as it is highly unbalanced. The dataset consists of the following fields:

  • Time: The number of seconds elapsed between a transaction and the first transaction in the dataset.
  • V1 to V28: Details not available due to confidentiality reasons.
  • Amount: The monetary value of the transaction.
  • Class: The response variable (0 = no fraud, 1 = fraud).

Upload CSV file

We can upload the CSV file into Databricks CE using the DBFS File Browser. We can check that the DBFS File Browser is enabled by first locating Settings in the left-hand navigation pane. Now we’ll select the Admin Console. On the Admin Console page, we can see three options:

We’ll select Workspace Settings.

Scrolling down on the Workspace Settings page, we’ll locate the DBFS File Browser, as follows:

In this case, we see that it is disabled. We need to toggle the button on the right-hand side to enable this feature, as follows:

Next, we’ll need to reload this browser page to ensure that the feature is active. Once that is done, we’ll select Data (1) from the left navigation pane, followed by DBFS (2), then FileStore (3) and finally the Upload button (4), as shown in Figure 1.

Figure 1. Upload Button.

Next, we’ll locate the CSV file that we downloaded earlier and Drag files to upload or use the file browser (1) and then click Done (2) as shown in Figure 2.

Figure 2. Upload the file.

Once the CSV file has successfully been uploaded, there should be a checkmark next to the file (1), and we can click Done (2), as shown in Figure 3.

Figure 3. Data Uploaded.

The CSV file should now be visible in the directory, as shown in Figure 4.

Figure 4. File successfully uploaded.

Fill out the Notebook

Let’s now create a new notebook. We’ll call it Data Loader for Fraud Detection using Logistic Regression. We’ll attach our new notebook to our Spark cluster, as shown in Figure 5.

Figure 5. Attach to Cluster.

Using the options from the pull-down for a code-cell (1), we can show the title of a code cell (2), as shown in Figure 6.

Figure 6. Show Title.

This title can then be modified by editing it (1), as shown in Figure 7. This will help us organise our code and remind us what we are trying to achieve in each code cell.

Figure 7. First Code Cell.

We’ll add the following code (2):

df = spark.read.csv("/FileStore/creditcard.csv",
header = True,
inferSchema = True)

This reads the CSV file and creates a Dataframe called df. We also tell Spark that there is a header row and ask it to determine the data types for each column. Finally, we can add a new code cell by hovering over the + sign below an existing code cell (3) and clicking it.

To execute each code cell, we can use Shift + Enter. This first example will create the Dataframe.

In the next code cell, we’ll count the number of rows, as follows:

df.count()

Executing this, we obtain the value 284807.

In the next code cell, to show the Dataframe, we can use:

display(df)

In the next code cell, to find the number of fraudulent and non-fraudulent transactions, we can use:

display(df.groupBy("Class").count())

This shows us that there are 492 (1 = fraud) and 284315 (0 = no fraud) transactions.

We are now ready to write the Dataframe to SingleStore. In the next code cell, we can add the following:

%run ./Setup

This will execute the notebook that we created in the previous article. We need to ensure that the server address and password have been added for our SingleStore Managed Service cluster.

In the next code cell, we’ll set some parameters for the SingleStore Spark Connector, as follows:

spark.conf.set("spark.datasource.singlestore.ddlEndpoint", cluster)
spark.conf.set("spark.datasource.singlestore.user", "admin")
spark.conf.set("spark.datasource.singlestore.password", password)
spark.conf.set("spark.datasource.singlestore.disablePushdown", "false")

These are parameters for the SingleStore cluster, username, password and whether Pushdown is enabled or disabled. We’ll discuss Pushdown in a separate article.

Finally, we are ready to write the Dataframe to SingleStore using the Spark Connector:

%%time

(df.write
.format("singlestore")
.option("loadDataCompression", "LZ4")
.mode("overwrite")
.save("fraud_detection.credit_card_tx"))

This will write the Dataframe to a table called credit_card_tx in the fraud_detection database. We can check that this table was successfully created from SingleStore. Using the %%time in the code cell allows us to measure the write operation. We can then compare this to using JDBC.

Visually, Figure 8 shows our Spark Connector setup.

Figure 8. SingleStore and Spark Connector.

In the next code cell, we can set the JDBC variables, as follows:

jdbc_url = "jdbc:mysql://{0}:{1}/{2}".format(server, port, "fraud_detection")

properties = {
"user": "admin",
"password": password,
}

In the next code cell, we can save the Dataframe to SingleStore using JDBC:

%%time

(df.write
.jdbc(url = jdbc_url,
table = "credit_card_tx2",
mode = "overwrite",
properties = properties))

This will write the Dataframe to a table called credit_card_tx2 in the fraud_detection database. We can check that this table was successfully created from SingleStore. Using the %%time in the code cell allows us to measure the write operation.

Visually, Figure 9 shows our JDBC setup.

Figure 9. SingleStore and Spark using JDBC.

Comparing the two values of %%time it should be clear that, out of the box, the SingleStore Spark Connector is much faster than JDBC. Therefore, we could use Spark to perform ETL operations with SingleStore.

Summary

In this second article in this Fraud Detection series, we have successfully loaded our credit card data into a Spark Dataframe, checked the number of rows and then written the Dataframe to SingleStore using both the SingleStore Spark Connector and JDBC. Comparing the performance of these two approaches, we should see that, out of the box, the SingleStore Spark Connector is much faster than JDBC.

--

--

Akmal Chaudhri

I help build global developer communities and raise awareness of technology through presentations and technical writing.