End-to-End Data Application using Modern data Stack Snowflake, snowpark, streamlit, and DBT (Part-1)

DataMount
12 min readApr 14, 2023

--

The modern data stack refers to the latest technologies and tools used for collecting, processing, storing, and analyzing data in a modern, scalable, and efficient manner. It encompasses a wide range of technologies that work together to enable organizations to harness the power of data and derive valuable insights for decision-making.

At the core of the modern data stack is cloud-based data warehousing, which allows for cost-effective and scalable storage of large volumes of data. Cloud data warehouses such as Amazon Redshift, Google BigQuery, and Snowflake provide powerful data processing capabilities and seamless integration with other data tools.

Data ingestion and integration tools like Apache Kafka, Apache Nifi, and Apache Airflow enable organizations to efficiently collect data from various sources and integrate it into their data pipelines. These tools provide capabilities for data cleansing, transformation, and enrichment, ensuring that data is in a consistent format for analysis.

Modern data visualization and analytics tools like Tableau, Power BI, and Looker offer intuitive and interactive ways to explore and analyze data, making it accessible to business users and decision-makers without extensive technical knowledge. These tools provide interactive dashboards, visualizations, and ad-hoc querying capabilities to gain insights and drive data-driven decision-making.

image is taken from here.

Advanced analytics and machine learning tools like Python, R, and TensorFlow enable organizations to apply advanced statistical techniques and machine learning algorithms to extract actionable insights from data. These tools allow for data exploration, model development, and deployment, enabling organizations to leverage the power of artificial intelligence and machine learning in their data analysis workflows.

Additionally, modern data stack architectures often involve the use of data lake technologies such as Apache Hadoop, Apache Spark, and Apache Delta Lake, which allow for large-scale data storage and processing in a distributed and scalable manner. Data lakes provide flexibility in data storage, allowing organizations to store both structured and unstructured data in its raw form, and process it as needed for analysis.

Overall, the modern data stack comprises a wide range of cutting-edge technologies that provide organizations with the capabilities to collect, store, process, analyze, and visualize data in a scalable and efficient manner, enabling data-driven decision-making and unlocking the full potential of data for businesses.

We will be working with UK Crime Analysis to get familiar with Modern data tools and build an end-to-end application as an outcome of the project.

The UK Crime Analysis Technical Report is a comprehensive and data-driven document that provides insights and analysis on crime trends, patterns, and patterns in the United Kingdom. The report is based on a wide range of data sources, including official crime statistics, police records, victim surveys, and other relevant datasets. The report utilizes advanced data analytics techniques, such as statistical analysis, machine learning, and data visualization, to uncover patterns and trends in crime data, identify emerging issues, and provide evidence-based recommendations for crime prevention and intervention strategies. The UK Crime Analysis Technical Report serves as a valuable resource for law enforcement agencies, policymakers, and researchers, providing them with valuable insights into the nature and dynamics of crime in the UK, and informing evidence-based decision-making to address crime challenges effectively.

Our data source will be https://data.police.uk/data/

The overall architecture of the project will be like this.

To start with install Anaconda, stremlit, and get one Snowflake account. Also, install snowsql.

First, we will have to upload the data into our Snowflake account. Our data looks like this

Step1: Loading the data into the snowflake stage

-- Create a database
CREATE DATABASE my_database;

-- Use the database
USE DATABASE my_database;

-- Create a schema
CREATE SCHEMA my_schema;

-- Create a stage
CREATE STAGE my_stage
URL = 's3://my_bucket/my_stage'
CREDENTIALS = (
AWS_ACCESS_KEY_ID = 'my_access_key'
AWS_SECRET_ACCESS_KEY = 'my_secret_key'
);

In this example, we first create a database named my_database using the CREATE DATABASE statement. Then, we use the USE DATABASE statement to switch to the newly created database.

Next, we create a schema named my_schema within the my_database using the CREATE SCHEMA statement.

Finally, we create a stage named my_stage using the CREATE STAGE statement. The URL parameter specifies the location of the stage, which in this case is an Amazon S3 bucket (s3://my_bucket/my_stage). The CREDENTIALS parameter specifies the access key and secret access key for authentication with the S3 bucket.

Note: Snowflake stages are used as external storage locations for loading data into and unloading data from Snowflake. They provide a way to ingest data into Snowflake from external sources or export data from Snowflake to external destinations.

for our example, we will be using the internal stage.

In Snowflake, there are three types of stages: internal stages, named stages, and external stages. Each type of stage has different use cases and configurations.

  1. Internal Stages: Internal stages are automatically created by Snowflake within each Snowflake account and are used for temporary storage of data during the processing of queries. They are managed by Snowflake and do not require any additional setup. Internal stages are typically used for intermediate data storage within Snowflake during query execution and are not intended for long-term data storage or data ingestion purposes.
  2. Named Stages: Named stages are user-defined stages that can be created in a Snowflake account. They can be used for data ingestion or data export purposes. Named stages provide a way to load data into Snowflake from external sources or export data from Snowflake to external destinations. Named stages can be created as either “named internal stages” or “named external stages”.
  3. External Stages: External stages are used to read data from or write data to external cloud storage services, such as Amazon S3, Google Cloud Storage, or Microsoft Azure Blob Storage. External stages are created in Snowflake as pointers to external data files stored in cloud storage, allowing Snowflake to read or write data from/to those external files. External stages enable Snowflake to leverage the scalability and durability of cloud storage services for data loading and unloading operations.

Named stages and external stages can be created and configured with various options, such as providing credentials for authentication, defining file formats, specifying encryption settings, and setting up notifications. These stages provide flexibility and versatility in ingesting and exporting data in Snowflake, allowing users to seamlessly integrate with external data sources and destinations.

once the stage is ready use the list command to explore the stage. Then use the snowsql to connect your snowflake instance to your local machine.

To connect Snowflake to a local machine using SnowSQL, you would need to follow these steps:

  1. Install SnowSQL: SnowSQL is a command-line client for Snowflake that allows you to interact with Snowflake from your local machine. You can download and install SnowSQL from the official Snowflake documentation at https://docs.snowflake.com/en/user-guide/snowsql-install-config.html.
  2. Configure SnowSQL: Once SnowSQL is installed, you need to configure it with your Snowflake account information. Open a terminal or command prompt and run the following command:
snowsql -a <account_name> -u <username> -r <role> -w <warehouse> -s <schema>

Replace <account_name> with your Snowflake account name, <username> with your Snowflake username, <role> with your Snowflake role, <warehouse> with your Snowflake warehouse, and <schema> with your desired Snowflake schema.

For example:

snowsql -a my_account_name -u my_username -r my_role -w my_warehouse -s my_schema

3. Enter Password: After running the above command, SnowSQL will prompt you to enter your Snowflake account password. Enter your password and press Enter.

4. Start Querying: Once you are successfully connected, you can start running SQL queries against your Snowflake account using SnowSQL. Simply type your SQL query in the terminal or command prompt and press Enter to execute the query.

SnowSQL provides a convenient way to interact with Snowflake from your local machine, allowing you to run SQL queries, upload/download data, and perform other data-related operations using Snowflake’s features and capabilities.

once in snowsql you can perform all the operation that you perform in worksheet . upload the datafiles we previously kept in our local folder .

In Snowflake, you can use the PUT command to upload data into a stage. The PUT command is used to copy data from a local file on your machine to an external stage in Snowflake. Here's an example of how you can use the PUT command:

  1. Use the PUT command: Once the stage is created, you can use the PUT command to upload data from a local file on your machine to the stage. Here's an example:
-- Upload data to stage
PUT file:///path/to/local_file.csv @my_stage;

In the above example, we use the PUT command to upload data from a local file (file:///path/to/local_file.csv) to the my_stage stage.

Note: The file path specified in the PUT command can be a local file path on your machine, or a network file path accessible from your machine. The @ symbol before the stage name (@my_stage) indicates that it is an external stage.

Once the PUT command is executed, Snowflake will upload the data from the local file to the specified stage, making it available for further processing within Snowflake, such as loading into tables or performing data transformations.

once data is uploaded into stage verify the data in stage and close the snowsql connection.

Now let us create the tables for holding the data

  1. Create Table: Next, create a table in Snowflake that will hold the data you are uploading. You can use SnowSQL to execute a CREATE TABLE statement to define the schema of the table, including column names, data types, and any other desired options.
  2. Upload Data: Use the SnowSQL COPY command to upload data from the local file on your machine to the Snowflake table. The COPY command allows you to specify the file format, file location, and other options for data loading. Here’s an example:
-- Create table
CREATE TABLE my_schema.my_table (
col1 STRING,
col2 INT,
col3 FLOAT
);

-- Upload data
COPY INTO my_schema.my_table
FROM @my_stage/my_file.csv
FILE_FORMAT = (TYPE = CSV, FIELD_DELIMITER = ',', SKIP_HEADER = 1);

In the above example, we create a table named my_table in the my_schema schema. Then, we use the COPY command to load data from a CSV file named my_file.csv located in a named stage called my_stage into the my_table table. We also specify the file format as CSV and provide options such as field delimiter and skip header.

Note: Before using the COPY command, you may need to create an external stage or a named stage in Snowflake to store the data file that you want to upload. You can use SnowSQL to create stages and configure their properties as needed.

Once the COPY command is executed, SnowSQL will upload the data from the local file into the Snowflake table, making it available for querying and analysis within Snowflake.

here is the code for debugging errors in the loading operation

list @uk_crime_stage;

select
t.$1,t.$2,t.$3,t.$4,t.$5,t.$6,t.$7
from @uk_crime_stage/LSOA_pop_v2.csv.gz as t;


SELECT t.$1,t.$2,t.$3,t.$4,t.$5,t.$6,t.$7,t.$8,t.$9,t.$10,t.$11,t.$12 FROM @uk_crime_stage/all_crimes21_hdr.txt.gz t limit 10;

CREATE OR REPLACE TABLE CRIME_DF (
Crime_ID VARCHAR(255),
Month VARCHAR(50),
Reported_by VARCHAR(255),
Falls_within VARCHAR(255),
Longitude float,
Latitude float,
Location VARCHAR(255),
LSOA_code VARCHAR(50),
LSOA_name VARCHAR(255),
Crime_type VARCHAR(50),
Last_outcome_category VARCHAR,
Context VARCHAR
);

select * from crime_df;

COPY INTO CRIME_DF
FROM @uk_crime_stage/all_crimes21_hdr.txt.gz file_format = 'UK_CRIME_FILE_FORMAT' ON_ERROR='CONTINUE';

COPY INTO CRIME_DF
FROM @uk_crime_stage/all_crimes21_hdr.txt.gz ON_ERROR='CONTINUE';



CREATE OR REPLACE TABLE save_copy_errors AS SELECT * FROM TABLE(VALIDATE(crime_df, JOB_ID=>'01ab8d15-3200-b324-0003-c30a0001c0b6'));

select * from save_copy_errors;


truncate table crime_Df;

select * from crime_df limit 100;


CREATE OR REPLACE TABLE LAOS_DF (
date TIMESTAMP_NTZ(9),
geography VARCHAR(50),
geography_code VARCHAR(255),
Rural_Urban VARCHAR(255),
usual_residents INTEGER,
males INTEGER,
females INTEGER,
household INTEGER,
communal_establishment INTEGER,
non_term_time_address INTEGER,
Area_Hectares FLOAT,
density FLOAT
);

COPY INTO LAOS_DF
FROM @uk_crime_stage/LSOA_pop_v2.csv file_format = 'UK_CRIME_FILE_FORMAT' ON_ERROR='CONTINUE';


CREATE OR REPLACE TABLE POSTCODES_DF (

Postcode VARCHAR(50),
Inuse VARCHAR(50),
Latitude FLOAT,
Longitude FLOAT,
Easting INTEGER,
Northing INTEGER,
GridRef VARCHAR,
County VARCHAR(50),
District VARCHAR(50),
Ward VARCHAR,
DistrictCode VARCHAR(50),
WardCode VARCHAR(50),
Country VARCHAR(50),
CountyCode VARCHAR(50),
Constituency VARCHAR(50),
Introduced TIMESTAMP_NTZ(9),
Terminated TIMESTAMP_NTZ(9),
Parish VARCHAR(50),
NationalPark VARCHAR(50),
Population VARCHAR(50),
Households VARCHAR(50),
BuiltUpArea VARCHAR,
Built_upsubdivision VARCHAR(50),
Lowerlayersuperoutputarea VARCHAR,
Rural_urban VARCHAR,
Region VARCHAR,
Altitude FLOAT
);

COPY INTO POSTCODES_DF
FROM @uk_crime_stage/postcodes.gz file_format = 'UK_CRIME_FILE_FORMAT' ON_ERROR='CONTINUE';

truncate table postcodes_df;

CREATE OR REPLACE TABLE save_copy_errors AS SELECT * FROM TABLE(VALIDATE(POSTCODES_DF, JOB_ID=>'01ab6f9c-3200-b1ee-0003-c30a00010052'));

select * from save_copy_errors;

once done we will move toward our first dashboard building using Snowsight to get insights about crimes where the outcome is landing in prison .

here is a sample query to explore that data

-- use case 1. Find insights for prison inmates events . 

select * from crime_df limit 100;

select * from crime_df where last_outcome_category is not null;

select ((select count(*) from crime_df where last_outcome_category is not null)/(select count(*) from crime_df))*100 as percentange_outcome_results;

select distinct(last_outcome_category),count(*) as number_of_occurence from crime_df where last_outcome_category is not null group by last_outcome_category order by number_of_occurence desc;

select * from crime_df where last_outcome_category like 'Offender sent to prison%' ;

CREATE OR REPLACE TRANSIENT TABLE PRISON_DF AS select * from crime_df where last_outcome_category like 'Offender sent to prison%' ;

select count(*) from prison_df;

Step 2: Prison data analysis dashboard building

Now go to the Home screen.

click on the dashboard option there.

click on the + icon and give your dashboard a meaningful name “Prison Data Analysis”

You will see an empty dashboard and we will create a dashboard like this

First, let’s create a filter, click on the + icon to create a filter. Fill the details

Next, click on the Edit Query option. paste the following query and click done

select distinct(crime_type) from crime_df;

Your filter is ready for the dashboard with ACCOUNTADMIN as the owner if you want you can work with SYSADMIN which is a recommended way of doing things in Snowflake.

click on the + icon and paste the following query

select distinct(latitude),longitude from PRISON_DF  limit 10000 ;

select chart adjust x and y axis and your first chart is ready.

click on Done and repeat same for the other charts wth the following code blocks.

select
distinct(LATITUDE),
LONGITUDE
from
PRISON_DF
WHERE CRIME_TYPE = :crime_type
LIMIT
10000;
select
distinct(month),
count(*) number_of_occurance
from
PRISON_DF
where
crime_type = :crime_type
group by
month
order by
month desc;
select
distinct(crime_type),
count(*) number_of_occurance
from
PRISON_DF
group by
crime_type
order by
number_of_occurance desc
limit
15;
select
LOCATION,
count(*) location_repitation_number
from
PRISON_DF
WHERE
crime_type = :crime_type
group by
location
order by
location_repitation_number desc
limit
10;
select
LOCATION,
count(*) OCCURENCE
from
PRISON_DF
WHERE crime_type = :crime_type
group by
location,
CRIME_TYPE
order by
OCCURENCE desc
limit
10;
select
reported_by,
count(*) number_of_reports
from
PRISON_DF
WHERE crime_type = :crime_type
group by
reported_by
order by
number_of_reports desc limit 10;
select
distinct(substr(month, 1, 4)) Year,
count(*) number_of_occurance
from
PRISON_DF
where
crime_type = :crime_type

group by
Year
order by
Year desc;

Once you are done with all these queries you will be ready with your first in-house product “Prison data Analysis Dashboard”.

Congratulations.

Part-2 is coming.

youtube link: https://www.youtube.com/watch?v=9YgR-fRXRiY&t=841s

youtube link: https://youtu.be/DSuOCXKtwsU

connect with me on Linkedin: https://www.linkedin.com/in/satyajit-maitra-b51b58150/

--

--

DataMount

Introducing "DataMount": Your Premier Destination for Data Science and Data Engineering Training and Project Content on YouTube, Medium and Facebook