DATA LOADING AUTOMATION USING MICROSOFT SSIS

Babatunde Akeredolu
8 min readOct 7, 2022

--

Microsoft SSIS is a powerful ETL tool used to solve complex business problems, it can be used to extract, transform and load data from a wide variety of sources such as XML data files, flat files, RDMS sources, etc. In this project, I created an ETL package that automatically populates the database of a SQL server from local storage.

Problem Statement

Company X gets live data which needs to be updated into their database periodically, this could be every tenth hour, 24 hours, or weekly interval. Additionally, the data has to be loaded into four different tables. Loading the data manually will require a lot of time since this must be done at intervals.

Solutions Provided

With the help of Microsoft SQL Server Integration Services (SSIS), I created a project that loads the data automatically into the database in SQL server. Two major problems were solved here:

1. Automation of the whole process freeing up several hours that can be spent on a more productive task.

2. Data Modelling — from a single master dataset, the ETL package loads the data into four different tables with their appropriate data types, and relationships were defined between those tables.

For this project, I used my computer’s local storage as the data source.

Tools Used

Microsoft SSIS

Microsoft SQL Server

Project steps

1. Package 1

1.1 Create connection

1.2 Create procedure

1.3 Update variable

1.4 Move file

2. Package 2

2.1 Get variable value

2.2 Data flow task

2.2.1 Excel source

2.2.2 Data conversion

2.2.3 Multicast

2.2.4 Tables

2.3 Event handlers

3. Package 3

3.1 Remove duplicates

3.2 Get variable value

3.3 Move file

4. Master controller

5. Automation

1. PACKAGE 1

In the control flow tab, I created two tasks — Execute SQL task and File system task. The Execute SQL task is named update variable while the File system task is named move file.

More details are below:

1.1 Create connection

The first thing I did was to create two connections — OLEDB connection and File connection. The OLEDB connection provides access to the SQL database while the File connection provides access to the local directory.

1.2 Create procedure

I created a procedure in the SQL database, I named the procedure update_var_value, the purpose of this procedure is to update the var_value column by incrementing it by 1 since there was only one row and one column in the table. Before I created this procedure, I already created a table named variable with one column var_value, the value of this var_value was set to zero. You will understand the need for this table and procedure as you read along.

1.3 Update variable

The next step is to update the variable table by calling the procedure created initially. Within the Execute SQL task, two things happen; first, the procedure is called and this updates the var_value in the database; second, there is a SELECT statement that returns the value of the var_value and the result is stored temporarily in a new variable called var in SSIS. With the help of the OLEDB connection, it was possible to call the procedure and run the SELECT statement through the Execute SQL task. I named this task update variable.

Remember the procedure updates var_value by adding 1 to it and the SELECT statement returns this value which is stored in the var variable in SSIS.

1.4 Move file

There are three folders in the local directory — files, sales, and archived.

The files folder contains all six datasets used for this project. The plan is to move each file one after the other as the project executes. The first file has to be moved to the sales folder so that package 2 can run successfully, after this, the file is moved from the sales folder to the archived folder. There is a challenge here, all six files have different names as we can see from the image below.

To move the file, a connection to the local directory is needed and this has been created in 1.1. For this connection to work, the path and name of the file to be moved needs to be specified. There are six files, and specifying a path to just one of the files means we have to keep changing the file path each time the package is executed, the whole essence of automation will be defeated if we have to do this. This is where the variable var comes to play, don’t forget the variable is holding a value of the result returned in the previous task.

Now, let’s specify the file path for the connection. For the first file, the path is

C:\Users\asus\Desktop\ANALYTICS\projects\SSIS\files\SalesData1.xls

To make this dynamic, this is what I do, I used three variables to set the connection: filepath, var, and fullpath.

filepath = C:\Users\asus\Desktop\ANALYTICS\projects\SSIS\files\SalesData

var = 0 (before execution) and 1 (after the first execution).

fullpath = filepath + var + “.xls” — I used concatenation here.

The variable var gets updated at each execution, this makes the connection dynamic. For example:

At first run, var updates from 0 to 1,

fullpath = filepath + var + “.xls”

= C:\Users\asus\Desktop\ANALYTICS\projects\SSIS\files\SalesData + 1 + “.xls”

= C:\Users\asus\Desktop\ANALYTICS\projects\SSIS\files\SalesData1.xls.

For the second run, var updates from 1 to 2 and

fullpath = filepath + var + “.xls”

= C:\Users\asus\Desktop\ANALYTICS\projects\SSIS\files\SalesData2.xls.

I believe this explanation is clear.

Let me also say this, var holds an integer value, we all know concatenating a string with an integer will through an error, I had to cast the integer to a string using the built-in function in SSIS. Let’s move on to the second package.

2. PACKAGE 2

In the whole project, I created a total of four packages, the first package has been discussed above, and this is the second package.

2.1 Get variable value

In this section, I used the Execute SQL task to get the var_value from the variable table just like I did in package 1. The result is stored in the var variable created in this package.

2.2 Data flow task

If you are familiar with SSIS, you’ll understand what Data flow task is all about.

2.2.1 Excel source

I created a connection to access the excel file, in this connection the excel file has to be specified. Since there are six files, the connection has to be dynamic just like in package 1. I used three variables to achieve this — var, path, and fullfilepath. The fullfilepath variable is what I used as the file path in the connection. The fullfilepath is a concatenation of path and var i.e

fullfilepath = path + var + “.xls”.

var variable updates every time the package is executed, this makes the connection path dynamic.

2.2.2 Data conversion

Once the excel file has been accessed, the next step is to convert the columns to the appropriate data types before they are loaded into the SQL database. The data conversion task helps to achieve this.

2.2.3 Tables

For the data to be loaded from SSIS into an SQL database, the tables must be present in the SQL database, I used the CREATE TABLE statement to create the four tables with their appropriate columns, the tables are customers, orders, products, and region tables.

2.2.4 Multicast

The image below shows the columns from the master data, loading this directly into the SQL database will cause a repetition of values. Imagine having to have city, postal code, and country every time an order is made, this is not necessary as all these details can be gotten from the region table which can be accessed at any time using the postal code in the orders table.

The purpose of Multicast is to help us load the data into multiple tables, without multicast, the data can only be loaded into one table. Here is the result using Multicast.

2.3 Event handlers

The purpose of the event handler is to log any error encountered during the execution of this project. I created a table called logerror in the database with two columns (package_name and error_description), there is an SQL task within the event handler, if any error occurs, the SQL task selects the package name where the error occurs and the error description and then insert these two values into the logerror table in the database.

Let’s move on to the third package.

3. PACKAGE 3

3.1 Remove duplicates

Once the data has been loaded successfully, the integrity of the data must be maintained. In this section, I used the Execute SQL task to remove the duplicates, within this task, there is a query that helps to remove the duplicates from the four tables. The query is shown in the image below:

3.2 Get variable value

The second part of this package is to get the var_value from the database and store the result in the var variable in this package. The purpose of this is to make the connection in the next section dynamic.

3.3 Move file

Once the data has been loaded successfully into the database and the duplicates removed, the next step is to move the excel file from the Sales folder into the archived folder.

4. MASTER CONTROLLER

The purpose of this package is to set the flow of the three previous packages. From the image below, you can see the flow, the first package named P1 has to be completed before moving to the second package, P2, and then to the third, P3. To run the three packages, I just need to execute the master controller and all three packages will run successfully.

5. AUTOMATION

Now let’s see how the automation works. To automate the whole process, the project that contains the packages needs to be deployed to the integration services server. After deployment, the automation of the package was set in SQL server agent. The video below shows the automation of the whole process.

THIS IS THE END OF THE PROJECT.

You can also check my other pages:

LinkedIn profile

GitHub repository

Twitter profile

Tableau profile

--

--