How to use the Data Factory Lookup activity to read data from storage during pipeline run

Ilse Epskamp
Azure Tutorials
Published in
6 min readFeb 7, 2022
Photo by Markus Winkler on Unsplash

When designing a pattern with Data Factory, you can choose from a wide range of features and activities to handle your data requirements. For example, Data Factory activities enables you to:

  • control your data flow using loops, conditional execution, variables and parameters;
  • transform data;
  • read/write data from/to various sources;
  • schedule and monitor pipeline jobs.

Today’s blog will focus on the Lookup activity, which can read data stored in a database or file system so you can use its content in your pipeline. In this hands-on blog we will demonstrate how to use the Lookup activity to read data from a JSON file located in a storage account, and use a value from the JSON file in your pipeline.

Prerequisites

  • Linked Service for storage account is available
  • JSON file in storage account

The JSON file that will be used in this example has the following structure:

{
“filename”:”jsonfile.json”,
“status”:”active”
}

We store this file in ADLS: tst@youradlsgen2account.dfs.core.windows.net/dir/jsonfile.json. You want your pipeline to access ADLS, open the JSON file and fetch the value for key status. For this task you can use the Lookup activity and the Set variable activity.

Data Factory pipeline with Lookup and Set variable activity.

Step 1: Create a dataset that represents the JSON file

Create a new dataset that represents the JSON file. Our JSON file is located in ADLS Gen 2, so we select New Dataset > Azure Data Lake Storage Gen2 > JSON.

Select JSON as format for your data.

After clicking OK, you will see the dataset in the Editor. In the Connection path, you need to define the Linked service and the filepath to the JSON file. For this purpose we will create three “Dataset parameters” and reference them in the filepath.

Go to tab Parameters and create a Container, Directory and FileName parameter. In a production environment it is recommended to dynamically pass the values for these parameters to the dataset. This enables you to reuse a dataset for multiple pipeline runs, because the path to the reference file is never hardcoded but always dynamically configured. Let’s explain this with an example. Assume you have configured a Blob Event trigger on a storage account directory. Whenever a JSON file is created in that directory, a Data Factory pipeline is triggered with the purpose to perform some actions depending on values in the JSON file. When using pipeline triggers, within the triggered pipeline run you will have a parameter called triggerFileName available which contains the filename of the JSON file that triggered the pipeline run. You can access this parameter by @pipeline().parameters.triggerFileName. This enables you to dynamically pass the name of the JSON file that triggered the pipeline run to the dataset of the Lookup activity, so your pipeline can access the file in your storage account.

For the purpose of this blog we will hardcode the filepath details in the parameters. So to Container we assign the value tst, to Directory dir and to FileName jsonfile.json.

Define Dataset parameters, pointing to the location of your JSON file.

Now go back to tab Connection and reference the Datasets parameters in the filepath by using

@dataset().ParameterName
Reference the Dataset parameters in the filepath of the Dataset.

Step 2. Create a pipeline to read the JSON file details

A. Create a new pipeline. Add an empty pipeline variable to store a value fetched from the JSON file. Go to tab Variables, click New and create a String variable. We called it “value”.

Create a pipeline variable.

B. Next, add and configure the Lookup activity.

Data Factory Lookup activity.

In tab Settings, select the dataset that you created in Step 1. As you can see, the values that you entered as Dataset parameters are displayed in the Dataset properties section.

The Dataset properties show the values that are passed to the Dataset.

We only expect a single row in the JSON file, so select the option “First row only”. If you are reading data from a database table or file and expect multiple rows, disable this option. Note: the way to handle the output of the Lookup activity depends on if you have enabled First row only option or not. This is discussed in more detail below.

C. Now add an activity “Set variable” to store the value for status. You can access the Lookup output with the following expression:

@activity(‘lookup-activity-name’).output.firstRow.keyname

so in our example:

@activity(‘Lookup1’).output.firstRow.status

Add this to the Value box in tab Variables.

Add an expression to define the value of a pipeline variable.

Step 3: Run the pipeline and analyse the output

Trigger the pipeline and monitor the activity output. First let’s analyse the output of the Lookup activity. The following output is generated:

{
“firstRow”:
{
“filename”: “jsonfile.json”,
“status”: “active”
},
“effectiveIntegrationRuntime”: “AutoResolveIntegrationRuntime (West Europe)”,
“billingReference”:
{
“activityType”: “PipelineActivity”,
“billableDuration”: [
{
“meterType”: “AzureIR”,
“duration”: 0.016666666666666666,
“unit”: “DIUHours”
}]
},
“durationInQueue”:
{
“integrationRuntimeQueue”: 0
}
}

In Step 2 we defined how to access output elements. For example to fetch the status value, you use:

@activity(‘Lookup1’).output.firstRow.status

Now let’s look at the output of the Set variable activity. As you can see, the value of the pipeline variable “value” is set to “active”, which is defined in the JSON file.

The value of the pipeline variable is set to the value of the element fetched from the JSON file.

Read multi-row tables
To read the content of a JSON file we have enabled the “First row only” option. Let’s see what the output of the Lookup activity looks like if we read a multi-row file. Assume the following csv:

col1,col2
value1,value2
value1,value2
value1,value2

When reading this with a Lookup activity, it will generate this output:

{
“count”: 3,
“value”: [
{
“col1”: “value1”,
“col2”: “value2”
},
{
“col1”: “value1”,
“col2”: “value2”
},
{
“col1”: “value1”,
“col2”: “value2”
}
],
“effectiveIntegrationRuntime”: “AutoResolveIntegrationRuntime (West Europe)”,
“billingReference”:
{
“activityType”: “PipelineActivity”,
“billableDuration”: [
{
“meterType”: “AzureIR”,
“duration”: 0.016666666666666666,
“unit”: “DIUHours”
}]
},
“durationInQueue”:
{
“integrationRuntimeQueue”: 0
}
}

As you can see, the result is structured differently so the way to fetch information from the output differs as well. For example, to fetch the value of the second column from the second row, use:

@activity(‘Lookup1’).output.value[1].col2

Azure Tutorials frequently publishes tutorials, best practices, insights or updates about Azure Services, to contribute to the Azure Community. Azure Tutorials is driven by two enthusiastic Azure Cloud Engineers, combining over 15 years of IT experience in several domains. Stay tuned for weekly blog updates and follow us if you are interested!

https://www.linkedin.com/company/azure-tutorials

--

--

Ilse Epskamp
Azure Tutorials

Azure Certified IT Engineer with 9+ years of experience in the banking industry. Focus areas: Azure, Data Engineering, DevOps, CI/CD, Automation, Python