Custom Data Catalog Parquet File using Azure Data Factory
Use Case
- Build meta data based schema information extraction using ADF
- Parse parquet and find columns
- Update database to store details
- Provide a means to schedule if possible
- Manual provide parquet folder
Architecture

Pre requisite
- Create Azure account
- Create Storage account
- Create Azure data factory
- Create Azure SQL
- Create Functions
Database
- Create a table with column name
- Create a database
- Create proper access to write to table
Drop table metacolumn;
Create Table metacolumn (
id bigint IDENTITY(1,1),
filename varchar(200),
columnname varchar(200),
columntype varchar(200)
)

Load sample data
- Create a storage account
- Load sample data
- i created folder called USpopulationInput\fact
- Loaded few sample parquet files
Azure Data factory
- Create Azure Data factory Pipeline
- We are going to use Get Meta Data
- Create a For Each loop
- Overall Flow




- Select the Folder where parquet files are available
- Drag Foreach and select the Select the Child Items

- Now go to activity
- Inside For Each bring the below components




@dataset().FileName
- Create new fields
- One for Item Name
- One for Item Type
- last modified
- Structure which has collections of columns
- For File name type

Azure Functions
- Create a Azure function in portal called metafunc1
- Use Visual Studio Code to create and code
using System;
using System.IO;
using System.Threading.Tasks;
using Microsoft.AspNetCore.Mvc;
using Microsoft.Azure.WebJobs;
using Microsoft.Azure.WebJobs.Extensions.Http;
using Microsoft.AspNetCore.Http;
using Microsoft.Extensions.Logging;
using Newtonsoft.Json;
using Newtonsoft.Json.Linq;
using System.Collections.Generic;
using System.Data.SqlClient;namespace Company.Function
{public class metadataItem
{
public string name {
get; set;
}
public string type {
get;
set;
}
}public class metadata
{
public string filename {
get; set;
}
public List<metadataItem> name {
get; set;
}}public static class metafunc1
{[FunctionName("metafunc1")]
public static async Task<IActionResult> Run(
[HttpTrigger(AuthorizationLevel.Function, "get", "post", Route = null)] HttpRequest req,
ILogger log)
{
log.LogInformation("C# HTTP trigger function processed a request.");//string name = req.Query["name"];string requestBody = await new StreamReader(req.Body).ReadToEndAsync();
metadata data = JsonConvert.DeserializeObject<metadata>(requestBody);
string connectionString = "Server=tcp:sqlservername.database.windows.net,1433;Initial Catalog=databasename;Persist Security Info=False;User ID=username;Password=xxxxxxx;MultipleActiveResultSets=False;Encrypt=True;TrustServerCertificate=False;Connection Timeout=30;";//JObject body = new JObject<(data);
using(SqlConnection connection = new SqlConnection(connectionString)){
// Opening a connection
connection.Open();
// Defining the log message and Create Date
//var logMessage = $"{name} has logged in.";
var createDate = DateTime.UtcNow;foreach(var ea in data.name)
{
// Prepare the SQL Query
var query = $"INSERT INTO [metacolumn] ([filename],[columnname], [columntype]) VALUES('{data.filename}', '{ea.name}', '{ea.type}')";
// Prepare the SQL command and execute query
SqlCommand command = new SqlCommand(query,connection);
// Open the connection, execute and close connection
if(command.Connection.State == System.Data.ConnectionState.Open){
command.Connection.Close();
}
command.Connection.Open();
command.ExecuteNonQuery();
log.LogInformation("data value. {0} {1}", ea.name, ea.type);
}
connection.Close();
}
// Using the connection string to open a connection
//try{
//
//}
//catch(Exception e){
// log.LogError(e.ToString());
//responseMessage = e.ToString();
//}string responseMessage = $"Hello, { data.name.Count}. This HTTP triggered function executed successfully.";JObject result = new JObject();
result.Add("result", responseMessage);
dynamic response = JsonConvert.SerializeObject(result);
return new JsonResult(result);
}
}
}dotnet add package System.Data.SqlClient
- install other necessary packages that are necessary
- Deploy the function
- Create a resource group


{ "name": [ { "name": "decennialTime", "type": "String" }, { "name": "stateName", "type": "String" }, { "name": "countyName", "type": "String" }, { "name": "population", "type": "Int32" }, { "name": "race", "type": "String" }, { "name": "sex", "type": "String" }, { "name": "minAge", "type": "Int32" }, { "name": "maxAge", "type": "Int32" } ] }

Azure Data Factory



- Save the pipeline
- Validate all
- Publish the pipeline
- Then Go to pipeline and click Add Trigger and select Trigger now
- Go to Monitor and view the results
- View the job and select the job run to view the details


Originally published at https://github.com.