Analytics Vidhya
Published in

Analytics Vidhya

Custom Data Catalog Parquet File using Azure Data Factory

Use Case

  • Build meta data based schema information extraction using ADF
  • Parse parquet and find columns
  • Update database to store details
  • Provide a means to schedule if possible
  • Manual provide parquet folder

Architecture

Pre requisite

  • Create Azure account
  • Create Storage account
  • Create Azure data factory
  • Create Azure SQL
  • Create Functions

Database

  • Create a table with column name
  • Create a database
  • Create proper access to write to table
Drop table metacolumn; 
Create Table metacolumn (
id bigint IDENTITY(1,1),
filename varchar(200),
columnname varchar(200),
columntype varchar(200)
)

Load sample data

  • Create a storage account
  • Load sample data
  • i created folder called USpopulationInput\fact
  • Loaded few sample parquet files

Azure Data factory

  • Create Azure Data factory Pipeline
  • We are going to use Get Meta Data
  • Create a For Each loop
  • Overall Flow
  • Select the Folder where parquet files are available
  • Drag Foreach and select the Select the Child Items
  • Now go to activity
  • Inside For Each bring the below components
@dataset().FileName
  • Create new fields
  • One for Item Name
  • One for Item Type
  • last modified
  • Structure which has collections of columns
  • For File name type

Azure Functions

  • Create a Azure function in portal called metafunc1
  • Use Visual Studio Code to create and code
using System;
using System.IO;
using System.Threading.Tasks;
using Microsoft.AspNetCore.Mvc;
using Microsoft.Azure.WebJobs;
using Microsoft.Azure.WebJobs.Extensions.Http;
using Microsoft.AspNetCore.Http;
using Microsoft.Extensions.Logging;
using Newtonsoft.Json;
using Newtonsoft.Json.Linq;
using System.Collections.Generic;
using System.Data.SqlClient;
namespace Company.Function
{
public class metadataItem
{
public string name {
get; set;
}
public string type {
get;
set;
}
}
public class metadata
{
public string filename {
get; set;
}
public List<metadataItem> name {
get; set;
}
}public static class metafunc1
{
[FunctionName("metafunc1")]
public static async Task<IActionResult> Run(
[HttpTrigger(AuthorizationLevel.Function, "get", "post", Route = null)] HttpRequest req,
ILogger log)
{
log.LogInformation("C# HTTP trigger function processed a request.");
//string name = req.Query["name"];string requestBody = await new StreamReader(req.Body).ReadToEndAsync();
metadata data = JsonConvert.DeserializeObject<metadata>(requestBody);

string connectionString = "Server=tcp:sqlservername.database.windows.net,1433;Initial Catalog=databasename;Persist Security Info=False;User ID=username;Password=xxxxxxx;MultipleActiveResultSets=False;Encrypt=True;TrustServerCertificate=False;Connection Timeout=30;";
//JObject body = new JObject<(data);

using(SqlConnection connection = new SqlConnection(connectionString)){
// Opening a connection
connection.Open();

// Defining the log message and Create Date
//var logMessage = $"{name} has logged in.";
var createDate = DateTime.UtcNow;
foreach(var ea in data.name)
{
// Prepare the SQL Query
var query = $"INSERT INTO [metacolumn] ([filename],[columnname], [columntype]) VALUES('{data.filename}', '{ea.name}', '{ea.type}')";

// Prepare the SQL command and execute query
SqlCommand command = new SqlCommand(query,connection);

// Open the connection, execute and close connection
if(command.Connection.State == System.Data.ConnectionState.Open){
command.Connection.Close();
}
command.Connection.Open();
command.ExecuteNonQuery();
log.LogInformation("data value. {0} {1}", ea.name, ea.type);
}

connection.Close();

}
// Using the connection string to open a connection
//try{
//
//}
//catch(Exception e){
// log.LogError(e.ToString());
//responseMessage = e.ToString();
//}
string responseMessage = $"Hello, { data.name.Count}. This HTTP triggered function executed successfully.";JObject result = new JObject();
result.Add("result", responseMessage);
dynamic response = JsonConvert.SerializeObject(result);
return new JsonResult(result);
}
}
}
dotnet add package System.Data.SqlClient
  • install other necessary packages that are necessary
  • Deploy the function
  • Create a resource group
{ "name": [ { "name": "decennialTime", "type": "String" }, { "name": "stateName", "type": "String" }, { "name": "countyName", "type": "String" }, { "name": "population", "type": "Int32" }, { "name": "race", "type": "String" }, { "name": "sex", "type": "String" }, { "name": "minAge", "type": "Int32" }, { "name": "maxAge", "type": "Int32" } ] }

Azure Data Factory

  • Save the pipeline
  • Validate all
  • Publish the pipeline
  • Then Go to pipeline and click Add Trigger and select Trigger now
  • Go to Monitor and view the results
  • View the job and select the job run to view the details

Originally published at https://github.com.

--

--

--

Analytics Vidhya is a community of Analytics and Data Science professionals. We are building the next-gen data science ecosystem https://www.analyticsvidhya.com

Recommended from Medium

Mobile GIS

25Common Ubuntu Shortcuts

Minting Medal of Honor Application Details

Linux: Software libraries in C

Dart sound null safety: technical preview 2

Screenshot of the preceding code with null errors.

Software Engineering- Courses, Admission and Careers in the US

What’s upcoming in Q2

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Balamurugan Balakreshnan

Balamurugan Balakreshnan

https://balakreshnan.github.io/

More from Medium

Hacker Rank Problem : Binary Tree Nodes : Solved Using Snowflake

Integrating Apache Pulsar with BigQuery

Set your GCP up for Databricks deployment

Enabled APIs

Batch orchestration on Azure flowchart