Image for post
Image for post

I started out my series of articles as an exam prep for Databricks, specifically Apache Spark 2.4 with Python 3 exam. These have now transformed into general notes for learning Databricks and reference when writing code.

This post covers DataFrame and attempts to cover it at breadth more than depth. The rationale is to give a wider perspective of what is possible with DataFrames.

My references were:

Spark The Definitive Guide

Sample Code from Spark The Definitive Guide

Apache Spark 2.4 Docos

Spark In Action

Databricks website

Databricks Engineering blog

Jace Klaskowski’s amazing Spark SQL online book

Various developer blogs that are fantastic and stack overflow. There are too many to be called out. …


Image for post
Image for post

(This is a now page, and if you have your own site, you should make one, too. The idea came from Derek Sivers)

Home routine

I live in a suburb of Melbourne, Victoria, Aus with my better half, Suba and dog, Max. Melbourne is home and we are committing to our wonderful city by building a home.

Most mornings, around 5:30 or 6(If I don’t feel like getting up) I do my morning routine which alternates between 20 mins workout & 15–30 mins meditation combo or going for a run. I’m limiting shoulder & chest workouts as I’m recovering from an injury. …


Image for post
Image for post

As I walk through the Databricks exam prep for Apache Spark 2.4 with Python 3, I’m collating notes based on the knowledge expectation of the exam. This is a snapshot of my review of materials. It’s a work in progress.

This post includes the following:

DataFrameWriter (Only covering this, so it’s going to be a short post)

My references were:

Spark The Definitive Guide

Apache Spark 2.4 Docos

Spark In Action

Databricks website

Databricks Engineering blog

Learning Spark, 2nd Edition

Munging Data

Various other developer blogs that are fantastic. There are too many to be called out.

DataFrameWriter — Expectation of Knowledge

· Write data to the “core” data formats (csv, json, jdbc, orc, parquet, text and…


Image for post
Image for post

Introduction

Data Lakes are the heart of big data architecture, as a result careful planning is required in designing and implementing a Data Lake. I’ll walk through my findings and thoughts on the topic in this post.

On a side-note, I do love the metaphor of water as data, reminds me of the famous Bruce Lee quote.

Empty your mind, be formless, shapeless — like water. Now you put water in a cup, it becomes the cup; You put water into a bottle it becomes the bottle; You put it in a teapot it becomes the teapot. Now water can flow or it can crash. …


Image for post
Image for post
Our spring will come…

Here are some thoughts I want to reflect on now and long after Covid-19 is forgotten. We’ll get through this but we do need to start doing things differently.

  1. Think globally not nationally. Nationality, religion and race are human constructs and nature doesn’t respect them. We are all interconnected by supply chains and the ‘virtue’ of being human.
  2. Medical, Education, Utilities (Energy & Water) and Police & Military are the front line fields, in that order. Give these fields the required resources and respect.
  3. Choose your career or next steps carefully during the economic recovery. Work on tasks or projects that really matter.

Image for post
Image for post

As I walk through the Databricks exam prep for Apache Spark 2.4 with Python 3, I’m collating notes based on the knowledge expectation of the exam. This is a snapshot of my review of materials. It’s a work in progress.

This post includes the following:

DataFrameReader (Only covering the reader so that I can go into it in more depth)

My references were:

Spark The Definitive Guide

Apache Spark 2.4 Docos

Spark In Action

Databricks website

Databricks Engineering blog

Various developer blogs that are fantastic. There are too many to be called out.

DataFrameReader — Expectation of Knowledge

  • Read data for the “core” data formats (CSV, JSON, JDBC, ORC, Parquet, text and…

Image for post
Image for post

If you’re looking to load data from Azure Data Lake Gen 2 via Azure Databricks into Azure SQL DW over Polybase, this article will layout an approach based on Managed Service Identity(MSI) for data transfer to the Azure SQL DW.

I prefer the MSI method as opposed to passing the account key for the storage account as it’s more secure i.e. more granular security wise. No doubt we could use Databricks Secrets backed by Azure Key Vault or use Databricks Secrets directly as a means to protect the key, nonetheless the key would have permissions storage account wide.

Secondly key rotation would involve an additional step to change the key in the key store; thus I find the MSI approach better, maintenance wise. The account key method is documented well in the Databricks page for Azure Sql Data Warehouse, as a result I’m not covering the approach in this article. …


Image for post
Image for post
During winter, this tree is just branches. When spring comes it blooms beautifully. Reminds me of latent potential.

If you don’t make the choices,
the choices are made for you.
You live with the outcomes,
whether you made them or not.

Looking forward, You know not if it’s…

About

Lackshu Balasubramaniam

I’m a data engineering bloke who’s into books. Currently re-reading Never Split The Difference by C. Voss. Also started on Thinking Fast and Slow by D. Kahneman

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store