Stories by Tharun Kumar Sekar on Medium

Understanding Streaming Query Metrics

Tharun Kumar Sekar — Sun, 10 Dec 2023 16:53:49 GMT

To optimize a Streaming Pipeline, Streaming query metrics is the right place to begin your analysis.

For illustration purposes, I’m picking Kafka Topic as a source and Delta table as destination. Here is a sample Streaming Query Metrics and this can be found in the log4j file of the driver.

INFO MicroBatchExecution: Streaming query made progress: {
  "id" : "8734d7c4-f46d-4e28-a7d6-5b4498ec9fc0",
  "runId" : "848g9a09-8141-4a78-8c05-c2138c1b4e09",
  "name" : "ingest-stream",
  "timestamp" : "2023-11-30T09:03:00.000Z",
  "batchId" : 2,
  "numInputRows" : 43,
  "inputRowsPerSecond" : 725573.7333333333,
  "processedRowsPerSecond" : 5705.508014487084,
  "durationMs" : {
    "addBatch" : 7629498,
    "getBatch" : 0,
    "commitOffsets" : 203,
    "queryPlanning" : 176,
    "triggerExecution" : 7630245,
    "walCommit" : 131
  },
  "stateOperators" : [ ],
  "sources" : [ {
    "description" : "KafkaV2[Subscribe[event-tree]]",
    "startOffset" : {
      "event-tree" : {
        "1" : 143,
        "0" : 149
      }
    },
    "endOffset" : {
      "event-tree" : {
        "1" : 166,
        "0" : 169
      }
    },
    "latestOffset" : {
      "event-tree" : {
        "1" : 199
        "0" : 199
      }
    },
    "numInputRows" : 43,
    "inputRowsPerSecond" : 725573.7333333333,
    "processedRowsPerSecond" : 5705.508014487084,
    "metrics" : {
      "avgOffsetsBehindLatest" : "0.0",
      "estimatedTotalBytesBehindLatest" : "0.0",
      "maxOffsetsBehindLatest" : "0",
      "minOffsetsBehindLatest" : "0"
    }
  } ],
  "sink" : {
    "description" : "DeltaSink[s3://datalake/bronze/event-tree]",
    "numOutputRows" : -1
  }
  }
}

Few of the basic entries in this Metrics are

id — Streaming Pipeline’s id. This will not change across different runs.
runId — Unique ID of that individual run. This is expected to change during every restart.
batchId — Number of the micro-batch which is being processed.
numInputRows — Number of Records that were consumed in this micro-batch
processedRowsPerSecond — Number of Records that were processed per second.

Now let’s get into the metrics which will help us in understanding the processing time.

durationMs — This category contains all the time related information of the micro-batch
getBatch — Time taken to retrieve the metadata about the next micro-batch, like offsets. This doesn’t include reading the actual data. This value would mostly be very minimal.
walCommit — Time taken to commit the offset value to the checkpoint. This value also would be very minimal. This value would also be very minimal
queryPlanning — Time taken to generate the execution plan by spark. This value would also be minimal.
addBatch — Time taken to read, process and sent the data to the sink. This metric would always have a higher value since the processing time is measured here.
commitOffsets — Time taken to commit to the commit log file after processing the micro-batch. This value would also be minimal.
triggerExecution — This metric is the summation of all the above metrics.

The next category is about the Source (Kafka in our case).

description — Name of the topic from which we are reading the data.
startOffset — Category in which we display information related to the start offset from which we are reading the data.
event-tree — Topic Name
0, 1 — partition Ids
The value displays the record number (offset) from which this micro-batch has started to read the data.
endOffset — The values displayed are the record number (offsets) until which the records has been read.
latestOffset — This displays the current latest record of the partition. If this value matches the endOffset, it means we have processed all records in the partition.

If you liked this article, click the 👏 so other people will see it here on Medium.

Understanding Streaming Query Metrics was originally published in Analytics Vidhya on Medium, where people are continuing the conversation by highlighting and responding to this story.

Dynamic Partition Upsert — SPARK

Tharun Kumar Sekar — Tue, 17 May 2022 13:18:16 GMT

Dynamic Partition Upsert — SPARK

If you’re using Spark, you probably know what partitioning is, and perhaps you would have even encountered Dynamic Partitions. But even, If you are not familiar with Spark partitioning in general or Dynamic Partition Inserts, don’t worry, we’ve got it covered.

Partitioning in Spark

Partition in simple terms means splitting the data based on a column’s value and storing it in individual partitions/folders.

Let’s look at the usual way, we save the data with Partitions and then see how Dynamic Partition can help us.

val historyDF = Seq(
      (8, "bat", "1", "2022-05-01"),
      (64, "mouse", "1", "2022-05-02"),
      (-27, "horse", "1", "2022-05-03"),
      (-28, "mouse", "1","2022-05-03"),
      (10, "bat", "1", "2022-05-04")
   ).
   toDF("number", "word", "priority","date")

Here, we have created a dataframe “historyDF” with 5 records. Let’s look at the data now.

Now let’s save this dataframe to the folder “dbfs:/dynamicPartitions/” with partitioning based on the column “date”.

historyDF.
   write.
   mode("overwrite").
   partitionBy("date").
   parquet("dbfs:/dynamicPartitions/")

Let’s look at how the data looks after it’s saved in the file system.

We can see 4 subfolders created for the 4 different dates which were used for partitioning. The subfolders were created at the same timestamp. We can confirm this by viewing the “modificationTime” entry in the above image.

Need for Dynamic Partitioning

In most cases, we would need to run a daily load/ETL to load data into HDFS. This would also have updated or additional records belonging to the previous dates. In technical terms, we should do an “UPSERT” — (Update and Insert) to the existing partitions and load the current date’s data into a new partition.

Let’s create a sample data and store it in a dataframe “deltaDF”.

val deltaDF = Seq(
      (64, "mouse", "2", "2022-05-02"),
      (-29, "mouse", "2", "2022-05-03"),
      (10, "cat", "2", "2022-05-05")
   ).
   toDF("number", "word", "priority", "date")

Now let’s try to understand the data in “deltaDF”. We have 3 records in total and 1 record for each day. The record for 2022–05–02is an exact replica of the existing record in “historyDF” with just the priority changed. The record for 2022–05–03has an update to the field “number” with the value being changed from -28 to -29. Finally, we have a new entry 2022-05-05 with one record.

The expectation now is to have one new partition created for 2022-05-05, have the record updated for 2022-05-03 and the partition to be overwritten for 2022-05-02. The rest of the partitions should remain untouched.

How to Dynamic Partition

In order to perform this, we need to first update the spark’s partition override mode to dynamic. This can be done by running the command,

spark.conf.set("spark.sql.sources.partitionOverwriteMode","dynamic")

Since most of the HDFS is object-based storage, we cannot update individual records in HDFS. Instead we can update the partitions we are interested in. In order to achieve this, we need to have the complete data in a partition that’s going to be overwritten in memory so that we don’t lose the data. It’s confusing right. Let’s make it simple with code.

val distinctDates = deltaDF.
   select('date).
   distinct.
   map(_.getString(0)).collect().toList

val filteredHistoryDF = historyDF.
   filter('date.isin(distinctDates:_*))

filteredHistoryDF.
   union(deltaDF).
   withColumn(
      "rank", 
      row_number.over(
         Window.partitionBy(
            'date, 
            'word).
            orderBy('priority.desc)
         )
   ).
   filter('rank === 1).
   write.
   mode("overwrite").
   partitionBy("date").
   parquet("dbfs:/dynamicPartitions/")

Here, we are selecting the different dates for which we are having entries in the deltaDF. Then we are filtering the records related to these dates from the historyDF. Now we union both the delta and the filtered history and select the latest records, so we have the updated ones along with the ones that previously existed. And finally we save the records to the same path.

Let’s see how the data looks in the file system after the update has happened.

We can see that a new partition folder was created for 2022-05-05 and partitions 2022-05-02 and 2022-05-03 were updated. We can confirm this by viewing the timestamp field modificationTime. The other partitions are untouched and have the same timestamp ( the creation timestamp).

By doing this, we save execution time by skipping partitions which we are not interested in and also save a lot of I/O time.

If you liked this article, click the 👏 so other people will see it here on Medium.

Dynamic Partition Upsert — SPARK was originally published in Analytics Vidhya on Medium, where people are continuing the conversation by highlighting and responding to this story.

Higher-Order Functions — Python

Tharun Kumar Sekar — Thu, 23 Sep 2021 07:15:47 GMT

Higher-Order Functions — Python

A programming language is said to support First Class Functions if it treats functions as first-class objects. By definition, a “first-class object” in a program entity is an object that can be passed around just like other objects. It has the following characteristics:

can have properties and methods.
can be assigned to a variable.
can be passed as an argument to a function.
can be returned as a result of another function.

Let’s try to understand each property through code.

Properties and Methods

Each Method/Function you create in Python has a set of default properties and methods, which can inspect using the dir() method. In the below example, I have defined a hello_world function that prints the string Hello World!.

def hello_world():
   print("Hello World!")

When we call the dir() function on the hello_world method, we could see all the default methods that are part of it.

print(dir(hello_world))
['__annotations__', '__call__', '__class__', '__closure__', '__code__', '__defaults__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__get__', '__getattribute__', '__globals__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__kwdefaults__', '__le__', '__lt__', '__module__', '__name__', '__ne__', '__new__', '__qualname__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__']

You can call any of these methods tied with the function. For example:

hello_world.__name__
# 'hello_world'

You can also use type to understand that the function we create is an instance of the function class.

print(type(hello_world))
#

2. Assigning Function to Variables

We can also assign functions to variables.

def hello_world_function(name):
   print("Hello " + name + "!")

hello_world_variable = hello_world_function

Here we are assigning hello_world_function function to the variable hello_world_variable. Now hello_world_variable is a functions object which means, we can call it just like the hello_world_function.

hello_world_variable("Tharun")
# Hello Tharun!

This assignment does not call the function instead it takes the function object referenced by hello_world_function and creates a second name pointing to it.

hello_world_function
#

hello_world_variable
#

3. Function as an Argument

Since Function is an object, you can pass it as an argument much similar to a variable.

Let’s consider iterating over a list of items and printing them sequentially. We can easily build an iterate function.

def iterate(items):
   for item in items:
      print(item)

This is usual stuff. What if we want to do something different from printing the items? That’s where Higher-Order Functions come in. We can create a function iterate_custom that takes in both the item list and the function that needs to be applied to each item.

def iterate_custom(items, function):
   for item in items:
      function(item)

By doing this, we have created a function that can do anything with the list that involves sequential iteration. This is a higher level of abstraction and this also makes our code reusable.

4. Return Function from Function

This is usually done to have a wrapper function that decides the control flow or to decide which function should be called. For example:

def square(num):
   return num * num

def cube(num):
   return square(num) * num

def power_of_num(power):
   if power == 2:
      return square
   elif power == 3:
      return cube

num_powers = power_of_num(2)
# num_powers is assigned with the square method

num_powers(5)
# 25

We have defined methods square and cube which are pretty usual. The third method power_of_num is a wrapper function that returns either of the first two methods based on the variable value. In this case, power_of_num is called with the variable value 2. Square method would be returned now and assigned in the variable num_powers. Now, if we call the num_powers variable, it would act as the square method.

num_powers = power_of_num(3)
# num_powers is assigned with the cube method

num_powers(5)
# 125

Got questions? Feel free to comment here.

If you liked this article, click the 👏 so other people will see it here on Medium.

Higher-Order Functions — Python was originally published in Analytics Vidhya on Medium, where people are continuing the conversation by highlighting and responding to this story.

Slowly Changing Dimension

Tharun Kumar Sekar — Wed, 07 Oct 2020 07:47:09 GMT

I wanted to learn about Slowly Changing Dimension for a long time, but I couldn’t find a clear, concise blog post for anyone not familiar with the topic. I, therefore, give you my own offering, a quick introduction to Slowly Changing Dimensions or SCD in a data warehousing scenario.

Let’s take 2 tables

Users (Dimension)

https://medium.com/media/ebadee9eb4cb176a01b056f4b47492cc/href

Sales (Fact)

https://medium.com/media/80d4a32b8c077a2e42ae8245b2fdc58a/href

When organizing a data warehouse into a star schema, we need to relate fact records to dimension records to get its related attributes. There are scenarios where the information in the dimension might change. For instance, the user Adam might move to the United Kingdom. If he does, do we associate all his fact records with the new country? Or do we want to ignore the change in the country to keep historical accuracy? Or do we treat facts before the change in the country to those after?

It is this decision that determines whether to make the dimension a slowly changing one. There are different types of SCD depending on how you treat incoming change.

Types of SCD

Type 0 — Fixed Dimension

No changes are allowed here. In other words, the dimension never changes. In this case, we don’t change Adam’s country even if he moves to another one.

Type 1 — No History

Update the dimension directly. There is no track of the change in dimensions. We could only see the current state. In our case, Adam’s record would be modified to have United_Kingdom as Country. All his orders, which he placed when was in United_States, will now be pointing to United_Kingdom.

https://medium.com/media/6116ecb190e5c6f5a6009aaa348c98d4/href

Type 2 — Row Versioning

Type 2 is the most common method of tracking change in data warehouses. Here, we track the changes with new records and additional columns such as the current flag and active dates.

https://medium.com/media/50186440175bb97a7f78566e1d1ce285/href

New Columns

ID — We add a new ID column since the existing user id will not be sufficient to identify the specific record we require.
Current_Flag — A quick method of returning only the latest record of each user.
Start_Date — The date from which the specific record is active.
End_Date — The date to which the specific record is active

This method is very powerful. We maintain the history for the entire record and can easily perform change-over-time analysis.

Type 3

In this type, we add a new column instead of a record. In our case, we add a new column “Previous Country” to track the change. In case, if the user changes the country again, we have to add another column.

Type 4

We will simply update the record similar to Type 1 to accommodate the new change. However, we simultaneously maintain a history table which is similar to type2 to track the changes.

The dimension table after update will look like

https://medium.com/media/6116ecb190e5c6f5a6009aaa348c98d4/href

The history table will have the following records.

https://medium.com/media/db05790605f47b3cc792e82d1625a0c3/href

Separating the history from the dimension makes the dimension table smaller and therefore helps in increasing the performance and reduces the complexity if the majority of the users only use the current value.

However, if you require historical values, this type adds complexity and performance overheads.

Type 1 and Type 2 are generally preferred than Type 4.

Got questions? Feel free to comment here.

If you liked this article, click the 👏 so other people will see it here on Medium.

Slowly Changing Dimension was originally published in Analytics Vidhya on Medium, where people are continuing the conversation by highlighting and responding to this story.

Data Reconciliation in Spark

Tharun Kumar Sekar — Sun, 20 Sep 2020 15:09:01 GMT

Data Reconciliation is defined as the process of verification of data during data migration. In this process target data is compared against source data to ensure that the migration happens as expected.

Need for Data Reconciliation

You cannot trust your data without data verification.
Comparing record counts and fill rates does not always work.
Untrustworthy data leads to flawed insights.

Data Reconciler is a data reconciliation tool that checks for the accuracy of your data. Before taking you through the technical implementation, I would like to show you the output of the Reconciliation tool. You can run this code by yourself by following the instructions in next section.

The input dataset has 4 fields with a record count of 50 million records sizing about 1 GB in parquet format. After performing reconciliation on this dataset, we get the following output.

https://medium.com/media/ffd99d83f2572bb6d092f4fdbf228584/href

The above output provides the following information.

Matching Record Count (Record 6) — Number of records with matching primary keys in both datasets. In Other words, the record count after performing an inner join between the datasets. This value will be used as the denominator to calculate the percentage of matching records for each column.
Dropped Records (Record 7) — Number of records that exist in the old table but not in the new one. In other words, the output of a left anti join.
New Records (Record 8) — Number of records that exist in the new table but not in the old one.
Old File Path (Record 9) — Actual Number of records in the old table.
New File Path (Record 10) — Actual Number of records in the new table.
Field Name (Column 1) — Contains each column available in the old table.
Matching Record Count (Column 2) — Number of records with same values in both old and new column.
Mismatch Record Count (Column 3) — Number of records with different values in both old and new column.
Matching Record Percentage (Column 4) — Matching Record Count of individual column / Matching Record Count between Datasets (Record 6)

How to run the Data Reconciler?

The source code of the Data Reconciler is available in github. For now, the tool only support data in Parquet format and the data should have a primary key or a combination of primary keys. You can download the code and add in customization if you need and then build it. Once you have the jar, you can load it into your big data environment and trigger the job using the command

/usr/lib/spark/bin/spark-submit --deploy-mode cluster --executor-cores 5 --name Data_Reconciliation --class com.github.tharun.datareconciler.Pipeline {JAR_PATH} --qualityCheckType reconciler --oldTable {PATH_OF_OLD_DATA} --newTable {PATH_OF_NEW_DATA} --outputPath {PATH_OF_OUTPUT_DATA} --primaryKey {COMMA_SEPARATED_PRIMARY_KEYS}

Parameters:

Executor Cores — 5. This is for achieving a balance in parallelism and equal load.
Name — Data_Reconciliation. A name for your spark job.
Class — com.github.tharun.datareconciler.Pipeline. Main class or entry point for the spark job.
JAR_PATH — path where you have placed the JAR.
Quality Check Type — reconciler. For triggering the reconciliation part of the code.
Old Table — Path of Old Data. Path where you have stored the old dataset.
New Table — Path of New Data. Path where you have stored the new dataset.
Primary Key — Comma Separated Primary Key Column Names.

Technical Implementation

Once you have fed in both the datasets, they are joined based on the primary keys. Now, for each record if the value in Column “A” matches with the value in Column “B”, a new column with value as “1” is created and if the values doesn’t match, the new column is filled with value “0”. A sum of this new column, gives us the total matching records for each column. Once we have the matching record count, other values like mismatch record count, matching percentage are calculated.

Runtime Stats

Dataset 1

50 Million Records
6 GB in parquet
170 columns
AWS r5.12x large — 5 nodes
3 minutes runtime

Dataset 2

350 Million Records
30 GB in parquet
170 columns
AWS r5.12x large — 10 nodes
6 minutes runtime

Github URL — https://github.com/tharun026/SparkDataReconciler

Got questions? Feel free to comment here.

If you liked this article, click the 👏 so other people will see it here on Medium.

Data Reconciliation in Spark was originally published in Analytics Vidhya on Medium, where people are continuing the conversation by highlighting and responding to this story.

Spark Parallel Job Submission

Tharun Kumar Sekar — Sun, 06 Sep 2020 14:27:13 GMT

Spark Parallel Job Execution

Spark is known for breaking down a big job and running individual tasks in parallel. But, this doesn’t mean it can run two independent jobs in parallel. This article will help you to maximize the parallelization that you can achieve from Spark.

Asynchronous Programming

This is a type of parallel programming in which a unit of work is allowed to run separately from the primary application thread. When the work is complete, it notifies the main thread about the completion or failure of the worker thread. In Scala, you can achieve this using Future.

Scala Futures

Futures are a means of performing asynchronous programming in Scala. A Future gives you a simple way to run a job inside your spark application concurrently.

Let’s look at the usual way we write our Spark code and then see how Future can help us.

val employee = spark.read.parquet("s3://****/employee")
val salary = spark.read.parquet("s3://****/salary")
val ratings = spark.read.parquet("s3://****/ratings")

println("Joining employee with salary")
employee.join(salary, Seq("employee_id"))
  .exportToS3AndJSON("s3://****/employee_salary")

println("Joining employee with ratings")
employee.join(ratings, Seq("employee_id"))
  .exportToS3AndJSON("s3://****/employee_ratings")

In the above code, we read 3 datasets — employee, salary and ratings.

In the first statement, we join tables Employee and Salary based on Employee_ID and we save down the result in parquet and JSON format.
In the second statement, we join tables Employee and Ratings based on Employee_ID and we save down the result again in parquet and JSON format.

The first and the second statement are in no way related to each other and yet Spark will run it sequentially. You would get a better picture of this, if you take a look at the picture of the Spark UI.

Spark UI

Job ID 0 — starts first and runs for 5.5 minutes and once the first job is completed, the second one is picked up and so on. You can deduce the same by looking at the event timeline too. None of the job overlaps and each job is picked up after the previous job is completed.

If the job 0 utilizes 50% of the cluster, the remaining 50% would be un-utilized.

Let’s understand how we can increase the utilization by using scala futures. Below is the same piece of code but with Future incorporated.

import java.util.concurrent.Executors
import scala.concurrent.duration.Duration
import scala.concurrent.{Await, ExecutionContext, Future}

//Allowing a maximum of 2 threads to run
val executorService = Executors.newFixedThreadPool(2)
implicit val executionContext = ExecutionContext.fromExecutorService(executorService)

val employee = spark.read.parquet("s3://****/employee")
val salary = spark.read.parquet("s3://****/salary")
val ratings = spark.read.parquet("s3://****/ratings")

val futureA = Future {
   println("Joining employee with salary")
   employee.join(salary, Seq("employee_id"))
     .exportToS3AndJSON("s3://****/employee_salary")
   println("Future A Complete")
   }

val futureB = Future {
   println("Joining employee with ratings")
   employee.join(ratings, Seq("employee_id"))
     .exportToS3AndJSON("s3://****/employee_ratings")
   println("Future B Complete")
   }

Await.result(futureA, Duration.inf)
Await.result(futureB, Duration.inf)

The changes include

Importing ExecutionContext to get access to the thread pool.
Defining the number of threads to run.
Enclosing the transformations inside a Future construct.
The Await.result method call declares that it will wait for the Future to get executed.

Let’s take a look at how the job performs now by looking at the Spark UI.

In here you could see that the jobs 0 and 1 have started at almost the same time. You can also see from the Event timeline, that both the jobs are running in parallel.

If you liked this article, click the 👏 so other people will see it here on Medium.

Spark Parallel Job Submission was originally published in Analytics Vidhya on Medium, where people are continuing the conversation by highlighting and responding to this story.

Git — Basics

Tharun Kumar Sekar — Fri, 29 May 2020 15:58:08 GMT

Git — Basics

Git is a powerful tool, but it has a reputation of baffling newcomers. With the right knowledge, anyone can master git. Once you start to understand it, the terminology will make more sense and you’ll (eventually) learn to love it.

What is Git?

Git is a type of version control system (VCS) that makes it easier to track changes to files. For example, when you edit a file, git can help you determine exactly what changed, who changed it, and why.

It’s useful for coordinating work among multiple people on a project, and for tracking progress over time by saving “checkpoints”. You could use it while writing an essay, or to track changes to artwork and design files.

Git isn’t the only version control system out there, but it’s by far the most popular. Many software developers use git daily, and understanding how to use it can give a major boost to your resume.

Environments in GIT

Working Directory: The directory where you have cloned the files in your system.

Staging Area: Temporary location to stage your files before you perform a commit. This area helps you to commit only required files and skip the remaining files.

Local Repository: Replica of the remote repository which will carry your commits.

Remote Repository: A common repository that everyone can use to exchange their changes. It is most commonly located on a remote server.

Pull: Once you have your branch created in the remote repository, you have to perform a git pull , which will sync your local repository with the remote server. Right after your perform a pull, the newly created branch will be available in your local repository.

Checkout: After you perform a pull you will have your new branch in your local repository. Now you need to perform git checkout branchname to point your working directory to the new branch.

Add: When you complete your changes in your local machine, you can do a git add to add the changes to the staging area. Some of the different functionalities of the add command are

git add * to add all your changed files to staging area
git add filename to specifically add a file to staging area
git add -p to look at individual changes in each file and then add it to staging area

Commit: Once you have staged the required files, you can do git commit -m "message" to commit your changes to your branch. A commit is the one which will be pushed to the Remote repository once your perform a git push.

Push: Until you do a push, all your changes will still be in your local machine and you can’t exchange it with anyone. git push pushes all your local commits to the remote repository. Now anyone can access your changes by going to your branch.

If you liked this article, click the 👏 so other people will see it here on Medium.

Git — Basics was originally published in Analytics Vidhya on Medium, where people are continuing the conversation by highlighting and responding to this story.

Git Squash Commit With Git Rebase

Tharun Kumar Sekar — Wed, 20 May 2020 04:29:29 GMT

When submitting a pull request to merge your code with Master/Develop, it’s better you squash your commits. Some applications that interact with git repos will provide a user interface for squashing. But let’s take the fun route — the terminal way.

There are multiple ways to do a git squash. One - do it locally in your system and then push it to remote. The other way is having a copy of all your changes in remote before doing a rebase so you have a copy of your changes in remote in case if something goes wrong.

Lets look at the safer way first. Make sure your branch is up to date with the remote server. Now do git log --pretty=oneline to understand the commits that happened in your branch.

* c88bc5 Implement search inputs for user
* 8f4917 Enriched plots for better understanding
* 59c01d Add pyplot for better analysis
* ba6f1f Add listing feature to quality checks
* 9f2adb Add feature to pipeline
* f796c1 Initial commit

The last 6 commits would look much better if they were wrapped up together, so let’s do that through interactive rebasing.

To interactively rebase commits, you can follow the below format and enter your command via the command line.

git rebase -i HEAD~ (n is the number of commits you want to squash)

git rebase -i HEAD~6 (This will roll up all 6 commits in the current branch)

git rebase -i  (sha code of the commit until which you want to squash)

git rebase -i f796c1 (sha code of the initial commit)

The -i flag is to indicate that this rebase process will be an interactive session.

Once you enter the above command, this is what you will see.

pick f796c1 Initial commit
pick 9f2adb Add feature to pipeline
pick ba6f1f Add listing feature to quality checks
pick 59c01d Add pyplot for better analysis
pick 8f4917 Enriched plots for better understanding
pick c88bc5 Implement search inputs for user

# Rebase 8db7e8b..fa20af3 onto 8db7e8b 
# 
# Commands: 
#  p, pick = use commit 
#  r, reword = use commit, but edit the commit message 
#  e, edit = use commit, but stop for amending 
#  s, squash = use commit, but meld into previous commit 
#  f, fixup = like "squash", but discard this commit's log message 
#  x, exec = run command (the rest of the line) using shell 
# 
# These lines can be re-ordered; they are executed from top to bottom. 
# 
# If you remove a line here THAT COMMIT WILL BE LOST. 
# 
# However, if you remove everything, the rebase will be aborted. 
# 
# Note that empty commits are commented out

We see the 6 last commits, from older to newer. See the comments below the list of commits? Good job explaining, git! pick is the default action. In this case, it would reapply the commit as is, no changes in the contents or messages. Saving this file would make no changes to the repository.

We are interested only in the below actions.

squash ( s for short), which melds the commit into the previous one (the one in the line before)
fixup ( f for short), which acts like “squash”, but discards the commit message

Let’s say we want to squash all our commits, because they belong to the same logical changeset. We’ll preserve the initial commit and squash all the subsequent commits into the previous one. We have to change pick to squash in all the commits except the first one.

pick f796c1 Initial commit
squash 9f2adb Add feature to pipeline
squash ba6f1f Add listing feature to quality checks
squash 59c01d Add pyplot for better analysis
squash 8f4917 Enriched plots for better understanding
squash c88bc5 Implement search inputs for user

Save the editor and you will land into another editor to decide the commit message of the melded three commits. In this editor, you will be given an option to add/remove the commit messages. Once you save the commit messages and quit the editor, all your commits will be transformed into one.

If you want to skip editing the commit message part, you can use fixup command and this will have your commit messages already commented out.

Once the commit message part is saved, the final thing you have to do is git push to push all your changes to remote. And this push has to be forced since the branch in your local and remote have been diverged after the rebase.

git push --force

P.S. If you have too many commits, to be squashed and you have manually update every pick to squash , vim provides a simple way to achieve it.

:%s/pick/squash/gc

This command will update every pick to squash upon your confirmation.

If you say reword ( r for short) in a commit you want to edit:

pick f796c1 Initial commit
pick 9f2adb Add feature to pipeline
reword ba6f1f Add listing feature to quality checks
pick 59c01d Add pyplot for better analysis
pick 8f4917 Enriched plots for better understanding
pick c88bc5 Implement search inputs for user

# Rebase 8db7e8b..fa20af3 onto 8db7e8b 
# 
# Commands: 
#  p, pick = use commit 
#  r, reword = use commit, but edit the commit message 
#  e, edit = use commit, but stop for amending 
#  s, squash = use commit, but meld into previous commit 
#  f, fixup = like "squash", but discard this commit's log message 
#  x, exec = run command (the rest of the line) using shell 
# 
# These lines can be re-ordered; they are executed from top to bottom. 
# 
# If you remove a line here THAT COMMIT WILL BE LOST. 
# 
# However, if you remove everything, the rebase will be aborted. 
# 
# Note that empty commits are commented out

When you save and quit the editor, git will follow the reword command and will land you in an editor again, as if you had amended commit ba6f1f . Now you can edit the commit message, save and quit the editor.

If you liked this article, click the 👏 so other people will see it here on Medium.

Git Squash Commit With Git Rebase was originally published in Analytics Vidhya on Medium, where people are continuing the conversation by highlighting and responding to this story.

Improve Spark Write Performance

Tharun Kumar Sekar — Tue, 14 Apr 2020 12:12:47 GMT

The EMRFS S3-optimized committer is a new output committer available for use with Apache Spark jobs as of Amazon EMR 5.19.0. This committer improves performance when writing Apache Parquet files to S3 using the EMR File System (EMRFS). In this post, we run a performance benchmark to compare this new optimized committer with existing committer algorithms, namely FileOutputCommitter algorithm versions 1 and 2. We close with a discussion on current limitations for the new committer, providing workarounds where possible.

Comparison with FileOutputCommitter

In Amazon EMR version 5.19.0 and earlier, Spark jobs that write Parquet to Amazon S3 use a Hadoop commit algorithm called FileOutputCommitter by default. There are two versions of this algorithm, version 1 and 2. Both versions rely on writing intermediate task output to temporary locations. They subsequently perform rename operations to make the data visible at task or job completion time.

Algorithm version 1 has two phases of rename: one to commit the individual task output, and the other to commit the overall job output from completed/successful tasks. Algorithm version 2 is more efficient because task commits rename files directly to the final output location. This eliminates the second rename phase, but it makes partial data visible before the job completes, which not all workloads can tolerate.

The renames that are performed are fast, metadata-only operations on the Hadoop Distributed File System (HDFS). However, when output is written to object stores such as Amazon S3, renames are implemented by copying data to the target and then deleting the source. This rename “penalty” is exacerbated with directory renames, which can happen in both phases of FileOutputCommitter v1. Whereas these are single metadata-only operations on HDFS, committers must execute N copy-and-delete operations on S3.

To partially mitigate this, Amazon EMR 5.14.0+ defaults to FileOutputCommitter v2 when writing Parquet data to S3 with EMRFS in Spark. The new EMRFS S3-optimized committer improves on that work to avoid rename operations altogether by using the transactional properties of Amazon S3 multipart uploads. Tasks may then write their data directly to the final output location, but defer completion of each output file until task commit time.

Performance test

When evaluated the write performance of the different committers by executing the following INSERT OVERWRITE Spark SQL query. The SELECT * FROM range(…)clause generated data at execution time. This produced ~15 GB of data across exactly 100 Parquet files in Amazon S3.

SET rows=4e9; -- 4 Billion 
SET partitions=100;  
INSERT OVERWRITE DIRECTORY ‘s3://${bucket}/perf-test/${trial_id}’ USING PARQUET SELECT * FROM range(0, ${rows}, 1, ${partitions});

Note: The EMR cluster ran in the same AWS Region as the S3 bucket. The trial_id property used a UUID generator to ensure that there was no conflict between test runs.

We executed our test on an EMR cluster created with the emr-5.19.0 release label, with a single m5d.2xlarge instance in the master group, and eight m5d.2xlarge instances in the core group. We used the default Spark configuration properties set by Amazon EMR for this cluster configuration, which include the following:

spark.dynamicAllocation.enabled true 
spark.executor.memory 11168M 
spark.executor.cores 4

After running 10 trials for each committer, we captured and summarized query execution times in the following chart. Whereas FileOutputCommitter v2 averaged 49 seconds, the EMRFS S3-optimized committer averaged only 31 seconds — a 1.6x speedup.

As mentioned earlier, FileOutputCommitter v2 eliminates some, but not all, rename operations that FileOutputCommitter v1 uses. To illustrate the full performance impact of renames against S3, we reran the test using FileOutputCommitter v1. In this scenario, we observed an average runtime of 450 seconds, which is 14.5x slower than the EMRFS S3-optimized committer.

The last scenario we evaluated is the case when EMRFS consistent view is enabled, which addresses issues that can arise due to the Amazon S3 data consistency model. In this mode, the EMRFS S3-optimized committer time was unaffected by this change and still averaged 30 seconds. On the other hand, FileOutputCommitter v2 averaged 53 seconds, which was slower than when the consistent view feature was turned off, widening the overall performance difference to 1.8x.

Enabling the EMRFS S3-optimized committer

Starting with Amazon EMR version 5.20.0, the EMRFS S3-optimized committer is enabled by default. In Amazon EMR version 5.19.0, you can enable the committer by setting the spark.sql.parquet.fs.optimized.committer.optimization-enabled property to true from within Spark or when creating clusters. The committer takes effect when you use Spark’s built-in Parquet support to write Parquet files into Amazon S3 with EMRFS. This includes using the Parquet data source with Spark SQL, DataFrames, or Datasets. However, there are some use cases when the EMRFS S3-optimized committer does not take effect, and some use cases where Spark performs its own renames entirely outside of the committer. For more information about the committer and about these special cases, see Using the EMRFS S3-optimized Committer in the Amazon EMR Release Guide.

Summary

The EMRFS S3-optimized committer improves write performance compared to FileOutputCommitter. Starting with Amazon EMR version 5.19.0, you can use it with Spark’s built-in Parquet support.

This article is a transcript from the Amazon Web Services blogs.

If you liked this article, click the 👏 so other people will see it here on Medium.

Improve Spark Write Performance was originally published in Analytics Vidhya on Medium, where people are continuing the conversation by highlighting and responding to this story.

Case Classes — Scala

Tharun Kumar Sekar — Fri, 10 Apr 2020 12:50:37 GMT

Case Classes — Scala

Representing data is a key part of writing programs, but it’s also mechanical: you need to define your fields, setters, getters, and other details. When coding in languages which are a bit more verbose, such as Java, you often end up using the tools, such as your IDE, to automatically generate some code for you. What if, instead of seeking support from your IDE, the compiler could do this for you?

Case classes are good for modeling immutable data. It can be considered as a class with an arbitrary number of parameters for which the compiler automatically adds ad-hoc code. Case classes are the ideal data containers because of these functionalities that encourage the use of immutability.

Defining a case class

case class Member(id: Integer, name: String, country: String)
val a = Member(10001, "Adam", "UK")

Notice how the new keyword was not used to instantiate the Member case class. This is because case classes have an apply method by default which takes care of object construction. By default all the parameters in a case class are public. You can’t reassign a variable in a case class because it is a val (i.e. immutable). It is possible to use var but this is discouraged. The compiler also adds other functionalities to a case class.

Getters

For each parameter, the compiler adds a getter function with the same name of the parameter it refers to. For example, you can easily access the id value by calling the object name and the parameter. Once instantiated, you cannot modify the value, since it is a val.

println(a.id) //prints 10001
a.id = 10002 //throws a compile time error

Scala does not generate setter functions for the parameters, since they are immutable.

Copying

When you want to modify a value of an existing case class, you can use the copy function to create a new data representation. In simple words, you can create a (shallow) copy of an instance of a case class simply by using the copy method.

val b = a.copy(id = 10002, name = "Bryan")
println(b.id)      //prints 10002
println(b.name)    //prints Bryan
println(b.country) //prints UK which was copied from object a

You can also change all of it’s parameter at the same time.

Apply

One of the topmost benefit of Case Class is that Scala Compiler generates an apply method with the name of the class having identical number of parameters as defined in the class definition, because of that you can create objects of the Case Class even in the absence of the keyword new.

case class Member(id: Integer, name: String, country: String)
val a = Member(10001, "Adam", "UK")

If you liked this article, click the 👏 so other people will see it here on Medium.