Python/Automation/3 → Shell Script & Python join hands for API Performance Test Report

8 min readFeb 24, 2024

This is a post on how to prepare a report using python which captures end to end transaction time during Performance Test of APIs.

(1) Problem:

We have a requirement to prepare a test report for performance test of an API. The API uses logger which generates output in pipe delimited format as logs in UNIX box with key identifier like transaction id, module name, current system timestamp at that time in flow. Module name basically suggests a specific milestone point reached in the flow. Based on the module names we have to prepare a script which will capture the current timestamp. Difference between the times between the first and last milestone points will give us the end to end time.

(2) Solution:

To implement the solution, we are going to use a shell script that will first collect all the current timestamps when the flow reaches a module/milestone point for all the transactions. We will generate module specific files containing the that time’s timestamp. Then using a python script will correlate all the files to match the transaction id and find the respective timestamp. Finally will prepare a csv file with all timestamp details and the time duration between first and last milestone point.

Shell Script to capture the transaction id and the time stamps at point A,B and C respectively from the pipe delimited log file:

#!/bin/bash

date =`date +"%Y-%m-%d"`
dateTime =`date +"%Y-%m-%d_%H-%M-%S"`

# Point-A
zgrep -h <Point_A_Mod_Name> /<log_path_archive>/*/*$1*.log.gz|grep <identifier>|awk -F'|' '${print$<extract_time_position>" "$<extract_id_position>}'> /<log_details_extract>/{$dateTime}/1_Point_A.csv

# Point-B
zgrep -h <Point_B_Mod_Name> /<log_path_archive>/*/*$1*.log.gz|grep <identifier>|awk -F'|' '${print$<extract_time_position>" "$<extract_id_position>}" "$<extract_time_position>}'> /<log_details_extract>/{$dateTime}/2_Point_B.csv

# Point-C
zgrep -h <Point_B_Mod_Name> /<log_path_archive>/*/*$1*.log.gz|grep <identifier>|awk -F'|' '{print$<extract_time_position>" "$<extract_id_position>}" "$<extract_time_position>}'> /<log_details_extract>/{$dateTime}/3_Point_C.csv

Here <Point_A_Mod_Name> is the Module name; <identifier> is additional filter condition to filter out a specific module name ; <extract_id_position> is the position of transaction id field and <extract_time_position> is position of timestamp field and <log_details_extract> is path where you want to keep the output files with transaction id and timestamp details captured. The output result will be 3 separate files for A, B, C points each with below like format:

A File Content:
2023-12-28T09:10:12.457+07:00 TRANS1001
2023-12-28T09:10:12.501+07:00 TRANS1002

B File Content:   
2023-12-28T09:10:12.653+07:00 TRANS1001
2023-12-28T09:10:12.723+07:00 TRANS1002

C File Content: 
2023-12-28T09:10:12.792+07:00 TRANS1001
2023-12-28T09:10:12.811+07:00 TRANS1002

Next is we will write a python script that correlates the data for all transaction id, find in each file against each transaction id , get the time stamp and in a resultant file put in the following format:

TRANS1001,<TX's Time in A file>,<TX's Time in B file>,<TX's Time in C file>,<Time Diff C-A>
TRANS1002,<TX's Time in A file>,<TX's Time in B file>,<TX's Time in C file>,<Time Diff C-A>

Also, it will calculate time duration between point C and A. Finally it will calculate the median, min, max, 85, 90 and 95 percentiles of those time difference values and plot them in a bar chart.

Pre-Requisite : config file “extract_report_cfg.properties”

###################################################
#### FLOW_X Logger Details ####
flow_x.logger_file_dir=C:\\<DIR>\\Report_FlowX_Logger_Input
flow_x.logger_master_file_name=1_File_A.txt
flow_x.logger_file_name_pattern=1_*_File_A.txt,2_*_File_B.txt,3_*_File_C.txt
flow_x.workbook_file_name=FlowX_{DATE}.csv
flow_x.workbook_col_name=TRANS_ID,TIME_A,TIME_B,TIME_C, DIFF_Millis
###################################################
#### FLOW_Y Logger Details ####
flow_y.logger_file_dir=C:\\<DIR>\\Report_FlowY_Logger_Input
flow_y.logger_master_file_name=1_File_A.txt
flow_y.logger_file_name_pattern=1_*_File_A.txt,2_*_File_B.txt,3_*_File_C.txt
flow_y.workbook_file_name=FlowY_{DATE}.csv
flow_y.workbook_col_name=TRANS_ID,TIME_A,TIME_B,TIME_C, DIFF_Millis
###################################################
#### FLOW_Z Logger Details ####
flow_z.logger_file_dir=C:\\<DIR>\\Report_FlowZ_Logger_Input
flow_z.logger_master_file_name=1_File_A.txt
flow_z.logger_file_name_pattern=1_*_File_A.txt,2_*_File_B.txt,3_*_File_C.txt
flow_z.workbook_file_name=FlowZ_{DATE}.csv
flow_z.workbook_col_name=TRANS_ID,TIME_A,TIME_B,TIME_C, DIFF_Millis
###################################################

Refer the below python file: “Extract_Report_From_Input_Logger.py”

import os
from datetime import datetime
from jproperties import Properties
import pandas as pd
import matplotlib.pyplot as plt

def func_gen_transaction_report(process, start_dt_time, prop_dict, tpsPT, durPT):
    # putting alphabets 'a' to 'z' in a list ; later we will pick based on number of input logger files
    alpha_list = []
    for i in range(65, 91):
        alpha_list.append(chr(i))

    # formatting sys timestamp for appending in file names
    dt_time = start_dt_time.strftime("%Y-%m-%d_%H:%M:%S")
    file_dt_time = start_dt_time.strftime("%Y%m%d_%H%M%S")

    logger_file_dir = str(prop_dict.get(process + '.logger_file_dir'))
    wb_arch_dir = str(logger_file_dir) + "\\archive"
    logger_file_pattern = str(prop_dict.get(process + '.logger_file_name_pattern'))

    # master logger is the first file / file A which we will refer for transaction ids and find details in other files
    logger_master_file = str(prop_dict.get(process + '.logger_master_file_name'))

    wb_file_name = str(prop_dict.get(process + '.workbook_file_name'))
    wb_file_pattern = wb_file_name.rsplit('{DATE}', 1)[0]
    wb_file_name = wb_file_name.replace('{DATE}', file_dt_time)

    wb_col_name = str(prop_dict.get(process + '.workbook_col_name'))
    wb_col_name_list = wb_col_name.split(',')

    logger_file_list = []
    master_file_found = 'N'
    missing_transID_file_pattern = 'missing_transID'

    for f_name in os.listdir(logger_file_dir):
        if f_name.startswith((wb_file_pattern, missing_transID_file_pattern)):
            os.rename(logger_file_dir + "\\" + f_name, wb_arch_dir + "\\" + f_name)
            print("File archived : " + f_name)

        elif f_name.startswith(tuple(logger_file_pattern)) and f_name != logger_master_file:
            logger_file_list.append(f_name)

        elif f_name == logger_master_file:
            master_file_found = 'Y'

        elif os.path.isfile(logger_file_dir + "\\" + f_name):
            print('Skipping junk file : ' + f_name)

        else:
            print('Skipped reading dir : ' + f_name)

    if master_file_found == 'Y':
        logger_file_list.sort()

        logger_file_list.insert(0, logger_master_file)
    else:
        raise Exception("No master file present in dir : " + logger_file_dir)

    in_file_cnt = len(logger_file_list)
    del alpha_list[in_file_cnt:]
    file_alpha_list = list(map(lambda a: "File_" + a, alpha_list))
    file_line_list = []

    for idx, file in enumerate(logger_file_list):
        File_Name_Id = "File_" + file.rsplit("_", 1)[1].rsplit(".", 1)[0]

        in_file = open(logger_file_dir + "\\" + file, 'r')
        in_file_line_list = in_file.readlines()

        in_file_line_list = list(map(lambda x: file_alpha_list[idx] + " " + x, in_file_line_list))

        for x in in_file_line_list:
            file_line_list.append(x.replace(' ', ',').strip())

        in_file.close()

    print(
        "Gathered contents in master set from all : " + str(in_file_cnt) + " files for processing. ")

    transDtlsDict = {}
    transDtlsDictNested = {}
    transDtlslist = []
    missingIdSet = set()

    for file_line in file_line_list:
        '''Initialize variables for transactionId and logging time ; reset and re-fetch in each iteration of lines in master set'''
        transID = ''
        transTime = ''
        startTime = ''
        transStartTime = ''
        transEndTime = ''
        transTimeDiff = ''

        transID = file_line.rsplit(",")[2]
        transTime = file_line.rsplit(",")[1]

        # we are having values as 'File_A,File_B,File_C' based on from which file the logging time is taken from
        col_type = file_line.rsplit(",")[0]
        # we are having keys as 'A,B,C' after splitting ; identifier key as from which file the logging time is taken from
        col_idx = col_type.rsplit('_')[1]

        # processing for first logging point (A) ; logging time is start time of processing ; transId is key to form a collection
        if str(file_line).startswith(file_alpha_list[0]):
            transDtlsDictNested[transID] = {}
            transDtlsDictNested[transID]['ID'] = "'" + transID + "'"
            transDtlsDictNested[transID][col_idx] = transTime
            start_col_idx = col_idx

            if transTime != '':
                transStartTime = datetime.strptime(transTime, '%Y-%m-%dT%H:%M:%S.%f%z')

        # processing for last/final logging point ; logging time is end time of processing ; transId is key to fetch from a collection
        elif col_type == file_alpha_list[in_file_cnt - 1]:
            try:
                transDtlsDictNested[transID][col_idx] = transTime

                startTime = transDtlsDictNested[transID][start_col_idx]
                if startTime != '':
                    transStartTime = datetime.strptime(startTime, '%Y-%m-%dT%H:%M:%S.%f%z')

                if transTime != '':
                    transEndTime = datetime.strptime(transTime, '%Y-%m-%dT%H:%M:%S.%f%z')

                    print(transStartTime)
                    print(transEndTime)
                    print('.....................')

                    transTimeDiff = ((transEndTime - transStartTime).total_seconds() * 1000)
                    '''print(transID)'''
                    '''print(transTimeDiff)'''
                    transDtlsDictNested[transID]['DIFF'] = transTimeDiff

            except:
                missingIdSet.add(transID + '|')
                # if logger time not found then add it to missing transaction file and skip to next line for processing
                continue

        # processing for other logging points ; only logging time is point of interest
        else:
            try:
                transDtlsDictNested[transID][col_idx] = transTime
            except:
                missingIdSet.add(transID + '|')
                continue

    lastColType = alpha_list[in_file_cnt - 1]

    for key, val in transDtlsDictNested.items():
        if lastColType not in val:
            missingIdSet.add(key + '|')

        else:
            transDtlslist.append(val)

    '''print(transDtlslist)'''

    df = pd.DataFrame(transDtlslist)
    df.to_csv(logger_file_dir + '\\' + wb_file_name, index=False, header=wb_col_name_list)

    out_file = open(logger_file_dir + "\\missing_transID_" + file_dt_time + ".txt", 'w')
    out_file.writelines(missingIdSet)

    end_dt_time = datetime.now()
    print(
        "Elapsed time : " + str((end_dt_time - start_dt_time).total_seconds()) + " secs for preparing + + report.")

    func_gen_stat_report(process, start_dt_time, logger_file_dir + '\\' + wb_file_name, tpsPT, durPT)

def func_gen_stat_report(process, start_dt_time, csv_file, tpsPT, durPT):
    print('Generating Stat Report based on csv file : ' + csv_file)

    df_in = pd.read_csv(csv_file)
    
    # fetches number of rows ; df_in.shape[1] fetches number of columns
    row_num = df_in.shape[0]

    col_list = df_in.columns
    col_cnt = len(col_list)
    last_col = col_list[col_cnt - 1]

    percentile_list = []
    median_list = []

    p95 = df_in[last_col].quantile(0.95)
    p90 = df_in[last_col].quantile(0.9)
    p85 = df_in[last_col].quantile(0.85)
    p50 = df_in[last_col].quantile(0.5)
    vMin = df_in[last_col].min()
    vMax = df_in[last_col].max()

    percentile_list = [vMin, p85, p90, p95, vMax]
    median_list = [p50, p50, p50, p50, p50]
    '''print(percentile_list)'''

    # creating dataframe
    df_out = pd.DataFrame({
        'Percentile': ['MIN', '85', '90', '95', 'MAX'],
        'E2E Median Time - ms': median_list,
        'E2E Time - ms': percentile_list
    })

    if tpsPT != "" and durPT != "":
        chartTitle = "PT -- " + str(start_dt_time).split(" ")[0] + " : " + str(row_num) + " record : " + tpsPT + " : " + durPT
    else:
        chartTitle = "PT --" + str(start_dt_time).split(" ")[0]

    # plotting graph
    ax = df_out.plot(x="Percentile", y=["E2E Median Time - ms", "E2E Time - ms"], kind="bar", title=chartTitle, rot=0)

    for container in ax.containers:
        ax.bar_label(container)

    # saving the graph as png image
    imageFile = csv_file.rsplit(".")[0] + ".png"
    plt.savefig(imageFile)
    print('Bar chart for percentile values of end to end time saved as ' + imageFile)

if __name__ == '__main__':
    print('--------------------Script Started--------------------')

    curr_dt_time = datetime.now()

    # TO MODIFY as per path & file name
    cfg_dir = "C:\\<Dir>\\Extract_Report_Config"
    cfg_file_name = "extract_report_cfg.properties"
    cfg_file = cfg_dir + "\\" + cfg_file_name

    registration_prop_dict = {}
    query_prop_dict = {}
    prop_dict = {}
    configs = Properties()

    # loading the contents of the config file using jproperties
    with open(cfg_file, 'rb') as cfg_prop:
        configs.load(cfg_prop)

    prop_items = configs.items()

    in_process = input('Please enter one of the flow name as input : flow_x/flow_y/flow_z : ')
    in_tps_PT = input('TPS for PT : ')
    in_tps_Dur = input('Duration for PT in minutes : ')

    if in_tps_PT != "":
        in_tps_PT = in_tps_PT + " TPS"
    else:
        in_tps_PT = ""

    if in_tps_Dur != "":
        in_tps_Dur = in_tps_Dur + " min"
    else:
        in_tps_Dur = ""

    for prop_key_val in prop_items:
        if str(prop_key_val[0]).startswith('flow_x.'):
            prop_key = prop_key_val[0]
            prop_val = prop_key_val[1].data
            prop_dict[prop_key] = prop_val

        if str(prop_key_val[0]).startswith('flow_y.'):
            prop_key = prop_key_val[0]
            prop_val = prop_key_val[1].data
            prop_dict[prop_key] = prop_val

        if str(prop_key_val[0]).startswith('flow_z.'):
            prop_key = prop_key_val[0]
            prop_val = prop_key_val[1].data
            prop_dict[prop_key] = prop_val

    if in_process.lower() == 'flow_x':
        func_gen_transaction_report(in_process, curr_dt_time, prop_dict, in_tps_PT, in_tps_Dur)

    elif in_process.lower() == 'flow_y':
        func_gen_transaction_report(in_process, curr_dt_time, prop_dict, in_tps_PT, in_tps_Dur)

    elif in_process.lower() == 'flow_z':
        func_gen_transaction_report(in_process, curr_dt_time, prop_dict, in_tps_PT, in_tps_Dur)

    else:
        print('Invalid flow name as input')

    print('Script run completed.')

So, what we are doing in the python script:

[i] We are loading the properties file and based on user input for the flow, it is calling the sub-program with respective details properties file.

[ii] In sub-program it is taking the first file (File A) as master file. It will merge the files’ (File A, File B, File C) content in a master list. For each transaction id it will find details and stitch together the details in following way: “<TX_Id>,<TX’s Time in A file>,<TX’s Time in B file>,<TX’s Time in C file>,<Time Diff C-A>” and prepare a dictionary of list to easily dump using panda library in a comma separated file format.

[iii] Once the csv file is written, it will read the contents to find the mean,min,max and percentile values and using matplotlib plot the statistics in a bar chart like below.

When tested with large records the variance between metric will be high enough to create a better looking chart.

Python/Automation/3 → Shell Script & Python join hands for API Performance Test Report

Written by integratio