Cleaning the Merged Data and Testing with Pytest — 5th Biweekly Blog GSoC’23 [NumFOCUS]

5 min readAug 13, 2023

Hello Everyone!!
This is my 5th Biweekly Blog post and this blog post is all about hands-on improvement. We’re cleaning up our data, perfecting our merging techniques, and introducing the game-changing Pytest framework for testing. Plus, we’ll tackle that sneaky flake8 and Yapf issues that were failing before.
By the end of this read, we’ll have not only polished datasets and smoother code but also a solid toolkit to make your future projects shine. So, let’s get started on this journey to cleaner data, smoother merges, and more reliable code.

Introduction

In previous blogs, I explained the extract_training_data function that returns a merged data frame of the image file (i.e., RGB data) and the CSV file (i.e., vst data) which was achieved using retrive_vst_data and retrieve_aop_data functions.

In the 9th and 10th weeks of the coding period, I will be explaining majorly three things:

Cleaning the Merged Data Frame
Testing with Pytest
Formatting Code using Flake8 and Yapf formate

Cleaning the Merged Data Frame

As discussed in my previous blog about the extract_training_data function which was used to generate merged data. Now, here we will have a temporary column for the contaminated prediction data frame named “temp_geo” which will store a copy of the geometry column from the concatenated data frame.

all_predictions_df['temp_geo'] = all_predictions_df['geometry']

After storing the “temp_geo” in the concatenated data frame, we will perform a spatial join of geopandas between “geo_data_frame” and “all_predictions_df”. This will help in storing the geometry of both the data frame and further replacing the “Polygon” geometry which is stored in the “temp_geo” column as the geometry column.

merged_data = gpd.sjoin(geo_data_frame, all_predictions_df, how="inner", op="within")
merged_data.drop(columns=['geometry'], inplace=True)
merged_data.rename(columns={'temp_geo': 'geometry'}, inplace=True)

After that we will create a dictionary for mapping “canopy_position” as “canopy_position_mapping” for mapping np.nan: 0, ‘Full shade’: 1, ‘Mostly shaded’: 2, ‘Partially shaded’: 3, ‘Full sun’: 4, ‘Open grown’: 5.

canopy_position_mapping = {
        np.nan: 0,
        'Full shade': 1,
        'Mostly shaded': 2,
        'Partially shaded': 3,
        'Full sun': 4,
        'Open grown': 5
    }

Rename merged_data as prediction, creating a copy of predictions and replacing the canopy position column values from the dictionary.

predictions = merged_data
predictions_copy = predictions.copy()
predictions_copy['canopyPosition'] = predictions_copy['canopyPosition'].replace(canopy_position_mapping)

From the predictions_copy data frame, we will filter the duplicates based on coordinates (i.e., ‘xmin’, ‘ymin’, ‘xmax’, ‘ymax’) and can create another data frame named “duplicate_entries”

duplicate_mask = predictions_copy.duplicated(subset=['xmin', 'ymin', 'xmax', 'ymax'], keep=False)
duplicate_entries = predictions[duplicate_mask]

Now based on duplicate entries, we will create a sorted prediction with the help of the ‘height’, ‘canopyPosition’, and ‘stemDiameter’ columns in descending order. Taking the first columns of the duplicated prediction will help us clean the prediction data frame.

predictions_sorted = predictions.sort_values(by=['height', 'canopyPosition', 'stemDiameter'], ascending=[False, False, False])

duplicates_mask = predictions_sorted.duplicated(subset=['xmin', 'ymin', 'xmax', 'ymax'], keep='first')

clean_predictions = predictions_sorted[~duplicates_mask]

Now, our extract_training_data function will return a clean prediction data frame.

Testing with Pytest

The extract_training_data the function takes in a subset of vst_data, a geo_data_frame and a handful of parameters that are the building blocks of our data analysis.

Our test consists of loading some sample data, creating a GeoDataFrame, and then running extract_training_data on a slice of that data.

Before testing, we need to import the useful packages first, i.e., pandas as pd for reading the vegetation structure tree data, geopandas as gpd for creating geo data frame out of vegetation structure tree data, shapely geometry for creating geometry points and lastly importing the function on which the test will be performed (extract_training_data).

import geopandas as gpd
import pandas as pd
from shapely.geometry import Point
from neonwranglerpy.lib.extract_training_data import extract_training_data

After imports, we will create a test function named “test_extract_training_data”. Create a variable used in the function named savepath for saving the required files, vst_data and geo_data_frame for converting the vst_data into a geo data frame.

savepath = 'tests/raw_data'
vst_data = pd.read_csv('tests/raw_data/vst_data.csv')

geometry = [Point(easting, northing) for easting, northing in
            zip(vst_data['itcEasting'], vst_data['itcNorthing'])]
epsg_codes = (vst_data['utmZone'].map(lambda x: (326 * 100) +
                                      int(x[:-1]))).astype(str)
geo_data_frame = gpd.GeoDataFrame(vst_data, geometry=geometry, crs=epsg_codes.iloc[0])

In the end, we will create a variable named results which will consist of extract_training_data.

result = extract_training_data(vst_data.iloc[1:10, :], geo_data_frame, year='2018',
                                   dpID='DP3.30010.001', savepath=savepath, site='DELA')

Now we will perform testing on the result:

Data Validation: We start by checking if the sample data we’re using for testing has a non-empty shape. If we can’t trust our input data, the rest of our analysis could be compromised.

assert (vst_data.shape[0] > 0) & (vst_data.shape[1] > 0)

Result Length: Next, we confirm that the result from extract_training_data has some length. After all, an empty result wouldn’t be of much use to us.

assert len(result) > 0

Column Check: We make sure the result has a ‘geometry’ column. This is at the heart of spatial analysis, and we want to ensure it’s present and accounted for.

assert "geometry" in result.columns

GeoDataFrame Guarantee: We validate that the result is indeed a GeoDataFrame. This reassures us that the spatial operations we plan to perform will go off without a hitch.

assert isinstance(result, gpd.GeoDataFrame)

Bounding Boxes: Our function deals with bounding boxes, and we’re careful to verify that there are no duplicates among the bounding box coordinates. Duplicates could throw our calculations off, so it’s important to catch them early.

assert ~result[['xmin', 'ymin', 'xmax', 'ymax']].duplicated().any()

Geometry Type: Lastly, we check that the geometry type in the result is ‘Polygon’. Since our function is intended to handle polygons, this is a crucial check.

assert 'Polygon' in result['geometry'].geom_type.values

Formatting Code using Flake8 and Yapf formate

Previously we were failing the workflow tests in GitHub Actions, to fix that I had a close look at the logs and identified that there were issues with the code format by which our workflows were failing. So I fixed this using the Flake8 extension of VSCode and looking at the output log of the console.

Ensuring code readability and consistency is a cornerstone of good programming practices. In this quest for well-structured code, tools like Flake8 and Yapf play crucial roles. Flake8 acts as a vigilant code watchdog, tirelessly scanning through your codebase for style and syntax blunders, ensuring your code aligns with the principles outlined in PEP 8 guidelines. It meticulously examines indentation, variable naming, and adherence to coding conventions, providing valuable feedback that promotes clean, uniform code.

On the other hand, Yapf assumes the role of a code beautician, streamlining the process of code formatting. Just as a stylist transforms an outfit into a polished ensemble, Yapf automatically arranges your code into a standardized layout, enhancing its visual appeal. It adjusts whitespace, organizes imports, and aligns code elements, creating a harmonious structure that fosters readability and collaboration.

Together, Flake8 and Yapf offer a dynamic duo for code enhancement. While Flake8 ensures that your code is syntactically sound and adheres to coding standards, Yapf takes care of the finer details, turning your code into a cohesive and visually pleasing creation. With these tools in your arsenal, your codebase becomes a testament to precision and professionalism, empowering developers to focus on crafting robust functionality while maintaining a consistent and attractive code style.

Link of the project: Neonwranglerpy

Checkout my other GSoC blog here

Reach me on LinkedIn

Checkout my repositories on GitHub

Cleaning the Merged Data and Testing with Pytest — 5th Biweekly Blog GSoC’23 [NumFOCUS]

Introduction

Cleaning the Merged Data Frame

Testing with Pytest

Formatting Code using Flake8 and Yapf formate

Written by SATYAM SINHA