Unit Testing in Python and Continuous Integration with GitHub Actions

Humphery
Data Epic
Published in
15 min readDec 5, 2023

There’s a motivation behind tests, and we shouldn’t start this journey together without knowing what it is. Tests in computer programs are ways of checking whether the software works as expected or not. There are different ways of testing a software program depending on the level, here are a few:

  • Unit testing: Individual units or components are tested.
  • Integration testing: Individual units are combined and tested as a group.
  • System testing: Complete integrated system software is tested. The purpose of this test is to evaluate the system compliance with specified requirements.
  • Acceptance testing: The purpose of this test is to evaluate the system compliance with business requirements and access whether it is acceptable for delivery.

Unit testing comes at the first of these levels of the pipeline or pyramid and can serve as a building block to learn other types of tests, which is why in this article we will work on unit testing.

According to aws. amazon, unit testing is a process where you test the smallest functional unit of your code. It is a program written to check various parts of the code if it is working as expected, it helps to check different pieces or units in a code, hence the name unit testing.

Some Importance of Unit test to a computer program

  • To ascertain a section of the code is working as expected.
  • To test every function and procedure.
  • To ensure reusability.

Note: Writing tests is a dynamic concept, all that would be covered here would try to encompass common concepts to enable you start writing basic tests for your programs, but there is definitely a whole lot more to learn under this topic.

Continuous Integration with GitHub Actions

Imagine you and your team are working on a project. Instead of everyone independently developing their features and waiting until the end to put everything together, Continuous Integration (CI) encourages developers to regularly merge their changes into a shared code repository. This process is automated, meaning there’s a system (CI server) that checks (tests) and merges the code automatically whenever someone makes changes.

GitHub Actions serves as a Continuous Integration (CI) server or platform. It allows you to define, customize, and automate your CI workflows directly within your GitHub repository. GitHub Actions can be used to build, test, and deploy your code automatically whenever changes are pushed to the repository.

Use Case: Writing test for a Python Program

Here is a simple python program.

This python program uses the selenium library to scrape information from a GitHub repository (mine in this use case). Understanding each and every line of code in the above program is not essential to understanding the concept of this article, so it is okay, if you are new to scrapping or don’t know much about selenium. To give a brief explanation on what various parts of the code is doing:

import os
import logging
import gspread
import pandas as pd
from dotenv import load_dotenv
from selenium import webdriver
from selenium.webdriver.common.by import By
  • This section imports various libraries needed for the execution of the program, the os package helps to interact with the operating system.
  • Gspread library enables us send data to an online google spreadsheet (the data being sent in this case is the information gotten from the GitHub repo), pandas enable us work with dataframes, or series objects.
  • Selenium library scrapes the data from the GitHub repository.
class UpdateSpreadSheet():

def __init__(self):
"""initializes instance variables of class"""
#dictionary to hold features and values of github data
self.github_repo = {}
#list variable holding repository names
self.RepoName = []
#list variable holding language used
self.Language = []
#list variable holding Description of each repo
self.Description = []
#list variable holding datetime when repo was las updated/posted
self.Datetime_Posted = []

The main program is enclosed in a class and separated into multiple methods. Now, here is another benefit of unit testing, because it forces you to separate your code into units(functions) that can be easily tested, multiple ripple positive effects come from this also, because with this you can easily understand your code, and more easily debug various sections of the code.

This section of the code creates the class UpdateSpreadSheet (as stated before, the script updates an online google spreadsheet with help of gspread) and initializes various instance variables of the class, the variables: RepoName, language, description and Datetime_posted are the various information being scrapped.

def getting_data(self):
"""gets data from the github_repo url"""
#Getting Repository names from github repo with selenium
self.RepoName += [name.text if isinstance(name,webdriver.remote.webelement.WebElement) is True else name for name in driver.find_elements(by=By.CSS_SELECTOR,value='a[itemprop="name codeRepository"]')]
#Getting Language used for each repo from github repo with seleniums
self.Language += [language.text if isinstance(language,webdriver.remote.webelement.WebElement) is True else language for language in driver.find_elements(by=By.CSS_SELECTOR,value='span[itemprop="programmingLanguage"]')]
#Getting Description of each repo from github repo with selenium
self.Description += [desc.text if isinstance(desc,webdriver.remote.webelement.WebElement) is True else desc for desc in driver.find_elements(by=By.CSS_SELECTOR, value='p[itemprop="description"]')]
#Getting Datetime when repo was posted or last updated from github repo with selenium
self.Datetime_Posted += [datetime.text if isinstance(datetime,webdriver.remote.webelement.WebElement) is True else datetime for datetime in driver.find_elements(by=By.CSS_SELECTOR,value='relative-time[class="no-wrap"]')]

This method scrapes the data from the github repository and updates the various instance variables.

def populating_dictionary(self):
"""updates dictionary variables
with data obtained from github"""
#populating dictionary github_repo which will be used to create pandas dataframe
self.github_repo["RepoName"] = self.RepoName
self.github_repo["Language"] = self.Language
self.github_repo["Description"]= self.Description
self.github_repo['Datetime_posted'] = self.Datetime_Posted

This populates the dictionary github_repo with the data/information scrapped, this dictionary is needed to create the dataframe which is done next.

def create_dataframe(self):
"""creating dataframe with dictionary
which contains data gotten from github"""
#attempting to create pandas dataframe and assigning to data variable
try:
self.data = pd.DataFrame(self.github_repo).reset_index(drop=True)
except Exception as e:
raise ValueError("All arrays must be of the same length",e)

This section attempts to create the dictionary, if the data scrapped is missing some information, then the dataframe will raise a valueerror.

def worksheet_update(self):
"updating workshet with data"
#updating to spreadsheet
self.worksheet.update([self.data.columns.tolist()] + self.data.values.tolist()),

Finally, this section updates a particular worksheet of the chosen google spreadsheet with the information gotten.

Now let’s write a test program

One thing to note is when writing a unit test, what is being tested is the code itself (to check if the code structure is working as expected) and not servers or external elements for example, for this code where there is a connection with GitHub website, if GitHub suddenly crashes, obviously the code will not perform its functions anymore, but that is not due to the fault of the programmer who wrote the code, but due to the GitHub website which crashed,

Hence as the programmer who wrote the code, you want to test that your program is working as expected (syntax and algorithm wise) and not whether the GitHub server is working smoothly, therefore when writing the test, we will utilize the concept of mocking. Mocking helps isolate the code being tested from external services or systems. This isolation ensures that the test results accurately reflect the behavior of the code itself. We will see more on mocking later on.

Step 1: Install Required Packages

To write tests in python, two libraries are commonly used, the unittest library which is a testing framework set in python by default, therefore has no need for installation, unittest is a simpler, more straightforward option that is suited for smaller projects.

Then there is the pytest library which is more powerful and flexible, and a great choice for more complex projects, here we will use the pytest library, we will also use part of the unittest for mocking purposes.

on the command line run: pip install -U pytest, then run: pytest — version to confirm that pytest has indeed installed.

Step 2: Import Libraries and Packages

import pytest
from gspread_code import UpdateSpreadSheet
from unittest.mock import patch
import pandas as pd

The pytest library is imported, then the UpdateSpreadSheet class is imported from our main code which we want to test called gspread_code into our test script, from the unittest.mock class we import patch, more on how to use this will be explained as we go on, finally the pandas library is imported.

Step 3: Create Test class

class TestUpdateSpreadSheet:  

@pytest.fixture
def instantiate_class(self):
"""setup function to create
instance of class"""
self.inst = UpdateSpreadSheet()
return self.inst

Some points to note:

  • Pytest requires certain naming conventions to be able to automatically detect tests, all your test functions or methods should start with the prefix test.
  • If they are methods of a class as is done here, the class name should also have the prefix test, your class should also not contain an __init__ method.
  • Finally, the name of your test script should be prefixed or suffixed by the name test (eg: test_gspread_code.py or gspread_code_test.py), here we will save ours as test_gspread_code.py .

So, here we create our class TestUpdateSpreadSheet and define our first method instantiate_class, this method is not a test but rather a setup method and that is the purpose of the decorator @pytest.fixture, which is for setup and teardown operations. So, in our setup method an instance of the class UpdateSpreadSheet is created.

Step 4: Write test for getting_data function

def test_getting_data(self,instantiate_class):
'''test to check that the data is being added to the list
variables successfully'''

#mocking selenium driver attribute for checks if code syntax is accurate
with patch("gspread_code.driver.find_elements") as mocked_get:
#mocked attribute driver.find_elements is assigned to list to check
mocked_get.return_value = ['textID','yes']
#calling getting_data function which begins to get the data from selenium
instantiate_class.getting_data()
#asserting all list variables were updated with first round of data gotten
assert instantiate_class.RepoName == ['textID','yes']
assert instantiate_class.Language == ['textID','yes']
assert instantiate_class.Description == ['textID','yes']
assert instantiate_class.Datetime_Posted == ['textID','yes']

#assigning new data
mocked_get.return_value = [1,2]
instantiate_class.getting_data()
#checking all list variables retained previous data and have added new data
assert instantiate_class.RepoName == ['textID','yes',1,2]
assert instantiate_class.Language == ['textID','yes',1,2]
assert instantiate_class.Description == ['textID','yes',1,2]
assert instantiate_class.Datetime_Posted == ['textID','yes',1,2]

So, we are unto writing our first test, first thing to note is the naming convention, test_getting_data, asides from the fact pytest requires the prefix of test to locate it, the name given accurately describes the function being tested.

The purpose of this test function is outlined in the docstring. So, for this test we adopt the concept of mocking. Now, the simplest way to understand mocking is seeing it as putting representatives or in place values, so instead of connecting to the server and scrapping our data, we just replace the function that gets the data.

Here the function is driver.find_elements (the gspread in front is because we are importing the function from that module), so we replace this with a mocked object — mocked_get and return the value of mocked_get as [‘textID’, ‘yes’], so it is like putting this as our inplace value and assuming this is what was scrapped by the function driver.find_elements.

Next, we call our getting_data() function with the instance we have created, now since we have mocked the scrapping function, this getting_data() returns what we assigned which is [‘textID’, ‘yes’]

Here comes the test, using the keyword assert, we verify if the data scrapped is updated in each of our instance variables; RepoName, Language, Description, Datetime_Posted, if this test doesn’t pass then it lets us know something is wrong with our program and it isn’t working as expected, because it fails to update the instance variables with the data as we expect, this is what we mean by testing our program and not that of the server or other external elements.

Next, we assign another value to mocked_get which is [1, 2], now we expect our instance variable to be updated with this new data, that is they should now contain four data now, the previous two sent and the new two sent. this is important cause most likely we are scrapping more than one row of data from the site, so if it doesn’t add the new data to the list, then we will have loss of information and again the program won’t perform as expected.

Step 5: Write test for populating dictionary function.

def test_populating_dictionary(self,instantiate_class):
"""test to check if dictionary variable
github_repo is succesfully being populated"""

#populating list variables wiith values
instantiate_class.RepoName = [1]
instantiate_class.Language = [1,2]
instantiate_class.Description = [1,2]
instantiate_class.Datetime_Posted = [1,2]
#calling function to populate dictionary
instantiate_class.populating_dictionary()
#asserting keys of dictionary are same as the number of list variables
assert len(instantiate_class.github_repo.keys()) == 4
#asserting values of dictionaries are identical to combination of values contained in list variables
assert list(instantiate_class.github_repo.values()) == [[1],[1,2],[1,2],[1,2]]

Here we write another test for our populating_dictionary() function. something to note here: You don’t have to write tests for every function in your code, it is a choice left to you and probably your organization, you write tests for functions that it would be beneficial to do so.

Now, for this test we assign values to the various instance variables as shown, here we don’t mock, because we are not interfacing with any external element, we just give our inplace values and test if our dictionary github_repo which we expect to be populated with all the data gotten and column names when we call the function populating_dictionary(), is indeed populated.

Hence, we assert the keys of the dictionary is equal to 4, keys are what represent our columns and we have 4 columns corresponding to:

  • RepoName
  • Language
  • Description
  • Datetime_Posted.

We also check the values of the dictionary contains same data that was in the rows under those columns.

Step 6: Write test for creating_dataframe function.

def test_creating_dataframe(self,instantiate_class):
"""test to check if dataframe is created successfully
when given appropriate data and if the ValueError
is raised when the data is not appropriate"""

#populating dictionary variable github_repo with appropriate data
instantiate_class.github_repo = {'test1':[2,'3'],'test2':['5','4']}
#calling create data_frame function which creates a dataframe with dictionary github_repo
instantiate_class.create_dataframe()
#asserting data variable which holds the dataframe created from github_repo is equal expected dataframe
assert instantiate_class.data.equals(pd.DataFrame({'test1':[2,'3'],'test2':['5','4']}))

#populating github_repo variable with inappropriate data
instantiate_class.github_repo = {'test1':[2,'3'],'test2':[5]}
#asserting that ValueError is raised when an attempt to create a dataframe with this data is made
with pytest.raises(ValueError):
instantiate_class.create_dataframe()

Now we write another test for our create_dataframe function, again here we populate the github_repo dictionary, it exists within our code and is not an external element, hence we don’t need to mock it, we can simply assign values to it as is done, then we call the create_dataframe() function which we expect to create our dataframe if the data contained in the GitHub dictionary is appropriate (when we say appropriate we mean, no missing values in some rows, every column should have same length).

Now, in case there are missing values, we expect our function to raise a valueerror, we also test that by populating the dictionary with inappropriate data, in the data being passed test1 column has two rows of data [2, ‘3’] while test2column only has one row of data [5]. if the value error is not raised when we call the create_dataframe() function, then the test will fail. Note how the test for errors is written, it differs from the usual assert method used in others.

Step 7: Write test for test_worksheet_update

def test_worksheet_update(self,instantiate_class):
"""test to check if the values passed in to
update worksheet is accurate"""

#populating github_repo variable
instantiate_class.github_repo = {'test1':[2,'3'],'test2':['5','4']}
#creating dataframe from github_repo
instantiate_class.create_dataframe()

#asserting dataframe created contains expected columns
assert instantiate_class.data.columns.tolist() == ['test1','test2']
#asserting dataframe created contains expected values
assert instantiate_class.data.values.tolist() == [[2,'5'],['3','4']]

Finally comes our last test, here we write a test to check if the function we pass in the worksheet_update which is data.columns.tolist() and data.values.tolist(), we assert here that these two functions return the what we expect in the correct format cause if this not so, when we attempt to add as we do, either an error is thrown by the program or the resulting data if it does still add, will be different from what we expect it to be, and our worksheet will now be updated with the data but not in the correct format.

And breathe! we have written four tests for our python script, this section gives you the insight on how to go about writing your own tests, now let’s run our test script, by now you should have pytest installed.

  • Navigate to the working directory on your command prompt/anaconda prompt.
  • Run the command pytest on your command prompt/anaconda prompt and if you named the file correctly then you should see the outputs of your tests.

Output:

If all your tests passed you should see an image like then one below, letting you know the total number of tests you wrote passed, in this case we wrote 4 tests.

All tests Passed.

If any of your test fails, you should see an output that looks like the one below.

Not all tests passed.

When a test fails, it not only tells you the number that passed and failed as is shown in the image, it gives you information on what test failed and why, in this case the test that failed was the test_creating_dataframe() and that was cause it did not throw a valueerror when we expected it too, this was due to the fact the data passed in was appropriate, but we were testing for a case when inappropriate data is passed, hence it was actually expecting an inappropriate data.

And that’s a wrap on test, next we move to the other section, how to we automate (Continuous Integration) this test we have written with github actions.

Section 2: Continuous Integration

Again, on this concept there is a lot to learn, what we will cover here is what you need to start automating your tests scripts and working with GitHub actions, you can learn more on the GitHub actions documentation. Two things we need to achieve this:

  1. GitHub actions works with a specific path. To set up your workflow (the automated process you want to set up, in our case an automated test) you need a .yml file, and you need to set it up in the following path .github/workflows/main.yml , hence you create this directory and the .yml file (you can give it any name, here we name it main.yml).
  2. The .yml file is the file where you write the syntax to execute the workflow process. Yaml is a language itself, so the syntax to write it is different from python that you are probably used to, but it is simple to learn.

Let’s write our workflow process.

The syntax program follows thus:

  1. We give the name of our workflow —here we call it continuous integration.
  2. The on command is where you specify what will trigger your workflow, here our workflow will be triggered whenever we make a push to the repository in essence, whenever we push a new update of our code, the test workflow begins and the test which we run in our workflow will be carried out on our updated code. this provides one of the uses of GitHub actions, cause now you don’t need to manually test if your updated code has broken some syntax and isn’t working as you expect, just push and a test will be done automatically as specified by your workflow, if the test fails then you know your updated code has some issues.
  3. jobs,this is where you specify the actual actions you want your workflow to have, here we have two jobs, build and test.
  4. For our first job build, we specify the operating system which it runs on, with the runs-on command, the common one used is the ubuntu-latest, but you can check the docs for other types, although this should do just fine.
  5. Next, we start specifying the steps of this particular job, first it checks out our repository to the runner, this enables us run this script and other actions against our code, then we install the dependencies that should be contained in your requirements.txt file with the run command, note: pytest must be part of this dependencies for us to carry out your tests, and that’s it for our first job build.
  6. Then we go on to our next job test, a specify similar procedure to the one above, just here, we have an extra step with the name test_script and the run command pytest, this will run the tests we have written in our test_gspread_code.py .
  7. the needs keyword in the test job, specifies that the test job should only run after the build runs. In essence we are saying test needs the build job to have completed running in order to run itself.

Now when you push an updated code to GitHub, your workflow is triggered! You should see a small yellow dot close to the green code button showing that your workflow has started running.

Next, you click on actions on the top bar of the GitHub page, and it takes you to where you see all your workflows, and if they passed or failed like shown below:

Workflows

Here it shows the workflows that have run, one has failed and is shown boldly with a red sign with an ‘x’, and one has passed, which is the latest one we just ran. finally, you can click on the workflows themselves and it takes you to where you can view the step by step process the workflow ran:

If we want to still view more details, we click on the build or test jobs, and it shows us more details of the workflow process.

And that’s a wrap! What a journey! Thanks for sticking along till the end and I hope you found this article informative, and you are able to start writing your very own tests and automating your workflows with GitHub actions!

--

--

Humphery
Data Epic

Machine Learning Enthusiast, Control Engineer