Jupyter Notebook meets Marsha.ai! 🤝

aguillenv
4 min readAug 8, 2023

--

It is incredible to see how people with different backgrounds and experts in diverse domains are genuinely intrigued and amazed by the potential of LLMs (Large Language Models). And how could they not be? Since LLMs have become widely available for everyone, they have opened up a world of possibilities. Some people have started using these tools to get answers about their businesses and ways to improve them. Others are seeking ways to integrate these tools into their current workflows to be more productive.

In the tech industry, countless articles discuss how LLMs have helped improve productivity. This holds true when these tools, such as ChatGPT, GPT Engineer, GitHub Copilot, etc., are used wisely. They can accelerate workflows by addressing challenges like the so-called “Cold start” problem, offering boilerplate code to kickstart projects, or assisting with tedious or repetitive tasks.

All of these articles, especially the ones discussing how to use code generation tools, converge on one crucial point: the generated response cannot be trusted. While it may appear almost perfect, more often than not, it requires some level of human interaction to address bugs or typos. The more complex the task, the higher the probability of errors occurring.

Another consideration to take into account is the ease of integration with these new tools. Initially, when using tools like ChatGPT, a significant amount of context switching was required. However, numerous wrappers and plugins have been developed over time to simplify the integration process and meet the users where they are.

David Ellis, Luis Fernando De Pombo and I recently introduced Marsha as a proposal of one of those tools that have come to help in terms of LLM code reliability.

What (or who? đź‘€) is Marsha?

As the GitHub page explains, Marsha is an LLM-based programming language. Based on an English description of what you want to do and some examples of usage following a minimal syntax, the Marsha compiler will guide an LLM to produce tested Python code.

So, what can Marsha actually do?

Marsha can be used to generate more reliable Python code following a syntax designed to be minimal. Just need to create a .mrsh file with the specification of the functions and types you want and Marsha will generate the Python code for those functions including a test file based on the examples provided and the necessary requirements.

In the GitHub repository, we can find different use cases for marsha. But in this post, let’s see an example of how marsha can be easily integrated with a Jupyter Notebook workflow.

The full notebook example can be found here

Assuming you have your Jupyter notebook up and running:

  • First, you need to install marsha via pip
%pip install git+https://github.com/alantech/marsha
  • Set theOpenAI API key
%env OPENAI_SECRET_KEY=sk-...
  • Create the `.mrsh` file with the desired definition. In this case, we show how can be defined inline and then saved into a file.
marsha_filename = 'employee_skills.mrsh'
with open(marsha_filename, 'w') as f:
marsha_content = f'''
# type EmployeesByDepartment ./employees_by_department.csv


# type DepartmentSkills ./department_skills.csv


# type EmployeeSkills
name, skill
Bob, math
Jake, spreadsheets
Lisa, coding
Sue, spreadsheets


# func get_employee_skills(list of EmployeesByDepartment, list of DepartmentSkills): list of EmployeeSkills

This function receives a list of EmployeesByDepartment and a list of DepartmentSkills.
The function should be able to create a response of EmployeeSkills merging the 2 list by department.
Use the pandas library.

* get_employee_skills() = throws an error
* get_employee_skills([EmployeesByDepartment('Joe', 'Accounting')]) = throws an error
* get_employee_skills([], []) = []
* get_employee_skills([EmployeesByDepartment('Joe', 'Accounting')], []) = []
* get_employee_skills([], [DepartmentSkills('Accounting', 'math')]) = []
* get_employee_skills([EmployeesByDepartment('Joe', 'Accounting')], [DepartmentSkills('Accounting', 'math')]) = [EmployeeSkills('Joe', 'math')]
* get_employee_skills([EmployeesByDepartment('Joe', 'Accounting'), EmployeesByDepartment('Jake', 'Engineering')], [DepartmentSkills('Accounting', 'math')]) = [EmployeeSkills('Joe', 'math')]
* get_employee_skills([EmployeesByDepartment('Joe', 'Accounting'), EmployeesByDepartment('Jake', 'Engineering')], [DepartmentSkills('Accounting', 'math'), DepartmentSkills('Engineering', 'coding')]) = [EmployeeSkills('Joe', 'math'), EmployeeSkills('Jake', 'coding')]


# func read_csv_file(path to file): file data without header

This function read a CSV file and return the csv content without header.


# func process_data(path to file with EmployeesByDepartment, path to file with DepartmentSkills): list of EmployeeSkills

This function uses `read_csv_file` to read the 2 csv files received and create the respective lists. Make sure to strip and lower each string property coming from the csv. Then, call and return the result from `get_employee_skills`.

* process_data('/pathA', '') = throws an error
* process_data('/pathA', '/pathB') = [EmployeeSkills('Joe', 'math')]
* process_data('/pathA', 'pathC') = [EmployeeSkills('Joe', 'math'), EmployeeSkills('Jake', 'coding')]
'''
f.write(marsha_content)

This file is defining a couple of types from the data coming from CSV files, and another type with inline definition, following the CSV format. These types will be translated as classes in the Python-generated code. The files can be found in the repository.

Then it is defining a set of functions trying to follow the single responsibility principle. The functions are reading a couple of CSV files, doing some data manipulation and returning the final result.

  • Execute marsha
!python -m marsha ./"$marsha_filename"

Marsha generates a Python script, the respective test file for the Python script and the required dependencies.

  • To integrate the generated code with the notebook workflow we install the dependencies generated by marsha
%pip install -r requirements.txt
  • Install matplotlib since we will be creating a visualisation with the data coming from the generated code
%pip install matplotlib
  • Now just import the function and use it to generate the visualisation for the obtained results
import pandas as pd
import matplotlib.pyplot as plt

from employee_skills import process_data

employee_skills_list = process_data('./employees_by_department.csv', './department_skills.csv')
employee_skills_df = pd.DataFrame([(e.name, e.skill) for e in employee_skills_list], columns=["Name", "Skill"])
skill_counts = employee_skills_df["Skill"].value_counts()

plt.figure(figsize=(8, 4))
plt.pie(skill_counts, labels=skill_counts.index, autopct="%1.1f%%")
plt.title("Employee Skills")
plt.show()

The full notebook example can be found here

--

--