Write better Jupyter Markdown using Google’s Gemini
Data scientists and machine learning engineers often struggle with maintaining well-documented Jupyter notebooks. As projects evolve, documentation can become outdated or inconsistent, making it challenging for team members to understand and collaborate effectively. In this article, I’ll share how I built an automated documentation generator that uses Google’s Vertex AI to analyze and enhance Jupyter notebooks with structured documentation.
The Challenge of Notebook Documentation
Jupyter notebooks are fantastic tools for exploratory data analysis and model development, but they come with their own set of challenges:
- Documentation often becomes an afterthought during rapid development
- Code sections lack clear boundaries and explanations
- Notebooks can become lengthy and difficult to navigate
- Maintaining consistent documentation across multiple notebooks is time-consuming
To address these issues, I developed a tool that automatically analyzes notebook content and generates comprehensive documentation using Google’s Vertex AI and Gemini model.
Core Features of the Documentation Generator
The tool provides several key features:
- Automatic Structure Analysis: Analyzes code cells to identify logical sections and their relationships
- Smart Title Generation: Creates descriptive titles based on notebook content
- Section Documentation: Generates detailed explanations for each code section
- Table of Contents: Creates a navigable table of contents for easy reference
- Contextual Understanding: Uses both code and existing markdown content to generate relevant documentation
Building the Documentation Generator
Setting Up the Foundation
The tool is built on top of Google’s Vertex AI platform, utilizing the Gemini 1.5 Pro model. Here’s how we initialize the core components:
class NotebookProcessor:
def __init__(self, project_id: str, location: str) -> None:
vertexai.init(project=project_id, location=location)
self.model = GenerativeModel(
MODEL_NAME,
tools=[Tool.from_google_search_retrieval(
google_search_retrieval=grounding.GoogleSearchRetrieval()
)]
)
Notice that Gemini will use Google Search grounding to avoid hallucinations when it comes to more recent terms, tools, libraries and concepts that could be not part of the model’s knowledge.
Smart Structure Analysis
One of the key features is the ability to automatically identify logical sections within the notebook. The tool analyzes code cells and their relationships to create a coherent structure:
def analyze_notebook_structure(
self, cells: List[NotebookCell], full_content: Dict[str, str]
) -> NotebookStructure:
"""
Analyzes only the code cells to determine structure (sections), title, and introduction.
The LLM assigns sections using the code_cell indices, ignoring markdown cells.
"""
code_cells = [cell for cell in cells if cell.cell_type == "code"]
if not code_cells:
default_structure = NotebookStructure()
default_structure.title = "Python Notebook"
return default_structure
cells_json_str = self._prepare_code_cells_for_prompt(code_cells)
prompt = f"""
You are given a list of code cells from a Jupyter Notebook. Each item has:
- 'code_index': the sequential index among code-only cells,
- 'original_index': the cell's position in the full notebook (for reference),
- 'content': the code content.
**Goal**:
1. Suggest a descriptive, technical title for the notebook.
2. Generate an introduction paragraph that will be added just after the title. This introduction should briefly summarize the notebook's purpose and key topics.
3. Divide these code cells into 2–10 logical sections based on functionality, context, and flow.
4. For each section, provide:
- A 'title' (subtitle),
- A short 'description',
- 'start_cell' and 'end_cell' (both inclusive), which refer to 'code_index' in the list below.
**Your output** must follow this JSON schema exactly:
{json.dumps(NOTEBOOK_STRUCTURE_SCHEMA, indent=2)}
Here are the code cells:
{cells_json_str}
Additionally, here's the entire notebook's code and markdown for background context:
- Markdown content:
\"\"\"markdown
{full_content['markdown']}
\"\"\"
- Code content:
\"\"\"python
{full_content['code']}
\"\"\"
Important:
- The first code_index is 0, the last code_index is {len(code_cells) - 1}.
- Use code_index in 'start_cell' and 'end_cell'; do not reference 'original_index'.
- The final structure should remain consistent (start_cell <= end_cell).
"""
try:
response = self.model.generate_content(
prompt,
generation_config={
**GENERATION_CONFIG,
"response_mime_type": "application/json",
"response_schema": NOTEBOOK_STRUCTURE_SCHEMA,
},
safety_settings=SAFETY_OFF_SETTINGS,
)
structure_data: Dict = json.loads(response.text)
return self._validate_and_create_structure(structure_data, len(code_cells))
except json.JSONDecodeError as e:
raise ValueError(f"Failed to decode JSON response from model: {e}")
except Exception as e:
print(f"Error analyzing notebook structure: {e}")
return self._create_default_structure(len(code_cells))The analysis generates a structured representation of the notebook, including:
- A descriptive title
- An introduction summarizing the notebook’s purpose
- Logical sections with start and end points
- Section descriptions and relationships
Gemini’s controlled generation comes in handy and guarantees a reproducible behavior. Here is the JSON schema of the generated notebook structure:
NOTEBOOK_STRUCTURE_SCHEMA = {
"type": "OBJECT",
"properties": {
"title": {"type": "STRING"},
"introduction": {"type": "STRING"},
"sections": {
"type": "ARRAY",
"items": {
"type": "OBJECT",
"properties": {
"title": {"type": "STRING"},
"description": {"type": "STRING"},
"start_cell": {"type": "INTEGER"},
"end_cell": {"type": "INTEGER"},
},
"required": ["title", "start_cell", "end_cell"],
},
},
},
"required": ["title", "sections", "introduction"],
}
Generating Smart Documentation
For each identified section, the tool generates contextual documentation that explains the purpose and approach rather than just describing the code:
def generate_section_markdown(
self, cells: List[NotebookCell], section_info: Dict, full_content: Dict[str, str]
) -> str:
"""Generates markdown documentation for a section of the notebook."""
code_cells = [cell for cell in cells if cell.cell_type == "code"]
section_code_cells = code_cells[section_info["start_cell"] : section_info["end_cell"] + 1]
prompt = f"""
Generate documentation for this section of the notebook.
Context:
1. Section: {section_info['title']}
Description: {section_info['description']}
2. Code in this section:
```python
{chr(10).join(cell.source for cell in section_code_cells)}
```
3. Complete notebook context:
Markdown: ```markdown
{full_content['markdown']}
```
Code: ```python
{full_content['code']}
```
Requirements:
1. Your text should exclusively talk about the Code of this section
2. The text you are writing for this section should fit the overall code and overall initial markdown but still be exlusive to the code of the section.
3. Focus on the purpose and outcomes rather than line-by-line explanation
4. Explain concepts and approaches rather than just code functionality
5. Maintain a narrative flow that connects to the overall notebook purpose
6. Use appropriate markdown formatting
7. Don't reference cell numbers or positions
8. Group related operations together in the explanation
9. Be concise, use structure and bullet points when necessary.
10. Focus more on functional rather than code and syntax details.
11. Keep it short and concise, very concise.
Return formatted markdown content that creates a clear narrative for this section.
"""
try:
response = self.model.generate_content(prompt)
return response.text.strip()
except Exception as e:
print(f"Error generating section markdown for '{section_info['title']}': {e}")
return f"<!-- Error generating markdown: {e} -->\n\n### {section_info['title']}"The documentation focuses on:
- Purpose and outcomes of the code section
- Conceptual explanations rather than line-by-line details
- Integration with the overall notebook narrative
- Concise and structured presentation
Best Practices and Lessons Learned
While developing this tool, I discovered several key practices that enhance the quality of generated documentation:
- Context is King: Providing the full notebook context to the LLM results in more coherent and relevant documentation
- Structured Output: Using JSON schemas for LLM responses ensures consistent and parseable output
- Robust Error Handling: Implementing fallback mechanisms ensures the tool remains useful even when faced with unexpected content
- Progressive Enhancement: The tool take into consideration existing less complete documentation when generating the new one.
- Google search grounding: The tool leverages Google Search Grounding as a tool to avoid hallucination when it comes to more recent tools, concepts and libraries.
Future Improvements
There are several exciting possibilities for enhancing the tool:
- Support for custom documentation templates and styles
- Enhanced code analysis for identifying dependencies and data flows
- Architecture diagram generation when it makes sense to do so
Conclusion
Automated documentation generation for Jupyter notebooks is more than just a convenience — it’s a crucial tool for maintaining high-quality, maintainable code in data science projects. By leveraging the power of Vertex AI and Gemini model, we can create documentation that is both comprehensive and contextually aware.
The tool demonstrates how AI can be effectively used to solve real-world development challenges while maintaining high standards of code documentation.
The complete code for this project is available on GitHub here. Feel free to contribute or adapt it for your own use cases.
About me
I’m Chouaieb Nemri, a Generative AI BlackBelt Specialist at Google with over a decade of experience in data, cloud computing, AI, and electrical engineering. My passion lies in helping executives and tech leaders turbocharge their cloud-based AI, ML, and Generative AI initiatives. Before Google, I worked at AWS as a GenAI Lab Solutions Architect and served as an AI and Data Science consultant at Capgemini and Devoteam. I also led cloud data engineering training at the French startup DataScientest, directly collaborating with its CTO. Outside of work, I’m dedicated to mentoring aspiring tech professionals — especially people with disabilities — and I hold a 5-star mentor rating across platforms like MentorCruise, IGotAnOffer and ADPList.
If you like the article and would like to support me make sure to:
- 👏 Clap for the story (50 claps) and follow me 👉
- 📰 View more content on my medium profile
- 💪 Have me as a Mentor on iGotAnoffer or MentorCruise
- 🔔 Follow Me: LinkedIn | TikTok | Instagram | Medium | GitHub | Twitter
- 🚀👉 Join the Medium membership program to continue learning without limits. I’ll receive a small portion of your membership fee if you use the following link, at no extra cost to you.