Efficiently Loading Hierarchical Raw Content from GitHub

Yatharth Gupta
VLEAD-Tech
Published in
5 min readJul 26, 2023

Introduction

In modern web development, efficiently retrieving hierarchical content from remote repositories is a common challenge. Whether it’s FAQs, blog posts, or any other type of content, developers often face the task of fetching and organizing data from nested sub-directories in a repository. In this blog post, we will explore a generalized approach to tackle this problem and provide context through a specific use case.

Understanding the Challenge

Consider a scenario where we have a repository on GitHub containing various sub-directories, each housing different content files. Our objective is to load this hierarchical content and display it in a structured manner on a web application. To achieve this, we need to optimize our approach to avoid unnecessary API calls and efficiently retrieve the data.

TL;DR

In this blog post, we explore efficient ways to retrieve hierarchical content from GitHub repositories using React.js. We present a generalized approach that involves utilizing the Git tree API, filtering and sorting content, and making a single API call using raw content URLs. By optimizing our approach, we overcome the GitHub API rate limit without embedding personal access tokens. We also share a specific use case during an internship at Virtual Labs, where we successfully fetched and sorted FAQs and built a collapsible FAQ component. Mastering these techniques helps developers handle diverse content and deliver smooth user experiences in web applications. Happy coding!

The Generalized Approach

Step 1: Using GitHub API for Tree Structure

To retrieve hierarchical content from the GitHub repository, we utilize the Git tree API. This API allows us to fetch metadata of the tree structure of the repository, including all directories and files within it. By obtaining the tree structure, we gain insight into the organization of the content, enabling us to navigate through the sub-directories.

Step 2: Filtering and Sorting

With the tree structure in hand, we can filter the content files based on specific criteria. For instance, if we are interested in FAQs, we can filter the tree to extract only the .mdx files containing FAQ information. Additionally, we can sort the filtered files to ensure a proper sequence when displaying the content.

Step 3: Single API Call for Data Fetching

To optimize our approach, we aim to reduce API calls to a minimum. Instead of fetching each content file individually, we make a single API call using the raw content URL. This technique significantly reduces the number of requests, ensuring smooth and efficient retrieval of data.

Step 4: Parsing and Organizing Data

Once the content is fetched, we may need to parse it to extract the relevant information. For instance, if the content is stored in Markdown format, we can use regular expressions to extract titles, content, and other metadata. We then organize this data to suit the needs of our web application.

Failed Approaches & Secure Solution: Overcoming GitHub API Rate Limit

During the development process, we encountered the challenge of the API rate limit imposed by GitHub, which is quite low. We used the regular unauthenticated API route and it resulted in so many calls that we could not scale the application because we hit the API limit. So, we considered the possibility of increasing the API rate limit by embedding a personal access token directly in the website’s code. This approach would have extended the API limit to 5000 requests per hour, but we quickly realized the potential security concerns associated with such a practice. Personal access tokens should never be exposed directly in the code of a website, as it could lead to unauthorized access to sensitive information and compromise the security of the GitHub repository.

To address this issue, we sought a more secure and efficient solution. By optimizing our approach to making a single API call using the raw content URL, we effectively reduced the number of requests, thus bypassing the API rate limit altogether without compromising security. This final solution ensured that we efficiently loaded hierarchical content while adhering to GitHub’s API rate limit restrictions.

A Specific Use-Case: Virtual Labs Web App

As an example, let’s consider a specific use case I encountered during a Virtual Labs web app development. In this context, we needed to pull FAQs from the GitHub directory, where the FAQs were organized in a tree structure. By applying the generalized approach and avoiding unnecessary API calls, we successfully fetched the FAQs efficiently, sorted them based on their question numbers, and displayed them as collapsible sections on the web app.

const fetchFaqs = async () => {
try {
const owner = "repo-owner";
const repo = "repo-name";
const ref = "main";
const url = `https://api.github.com/repos/${owner}/${repo}/git/trees/${ref}?recursive=1`; // git tree API

const response = await fetch(url);

if (!response.ok) {
throw new Error("Failed to fetch folders");
}

const contents = await response.json();

const mdxFiles = contents.tree.filter((item) => // filter for mdx files
item.path.endsWith(".mdx")
);

// optional - only if you require faq's of a specific folder in repo
const mdxFilesInSubfolder = mdxFiles.filter( // filter for mdx files in subfolders (faq or faq-virtual-labs)
(item) =>
item.path.startsWith(folderPath + "/") &&
item.path.endsWith(".mdx")
);
mdxFilesInSubfolder.sort((a, b) => { // sort mdx files in subfolders by question number
const nameA = a.path;
const nameB = b.path;

const [, numberA] = nameA.match(/Q(\d+)/);
const [, numberB] = nameB.match(/Q(\d+)/);

return Number(numberA) - Number(numberB);
});
const fetchMdxFileContent = async (downloadUrl) => { // fetch mdx file content
const response = await fetch(downloadUrl);

if (!response.ok) {
throw new Error("Failed to fetch file content");
}

const contents = await response.text();

const contentRegex =
/^---\s*title:\s*(.*?)\s*(?:excerpt:\s*(.*?))?\s*---\s*(.*)$/s; // regex to extract title, excerpt and content from mdx file

const match = contents.match(contentRegex);
const title = match[1].trim();
const excerpt = match[2] ? match[2].trim() : "";
const content = match[3].trim();
return { // return object with title, content
title,
content,
};
};
const mdxFileContents = await Promise.all( // fetch mdx file content for each mdx file in subfolder
mdxFilesInSubfolder.map(async (mdxFile) => {
const downloadUrl = `https://raw.githubusercontent.com/${owner}/${repo}/${ref}/${mdxFile.path}`; // download url for mdx file
const { title, content } = await fetchMdxFileContent(downloadUrl); // function to fetch mdx content

return {
title: title,
content: content,
};
})
);
// final faqs are now in "mdxFileContents.flat()"
} catch (error) {
console.error(error);
}
};

fetchFaqs();
// ...

Conclusion

Efficiently loading hierarchical content from GitHub repositories is a crucial aspect of modern web development. By following a generalized approach and learning from the failed attempts to extend the API limit, we can optimize API calls, filter, and sort data, and organize content in a structured manner. Through a specific use case during an internship at Virtual Labs, we demonstrated the application of this approach to effectively fetch FAQs and build a collapsible FAQ component for the web app.

As developers, mastering these techniques enables us to efficiently handle various types of content and deliver smooth user experiences, making web applications more dynamic and user-friendly.

Happy coding and content management!

--

--