Stargazers Reloaded: LLM-Powered Analyses of your GitHub Community

Published in

EvaDB Blog

8 min readSep 30, 2023

GitHub stars symbolize a repository’s popularity in the developer community. What if we could delve into the minds of these star-givers, extracting insights from their profiles to understand their interests? That’s where the Stargazers Reloaded app comes in.

In this post, we show how EvaDB makes it easy to get insights about your GitHub community using large language models (LLMs). We analyze three popular GitHub communities: GPT4All, Langchain, and CockroachDB. We have open-sourced the code so that you can extract insights about your GitHub community.

This app is inspired by the “original” Stargazers app written by Spencer Kimball from CockroachDB. While the original app pioneered the art of analyzing the community exclusively using the GitHub API, our LLM-powered Stargazers app powered by EvaDB additionally extracts insights from unstructured data obtained from the stargazers’ webpages.

What does the App do?

**Workflow of LLM-Powered Analyses of GitHub Community**

The app first scrapes the GitHub profiles of every user who’s starred a repository, capturing key details like their usernames, locations, and topics of interest.
It then takes screenshots of each user’s GitHub webpage and extracts text using an OCR model.
Lastly, it extracts information using LLM functions embedded in the extracted text.

In this app, we could directly use GPT-4 to generate the “golden” table; however, it is super expensive — costing $60 to process the information of 1000 users. To maintain accuracy while reducing costs significantly, we employ an LLM model cascade, running GPT-3.5 before GPT-4, leading to a 11x cost reduction ($5.5).

Why Does This Matter?

Now, why does all of this matter? Developers understand the value of community and collaboration. This open-sourced app helps extract insights about your community to facilitate further collaboration. Here, we show some insights we obtained by analyzing 1000 stargazers from three communities: GPT4All, Langchain, and CockroachDB.

Exploring Common Interests: By extracting the topics of interest from the profiles of stargazers, you can pinpoint developers with similar passions. Our analysis of the GPT4All community shows that the majority of the stargazers are proficient in Python and JavaScript, and 43% of them are interested in Web Development. Web developers ❤️ open-source LLMs 🤩 In the Langchain community, 37% of the stargazers are interested in Machine Learning.

**Web developers ❤️ open-source LLMs 🤩**

**Langchain is most popular 📈 in the Machine Learning 🤖🧠 interest groups**

**Web Developers and Database Enthusiasts ❤️ CockroachDB 🤩**

2. Identify your core audience: Understanding your stargazers’ demographics helps identify engaged user segments based on profession and location. These insights allow you to tailor your project roadmap, documentation, and outreach to resonate with your target users. For example, if data shows many stargazers are students, you may want to highlight educational use cases. If many users come from China, translating documentation into Chinese could help grow the community.

**Stargazers distributed by their country of origin.**

3. Exploring New Repositories: Examining the repositories your stargazers are enthusiastic about helps pinpoint emerging topics and promising ideas that resonate with your developer community. By tracking your stargazers’ engagement on GitHub, you can stay informed about the topics that matter most to your community. Here, we analyze the top 10 repositories co-starred by GPT4All and CockroachDB users. Agent frameworks top the list for GPT4all 🤩.

Lessons Learned

GPT-4 is Super Expensive: OpenAI’s gpt-3.5 does not yield satisfactory results when extracting key attributes from the text obtained via OCR. It often results in numerous ‘n/a’ values. In contrast, gpt-4excels in extracting occupation, programming languages, and topics of interest from the messy OCR output. In theory, we could directly use gpt-4 to generate the structured output from the raw text blobs.

However, gpt-4’s API calls are 40 times more expensive than gpt-3.5 on average.

2. Model Cascades to the Rescue: To reduce the cost, the EvaDB query first calls gpt-3.5 to process the extracted text into broader and more general topics. Then, we feed this semi-organized output from gpt-3.5 to gpt-4 to refine the outcomes.

3. Batching Optimization: Another advantage of the model cascade is that the input token lengths for the gpt-4 request are much smaller than the entire scraped text blob. For example, the semi-organized topics only comprise 20 tokens per user on average, compared to 300 tokens for the entire text blob. This allows EvaDB to process multiple batches of user data using a single prompt request, greatly reducing the number of prompt tokens used.

Cascading and Batching optimizations in EvaDB lead to 11 times lower dollar cost compared to running GPT-4 on the entire dataset.

4. Vision + Language Model: EasyOCR outperforms Pytesseract in extracting meaningful content from user profile snapshots. It is super important to use a good vision model before analyzing the extracted text using a language model.

5. GitHub API: The API is notably slow due to GitHub’s rate limits and offers limited user information. OCR followed by LLM is a faster and more flexible alternative to extracting insights from GitHub profiles. For our analysis, the GitHub API calls were free of charge. Thanks to the cascades and batching optimizations, we could keep the cost of our analysis low.

Let’s Dive Deep into the App!

Here are the key steps in the app:

Scraping using Github API: The app begins by scraping the GitHub profiles of users who’ve starred a particular repository. This initial step is crucial as it forms the foundation for the subsequent data extraction.

--- List of Stargazers
CREATE TABLE gpt4all_StargazerList AS
  SELECT GithubStargazers("https://github.com/nomic-ai/gpt4all", "GITHUB_KEY");

--- Details of Stargazers extracted using the Github API
CREATE TABLE gpt4all_StargazerDetails AS
  SELECT GithubUserdetails(github_username, "GITHUB_KEY")
  FROM gpt4all_StargazerList;

2. OCR Magic: Once the profiles are collected, the app takes screenshots of each user’s GitHub profile page. The screenshots are processed through an OCR model. This model extracts the textual content from the images, preparing it for analysis.

**Using OCR to extract text from user profiles**

--- Text in webpages of Stargazers extracted using WebPageTextExtractor
CREATE TABLE gpt4all_StargazerScrapedDetails AS
  SELECT github_username, WebPageTextExtractor(github_username)
  FROM gpt4all_StargazerList;

Sample text extracted by running OCR on user profile screenshots.

3. LLM Unleashed: Here’s where the real magic happens. The extracted text is fed into an LLM, for example, GPT-3.5 from OpenAI. This AI model extracts key attributes that shed light on the stargazers’ profiles, such as the user’s occupation, social media presence, and topics of interest.

--- Prompt to GPT-35
You are given a block of disorganized text extracted from the GitHub user profile of a user using an automated web scraper. The goal is to get structured results from this data.
Extract the following fields from the text: name, country, city, email, occupation, programming_languages, topics_of_interest, social_media.
If some field is not found, just output fieldname: N/A. Always return all the 8 field names. DO NOT add any additional text to your output.
The topic_of_interest field must list a broad range of technical topics that are mentioned in any portion of the text.  This field is the most important, so add as much information as you can. Do not add non-technical interests.
The programming_languages field can contain one or more programming languages out of only the following 4 programming languages - Python, C++, JavaScript, Java. Do not include any other language outside these 4 languages in the output. If the user is not interested in any of these 4 programming languages, output N/A.
If the country is not available, use the city field to fill the country. For example, if the city is New York, fill the country as United States.
If there are social media links, including personal websites, add them to the social media section. Do NOT add social media links that are not present.
Here is an example (use it only for the output format, not for the content):

name: Pramod Chundhuri
country: United States
city: Atlanta
email: pramodc@gatech.edu
occupation: PhD student at Georgia Tech
programming_languages: Python, C++
topics_of_interest: PyTorch, Carla, Deep Reinforcement Learning, Query Optimization
social_media: https://pchunduri6.github.io

--- Using LLMs to extract insights from text
CREATE TABLE gpt4all_StargazerInsights AS
  SELECT StringToDataframe(GPT35({LLM_prompt}, extracted_text))
  FROM gpt4all_StargazerScrapedDetails;

4. Going Deeper with LLM: To gain deeper insights from user profiles, we can feed the outputs from the initial analysis into a more powerful large language model like GPT-4. This advanced AI model can categorize users’ topics of interest into more granular categories like Machine Learning, Databases, and Web Development.

--- Prompt to GPT-4
You are given 10 rows of input, each row is separated by two new line characters.
Categorize the topics listed in each row into one or more of the following 3 technical areas - Machine Learning, Databases, and Web development. If the topics listed are not related to any of these 3 areas, output a single N/A. Do not miss any input row. Do not add any additional text or numbers to your output.
The output rows must be separated by two new line characters. Each input row must generate exactly one output row. For example, the input row [Recommendation systems, Deep neural networks, Postgres] must generate only the output row [Machine Learning, Databases].
The input row [enterpreneurship, startups, venture capital] must generate the output row N/A.

--- Deeper insights using an expensive LLM prompt
CREATE TABLE IF NOT EXISTS
gpt4all_StargazerInsightsGPT4 AS
SELECT name,
 country,
 city,
 email,
 occupation,
 programming_languages,
 social_media,
 GPT4(prompt,topics_of_interest)
FROM gpt4all_StargazerInsights;

Conclusion

EvaDB’s vision is to make it easy to build scalable AI apps. With the power of LLM and AI-driven data extraction, we’re entering an era of unprecedented insights into the developer community. This app demonstrates technology’s ability to bridge gaps, foster connections, and empower developers to explore new horizons. As the app continues to evolve, anticipate additional features and capabilities tailored to the developer community’s requirements.

Analyze your favorite GitHub community using the open-sourced Stargazers Reloaded!

If you’re excited about our vision of building a database system for AI apps, show some ❤️ by giving a ⭐ for EvaDB on Github.

Stargazers Reloaded: LLM-Powered Analyses of your GitHub Community

What does the App do?

Why Does This Matter?

Lessons Learned

Let’s Dive Deep into the App!

Conclusion

Published in EvaDB Blog

Written by Gaurav Tarlok Kakkar

No responses yet