A Sentiment Analysis example with Snowflake Marketplace Cybersyn Github Archive Events Data

Date: December 2022
Opinions expressed in this post are solely my own and do not represent the views or opinions of my employer.

Snowflake announced a strategic partnership with Cybersyn back in August 2022. As a partner, Cybersyn offers data products in the Snowflake Marketplace (SMP) that will be exclusively distributed via Snowflake. These data products are intended to combine multiple public data sources and focus on industry use cases providing sample queries as well as Streamlit applications to make the data discovery easier. And of course, just like any other datasets in the SMP, you can access these datasets in your own Snowflake account with one-click.

A couple of Cybersyn datasets that have been available recently include the following:

  1. Data Commons: Public Data: Datasets that aggregate data from government data sources on Demographics, Economics, Emissions and Environment topics at national, state, county, and municipal levels. Here is a nice Streamlit dashboard for discovery: https://cybersyn-datacommons.streamlit.app/. I spent quite a bit time in this dashboard looking at country trends and feature correlations. Fun!
  2. FHFA: Single Family Home Appraisals & Values: The Uniform Appraisal Dataset (UAD) Aggregate Statistics dataset that provides trends in single family appraisals collected by Fannie Mae and Freddie Mac.
  3. Github Archive Events: GH Archive datasets that provide Github activity data from public Github repositories. All public Github actions (along with the acted upon user, repository, etc.) are available in a single schema in this dataset. This is a particularly fun dataset to look at and play with to get interesting insights. Let’s play with it.

I got the Github Archive Events dataset in my Snowflake account with one click and then decided to apply NLP Sentinent Analysis in a Hex notebook using nltk. Hex notebook with the code is provided here.

Also, here is useful documentation about this dataset from Cybersyn: https://dbdocs.io/Cybersyn/Github

In the Hex notebook, I first ran the following query against Cybersyn GH_EVENTS table to get the comments for the ‘Snowflake-Labs/sfquickstarts’ public Github repo and created a table with all the comments from the pull requests review and commit events:

CREATE OR REPLACE TABLE GITHUB_COMMENTS (id string, text string) AS 
(
SELECT id, parse_json(payload_comment):body as text from GITHUB_ARCHIVE_EVENTS.CYBERSYN.GH_EVENTS
WHERE type IN ('PullRequestReviewCommentEvent','PullRequestReviewEvent', 'CommitCommentEvent')
AND repo_name='Snowflake-Labs/sfquickstarts'
AND payload_comment IS NOT NULL
AND text IS NOT NULL
)

Here are the results with all the comments for the repository:

After loading nltk packages and dependencies (nltk_data.zip) into a stage, a Python UDF that uses nltk packages is created as follows:

#create model udf
sentiment_analysis_udf = hex_snowpark_session.udf.register(lambda *args:
sentiment_analysis(str(args)),
name="sentiment_analysis",
stage_location="@model_files",
return_type=FloatType(),
is_permanent=True,
input_types=[StringType()],
packages=["nltk"],
imports = [('@python_load/nltk_data.zip')],
replace=True)

And finally we can call this Python UDF in a simple SQL statement to do sentiment analysis for each comment and write the results into a new table called SENTIMENT_RESULTS:

sentiments = hex_snowpark_session.sql("SELECT text, sentiment_analysis(text) AS sentiment FROM GITHUB_COMMENTS")
sentiments.write.save_as_table('SENTIMENT_RESULTS', mode='overwrite')

Here are the results of the sentiment analysis:

This a just simple, fun example with a Cybersn datasets. There are a lot of other use cases to explore with the Cybersyn datasets in the Snowflake Marketplace!

--

--

Eda Johnson
Snowflake Builders Blog: Data Engineers, App Developers, AI/ML, & Data Science

NVIDIA | AWS Machine Learning Specialty | Azure | Databricks | GCP | Snowflake Advanced Architect | Terraform certified Principal Product Architect