Building dictionaries for Word Encodings using BigQuery SQL

I am working on the Reddit Community Engagement analysis, and one of my data sources is the GDELT BigQuery dataset. I love the richness of themes available in that data, but sometimes it can be too … chatty.

Here is an example of how an LA Times article about a warehouse fire is tagged in GDELT across 3 fields V2Themes, AllNames, and V2Locations of the GDELT global knowledge graph table gdelt-bq.gdeltv2.gkg.

MANMADE_DISASTER_POWER_OUTAGES Alameda_County__California__United_States WB_1174_WAREHOUSING_AND_STORAGE TAX_FNCACT_FIRE_CHIEF DISASTER_FIRE Oakland__California__United_States CRISISLEX_C07_SAFETY CRISISLEX_T03_DEAD TRIAL Alameda_County TAX_FNCACT_CHIEF EPU_POLICY_AUTHORITIES TAX_FNCACT_VICTIMS WB_793_TRANSPORT_AND_LOGISTICS_SERVICES TAX_FNCACT_CREATIVE_DIRECTOR WB_135_TRANSPORT TAX_FNCACT_ARTISTS TAX_FNCACT_AUTHORITIES MANMADE_DISASTER_IMPLIED Max_Harris TAX_FNCACT_CHILDREN SECURITY_SERVICES TRANSPARENCY CRISISLEX_T02_INJURED TAX_FNCACT_DOORMAN GOV_LOCALGOV TAX_WORLDLANGUAGES_SERRA EPU_CATS_MIGRATION_FEAR_FEAR TAX_FNCACT_INSPECTOR TAX_FNCACT_ATTORNEYS TAX_FNCACT_OFFICIALS WB_2445_NON_STATE_SECURITY_ACTORS Ghost_Ship CRISISLEX_CRISISLEXREC KILL TAX_FNCACT_POLICE WB_2451_REBELS_GUERRILLAS_AND_INSURGENTS Jeffrey_Krasnoff TAX_FNCACT_CORONER WB_2433_CONFLICT_AND_VIOLENCE CRISISLEX_T01_CAUTION_ADVICE SHORTAGE WB_2432_FRAGILITY_CONFLICT_AND_VIOLENCE ARREST TAX_FNCACT_DIRECTOR California__United_States POWER_OUTAGE SOC_GENERALCRIME CRISISLEX_T08_MISSINGFOUNDTRAPPEDPEOPLE TAX_FNCACT_PIRATE

I want to put these tags to good use in Tensorflow models, and need to embed them into multi-dimensional vectors, but their sheer number is causing all kinds of memory, data size, and model accuracy issues. So, to have better control over what I eventually take over into my sequences of tags, I developed a SQL blurb that does the same thing that Keras Tokenizer or Tensorflow VocabularyProcessor do. Having it available right where my source data is allows me to filter and order as I like (review the final query in this repo). I can even create Bag-of-Word or Sequence encodings, depending on my use case.

WITH
b AS (
SELECT gkg.DocumentIdentifier, gkg.V2Themes, gkg.AllNames, gkg.V2Locations
FROM `gdelt-bq.gdeltv2.gkg` gkg
LIMIT 1000
)
,c AS ( -- Mentions of Themes
SELECT b.DocumentIdentifier, SPLIT(theme_mentions,',')[SAFE_OFFSET(0)] AS Entity, SPLIT(theme_mentions,',')[SAFE_OFFSET(1)] AS Offset
FROM b, UNNEST(SPLIT(b.V2Themes,";")) AS theme_mentions
)
,d AS (
SELECT b.DocumentIdentifier,
REPLACE(SPLIT(name_mentions,',')[SAFE_OFFSET(0)],' ','_') AS Name,
SPLIT(name_mentions,',')[SAFE_OFFSET(1)] AS Offset
FROM b, UNNEST(SPLIT(b.AllNames,";")) AS name_mentions
)
,e AS ( -- Calculate frequency stats for Name mentions
SELECT Name, COUNT(DISTINCT DocumentIdentifier) FROM d
GROUP BY 1 HAVING COUNT(DISTINCT DocumentIdentifier) >= 10
)
,f AS (-- Filter mentions of Names
SELECT d.DocumentIdentifier, d.Name AS Entity, d.Offset
FROM d INNER JOIN e ON e.Name = d.Name
)
,g AS ( -- Mentions of Locations
SELECT b.DocumentIdentifier, SPLIT(loc_mentions,'#') AS LocFieldArray
FROM b, UNNEST(SPLIT(b.V2Locations,";")) AS loc_mentions
)
,h AS (
SELECT
g.DocumentIdentifier,
REPLACE(REPLACE(LocFieldArray[SAFE_OFFSET(1)],' ','_'),',','_') AS Loc,
LocFieldArray[SAFE_OFFSET(8)] AS Offset
FROM g
)
,i AS ( -- Calculate frequency stats for Location mentions
SELECT Loc, COUNT(DISTINCT DocumentIdentifier) FROM h
GROUP BY 1 HAVING COUNT(DISTINCT DocumentIdentifier) >= 10
)
,j AS ( -- Filter mentions of Locations
SELECT h.DocumentIdentifier, h.Loc AS Entity, h.Offset
FROM h INNER JOIN i ON i.Loc = h.Loc
)
,k AS ( -- Join all Themes, Locations, Names
SELECT DocumentIdentifier, Entity, Offset FROM c
UNION ALL
SELECT DocumentIdentifier, Entity, Offset FROM f
UNION ALL
SELECT DocumentIdentifier, Entity, Offset FROM j
)
,n AS ( -- Create Encoding for Entities
SELECT Entity, cnt, CAST(RANK() OVER (ORDER BY cnt DESC, Entity ASC) AS STRING) AS EntityIdx
FROM (SELECT Entity, COUNT(*) AS cnt FROM k GROUP BY 1)
)
SELECT DocumentIdentifier,
STRING_AGG(DISTINCT EntityIdx," ") AS BOWEntitiesEncoded, -- For Bag-of-Words encoding order is not important
COUNT(DISTINCT k.Entity) AS BOWEncodingLength,
STRING_AGG(DISTINCT k.Entity," ") AS EntitiesBOW,
STRING_AGG(EntityIdx," " ORDER BY Offset ASC) AS EntitiesSeqEncoded, -- For CNN and RNN analysis, use Entity Sequence
COUNT(*) AS EntitiesSeqLength,
STRING_AGG(k.Entity," " ORDER BY Offset ASC) AS EntitiesSeq
FROM k
INNER JOIN n ON k.Entity = n.Entity
WHERE k.Entity<>""
GROUP BY 1

The result? Instead of the longish field with tags, I get a sequence of integers that work as good for Embedding purposes as do the tags themselves.

72 153 693 27 68 443 5050 646 42 5333 345 447 15 1350 75 6509 6901 237 49 2073 1935 224 81 38 66 5 10 4 8819 55 14 451 150 1162 27041 942 252 16 409 669 5172 149 28 468 79 128 3478 3980 10288 70