Improving Sex Education by Exploring Sexually-related Topics Using NLP

Using Natural Language Processing to analyze frequently discussed sexually-related topics in Polish online forums.

Julia Jakubczak

Follow

Published in

Omdena

6 min readApr 16, 2020

--

The following work is part of an Omdena Project with #sexedPL and more than 30 other AI Collaborators to explore which sexually-related topics are frequently discussed among young people.

Omdena is an innovation platform for building AI solutions to real-world problems through the power of bottom-up collaboration.

In this project, the goal was to understand better which topics should be covered more broadly in books related to sexual education, media, etc.

For that purpose, we gathered and analyzed data from two online forums dedicated to teenagers.

Introduction

Poland’s current government has recently criticized sexual education and attempted to criminalize its provision to minors¹.

While thinking about sex education in Poland, we focused on teenagers who are directly experiencing so-called “education for family life” classes at schools. Questions like: What do teenagers want to know about? Which topics are the most frequently asked? are relevant.

Two Polish online forums zapytaj.onet.pl and dojrzewamy.pl seemed to be the most visited websites where teenagers ask questions concerning sexuality overall. This included questions about contraception, LGBTQIA, religion, STDs, aesthetic preference, reproductive health, and others. As a result, we chose those two online forums to gather data and analyze it.

The main focus was to distinguish topics that young people ask questions about and discuss with each other.

Our approach

After gathering the data, we started working on the corpus using supervised, unsupervised and semi-supervised machine learning algorithms.

Word frequency analysis (NLTK packages)

Using word frequency analysis with NLTK packages showed that the questions asked on dojrzewamy.pl are often related to masturbation, first time, age and males (boyfriend, guy). Words, like want or know, appeared frequently in the corpus of the data which suggests a high number of normative questions.

Topic Modeling with Gensim LDA (TF-IDF representation)

Topic Modeling is a technique to extract the hidden topics from large volumes of text. Latent Dirichlet Allocation (LDA) is a popular algorithm for topic modeling with excellent implementations in Python’s Gensim package.

Using LDA along with TF-IDF enabled us to distinguish topics like Pregnancy & Period, First time & Relations, Virginity & Hurt and Masturbation as the key subjects that are asked about on dojrzewamy.pl.

Topic modeling Gensim LDA —TF-IDF representation (dojrzewamy.pl)

Topic Modeling with NMF

For another unsupervised approach, we worked with NMF (Nonnegative Matrix Factorization), which is commonly used for topic modeling.

The results showed that the number of questions related to pregnancy and reproductive health is significantly higher than for other topics.
According to the data from the Contraception Atlas 2019², created by the Contraception Info initiative and powered by the European Parliamentary Forum for Sexual and Reproductive Rights (EPF), Poland is the country with the lowest access to contraceptive supplies, family planning counseling and provision of online information on contraception.

Topic Modeling with Gensim LDA (BOW representation)

Using the LDA model on the corpus from zapytaj.onet.pl helped to identify that questions related to relationships are in the center of the interest among young people.

Topic modeling Gensim LDA — BOW representation (zapytaj.onet.pl)

Multi-label classification using supervised & semi-supervised learning

The first step for working with supervised models was to manually label questions using selected previously labels. Our team chose to create a scope of tags: contraception, stds, consent, reproductive_health, violence_abuse, orientation_dilemma, religion, awkward_encounter, aesthetic_preference, trolling, normative, legal, homophobia, romance, sexual_frustration, lgbtqia, sex_boundaries, porn, and coming_out.

We have manually labeled over 1000 rows, however, the corpus of the data from zapytaj.onet.pl and dojrzewamy.pl contained over 40 thousand questions — a lot of them included colloquialisms and uncommon sentence structures. For these reasons, we decided to create synthetic samples by augmenting the data.

After artificially expending the corpus of the data, we started working on various supervised machine learning approaches in order to create a model that would perform the best. One of the first ideas was to create models using the Scikit-learn library with the One-vs-Rest classifier:

The other approach included using Random Forest classifier:

Aside from working with Scikit-learn, we also tried other libraries such as Flair. This framework additionally allows working with Polish which sounded perfect for our case.

We also tried using ELMo (Embeddings from Language Models). ELMo gained its language understanding from being trained to predict the next word in a sequence of words — a task called Language Modeling. This is convenient because we have vast amounts of text data that such a model can learn from without needing labels.

Our last approach was to use BERT (Bidirectional Encoder Representations from Transformers). BERT is a method of pre-training language representations, meaning that we train a general-purpose “language understanding” model on a large text corpus, and then use that model for downstream NLP tasks that we care about (like question answering). BERT outperforms previous methods because it is the first unsupervised, deeply bidirectional system for pre-training NLP.

Due to limited time and specifics of the text corpus we worked on, the results for the models trained with the aforementioned libraries didn’t provide us with highly valuable results, therefore we decided to focus on different algorithms, which were discussed previously.

Challenges

Many challenges we have encountered were related to the quality of the data gathered from zapytaj.onet.pl and dojrzewamy.pl. The initial language of the text was obviously Polish, but later on, we used GOOGLETRANSLATE() formula in Google Sheets to translate the questions into English and work with this version. Due to the translation, the context of some questions has changed, additionally, the original version included many colloquialisms, which weren’t translated correctly via formula. Those elements contributed to the mediocre quality of the text corpus we based our work on.

Summary

Many teenagers build their sexual knowledge around the information they find on the internet (64%)³. In order to understand which subjects they often want to ask questions about, it seemed appropriate to analyze the data from online platforms. After diving into the data from zapytaj.onet.pl and dojrzewamy.pl, we found out that questions related to topics like masturbation, first time, pregnancy, reproductive health, and relationships seemed to be in the center of the interest for young people.

More about Omdena

Building AI solutions collaboratively