Cecilia Chen Liu
techatwattpad
Published in
3 min readJun 9, 2020

--

At Wattpad, we do interesting work that lies at the centre of machine learning and story-telling. One important pillar of our data work is Natural Language Processing and Understanding (NLP/NLU). We work on projects like NER, representation learning, quality assessment, sentiment analysis, summarization and much more, all of which push us to constantly work with methods that are innovative and also scale to millions. Here are some of the latest ML works from Wattpad:

Graph Representation Learning Network via Adaptive Sampling [Arxiv]

Graph representation is an important topic in academic research as well as in industry. Better graph representations can help improve user social-interaction modelling, recommendation, as well as bringing domain-specific knowledge to content understanding. Here we present a recent research work that achieves the state-of-the-art performance for graph representation for link-prediction, as well as node classification in an inductive setting.

Abstract:

Graph Attention Network (GAT) and GraphSAGE are neural network architectures that operate on graph-structured data and have been widely studied for link prediction and node classification. One challenge raised by GraphSAGE is how to smartly combine neighbour features based on graph structure. GAT handles this problem through attention, however the challenge with GAT is its scalability over large and dense graphs. In this work, we proposed a new architecture to address these issues that is more efficient and is capable of incorporating different edge type information. It generates node representations by attending to neighbours sampled from weighted multi-step transition probabilities. We conduct experiments on both transductive and inductive settings. Experiments achieved comparable or better results on several graph benchmarks, including the Cora, Citeseer, Pubmed, PPI, Twitter, and YouTube datasets.

Explore Multilingual Syntactic Sentence Representations [Arxiv]

Attention based language models like the Transformer and BERT gained a lot of popularity. Their representations served as panacea for downstream NLP tasks. These models capture rich semantics, however their abilities to capture syntax are less known to us. In the industry, for very specific applications, we may want representations that only capture syntax. Having representations with rich semantic information means that subsequent models could be biased to certain topics or vocabulary.

Furthermore, a lot of industrial projects are constrained by time and resources. Training with multiple GPUs for a couple of weeks may not be a realistic approach for an experimental project, as many modern language models are known for their number of parameters and training time.

Perhaps, we want simpler and smaller models that can possibly capture the syntactic structure in dense representations and do the job.

Abstract:

We study methods for learning sentence embeddings with syntactic structure. We focus on methods of learning syntactic sentence embeddings by using a multilingual parallel-corpus augmented by Universal Parts-of-Speech tags. We evaluate the quality of the learned embeddings by examining sentence-level nearest neighbours and functional dissimilarity in the embedding space. We also evaluate the ability of the method to learn syntactic sentence embeddings for low-resource languages and demonstrate strong evidence for transfer learning. Our results show that syntactic sentence-embeddings can be learned while using less training data, fewer model parameters, and resulting in better evaluation metrics than state-of-the-art language models.

DENS: A Dataset for Multi-class Emotion Analysis [Arxiv]

As a storytelling platform, we believe humans connect with stories through emotions. It is hard to believe that not a lot of research has been focused on understanding narrative emotions, especially in modern writings. Here, we introduced a new dataset for multi-class emotion analysis from long-form narratives in English as a first step to understand emotions within stories.

Abstract:

We introduce a new dataset for multi-class emotion analysis from long-form narratives in English. The Dataset for Emotions of Narrative Sequences (DENS) was collected from both classic literature available on Project Gutenberg and modern online narratives available on Wattpad, annotated using Amazon Mechanical Turk. A number of statistics and baseline benchmarks are provided for the dataset. Of the tested techniques, we find that the fine-tuning of a pre-trained BERT model achieves the best results, with an average micro-F1 score of 60.4%. Our results show that the dataset provides a novel opportunity in emotion analysis that requires moving beyond existing sentence-level techniques.

Want to talk to us more about data and ML? Feel free to drop us an email at: ml@wattpad.com

Also, we are hiring, so check out our job listings page.

--

--