Teaching “Paddling Lingo” Part 1

At PaddleSoft we constantly seek to incorporate new technologies that we believe paddlers will find useful. One of our central goals is to provide paddlers with information quickly and easily. In order to do this however, computers need to “understand” the language that paddlers use. The field of Natural Language Processing (NLP) has made significant strides over the past few years with many advanced systems such as IBM Watson answering Jeopardy questions, online chat bots, and advanced Q&A modules. One of the techniques used by computers is to understand the meaning of words based on their context. Boaters like other groups, use words in certain contexts that differ from the dictionary definition, or what people commonly associate with the word. Therefore, we thought it would be interesting to see if using this technique we could get the computer to “recognize” or “understand” certain words that paddlers use frequently and what they mean by them.

Some of these words we will look at are:

(definitions from http://www.keelhauler.org/khcc/Paddling_Dictionary.htm)

eddy — “A place in the river, often behind an obstruction or inside a sharp turn, where the water reverses and flows upstream. Eddies are a good place to pause, rest, or boat scout. They are also the place where your gear is likely to collect after your bowman misses the draw stroke, your boat broaches and you forget to lean downstream. See yard sale.”

hole- “is a river feature where water drops over a obstruction (rock ledge or a rock) into deeper water on the downstream side. This causes water on the surface to be drawn back toward the rock or ledge. This can be a potentially hazardous feature but it could also be a feature used for playboating. Low head dam’s are the most dangerous example of a hydraulic.”

ferry- “Angling the boat to move sideways or upstream against a current, a properly executed ferry uses the current to help move the boat sideways. A Hairy Ferry is a ferry with dire consequences if you screw up.”

creek- “Paddling (or simply bouncing down) small, high gradient streams. Also known as steep creeking.”

Let’s see how well a computer can pickup these words based on context.

Word2Vec

“Given enough data, usage and contexts, Word2Vec can make highly accurate guesses about a word’s meaning based on past appearances. Those guesses can be used to establish a word’s association with other words (e.g. “man” is to “boy” what “woman” is to “girl”), or cluster documents and classify them by topic. Those clusters can form the basis of search, sentiment analysis and recommendations in such diverse fields as scientific research, legal discovery, e-commerce and customer relationship management.” -http://deeplearning4j.org/word2vec.html

Word2Vec is an implementation (using neural networks) of the Skip-Gram Model which converts words into vectors based on their context in sentences (if you want more of the technical mathematics behind it you can find that here http://deeplearning4j.org/word2vec.html). There are many Word2Vec versions in multiple languages: we chose to use the Java implementation Deeplearning4j or Dl4j version. Once trained you can use the model to determine the closeness i.e, the cosine similarity between words. The Dl4j website for instance, shows the cosine similarity between Sweden and Norway is .76. For Word2Vec to properly function, it requires a large, and by large I mean massive corpus of plain text for use as training data. For example, people frequently use the entire Wikipedia corpus (that is every English language article) as training data. However, we decided against this because paddling terminology can often mean different things in a paddling context than a general context (for example hole, eddy, ferry…). We needed a large corpus of paddling related text in order to get Word2Vec working.

Initial Data Mining

Anyone in the field of data science will tell you that getting the data is often the most difficult part of the process. In order to build a plain text file to run Word2Vec we initially went to Facebook. There are hundreds of paddling related Facebook groups with thousands of posts on whitewater paddling. However, extracting the plain text posts to form a text corpus would obviously be way too time consuming to do manually. So we used Facebook’s Graph API explorer to issue bulk requests of whitewater related group and page feeds. These requests resulted in JSON data. We however, still needed just the text extracted from the posts so we built a JSON parser in Java (using the JSON library) to extract the message attribute. This produced plaintext which we then compiled into a single file. Facebook unfortunately, limits page feed requests to 250 posts and group requests to 2000 posts. While this may seem like a lot, remember the majority of posts are only one or two line messages. After five hours of searching for and extracting data we still had only a 410kb text file. Nevertheless, we decided to at least attempt to run Word2Vec on the dataset and see what it produced.

Initial Results

When we first ran Word2Vec on the dataset it produced almost complete garbage. This was mainly due to the fact that we forgot to remove the “stop words” from the text. Stop words are words like “a,” “I,” “you,” “of,” that contribute very little to the text but are very common in English. This caused words to become disproportionately related to these stop words. We then removed the stop words and reran Word2Vec. Once removed our dataset contained approximately 70,635 words (much smaller than the 1.9 billion of the Wikipedia corpus).

Closest words on FB Dataset (we removed about hundred excess ):

eddy:

[@, est, dryway, pm, begin, mens, team, west, bulls, bridge]

creek:
[the, this, sunday, -, weekend, saturday, am, release, river, is]

kayak: 
[technology, adventure, paddles, oakley, aunz, yakima, equipment, est, iceland, anton]

wave: 
[technology, @, est, mens, begin, paddles, iceland, anton, pm, oakley]

hole:
[dryway, saturday, sunday, deerfield, pm, river, west, @, am, release]

cosine similarities:

river to creek 
0.54892897605896

hole to sticky 
0.3262569010257721

Here we can clearly see the limitations of a small dataset. Certain words are over represented and Word2Vec even considered non words like @ and est as valid vectors. Additionally, it might be useful to remove days of the week as they do not add meaning to the text, along with punctuation.

More data mining

Clearly we needed a bigger dataset to effectively teach paddling terminology. So we decided to look into other data sources like online paddling message boards and other paddling websites. However, without access to these website’s databases we had to build a text extraction engine. Information extraction is a science in and of itself as many webpages differ in formatting. For this reason we generally tried to pick online message boards that generated <divs> with consistent class ids and whose links to posts were easy to extract. We built a text extraction engine using JSOUP, an information extraction API built for Java. We also looked at various paddling clubs’ trip reports. Trip reports provided useful data as they gave several paragraphs of uninterrupted text with few URLs or other odd formatting arrangements. Using this we gathered an additional 18mb of data or 3.2 million additional words.

Cleaning Text and fine tuning the training

Before retraining the neural network we did extensive cleaning on the text. We removed stop words and other punctuation marks. We set the minimum number of words to five in the training parameters in order to weed out URLs and misspelled words. Our final Word2Vec model ended up looking like this

Word2Vec

vec = new Word2Vec.Builder()
 .minWordFrequency(5).iterations(1)
 .layerSize(layer).lookupTable(table)
 .stopWords(new ArrayList<String>())
 .vocabCache(cache).seed(42)
 .windowSize(5).iterate(iter).tokenizerFactory(t).build();

Results

Initial results on the new dataset were still not great despite our best efforts; useless words managed to creep into the results. Abbreviations for the days of the week remained along with odd punctuation marks. This demonstrated problems with our text extraction and cleaning methods as clearly a lot of interference was still getting into the text. We did not remove all the stop words either as some still showed up. However, part of the problem can also stem from the informal “nature” of message boards themselves; whereas the Wikipedia corpus only has minor grammatical mistakes and minimal use of abbreviations, message boards feature tons of impromptu and poorly written posts.

eddy

[triple, am-, -, aka, along, haven, video, lifetime, rt, race]

creek 
[decided, flipped, times, wait, managed, thru, walked, perfect, deal, realized]

kayak 
[learned, teach, look, next, gotten, learning, canoe, taught, possible, loved]

wave 
[hull, alot, can, available, tons, comfortable, are, glad, an, <span]

hole
[curious, <span>, have, ledge, tue, <span, or, except, sep, thu]

cosine similarity

river and creek 
0.048014018684625626

hole and sticky 
0.32787105441093445

Discussion, Code and Further Experiments

This experiment provided interesting insight into the use of Natural language techniques with regards to paddling and whitewater. We believe we could achieve much better results by continuing to mine paddling messaging boards and further cleaning the text. Over the next few months we will build and refine our text corpus. While we at PaddleSoft are still unsure of the final role Word2Vec will play in our Information Retrieval systems, we found this test a fun way to show the utility of emerging NLP technologies. Also we plan on testing newer versions of the Skip-Gram Model such as the Multi-Sense Skip Gram and the Non-Parametric Skip-Gram.

You can find the code for Word2Vec algorithm at

https://github.com/isaacmg/word2vec-paddlelingo

as well as our Facebook JSON parser at https://github.com/isaacmg/FBJSONParser

Remember we at PaddleSoft are constantly investigating the newest technologies in order to make it easier to find whitewater related information.