2020 is the year you should stop using Ctrl-f, here’s why:

Published in

auquan

4 min readJun 29, 2020

Have you ever found yourself depending on Ctrl-F as you crawled through news, a long report or series of academic papers? Currently, it is not uncommon to see analysts the world over using this simple tool to find what they are looking for in a mountain of text. This is clearly a suboptimal strategy. Much of the time spent on searching through these documents should be allocated to more valuable tasks.

Before discussing some solutions for these problems, it is important to understand why manually searching using Ctrl-F is ineffective in a lot of cases.

The Problem with Ctrl-F

Take the role of an investment manager, where continuous surveillance of the latest news is required to stay ahead of any calamities. However, if they are responsible for a large number of stocks, critical information can easily be missed from the overwhelming quantity of reports coming out. This could end up being very costly. Furthermore, in a world where time is so limited, knowing the major topics, companies, names, locations etc, could allow an analyst to prioritise more efficiently, instead of having to sift through each one.

Today, most people will just search each report for a particular word using Ctrl-F. This ignores the context that the word lies in or the multiple definitions the word may have. In the end they have to tediously sift through many false positives. This is time consuming even for short reports and frustrating for long reports when hundreds of instances of the same word are found. Ctrl-F is more like a microscope and shouldn’t be used to understand the landscape.

Sometimes the report may not contain the original term but a synonym to it. For example, the most common term to report revenue is ‘Revenues’, however some companies report their revenue as ‘SalesRevenueNet’ or ‘SalesRevenueGoodsNet’.Other times, the report may contain both the word and the synonym. For example, current news will use coronavirus and COVD19 interchangeably. Due to this, it is inevitable that many key pieces of information, again, will be missed.

A lack of contextual understanding, and the absence of semantic awareness means relying solely on Ctrl-F is unnecessary time waste for you and your team. But it doesn’t have to be so. Modern Natural Language Processing Techniques (NLP) provide us with tools that enable an analyst to quickly skim through reports and ensure nothing is missed. We discuss two of them below.

Solutions

Named-Entity Recognition

Named-Entity Recognition (NER) is a particular subtask of text information extraction, which aims to identify the named “entities” (which might include places, names, companies, figures, currencies, percentages) that lie in the text and to classify them. An NER model is able to pick up appropriate phrases, and not just a single word, providing additional context. For instance, in the extract below, the model discovers the phrase “fiscal 2020 first quarter”, rather than just “2020” or “first quarter” because it is clear from the context that these words ought to be grouped together. Contextual understanding is a feature of NER that Ctrl-F does not have, and so NER has an immediate advantage in that way.

In NER, the term “entities” is rather broad and can refer to many different predefined categories that have a physical or abstract existence. For example, below is the first paragraph from Apple’s press release earlier this year. The paragraph has been analysed using Amazon’s Standard NER model, and some of the relevant words or phrases have been highlighted.

Extracting these entities is not as easy as it might first appear. For instance, Apple the corporation must be differentiated from the fruit. NER is able to leverage the context that a word lies in to make accurate classifications. Allowing it to disambiguate a word’s meaning based on the context, then label the word appropriately.

NER also leads naturally onto other methods to extract and organise information from unstructured text. For instance, you can combine NER with sentiment analysis (where text is given a sentiment rating by a model to determine how negative, positive, or neutral it is). Knowing at a glance the sentiment of the report, as well as which companies are mentioned, is a powerful combination. Ctrl-F doesn’t lead onto any other intelligent ways to analyse text.

Unstructured text is rich with information and with a vast number of reports to read through, NER can be an invaluable tool to quickly extract and categorise the key names, companies, figures and locations so that an informed decision can be made about which reports to read. NER is able to use context, something not intrinsic to Ctrl-F, to find these entities. NER can also be used as a stepping stone to other forms of text processing, such as sentiment analysis.

Full article with images at: https://blog.auquan.com/page/ctrl-f

2020 is the year you should stop using Ctrl-f, here’s why:

The Problem with Ctrl-F

Solutions

Named-Entity Recognition

Written by David Ardagh