LLMs for innovation and technology intelligence: news categorization and trend signal detection

Matthias Plaue
MAPEGY Tech
Published in
13 min readJun 6, 2023
A neon sign that spells “HYPE”
Photo by Verena Yunita Yapi on Unsplash

This blog post documents part of MAPEGY’s contribution to the research project KI4BoardNET funded by the Federal Ministry of Education and Research (Germany).

This text has been written by a human. The AI assistant Claude has been used to aid in the process.

In order not only to maintain competitive vitality in this highly competitive environment, but also to find new opportunities for competitive advantage, firms have a strong incentive to detect relevant emerging topics and trends at an early stage to develop adequate response strategies for the future. [Muhlroth & Grottke 2022]

Introduction: NLP and news categorization

In applications of business intelligence, news articles are an important source for relevant and timely information. Methods from natural language processing (NLP) and text mining can be used to analyze these data and extract relevant insights. For example, news articles can be used to gauge public sentiment — see my previous blog post.

NLP methods can also help the analyst explore a large collection of news articles more efficiently by detecting events and trends [Panagiotou et al. 2022], or summarizing key points [Ma et al. 2022].

Removing irrelevant results such as fake news [Capuano et al. 2023] can also reduce the data deluge. Conversely, the analyst greatly benefits from a system that helps them focus on the news most relevant to their domain.

One approach to identify the articles that are most likely to contain relevant information is automated news categorization. Technology and trend scouts in particular, who wish to collect data that informs innovation strategy, are particularly interested in news that belong to one or more of the following categories, which we can also refer to as genres:

A simple technique for automated news categorization is the search for keywords. For example, we can expect a news article that contains one of the keywords “startup”, “venture capital”, or “angel investor” to be an article that belongs to the genre of startup news.

The goal of this report is to compare the performance, in terms of accuracy and runtime, of traditional keyword search for news categorization with state-of-the-art methods from machine learning.

Trend signal detection

In addition to the news categories listed in the previous section, we want to detect trend signals which we understand as news articles that describe events, claim facts, or reflect on opinions that point to the potential development of significant change in the landscape of innovation and technology. In other words, trend signals can be understood as precursors to emerging trends.

The news genre of trend signals is very broad, and may refer to any of the following sub-categories. Some of those sub-categories may have a large overlap with one or more of the news genres defined in the previous section.

1. Science and Technology

1a. Novel materials or methods. News articles discussing the development and launch of new, innovative manufacturing techniques, as well as newly-created materials that can improve products, services, or technologies. Example: ‘Smart plastic’ material is step forward toward soft, flexible robotics and electronics

1b. Advancements in efficiency or effectiveness. Articles covering successful improvements to existing products or technologies regarding functionality, adaptability, performance, or usability. Example: New battery tech boosts EV range by 20%

1c. Innovative applications of existing technologies. Articles reporting creative new uses of current technologies or products, giving them alternative purposes. Can relate to recycling or repurposing byproducts, materials, applications or systems. Example: 3 Surprising Uses for Depleted EV Batteries

1d. Scientific discoveries and breakthroughs. Articles covering major discoveries in scientific research, new inventions, advancements or discoveries that solve problems, reduce costs, or enable new applications. Example: Scientists break world record for solar power window material

2. Economics and Politics

2a. Startups. Articles profiling innovative new startups with proprietary technologies, techniques, designs or materials that address industry challenges, reshape manufacturing, or promote sustainability. May detail a startup’s work, partnerships, or funding.

2b. Mergers, acquisitions, and partnerships. Articles covering companies investing in startups, collaborating strategically, merging, being acquired, going public, or raising capital to enable product development or launches.

2c. Policy changes, new legislation, and funding opportunities. Articles announcing government decisions, policies, public contracts, funding programs, laws, regulations, or economic policies affecting specific industries or markets. Policies may relate to changes in political leadership.

2d. New market entrants. Articles covering the emergence of new competitors in existing markets, including large firms expanding into new sectors or new firms entering established markets. Example: Tesla opens its EV charging network to the masses

3. Society and Markets

3a. “Hype” or “buzz” surrounding technologies or high-tech products. Articles discussing temporary surges of attention for contemporary emerging technologies or products among researchers, industry players, policymakers, or users. Example: Five technology trends that will define the future of EVs

3b. Events or claims influencing public opinion of technologies or players. Articles reporting unforeseen news or events that negatively or positively impact public perception of specific technologies, companies, or industry leaders. Could cover accidents, lawsuits, misconduct allegations, product defects, or reputation-building announcements. Example: 10 Dirty Truths Of Electric Cars Nobody Is Talking About

3c. Launch of new high-tech products. Articles announcing the release of new technology products, systems, materials, techniques, features or designs that provide additional functionality, applications or benefits. Example: The New Abarth 500e: The Scorpion Stings Again, Now In Full Electric Mode

Trend signals can be strong signals, i.e., about events that are widely reported on. As a result, many trend signals may point to the same event, and cluster analysis can help identify those events.

What makes trend signal detection a powerful concept, however: trend signals are not defined by signal strength, and can therefore also be early signals or weak signals. More traditional methods for trend detection based on time series analysis have a difficult time detecting early signals because there is no emerging trend yet that could be identified in a robust manner. Similarly, weak signals are often drowned out by noise, which makes them difficult to detect by unsupervised analysis of the data stream alone.

Setup and scope of evaluation

For news categorization and trend signal detection, both rule-based approaches (i.e., keyword search) and machine learning can be used.

Any given article may belong to any number of categories, including the case where the article is not assigned to any of the categories. This is known as a multi-label classification task. We will treat this task as a sequence of binary classification tasks aimed at determining whether the article belongs to a specific category, which corresponds to assigning a positive class label. If the classifier assigns a negative class label, it has determined that the article does not belong to the respective category.

The goal of this report is to compare the performance of the following methods:

  • rule-based categorization based on the identification of keywords in title and short summary of the article, and short descriptors of the news source/feed,
  • zero-shot classification applied to title and short summary,
  • a large-scale language model (LLM) finetuned on a custom dataset labeled by MAPEGY.

In practice, a keyword search can be implemented most effectively through regular expressions. The following code is an example of a regular expression that if it were to match the title, would result in the document to be categorized as “startup news”:

(\yseed|\ycrowd) funding|\yventure capital\y|
\yangel invest|\yentrepr?eneur|silicon valley|
\ystart-?ups?\y|
((\ynew\y|\yinnovative\y|
\ynewly\y \y[[:alpha:]]*\y) (\ycompan(y|ies)\y|
\ycorporations?\y|\yorgani(z|s)ations?\y|
\yendeavou?r\y|\yventure\y|entrants?))

Preparation of training and test data

As of June 2023, the MAPEGY Innovation Graph includes more than 70 million news articles, collected since 2016 from more than 10,000 news feeds. MAPEGY’s data team curates these feeds, ensuring that they include sources that are highly relevant to the domain of innovation and technology:

  • Technology news like MIT Technology Review, The Verge, Techcrunch,
  • science news from outlets like Nature or aggregators like Phys.org, ScienceDaily,
  • general business news like Forbes, Financial Times,
  • specific news alerts set up to target and monitor high-tech companies as well as the latest technological, economical, and societal trends,
  • key information sources that focus on a wide array of industries, innovative services, and manufactured products, from 3D printing to white goods.

These sources are complemented by feeds set up by international news agencies and national news outlets, such as Reuters, BBC, CNN, Al Jazeera, and South China Morning Post.

One particular challenge of preparing a manually labeled dataset for training and testing the news categorization algorithms presents itself by the fact that the label distribution for most of our news categories is severely imbalanced. For example, only about 1% of all news in MAPEGY’s database are startup news. Given limited resources and only a few human data labelers, this makes it impractical to simply retrieve a random sample of news articles, and manually label them: we would have to label too many documents to create training and test datasets with a sizable number of positive examples.

Instead, the datasets to be presented to the data labeler have been extracted via heuristics which add documents based on the regular expressions used for rule-based categorization. The rules have been modified so that they produce a sizable number of positive examples while still being representative of the corpus. For example, searching for the term “acquire” in the short summary of an article instead of just the title produces documents that may be about business relationships but also produces articles about other events or facts, unrelated to that genre.

The documents extracted in such a way were aggregated and presented for labeling in pairs, i.e., given one of three datasets, the data labeler had to decide whether each article belongs to the genre of:

  • market research reports, legal news, both, or neither,
  • startup news, business relations, both, or neither,
  • product news, trend signals, both, or neither.

The following table shows the datasets available after labeling, with percentage of positive class labels, and estimated percentage of positive class labels in the corpus:

Table of news genres: # articles typically 1000–1200; % positive labels in test data set typically 30–40%, in reality typically 1–5%

The percentage of positive class labels in the corpus is an estimate based on the rule-based labels of a large number of examples.

The regular expressions could be used as labeling functions for weakly supervised learning — a notable option which we do not, however, explore further in this blog, see [Lison 2021] for more information.

From each dataset, 300 articles were selected at random to serve as the test dataset.

Implementation and used technologies

The evaluation was run in Google Colab, the training and test data stored in Google Sheets. The Python packages used had already been listed in a previous blog post.

For rule-based categorization, regular expressions were used that have been created manually by trial and error.

For zero-shot classification, the popular bart-large-mnli model available through the Hugging Face model repository was used: a BART model [Lewis et al. 2019] trained on the MultiNLI dataset [Williams et al. 2018].

The zero-shot model was tasked to determine whether the news would either be a better fit to the generic prompt “general news”, or one of the following: “market research”, “startup news”, “business relations”, “consumer news, products”, “law, litigation, and policy”, “buzz, hype, trends”.

It should be noted that binary classification can be considered an out-of-scope use of zero-shot classification. Secondly, no serious attempt was made to optimize the prompts. Consequently, the evaluation results with respect to zero-shot classification only serve as an additional baseline, and must not be considered representative for the technique in general.

The basis for finetuning was the distilroberta-base model.

Evaluation results

Since the dataset in production is significantly more imbalanced than the test dataset used for evaluation, we need to use performance metrics that are invariant to a change in the proportion of positive vs. negative labels. One option is the balanced accuracy, given as follows:

balanced accuracy = 0.5 × (specificity + sensitivity)

The sensitivity is also know as recall. The Youden index [Youden 1950] is a simple rescaling of the balanced accuracy which ranges between minus one and plus one:

Youden index = 2 × balanced accuracy - 1

A Youden index of one would indicate that every prediction the classifier made was correct, while an index equal to minus one would mean that the classifier assigns the opposite of the true label for every test example. A vanishing Youden index corresponds to the baseline of assigning either positive or negative label to every test example. Another performance metric with similar properties is the Matthews correlation coefficient [Baldi et al. 2000].

Comparing the Youden index for the different methods across news genres yields the following results:

Table of Youden index computation: finetuned LLM always wins out

The rule-based method is able to determine whether a news article belongs to a particular category for 6000 documents per second, on average. The runtime performance depends on the complexity and length of the regular expression, ranging from typically 2000 to 10,000 documents per second.

On a “standard” GPU in Google Colab, the zero-shot classifier has a runtime performance of merely 4 examples per second; the finetuned model’s performance is 36 examples per second.

The following confusion matrix shows that the finetuned LLM exhibits almost perfect prediction performance on the task of identifying market reports:

Confusion matrix: 96% specificity, 100% recall
Confusion matrix for market research report identification with a finetuned LLM

The matrix is normalized so that the rows sum to 100%, i.e., the diagonal entries show specificity and recall.

The regular expression for identifying market reports, on the other hand, is geared towards specificity, so as to not accidentally tag news articles as market research reports that are, in fact, not market research reports:

Confusion matrix: 99% specificity, 19% recall
Confusion matrix for market research report identification with regular expressions

Other designs for regular expressions may lead to very different characteristics in performance.

Startup news can be identified quite well with the rule-based approach:

Confusion matrix: 83% specificity, 81% recall
Confusion matrix for startup news identification with regular expressions

We may assume that this is because the startup genre can be described well by a limited set of characteristic keywords. However, an LLM still performs significantly better:

Confusion matrix: 92% specificity, 96% recall
Confusion matrix for startup news identification with a finetuned LLM

Detecting trend signals is a hard task even for the finetuned LLM. We can speculate that this may be because of the broad and less well-defined nature of that category. Additional training data might improve performance. Still, the algorithm is able to correctly identify 85% of the trend signals in the data stream:

Confusion matrix: 71% specificity, 85% recall
Confusion matrix for market research report identification with finetuned LLM

Conclusion

  • An LLM, finetuned on a small dataset consisting of about 1000 news articles per class label, delivers results that are clearly superior in quality compared to a rule-based news categorization based on matching keywords.
  • Determining the categories for 70 million news articles can be expected to take about 135 days of “standard” GPU computation time.
  • Rule-based news categorization is faster by orders of magnitude.

References

[Muhlroth & Grottke 2022] Christian Muhlroth and Michael Grottke. “Artificial Intelligence in Innovation: How to Spot Emerging Trends and Technologies.” IEEE Transactions on Engineering Management. 69, no. 2 (April 2022): 493–510.

[Panagiotou et al. 2022] Nikolaos Panagiotou, Antonia Saravanou, and Dimitrios Gunopulos. “News Monitor: A Framework for Exploring News in Real-Time.” Data 7, no. 1 (2022): 3.

[Capuano et al. 2023] Nicola Capuano, Giuseppe Fenza, Vincenzo Loia amd Francesco David Nota. “Content-Based Fake News Detection With Machine and Deep Learning: a Systematic Review.” Neurocomputing 530 (2023): 91–103.

[Ma et al. 2022] Congbo Ma, Wei Emma Zhang, Mingyu Guo, Hu Wang, and Quan Z. Sheng. “Multi-document Summarization via Deep Learning Techniques: A Survey.” ACM Comput. Surv. 55, no. 5 (2022).

[Lewis et al. 2019] Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdel-rahman Mohamed, Omer Levy, Veselin Stoyanov and Luke Zettlemoyer. “BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension.” Annual Meeting of the Association for Computational Linguistics (2019).

[Williams et al. 2018] Adina Williams, Nikita Nangia and Samuel Bowman. “A Broad-Coverage Challenge Corpus for Sentence Understanding through Inference.” Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, vol. 1 (2018): 1112–1122.

[Lison 2021] Pierre Lison, Jeremy Barnes and Aliaksandr Hubin. “skweak: Weak Supervision Made Easy for NLP.” Annual Meeting of the Association for Computational Linguistics (2021): 337–346.

[Youden 1950] W. J. Youden. “Index for rating diagnostic tests”. Cancer 3.1 (1950): 32–35.

[Baldi et al. 2000] Pierre Baldi, Søren Brunak, Yves Chauvin, Claus A. F. Andersen and Henrik Nielsen. “Assessing the accuracy of prediction algorithms for classification: an overview.” Bioinformatics 16, issue 5 (May 2000): 412–424.

--

--

Matthias Plaue
MAPEGY Tech

Math professor, data scientist. Author of text books on applied math and data science.