Planning an automated horizon scanning process

Tom Willcocks
Discovery at Nesta
Published in
8 min readMay 7, 2024
Image: Blueprint of a space telescope (NASA)

This article was co-written with Karlis Kanders.

In this blog, we introduce our upcoming work on an automated, data-driven horizon scanning approach.

It is a new project to boost Nesta’s foresight capacity and provide up-to-date insights on innovation related to sustainability, health and education. This should be interesting to anyone doing futures and foresight work.

We cover the following:

  • The advantages of automating horizon scanning
  • Prototyping a minimum viable product
  • Emerging tech stack for automated scanning
  • Conclusion and lessons learned (so far)

The advantages of automating horizon scanning

The Discovery Hub helps Nesta stay on top of emerging trends, technologies and interventions through the use of established horizon scanning approaches such as desk research as well as through our data-driven projects using data science and machine learning. Over the past few years, we have been developing data-driven approaches for analysing innovation trends in various domains: from low-carbon heating tech for reducing household carbon emissions, to exploring food technology trends and their impact on food environments, to innovations for supporting child development.

However, innovation is, by definition, not static: new research projects are started, new startups get founded, and new policies are constantly being published. Therefore, it’s critical for Discovery to continuously update its view on the innovation landscape.

We are thus entering the next phase of our data-driven foresight work, transitioning from one-off ‘snapshots’ to a more automated, continuously updating process.

This will enable frequent updates to our trends analysis with reduced manual effort, as well as surfacing instances of new innovations, research projects and startups on an ongoing basis — ensuring we are prepared to provide the most up-to-date insights as and when they are needed.

An MVP pipeline on investment signals

Building an automated analysis pipeline is markedly different from doing a one-off research project and requires a different — data engineering — skill set. As this is new territory for the Discovery Hub, we are taking an iterative approach and starting with a minimum viable product that will be initially tailored to one of our internal stakeholders, Nesta’s Impact Investments team.

We’re building on previous collaborations which have helped to prototype the data-driven analysis process and demonstrate its value for understanding investment trends and originating investment leads. To further understand the user needs, we drew from agile practices and ran a requirements prioritisation workshop, which produced several user stories that are now guiding the development of our minimum viable product.

The general architecture of our initial prototype pipeline is shown in the figure below. It consists of three main stages: (1) data collection where we update our datasets, (2) processing, where we classify and enrich our data, and synthesise it into information for our stakeholders, and (3) dissemination via various channels to the wider organisation.

Automated horizon scanning pipeline prototype

Collection

Our minimum viable product begins with an automated daily collection of the most recent venture capital investments data. We determine whether there are any new companies or new venture funding rounds that have been added to the database since the previous data collection. We call these new data points “signals”.

Going forward, we plan to improve the minimum viable product by adding other valuable innovation data, such as research funding in the UK (e.g., from the UKRI’s Gateway to Research database or NIHR’s Open Data site), research publications (OpenAlex), patents (Google Patents) and public and policy discourse (e.g., from various news media and public sector websites).

By approaching this work iteratively, the automated pipeline can start bringing value early on, while we are adding more datasets in parallel.

Each new dataset, however, adds to the technical complexity and maintenance needs of the pipeline. So it will be important to prioritise them in terms of the value they bring to the organisation.

Processing

Classification of new data

But before we share the new signals with colleagues in our organisation or incorporate them into our analyses, we need to determine whether they are relevant to Nesta’s three mission areas.

For this purpose, we have a classification step, where we’ll use methods ranging from simple keyword search to supervised machine learning to classify incoming signals in terms of our three missions and more granular mission area topics. For example, for new companies related to the sustainable future mission, we would also like to determine whether their descriptions mention specific low-carbon heating technologies, renewable energy sources or energy efficiency measures.

Note that while the datasets we use might often have existing classifications and tags, it has been important for us to establish our own taxonomy that is aligned with Nesta’s mission areas — and that be applied across different datasets to join them up.

Enrichment: going beyond the raw data

In addition to classification, we can also enrich the data with useful information that can help highlight particularly important signals. For example, by using a company’s financial data and its number of employees, we can determine an initial indication of whether this company is a viable investment opportunity for our investments team.

By analysing the text content of the incoming signals in greater detail, we could get an indication of whether a company, research project or patent seems particularly novel and innovative — and thus deserves special attention to evaluate its potential impact.

In the future, we could also link different datasets, and, for example, keep track of a company’s venture funding, patenting activity, and news media profile.

Synthesis: from data to information

This step takes the classified and enriched signals as input and outputs user-friendly information, to be shared via various channels. This might be a daily message following a specific template detailing new venture funding rounds, a weekly summary of the developments across all three Nesta’s mission areas, or an updated chart for a live trends report.

As we increase the number of datasets we track, we expect the amount of daily or weekly incoming signals to become overwhelming for us to keep up with (anyone who’s subscribed to more than a few newsletters can attest to that!). Here, we expect to use generative AI and large language models (LLMs) to generate summaries and help us curate the new information more efficiently.

Dissemination

Finally, the new information needs to be shared with colleagues in our organisation — this is where the new information can be leveraged to create impact.

For the minimum viable product, we will primarily use Slack for delivering daily messages, as it is our main channel of communication internally. Going forward, we might also produce semi-automated trends reports (in the spirit of our Innovation Sweet Spots work) that are curated further by our foresight analysts.

Prototype Slack message alert of new investment signals

More generally, the most appropriate communication channel, format and frequency of updates will depend on the specific stakeholder and their use case — and we expect that getting this right will require some iteration.

Emerging tech stack for automated scanning

We are still at the beginning of building this pipeline, but some technologies have emerged as potentially useful.

For the orchestration of the entire pipeline we use Airflow. It helps to set up the scheduling of data collection, processing and dissemination, and ensures that each analysis step is executed in the correct order. Airflow’s visual interface aids in monitoring the workflows, making it easier to check for completion of analysis runs, inspect logs and identify bottlenecks or failures.

For the actual data ingestion and processing, we utilise Metaflow (where Metaflow scripts are triggered by the Airflow scheduler). Metaflow allows us to break down complex processing into smaller steps, easily manage dependencies for each step, and scale up the required compute resources using AWS Batch for more intensive steps.

When it comes to classification of the data, we’ll be experimenting with various methods, starting with training simple text classification models based on pre-trained text embeddings or fine-tuning transformer models. We have also found it useful to log our training experiments using the Weights & Biases platform.

We’re also experimenting with using LLMs to label some of our training data automatically, and to synthesise incoming signals into summaries. Presently, we’re using OpenAI, but we are open to trying out other models.

We will write more about the technical solutions in the future, and we aim to work as much as possible in the open, and most of our data processing code will be available on GitHub.

Conclusion and lessons learned (so far)

We hope this has been a helpful summary of our automated horizon scanning efforts so far.

It’s important to carry out the development iteratively, with an agile mindset. Perfecting and expanding your automated data processing pipeline with more data sources will take time, but ideally you can start generating value with your minimum viable product as early as possible. For example, for the minimum viable product, it will be enough to tag incoming signals using keyword matching even if this approach is somewhat noisy. This will provide a baseline from which we can build more sophisticated signal classification algorithms using machine learning. Similarly, before producing detailed dashboards of signals and trends, we can start by providing simple daily alerts to our stakeholders about the new signals.

Through on-going feedback from Nesta’s investments team, we will be able to continually assess whether the companies we are identifying are both mission relevant and at a relevant investment stage to ensure their investment origination efforts are well targeted. Additionally, as new technologies and markets emerge within our Mission areas, based upon market feedback, the investments team can support us in expanding the search to ensure the outputs are responding to the fast pace of technological development.

While automated collection, processing and synthesis of data opens the door for a more up-to-date and dynamic foresight process, it’s essential to recognize that human input is still crucial. Determining the most valuable information for our decision-makers and converting horizon scanning findings into actionable insights requires human expertise. Therefore, it’s crucial for the automated horizon scanning work to be tightly integrated with the ways of working in the foresight team. We’ll be doing this, for example, by combining the signals from the automated process with manually sourced signals in one repository.

By automating labour-intensive tasks, we will be able to focus more manual effort on analytical and interpretive aspects rather than data collection and processing. This strategic approach will hopefully allow us to focus our time on deeper analysis of evolving trends and their implications for strategy.

We will post further updates as we go, so stay tuned and get in touch with the project team if you’re interested to know more.

We thank Solomon Yu, Will Woodward, Celia Hannon, Leo Chandler and Faizal Farook for their feedback on the article.

--

--