Data-driven innovation and technology intelligence

Matthias Plaue
MAPEGY Tech
Published in
17 min readSep 7, 2022
woman looking at a laptop in her hands standing next to a glass-walled room full of what appears to be computer racks
Photo by Christina @ wocintechchat.com on Unsplash

This article is an excerpt from the MAPEGY Whitepaper: “A Guide to Data-Driven Innovation and Technology Intelligence”.

Successful innovation and making money from emerging technology is a complex and difficult feat. This is exemplified by the fact that only about 7–10% of all technology startups succeed [1]. 88% of the Fortune 500 companies from 1955 do not exist anymore, many due to a failure to innovate [2]. Consequently, the ability to make informed strategic decisions in innovation and technology is mission-critical.

This is where MAPEGY comes into play. MAPEGY is a Berlin-based data intelligence company founded in 2012. Guided by our purpose to drive growth and sustainable development, our mission is to help organizations improve tomorrow with today’s data.

MAPEGY collects and provisions innovation and trend insights for decision-makers in the business, academic, and non-profit sectors. These insights are provided through MAPEGY.SCOUT, a powerful end-to-end data intelligence platform that taps into MAPEGY’s global innovation warehouse, the MAPEGY Innovation Graph.

Innovation and technology intelligence

“I’m an inventor. I became interested in long-term trends because an invention has to make sense in the world in which it is finished, not the world in which it is started.”

— Ray Kurzweil

Innovation and technology intelligence is about the collection and dissemination of information and knowledge that serves subject-matter experts and decision-makers. MAPEGY’s customers may be tasked with a variety of functions and responsibilities in research and development (R&D), engineering, design, open innovation, market intelligence, business development, business strategy, innovation management, or technology scouting and technology management.

What they all have in common: the need for up-to-date and relevant insights that cover the world of innovation and technology. We at MAPEGY strongly believe that gathering intelligence can be nothing else but data-driven. Trusted insights can only be based on hard facts and figures extracted from reliable data.

Thus, we can only succeed by making extensive use of the vast and powerful set of tools commonly known as data science —”the structured study of data for the purpose of producing knowledge’’ [3].

Every data science process starts with stating the problem: What are the questions that we need answered? An important set of questions that already cover many information needs in technology management are the following.

Technology

  • What are the components and sub-types of a technology, and what are the related technologies?
  • What are the technologies behind a given application or product?
  • What are the applications of a technology? Is there a high demand for such applications?
  • What is the public sentiment towards the technology?

Players

  • Who are the relevant organizations and experts that drive innovation in a given technology field?
  • How do they network and collaborate?
  • How strong, competent, and new are they? How are they positioned within the competitive landscape?

Trends

  • What emerging and disruptive topics or new frontiers should be invested in or closely monitored?
  • What is the potential of a given technology? What scenarios will happen in the future?

The data value chain

“Computers are useless. They can only give you answers.”

— Pablo Picasso (1968)

Answering key questions about technology, players (like organizations and experts), and trends during all stages of the life cycle of innovation and technology provide the foundation for successful innovation and technology intelligence.

To answer these questions in a data-driven manner, we need to extract information and knowledge from a variety of content sources, such as news articles, scientific publications, patents, and company data, to name a few. Innovation professionals ultimately combine this information and knowledge with the context at hand and their own personal experience to gain the wisdom needed to make the right decisions. One popular way to illustrate this refinement of raw data into actionable insight is the DIKW pyramid [4]:

pyramidal schema, showing from bottom to top: data, information, knowledge, wisdom
Longlivetheux, CC BY-SA 4.0 <https://creativecommons.org/licenses/by-sa/4.0>, via Wikimedia Commons

On a more detailed and technical level, the DIKW pyramid is realized as the data value chain. The data value chain consists of a series of processes and steps that capture raw data and turn them into information and knowledge that can be retrieved and displayed within the web application MAPEGY.SCOUT:

flow diagram with the following steps: data capture & acquisition, data integration & cleaning, data enrichment & representation, information retrieval & visualization
Image by author

The Innovation Graph

“It is a capital mistake to theorize before one has data.”

— Sherlock Holmes (Arthur Conan Doyle)

MAPEGY’s data warehouse stores raw data and information on various types of entities, such as organizations, experts, topics, and documents. These entities are connected via various kinds of relations, such as authorship or ownership of publications, collaboration between organizations and experts, or semantically related topics. Together, they span a vast network of information and knowledge — MAPEGY’s Innovation Graph.

As of 2022, MAPEGY’s data warehouse stores document data of the following volume:

  • News articles: 65,000,000 adding 230,000 per week from 6300 curated feeds
  • Research publications: 74,000,000 adding 150,000 per week from 170,000 journals, 64,000 publishers
  • Patents: 47,000,000 adding 150,000 per week from 150 patent offices

Add to that 350,000 technical standards, 4,000,000 descriptions of organizations and experts, plus more than 100,000 descriptions of research projects funded by the European Commission or the United States government.

News articles

MAPEGY’s data team curates and maintains a selection of more than 6,000 news feeds highly relevant to the world of innovation and technology:

  • Technology and startup news like MIT Technology Review, The Verge, Techcrunch
  • Science news from outlets like Nature or aggregators like Phys.org, ScienceDaily
  • General business news like Forbes, Financial Times
  • Specific news alerts set up to target and monitor high-tech companies as well as the latest technological, economical, and societal trends
  • Key information sources that focus on a wide array of industries, innovative services, and manufactured products, from 3D printing to white goods

These sources are complemented by feeds set up by international news agencies and national news outlets, such as Reuters, BBC, CNN, Al Jazeera, and South China Morning Post.

Our system is engineered to retrieve the latest news from these feeds every day and to apply techniques from natural language processing to extract further information, like sentiment or mentions of high-tech companies, research institutions, or persons of public interest.

News articles provide relevant information during most of the innovation/technology life cycle. Early reports on promising technological developments point to innovation triggers. In high volume, technology forecasts and startup news as well as shifts in public sentiment point to the peak of public “hype” or “buzz”. Later stages can be dominated, for example, by market reports, product reviews, and general industry news.

Research publications and projects

Research publications provide the basis for MAPEGY’s scientific content. The scientific content that we use primarily comes from journal articles, conference proceedings, and text books from more than 60,000 publishers, like Elsevier and IEEE. We also integrate scientific content from preprint servers, like arXiv and medRxiv.

In addition to these scientific publications, research projects are another important source of information. Most often, these projects are collaborative efforts between private companies and academic institutions. Therefore, they provide insight into activities that result in knowledge transfer or technology transfer, which are strong driving forces of innovation. The volume with which those projects are funded also provides tangible evidence for a technology push driven by research policy.

Patents and technical standards

MAPEGY’s main source for patent data is a well-established, high-quality data product tailored to support statistical analyses. The dataset is provided by the European Patent Office and covers patent filings from all over the world. Patent data analysis and patent landscaping are established tools in technology intelligence research. IP strategies of organizations all over the world are reflected in patent data, offering insights into global technology trends and R&D activity.

Technical standards form an additional relevant body of work that describe requirements, procedures, methods, specifications, terminology, design, etc. that represent established practices.

Topics

As of 2022, more than 520,000 thesaurus topics, topic categories, and industrial sectors relevant to global technology, science, innovation, economics, sustainability, and societal change are stored in the MAPEGY Innovation Graph.

These topics and categories are interconnected as well as connected to all other types of information entities, like organizations, experts, and documents. They contribute to a vast knowledge graph that structures the data and imbues them with meaning, elevating them to information, insight, and knowledge.

The MAPEGY Thesaurus consists of more than 530,000 English-language topics relevant to the world of innovation and technology.

Around 1,200 of those topics have been carefully selected by our content curation team as trend topics that are currently of particular interest, impactful, and relevant for global innovation and R&D activity.

Organizations

While the vast majority of mission-critical information is buried in unstructured text data waiting to be extracted, at the core of global innovation activity lie the major players driving it. When it comes to organizations, we distinguish between companies and public institutions. Public institutions are universities and other research institutions, hospitals, and government bodies. Non-profit organizations are also included. As of 2022, MAPEGY’s data warehouse contains about 3 million companies and 2 million entities identified as public institutions.

Most of these organizations can be extracted from the “paper trail” that they leave as a consequence of their innovation activity: they appear publicly as applicants of patent documents or as bodies that authors of research publications are affiliated with.

Other sources for obtaining lists of organizations include public encyclopedias, like Wikipedia, and published listings, like university rankings. From these sources, we can extract additional relevant information, such as the estimated number of employees or the year that the organization was founded in.

About 800,000 of the companies in MAPEGY’s Innovation Graph have been tagged as startups. Startups are important drivers of global innovation, as entrepreneurs take significant risks to innovate and stay under high pressure to bring their ideas to market quickly. They form the spearhead of exploring which business models work and which do not.

To develop and establish their business, many startups rely on venture capital and other types of third-party investment. These investment data, which MAPEGY procures from a global leader in private-company data, provide yet another important source of information on innovation trends. Investment data allow us to identify the kinds of business models and ideas that investors find promising and are willing to support in order to make a profit.

One particular challenge that we face when integrating organizations are duplicated data. Merging multiple data sources inherently leads to data duplication issues, hence record linkage is a mission-critical part of MAPEGY’s data engineering operations.

But even integrating just a single data source may already call for applying data deduplication algorithms. For example, the same organization may be represented differently with each patent document that we capture. More concretely, a given company’s name might contain typographic errors on some of their filed patents. The following table illustrates the challenge; all of the following data records have been identified by our system as representing the single entity Volkswagen.

table showing different instances of the entity “Volkswagen” that are characterized by varying metadata: typographic errors in the name, missing address, different company website etc.
Image by author

Analytics

“The question of whether machines can think is about as relevant as the question of whether a submarine can swim.”

— Edsger W. Dijkstra

While the collection and cleaning of data provides the foundation of data-driven intelligence, analytics lies at its core: computation of indicators, graph mining, ontology engineering, and natural language processing. All of these techniques and methodologies are leveraged to represent and enrich the data, turning them into actionable information and knowledge.

Once the right data sources have been selected and the data is acquired, cleaned, and integrated into MAPEGY’s data warehouse, it is time to generate insight from the data. One of the most important steps of that process is the creation of new connections in our Innovation Graph from information that is implicit in the data. For example, how do we know that two organizations collaborate? This information may not be explicitly included with the raw data but can be deduced, for example, from common patent applications or research projects.

Document co-occurrence

One way to add new connections to the Innovation Graph lies in determining and measuring document co-occurrence. For example, many patent filings by the company Volkswagen or news articles mentioning that company will also contain terms from our thesaurus, like motor vehicle or motor engine, or be assigned to a topic category, like Automotive engineering. Such a frequent co-occurrence of Volkswagen documents with said topics or topic categories signifies a strong relationship, and thus another edge within the Innovation Graph network is added.

One particularly powerful and widely used indicator to measure the strength of such a relationship is given by the so-called pointwise mutual information, or PMI for short. The PMI is an indicator that measures whether the word motor vehicle, for example, appears more frequently among documents associated with Volkswagen compared to how often the word occurs in general.

Document co-occurrence is also instrumental in the identification of collaborations. For example, companies with common patent applications or individuals with commonly authored research publications are identified as partners.

Personalized PageRank

The PageRank link analysis algorithm is one of the earliest established ingredients of Google’s search engine [5] and presumably one of the reasons for its early and ongoing success. The algorithm can be used to measure the centrality of vertices in a directed graph, such as a hyperlink network. At MAPEGY, we use a variant called personalized PageRank in order to supply the Innovation Graph with additional connections that represent an indirect relationship or proximity between topics/vertices.

The idea is as follows: Two topics, for example, wind turbine and solar fuel, might be related but not necessarily directly related. It might be that they are “siblings” by way of having a common umbrella term — renewable energy, say.

You can imagine these connections as a network where the nodes/vertices represent the topics, and connections/edges represent a direct semantic relationship. Since the MAPEGY Thesaurus has many connections (about 850,000 overall), the actual situation is much more complex than the simple example above illustrates. In fact, any two topics/nodes can be linked by many more alternative paths through the network.

The personalized PageRank takes this complexity into account and produces a robust measure for the proximity between two nodes in the network. Accordingly, two nodes that are in very close proximity of each other within the network lead to a high value for the personalized PageRank, while two nodes that can only be joined by long paths through the network will share a low value.

The following synthetic example serves as an illustration of the concept. The term electric vehicle is the query node, and darker colors correspond to higher values for the personalized PageRank.

network of technology topics colored in red, topics further away from “electric car” have lower color saturation
Image by author

Data visualization

“If only we could pull out our brain and use only our eyes.”

— Pablo Picasso

A long section of the data value chain covers the implementation of algorithms and deployment of information technology for the efficient and effective processing of data by computers. However, results of data analyses are ultimately inspected and interpreted by human subject-matter experts and decision makers.

Therefore, at MAPEGY, we find it to be an equally important and challenging task to present the insights extracted from the Innovation Graph in ways that are not only efficient and effective but also intuitive, aesthetically pleasing, and simply enjoyable. We want our data product, MAPEGY.SCOUT, to satisfy strategic information needs by literally painting the picture and spurring the users’ curiosity for exploration of the world of innovation and technology.

Topic maps

Probably everybody is familiar with how a geographical map works: the farther away two points are from each other on the map, the farther away the two points are geographically in real life. A topic map represents not organizations but topics as points in the plane and also uses the distance between those points as the main visual cue for inference. However, in this case, distance on the map signifies semantic (dis-)similarity instead of geographical relation: the closer two topics are on the map, the more they have in common in terms of underlying technologies, potential applications, etc.

For illustration, the following figure shows an example of a topic map that displays various fields of emerging technologies. Additionally, topics are also grouped into topic clusters, indicated by colored highlights. Topics colored gray were not assigned to any particular cluster by the system, they represent “in-betweens”.

data visualization of emerging technologies, represented as points in the plane; similar technologies are grouped together
Image by author (cropped)

For the proper interpretation and visual analytics of the topic map, some effort has to be made by the user, injecting domain knowledge to identify the clusters’ meaning. In the case of the emerging technology map, the user can quite easily see that the system divides those technologies into the following clusters:

  • Aerospace: micro air vehicle, interstellar travel, asteroid mining, etc.
  • Transport, smart devices: Hyperloop, autonomous cars, internet of things, artificial intelligence, etc.
  • Displays: head-mounted display, and others
  • Human — machine interaction, robotics: virtual reality, brain computer interface, etc.
  • Medical technology, biotech: tissue engineering, cryonics, gene editing, etc.
  • Nanotech: nanomaterials, nanoelectronics, etc.
  • Materials: space elevator, and others
  • Quantum tech, optics: quantum computing, holography, etc.

The computation of a semantic map for a selection of topics consists of the application of the following three unsupervised machine learning techniques:

  • Word embedding. During pre-processing, every topic has been assigned a high-dimensional vector.
  • Cluster analysis. On query time, Hierarchical Density-Based Spatial Clustering of Applications with Noise, or HDBSCAN [6] for short, is used to identify groups of similar topics based on the embedded vectors and their cosine similarity.
  • Dimensionality reduction. Finally, the Uniform Manifold Approximation and Projection algorithm, or UMAP [7] for short, is applied to determine the position of the topics on the map. UMAP is a non-linear dimensionality reduction technique: if we imagine the embedded vectors as a cloud of abstract points in a high-dimensional space, UMAP attempts to map those points onto a two-dimensional space while deforming the point cloud as little as possible, preserving features such as local distance.

Collaboration networks

The following figure shows part of a hierarchical edge bundling chart [8] for the network of companies in the quantum computing field. Every path drawn across the circle that connects two organizations indicates a partnership, or collaboration. This collaboration is inferred by shared patents, joint research projects, and news articles that imply a partnership.

hierarchical edge bundling visualization, showing companies cooperating in the field of quantum computing grouped by region and country
Image by author (cropped)

One goal of this representation is to identify well-networked players within the ecosystem. Since collaboration networks can easily reach a complexity that makes visual analytics difficult, this task is simplified within MAPEGY.SCOUT by way of interaction. For example, hovering over a specific organization will highlight its partners.

Although the circular layout combined with edge bundling might lack some features compared to other layouts (e.g., force-directed layouts [9]), such as a clearer depiction of communities or hubs, the technique enables us to visualize arbitrarily dense networks without node or label clutter.

Additionally, the chart allows for grouping the displayed organizations, either by industrial sector or region/country. This allows for a more global view, identifying cross-sector and international relations.

Case studies

“Everybody wants to be the next Apple, Google or Netflix, nobody wants to be Kodak, Blockbuster or US Steel.”

— Greg Satell, Forbes

MAPEGY.SCOUT makes it easier for R&D and innovation teams to scan their environment and identify opportunities (white spots for innovation) and risks. MAPEGY provides powerful and easy-to-use data-driven solutions that empower organizations to be proactive and make better decisions faster.

Startup scouting at REWE

More firms today are dealing with constantly unstable and disrupted markets as a result of massive changes in consumer behavior, technology, regulation, and demographics. These companies are looking for entrepreneurs and startups to help them find new ideas and opportunities in order to spur innovation. More than ever, innovation is a vital component in driving corporate growth and retaining market shares. By investing, merging, or partnering with other organizations, businesses can quickly expand their innovation capabilities.

Every month, hundreds of companies contact REWE through a variety of access channels to collaborate on innovation projects. REWE employs a vetting process to determine whether to work with such organizations. Early in this process, MAPEGY’s startup identification functionalities are used to gain a better understanding of early-stage startup signals, correlate them to trend landscapes, and, more often than not, pilot them.

Using MAPEGY.SCOUT, REWE is able to undertake a deep dive and analyze the current developments in surrounding technologies, but also identify the digital footprint of startups. Organizations utilize SCOUT’s startup identification capabilities to acquire a complete list of startups from across the globe in seconds, discover how they are linked to particular trends, and gain a holistic view beyond startup perspectives by linking the patterns between them. MAPEGY has become a critical aspect in the early stages of the vetting cycle, which simplifies the innovation process of many companies like REWE by lowering the time required to qualify startups for further consideration regarding innovation projects and investments.

Startup ranking, from top to bottom: Door Dash, Product Hunt, Caper, Techwisely, Mashgin
Image by author

Portfolio analysis at in-manas

In-manas is a company which specializes in building and selling intelligent software solutions that help companies, clusters, and consultants to automate
their strategy and innovation work. They use the SCOUT Matrix to perform trend analysis that helps them forecast future opportunities and risks. By using this feature, in-manas is able to get insights from a vast amount of data hat they could never analyze on their own, enabling them to broaden the scope of their information gathering, obtain key indicators for any topic of interest, and increase their ability to assist customers in making informed decisions.

Using the custom trend analysis provided by MAPEGY, in-manas is developing an excellent innovation management toolbox. It enables them to assess the environment, do research on important key players, identify white or blind spots, and anticipate future plans. Data-based systematic observations are enormously valuable for decision making. It augments human intelligence and is a great tool to be used in workshops and brainstorming sessions.

By providing a bird’s-eye view of diverse company ecosystems, our matrix aids in the development of an effective top-down approach. Hands down, agreat starting tool for any innovation project looking to analyze numerous external elements and develop superior decision-making solutions.

heatmap/matrix showing association of European energy companies with energy sources
Image by author

Acknowledgement

The Whitepaper that this article is based on has been co-authored by MAPEGY’s product owner Ainhoa, Section “Case studies” in particular. Special thanks go out to data quality engineer Michael for proof reading.

References

[0] MAPEGY GmbH — Innovative Solutions for Decision-Making. www.mapegy.com

[1] E. Kodra, O. Zik, and C. Hartshorn. Measuring and quantifying success in
innovation Lessons learned from a decade of profiling emerging technology
start-ups
. 2015.

[2] Frances Goh. 10 Companies that Failed to Innovate, Resulting in Busi-
ness Failure
.

[3] Lourdes S. Martinez. “Data Science”. In: Encyclopedia of Big Data. Ed. by
Laurie A. Schintler and Connie L. McNeely. Cham: Springer International
Publishing, 2022, pp. 328–331. ISBN: 978–3–319–32010–6.

[4] Jennifer Rowley. “The wisdom hierarchy: representations of the DIKW
hierarchy
”. In: Journal of Information Science 33.2 (Feb. 2007), pp. 163–180.

[5] Sergey Brin and Lawrence Page. “The anatomy of a large-scale hypertextual Web search engine”. In: Computer Networks and ISDN Systems 30.1–7 (Apr. 1998), pp. 107–117.

[6] Ricardo J. G. B. Campello, Davoud Moulavi, and Joerg Sander. “Density-
Based Clustering Based on Hierarchical Density Estimates
”. In: Advances in
Knowledge Discovery and Data Mining. Springer Berlin Heidelberg, 2013,
pp. 160–172.

[7] Leland McInnes, John Healy, and James Melville. UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. Sept. 2020

[8] D. Holten. “Hierarchical Edge Bundles: Visualization of Adjacency Relations in Hierarchical Data”. In: IEEE Transactions on Visualization and Computer Graphics 12.5 (Sept. 2006), pp. 741–748.

[9] Daniel Tunkelang. “A Numerical Optimization Approach to General Graph Drawing”. PhD thesis. Carnegie Mellon University, Jan. 1999.

--

--

Matthias Plaue
MAPEGY Tech

Math professor, data scientist. Author of text books on applied math and data science.