ThousandEyes BGP Monitors

by Lefteris Manassakis, Software Engineering Technical Leader at Cisco ThousandEyes

At Cisco ThousandEyes, we are currently deploying a real-time BGP monitoring infrastructure developed originally by Code BGP, a company Cisco acquired in August 2023. In this post, we discuss the motivations for developing our own BGP monitoring infrastructure, outline the challenges encountered during the process, and explain the solutions implemented to address those challenges.

Why develop our own BGP monitoring infrastructure?

The reasons can be classified into three main categories: Availability, Quality/Integrity, and Security/Visibility.

Availability: RIPE RIS and RouteViews are the leading public BGP monitoring infrastructure projects. Network operators and researchers around the world depend on these resources to gain insights into the inter-domain routing system. Both projects offer invaluable datasets from hundreds of BGP peers, providing a window into the Internet’s evolution over the past two decades.

Relying exclusively on third-party sources for BGP data presents a risk, as these infrastructures may experience outages or periods of unavailability.

Quality/Integrity: In this context, ‘data quality and integrity’ refers to the accurate maintenance of routing state. This means correctly reporting whether each of the nearly 1.2 million IPv4 and IPv6 prefixes — currently announced by more than 75,000 Autonomous Systems — is present (or not) in the routing tables of each of our BGP monitors.

To collect as much data as possible, public infrastructures have historically maintained an open peering policy. This allowed any network to peer with them and contribute BGP data. As a result of this policy, RIPE RIS has more than 1,500 peers, and RouteViews has over 1,000.

‘Hobbynets’ — amateur networks connected to public sources due to their open peering policy — are more prone to reporting incorrect routing state due to misconfigurations. We reported on such an instance last year. These errors are particularly challenging to debug without access to the entire data pipeline.

Notably, RIPE has acknowledged challenges associated with its RIS open peering policy and over the last few years has switched to a selective approach for new peers.

Security/Visibility: Research has demonstrated that sophisticated hijackers can craft hijacks to avoid detection by public monitoring infrastructures. These attackers are capable of altering the routes they announce — employing well-known techniques such as AS-path prepending — to prevent their propagation to public route collectors. This issue extends beyond theoretical discussions and research. In fact, we detected such an event last year and reported on it in our blog. By maintaining our own monitoring infrastructure, we are able to cover “dark corners” of the Internet that are not visible to public sources, thus enabling the detection of such events of low visibility.

The ThousandEyes platform has utilized BGP datasets from both RIPE RIS and RouteViews for several years. To address the associated challenges, the team has developed an automated mechanism that tracks the health of public BGP monitors and Route Collectors. This mechanism employs a penalty algorithm that triggers corrective action when specific conditions are detected, such as missing updates. Once these conditions are identified, the automation intervenes by disabling the malfunctioning monitor, essentially by stopping it from providing data to the pipeline. We will continue maintaining and enhancing this logic for the ThousandEyes BGP monitors to ensure the quality and integrity of our data.

Desired characteristics of our BGP Monitoring System

The desired characteristics of our BGP monitoring infrastructure are:

  1. Access to BGP Data from Hundreds of Sources
  • Geographical distribution of sources
  • Large number of upstream providers/upstream diversity

2. Near real-time functionality

  • Acceptable delay is within minutes

3. Ability to identify the source of each update

  • For each BGP update, we should be able to identify the router we got it from

Each of these requirements comes with its own set of challenges. In this post, we will discuss how we addressed these challenges and built the “ThousandEyes BGP Monitor” service.

Requirement #1: Access to BGP Data from Hundreds of Sources

One of the primary challenges at Code BGP involved establishing peering relationships with networks to acquire BGP data. Identifying peering providers, deploying routers in locations such as colocation facilities, and engaging in peering negotiations proved to be logistically impractical. To overcome this, we turned to Virtual Machine (VM) providers who not only offer VMs but also BGP sessions with full tables. I humorously refer to this service as “The Democratization of Peering.” Currently, there are over 180 providers offering such services, making the selection process a significant challenge in itself.

Criteria for Selecting BGP Providers:

Dual Stack: Support for both IPv4 and IPv6 is mandatory.

Full BGP Tables: Providers must supply full BGP tables for both IPv4 and IPv6.

Global Footprint: A wide selection of locations across various cities and countries is preferable.

Automation: The capability for users to subscribe, deploy VMs, and access services independently through a web interface is crucial, minimizing the need for manual intervention.

Upstream Diversity: It is important to assess the range of upstream providers. We give priority to providers with a diverse set of upstream connections.

Requirement #2: Near real-time functionality

To achieve our objective of near real-time functionality, we decided to utilize BGP directly. The reader might wonder how BGP can meet this requirement, given its inherent limitations, such as those imposed by the Minimum Route Advertisement Interval (MRAI).

The MRAI timer, as defined in the main BGP specification (RFC 4271), is designed to rate-limit the frequency of routing updates. This timer is applied to both route announcements and withdrawals on a per-session basis. In Cisco implementations (e.g. IOS or IOS-XR), the default eBGP MRAI timer is set to 30 seconds, which is in line with the recommendation in RFC 4271. However, this default setting can significantly affect network convergence times. Its impact is particularly noticeable during route withdrawals, where the phenomenon known as “path hunting” occurs. During path hunting, BGP tries all available alternative paths one by one before concluding that a route is non-existent, which can result in convergence times of several minutes. Therefore, in this context, the opposite of the saying “bad news travels fast” is true. In BGP, good news (announcements) travel faster than bad news (withdrawals).

Last year, we presented at various network operator conferences (such as DKNOG, UKNOF and others), demonstrating that detecting hijacks not only in minutes but also in seconds is feasible. However, given the MRAI, people were asking us how this is even possible.

When a new prefix is announced in BGP, it propagates instantly via the best AS path and becomes visible across the globe in seconds. The MRAI timer is triggered after the first update for this prefix is received. This fact explains how we were able to achieve such fast detection. In most cases, though, detection within minutes is the norm and what we should expect.

Requirement #3: Ability to identify the source of each update

Diagram 1: ThousandEyes Route Collectors

This requirement is critically important, as it allows us to answer questions such as, “Which monitors detected this hijack?” or “On which monitors is this prefix visible?”

The ThousandEyes platform currently operates across three regions: US1, EU1, and US2. Customers are distributed among these regions, each with its own unique BGP monitoring needs. To address this, we have decided to deploy sets of Route Collectors in each AWS region. These Route Collectors act as demarcation points for the monitors. The term “Route Collectors” essentially refers to routers that “speak” BGP and collect routes from the BGP monitors.

In an eBGP session, the router typically changes the ‘next hop’ attribute of a BGP route to its own address when forwarding a route, according to RFC 4271. By establishing eBGP sessions between our monitors and Route Collectors, the routes we collect include the IP addresses of the routers from which they propagate before reaching the collectors. Consequently, the Adj-RIB-in of the Route Collectors is transmitted to the rest of the pipeline using the BMP protocol (RFC 7854), which is considered the state-of-the-art in BGP monitoring. This process provides us with a continuous stream of BGP updates.

Leveraging Open Source software

Cisco is actively engaged in supporting and sponsoring open source software, and we also utilize OSS throughout our data pipeline. As we already mentioned, our BGP monitors are Virtual Machines, and similarly, our Route Collectors are AWS EC2 instances running open source operating systems such as Ubuntu or Amazon Linux. Moreover, we employ GoBMP, an implementation of the BMP protocol’s collector in Go language, and publish the collected BGP data to kafka topics.

For the open-source routing software of choice for both our monitors and route collectors, we have chosen to deploy Bird 2 for several reasons:

  • Minimal Memory Footprint and Resource Consumption: We aim to ensure each VM operates efficiently with limited resources. Despite being configured to use only 1 CPU and 1 GB of RAM, Bird is capable of maintaining the entire Default-Free Zone and propagating BGP updates to dozens of route collectors without strain.
  • Mature and Stable Implementation: Bird has a solid reputation for reliability, having been in use for over a decade. It is the preferred route server software for most Internet Exchange Points (IXPs) worldwide.
  • BMP support: Since October 2023, Bird version 2.14 has included support for both pre- and post-policy BMP monitoring, along with the capability to handle multiple BMP protocol instances. This update has allowed us to leverage Bird as a route collector as well, since BMP is a core component of our architecture.

We have successfully deployed a large fleet of BGP monitors across the globe and we are currently connecting them in production.

Want to be a part of our team? ThousandEyes is hiring! Please see our Careers page for open roles.

--

--