Large Scale Data Ingestion — Challenges and Opportunities

Published in

Compass True North

8 min readJul 14, 2022

Authors: Jiazhi Ou, Praveen Jain, Chang Hu, Zichong Luo, Asmita Kulkarni, Abhijit Sadhu

Compass is working on revolutionizing the real estate industry with pioneering innovations. Our strategy is to replace today’s complex, paper-driven, antiquated workflows with a seamless, all-digital, end-to-end platform that empowers real estate agents to deliver an exceptional experience to every buyer and seller. Real estate agents are the primary users of our platform and our goal is to provide efficient solutions that enable them to complete their transactions end-to-end using the Compass platform.

Real estate transactions are based on information about a listing with associated details about its location, attributes, photographs, connected agents, etc. Agents, being professional users of this data for their day to day business operations, lose trust if they cannot reliably find the most up-to-date and accurate information on the Compass platform.

An MLS (Multiple Listing Services) is a database of records established by a group of agents operating in a region. An MLS allows agents to share listings data and create a marketplace, which increases the overall reach of their listings. In the United States alone there are more than 600 small and large MLS systems operating in different regions.

The Ingestion is the component which feeds all the data into the Compass platform in near real time and enables all the downstream systems such as Agent Search, Listing search, and Marketing Center. Being the first process in the Listing Data pipeline, Ingestion plays a critical role in building and maintaining the trust on listing data on the Compass platform.

Challenges with MLS Systems:

Different Protocols: MLS operates on different protocols with some based on the old RETS standard (Real Estate Transaction Standard) while others have moved to the newer standard of RESO API (Real Estate Standards Organization). Interfacing with different technology stacks is a huge challenge with their own intricacies. Some popular MLS data distributors are MLS Grid, CoreLogic, FBS, and Bridge Interactive.
Only Pull model supported: MLS data providers do not send any notification when data is added or changed. Ingestion needs to pull regularly for data changes and download the same in a timely manner. Hence, the freshness of the data is largely limited by the frequency of the polling that can be supported by MLS data providers. Data quality is also limited by the accuracy to detect any changes on the MLS data providers.
Naming Conventions: Different MLS feeds have different field naming conventions (schema), which have been in use for a long time. Onboarding a new feed requires us to understand and map its various attributes and ensure there is no error in the same process.
Availability and Scale Challenges: Most MLS feeds are unable to support large scale data pull operations. At times, some MLS systems face downtime and do not provide any guaranteed SLA around the same. Problems related to downtime, API throttling, connection limits etc. are observed from time to time.
Data Consistency: In multiple cases it has been observed that MLS does not notify its clients when data gets removed from the feed, which causes stale data problems. Ingestion has to compare the whole data set against MLS feed to detect such changes.
Compliance: MLS governs how data can be used. For example, certain listings types cannot be consumer searchable and certain listings cannot be syndicated to clients depending on the listing status or point of origin (eg: Coming Soon listings, Data Share listings).
Unplanned changes by MLS: It has been observed that some MLS change the schema/data format at times without any advance communication. The ingestion pipeline needs to be intelligent enough to detect such changes and raise alerts. Additionally, MLS can also change credentials, session limits, IP whitelisting, which need to be handled in an elegant manner to minimize any data flow disruption.
Bulk Updates: MLS conducts regular data maintenance causing a large volume of data to be updated within a small timespan. This needs to be handled in the data pipeline without choking the other downstream systems, while ensuring we don’t sacrifice uptime and latency.
TimesStamp Challenges: Some tables in the MLS don’t have updated timestamps. This is a tricky situation where instead of incremental data, we are required to pull the complete data set every few minutes, thus causing potentially unnecessary load on the data ingestion pipeline.

Requirements for the Ingestion pipeline:

Data Parity: Compass is currently working with hundreds of MLS systems across many regions in the United States. The basic requirement of the ingestion pipeline is to ensure that the data present in the Compass systems (about Listings, Open House, Agents, Media etc.) matches with what is present in the MLS.
Minimize Latency: The ingestion pipeline needs to ensure that every update in MLS related to a listing or agent is available in near real time to the downstream systems.
Feed Onboarding: The ingestion pipeline needs to provide a reliable and fault tolerant mechanism to onboard a new feed quickly and easily, so as to reduce the overall time and effort needed for the same. This allows us to rapidly expand our regional coverage.
Data Backfill: The ingestion pipeline needs to provide an ability to backfill old data in a given time range for any specific use cases such as bootstrapping a new feed.
Bulk Updates: Ingestion pipeline should be resilient to sudden inflow of large volume of listings which are changed due to bulk updates. Also, it needs to ensure that such updates do not impact normal pipeline operations and isolate the impact on other feeds.
Auditing and Analytics: The ingestion pipeline needs to provide a capability to audit the data and ensure that data integrity is maintained across the systems. Also, metrics should be available to analyze the pipeline operations to identify and alert on trends.
Special Listings: The ingestion pipeline needs an ability to receive in-house listings, such as Compass Coming Soon and Private/Office Exclusive listings.
Detecting Stale/Deleted Listings: Data quality is the primary goal and we need to ensure that the listings reflect in their correct state. The pipeline needs to detect and update the status for listings that are stale, and have been deleted from the source system.
Detection of Data Quality Issues: We need to have an ability to detect and fix issues related to data quality, such as missing listings, mismatches in attributes between source and destination, and missing fields that are available on the MLS but missing from our feeds.

Key Customer Metrics

Latency: Any updated listing should ideally be available to the end user within ‘x’ minutes for P99 cases. Alerts get emitted for any feed that violates this SLA.
Data Coverage: The Ingestion pipeline should be able to periodically check for coverage on its own. Alerts get emitted for any coverage issues and the system should try to re-ingest any listings that are missing from the mirrored dataset.
Completeness: Every field from a listing needs to be present and understood by the Compass system. Alerts get raised in case a listing does not adhere to the contract as expected by the downstream systems.
Accuracy: This metric ensures that all the pieces of information about a Listing and associated agent are correct.

Data Ingestion Pipeline Architecture:

Ingestion at Compass is a complex distributed system based on messaging and multiple sub-components. Each part of the pipeline consumes a message from the queue, augments and processes the message and puts it back in the pipeline for subsequent processing. This loosely coupled architecture allows us to have cleaner separation of responsibilities across multiple components which aids in debugging and observability of the ingestion system.

Core Pipeline Components:

Gateway Service: Gateway service is the primary component which interacts with the MLS directly to download all the required data and understands the various MLS API specific details, connection mechanisms and the limits imposed by the MLS.
Mirror: This component queries the Gateway based on the time range requirements and is agnostic to MLS specific attributes. Mirror provides the ability to download and store the raw data in a specific data store and has a separate pipeline for processing Properties, Media and Agent data.
Dynamic Config Store: This is a generic configuration store which has all the settings related to specific feeds like query interval window, underlying protocol, mode of operations, timeouts, retry info etc. This allows an on-call engineer to make config changes on the fly and ensure that the same gets applied to the running pipeline without any overhead of deployment.
Scheduler: A component that manages query frequency and query time range to download the updated data from MLS feed. Effective scheduling is achieved by downloading the incremental data and passing the same to downstream components through async communication mechanisms. Scheduler also runs regular backfill jobs to cover any possible data loss happening due to time zone discrepancies or temporary, intra-day outages.
Joiner: Joiner is driven by changes detected from different data pipelines (e.g. Property, Agent or Open houses). It reads all the associated raw data from local databases, joins them into a single raw listing object and transfers it to the downstream Converter Service.
Converter: The conversion component transforms a raw listing with the associated agent and office information to a format which is well understood by the Compass internal downstream systems.
Metadata Store: The metadata store provides the ability to store associated information about the various jobs that fetch data from the feeds regularly based on incremental and backfill jobs. This effectively keeps track of the various jobs running in the systems and ensures they run reliably.
Stale Checker: The data ingestion pipeline has a robust stale listing detection system which works on MLS specific logic to detect entries which are removed or rented from the source and thus needs to be marked appropriately in all the downstream systems.
Media Pipeline: Ingesting media documents such as images, videos and documents related to a listing is an essential part of the data pipeline and happens in an async manner compared to the core listing data to avoid impacting the main pipeline.

Future Ingestion Roadmap:

Ingestion as a system needs to be continuously enhanced with an iterative approach as per the needs of the end users and behavior of source MLS systems. The system should be able to provide near real-time data that is both accurate and complete so agents can make appropriate business decisions. Some of the forward-looking enhancements that we are working on are optimizing the load on the source system and improving the end-to-end latency by decoupling resource ingestion and adjusting the polling frequency for different MLS resources (Open House vs Listings vs Agents), to match their respective rate of change. We also need to identify and prioritize important updates before others to improve the overall agent experience. Batch downloading and parallel processing are also some of the important optimizations that can be performed in the pipeline architecture to improve the data timeliness and shorten the latency to provide near real time end user experience. Data quality and integrity check systems need to be built at multiple stages of the pipeline to detect and prevent data leaks in a timely manner.

Contributors:

Alok Sen, Amit Jain, Prashant Sahu, Shobhit Shah, Sumesh Agarwal