Tricks to make your search 10X faster using Solr and Cassandra

A mini-guide about speed layer data warehouse design

Jiazhen Zhu
Walmart Global Tech Blog
5 min readSep 11, 2020

--

written by Jiazhen Zhu

designed by Global Governance Team

Photo credit: Pixabay

Motivation

The concept of big data has been popular around the world for a while, even back to a paper from Charles Tilly in the 1980s. To me, I started to get to know it during my undergraduate study in 2008. At the beginning, I thought the larger the data we have, the greater value and power we can get from it. However, after a few years experience as a data engineer, I have been convinced that speed is the most critical feature in the big data area, because data changes so rapidly that we will miss the value from the result we get later.

At Walmart, there are various huge datasets which are available to different use cases. But as I mentioned before, it is critical how to quickly use them for the decision usage.

Normally, with huge data in terabytes (TBs) or pebibytes (PBs), there are two type of cases which most companies need to deal with:

  1. The data provided to Operation Level. In this case, the data contain a lot of detail information which makes it hard to do quick search.
  2. The data provided to Management Level. In this case, we need to aggregate the Operation Level data into various metric which can reduce the size of original data resulting in faster search.

In the next section, I will introduce a traditional design and a new paradigm which is designed by Global Governance team at Walmart Global Tech. We will see that the new paradigm will significantly improve the fetch speed by 10X faster than the old one for case 1.

Traditional Design

1. Tools

  • RDMS: PostgreSQL, MySQL, SQL Server, Teradata
  • Key/value Database: Redis
  • Columnar Database: HBase, BigQuery, Redshift

2. Data Modeling

Most of time, data architect will choose the Star or Snowflake schema to design and implement the data modeling. For example, we can choose Teradata or BigQuery as our data warehouse and design the Fact table and Dimensional tables. The data can be joined from multiple tables by using service API.

A Simple Star Schema: Both Fact Table and Dimension Tables are stored in all-in-one database.

The Star scheme is a good idea because the data is denormalized into two levels (fact and dimensions). However, we still face the time-consuming issue when fetching data which needs to be joined even just at one level. This causes a very bad user experience.

On the other hand, most of databases don’t support upsert method. And all-in-one databases like Teradata are restively costly.

3. Workflow

Traditional Workflow

User Side

  • User can search the provided searchable key information on UI
  • Backend send an API call to all-in-one database
  • Database process the required join statement
  • All-in-one database response all detail information back to UI

Engineer Side

  • Manually upsert the delta data to all-in-one database

A New Paradigm

In order to avoid the high cost, we begin to use open source. In order to avoid time consuming (join) when fetching the data, we use two different type of databases instead of single database.

1. Tools

To balance the speed (Search), linear scalability, high availability and flexible data model, we chose Columnar Database, NoSQL and Search Engine as our candidates from the list of below pool.

  • NoSQL Columnar Database: Cassandra
  • Search Engine Database: Solr, Elasticsearch

After considering the speed is our first citizen at here and the data size, we really consider NoSQL plus Search Engine together.

  • NoSQL: We chose the Cassandra because it is not only NoSQL but also is columnar database which is good for OLAP (DataWarehouse).
  • Search Engine: The reason we chose Solr is because it is a mature project and has good user community behind it.

2. Data Modeling

We still use Star schema design concept, but redesigning the fact table (we rename it as info table). Instead of storing quantitative information for analysis, we store the key information under fact table or info table. Those data will be stored in the Solr search engine. All others’ dimensional tables will be stored in the Cassandra. Because Cassandra is query driven, we design a Global Key to link the dimension tables instead of using many dimensional keys and also avoid to create many different Cassandra tables for query driven usage.

Global Key is a universal key among data modeling which is generated on some basic logic, not just combine several natural keys into one. For merchandise, the key can be generated by using Item Id, Shipping Id etc. For person, the key can be generated by using Last Name, First Name, Phone etc.

Redesigned Star Schema: Info / Fact Table is stored in Solr, All others’ Dimension Tables are stored in Cassandra

3. Workflow

Redesigned Workflow

User Side

  • User can search the provided searchable key information on UI
  • Backend send an API call to Solr based on key info
  • Solr response the Global Key to Backend
  • Backend send an API call to Cassandra with Global Key
  • Cassandra response all detail information back to UI

Engineer Side

  • Easily upsert the delta data to Solr
  • Easily upsert the delta data to Cassandra

Benefit

I created a metric between traditional model and new paradigm for filter and join. We can find the new paradigm improve fetching time from seconds to milli seconds.

+---------------+--------------------+---------------+
| Method | Traditional | New Paradigm |
+---------------+--------------------+---------------+
| FILTER | 2536 (ms) | 307 (ms) |
| FILTER & JOIN | 6900 (ms) | 601 (ms) |
+---------------+--------------------+---------------+

Beside for Improving fetching time, we can gain following benefits also:

  • Saving development and loading time with upsert method.
  • Avoiding query driven limitation for Cassandra using Global Key design.
  • Saving huge cost using open source.

Conclusion

Combination of Solr and Cassandra is a good choice for UI when we have million or billion size data. Using global design key among them will give us huge benefit also.

Thanks

Thanks to everyone who helped me to review and gave me many valuable feedbacks.

Reference

  1. Solr: https://lucene.apache.org/solr/
  2. Cassandra: https://cassandra.apache.org/

--

--