Tricks to make your search 10X faster using Solr and Cassandra

A mini-guide about speed layer data warehouse design

Jiazhen Zhu
Sep 11, 2020 · 5 min read

written by Jiazhen Zhu

designed by Global Governance Team

Image for post
Image for post
Photo credit: Pixabay

Motivation

The concept of big data has been popular around the world for a while, even back to a paper from Charles Tilly in the 1980s. To me, I started to get to know it during my undergraduate study in 2008. At the beginning, I thought the larger the data we have, the greater value and power we can get from it. However, after a few years experience as a data engineer, I have been convinced that speed is the most critical feature in the big data area, because data changes so rapidly that we will miss the value from the result we get later.

At Walmart, there are various huge datasets which are available to different use cases. But as I mentioned before, it is critical how to quickly use them for the decision usage.

Normally, with huge data in terabytes (TBs) or pebibytes (PBs), there are two type of cases which most companies need to deal with:

  1. The data provided to Operation Level. In this case, the data contain a lot of detail information which makes it hard to do quick search.
  2. The data provided to Management Level. In this case, we need to aggregate the Operation Level data into various metric which can reduce the size of original data resulting in faster search.

In the next section, I will introduce a traditional design and a new paradigm which is designed by Global Governance team at Walmart Global Tech. We will see that the new paradigm will significantly improve the fetch speed by 10X faster than the old one for case 1.

Traditional Design

1. Tools

  • RDMS: PostgreSQL, MySQL, SQL Server, Teradata
  • Key/value Database: Redis
  • Columnar Database: HBase, BigQuery, Redshift

2. Data Modeling

Most of time, data architect will choose the Star or Snowflake schema to design and implement the data modeling. For example, we can choose Teradata or BigQuery as our data warehouse and design the Fact table and Dimensional tables. The data can be joined from multiple tables by using service API.

Image for post
Image for post
A Simple Star Schema: Both Fact Table and Dimension Tables are stored in all-in-one database.

The Star scheme is a good idea because the data is denormalized into two levels (fact and dimensions). However, we still face the time-consuming issue when fetching data which needs to be joined even just at one level. This causes a very bad user experience.

On the other hand, most of databases don’t support upsert method. And all-in-one databases like Teradata are restively costly.

3. Workflow

Image for post
Image for post
Traditional Workflow

User Side

  • User can search the provided searchable key information on UI
  • Backend send an API call to all-in-one database
  • Database process the required join statement
  • All-in-one database response all detail information back to UI

Engineer Side

  • Manually upsert the delta data to all-in-one database

A New Paradigm

In order to avoid the high cost, we begin to use open source. In order to avoid time consuming (join) when fetching the data, we use two different type of databases instead of single database.

1. Tools

To balance the speed (Search), linear scalability, high availability and flexible data model, we chose Columnar Database, NoSQL and Search Engine as our candidates from the list of below pool.

  • NoSQL Columnar Database: Cassandra
  • Search Engine Database: Solr, Elasticsearch

After considering the speed is our first citizen at here and the data size, we really consider NoSQL plus Search Engine together.

  • NoSQL: We chose the Cassandra because it is not only NoSQL but also is columnar database which is good for OLAP (DataWarehouse).
  • Search Engine: The reason we chose Solr is because it is a mature project and has good user community behind it.

2. Data Modeling

We still use Star schema design concept, but redesigning the fact table (we rename it as info table). Instead of storing quantitative information for analysis, we store the key information under fact table or info table. Those data will be stored in the Solr search engine. All others’ dimensional tables will be stored in the Cassandra. Because Cassandra is query driven, we design a Global Key to link the dimension tables instead of using many dimensional keys and also avoid to create many different Cassandra tables for query driven usage.

Global Key is a universal key among data modeling which is generated on some basic logic, not just combine several natural keys into one. For merchandise, the key can be generated by using Item Id, Shipping Id etc. For person, the key can be generated by using Last Name, First Name, Phone etc.

Image for post
Image for post
Redesigned Star Schema: Info / Fact Table is stored in Solr, All others’ Dimension Tables are stored in Cassandra

3. Workflow

Image for post
Image for post
Redesigned Workflow

User Side

  • User can search the provided searchable key information on UI
  • Backend send an API call to Solr based on key info
  • Solr response the Global Key to Backend
  • Backend send an API call to Cassandra with Global Key
  • Cassandra response all detail information back to UI

Engineer Side

  • Easily upsert the delta data to Solr
  • Easily upsert the delta data to Cassandra

Benefit

I created a metric between traditional model and new paradigm for filter and join. We can find the new paradigm improve fetching time from seconds to milli seconds.

+---------------+--------------------+---------------+
| Method | Traditional | New Paradigm |
+---------------+--------------------+---------------+
| FILTER | 2536 (ms) | 307 (ms) |
| FILTER & JOIN | 6900 (ms) | 601 (ms) |
+---------------+--------------------+---------------+

Beside for Improving fetching time, we can gain following benefits also:

  • Saving development and loading time with upsert method.
  • Avoiding query driven limitation for Cassandra using Global Key design.
  • Saving huge cost using open source.

Conclusion

Combination of Solr and Cassandra is a good choice for UI when we have million or billion size data. Using global design key among them will give us huge benefit also.

Thanks

Thanks to everyone who helped me to review and gave me many valuable feedbacks.

Reference

  1. Solr: https://lucene.apache.org/solr/
  2. Cassandra: https://cassandra.apache.org/

Walmart Global Tech Blog

We’re powering the next great retail disruption.

Medium is an open platform where 170 million readers come to find insightful and dynamic thinking. Here, expert and undiscovered voices alike dive into the heart of any topic and bring new ideas to the surface. Learn more

Follow the writers, publications, and topics that matter to you, and you’ll see them on your homepage and in your inbox. Explore

If you have a story to tell, knowledge to share, or a perspective to offer — welcome home. It’s easy and free to post your thinking on any topic. Write on Medium

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store