Scaling Solr Collection: Using collection alias
SolrCloud helps us to build highly scalable, fault-tolerant, distributed indexing and search platform. It offers index replication, failover, load balancing and distributed queries out of the box with the help of ZooKeeper. To scale solr we usually tune the following configurations
Collection configurations:
- Changing hard and soft commit time
- Merge factor
- Tuning ramBufferSize
- Term index settings
- Using DocValues instead of field cache
- Avoiding optimize
Solr server configurations:
Client side tuning:
- Using batch updates
- Parallel indexing
Even after tuning all these parameters there is a limit to which we can scale a single collection. After that point both indexing and query performance will start to degrade. We can try to scale horizontally by maintaining multiple collections but we need to change both indexing tools and querying system to handle multiple collections. We can use collection aliasing feature in solr for this scenario.
Collection Aliasing
Collection alias helps us to create a virtual collection which can be mapped to one or more collections. We can create multiple collections in the background and maintain alias, write alias for indexing which is mapped to the latest collection and read alias for querying which is mapped to all collections. Since alias configuration is maintained in zookeeper we can change the alias mapping any time and the same will be reflected in both index and query systems. Collection aliasing is helpful in scaling both indexing and querying performance.
Scaling index performance:
We can scale indexing by having large number of shards and high merge factor to avoid frequent merging (which leads to large number of segments), but this will affect the query performance.
Handling both index and query in the same server will affect both performance due to GC and cache usage. We can increase the performance by having separate set of nodes for ingestion and search.
By using collection alias we can have large number of shards for the collection which is currently indexed in ingestion server and copy it to another collection with few number of shards in search server once indexing is stopped for that collection.
Scaling query performance:
Searching a single big collection will affect query performance. We can create small collections and maintain multiple read aliases mapped to few collections based on the use-case and select a read alias based on query parameters. For example, in time series data we can create day alias, week alias, month alias and choose the alias based on the time filter.
When we copy the collection from ingestion server to search server we can also optimize it to merge all the segments into one which will increase the search performance.
Collection Timeline
For time series use-case the steps will be like
- On day 1, create a collection day1_ingestion in ingestion server with large number of shards to scale ingestion. Map write_alias and read_alias to day1_ingestion collection
- On day 2, create a collection day2_ingestion with large number of shards. Map write_alias to day2_ingestion and read_alias to day1_ingestion and day2_ingestion collections
- Create a collection day1_search in search server with few shards
- Using MergeIndexes API merge the shards in day1_ingestion collection to day1_search collection. Merge index will also optimize the collection so we don’t have to call the Optimize API separately.
- Once day1_search is merged properly change read_alias to day1_search and day2_ingestion collections
- Delete the day1_ingestion collection
Pain Points
- Creating collections and updating aliases in schedule
- Handling late arrival of data: Using only a single write alias for indexing cannot handle late arriving data. Ingestion tool has to check the event time and route it to correct collection
- Multiple read alias: Even though query performance is improved by having multiple read alias, the client has to choose the correct read alias based on the query parameters
- Deleting old collections: Old collections and aliases need to be deleted periodically based on the use case
These pain points can be partially solved using Time Routed Aliases.
Time Routed Aliases
Time Routed Aliases is a newly released feature in Solr 7. Once routed alias is created, solr will automatically create the time based collections and route the documents to correct collection based on the time field. Time routed aliases is helpful in creating collections, handling late arrival data, deleting old collections but it has few drawbacks like only time based use cases are supported, time filters in queries are not considered to filter the collections during search and old collections are not optimized. Most of these improvements are already in progress. Hope it gets released soon.