Making the Internet Archive’s full text search faster.

  • Our corpus is very heterogeneous. It contains a lot of documents different for content, context and language.
  • The size of our documents varies considerably, from a few kilobytes to several megabytes of text. We have documents consisting of only a few pages and books with thousands of pages.
  • This index should also be agnostic about the kind, content, context and language of a document. For simplicity and for an easier connection with our documents ingestion pipeline, we need to keep everything in a single index.
  • At the Internet Archive we have thousands of new documents to index every day so the search must continue to work properly in a continuous indexing mode.
  • The search has to work as fast as possible, serving each search request without slowing down the execution of the other ones.

:: Cluster ElasticSearch

╔═══════════════╦══════╦═══════╦═══════════╗
║ Node ║ CPU ║ RAM ║ Storage ║
╠═══════════════╬══════╬═══════╬═══════════╣
║ 10 datanodes ║ 14 ║ 45 GB ║ 6.4TB SSD ║
║ 3 masters ║ 2 ║ 2 GB ║ - ║
║ 4 clients ║ 22 ║ 34 GB ║ - ║
╚═══════════════╩══════╩═══════╩═══════════╝
╔═════════════════╦═════════════════╦═════════════════╗
║ tokenize ║ normalizer ║ folding ║
╠═════════════════╬═════════════════╬═════════════════╣
║ icu_tokenizer ║ icu_normalizer ║ icu_folding ║
╚═════════════════╩═════════════════╩═════════════════╝

:: First results and first slow queries

:: The analysis

  • Lots of hot threads were about the highlighting phase. We need this feature because we don’t want to know only what documents contain the best matches, but we also want to show the results in their context with a snippet. This operation can be expensive, especially considering that we want to display snippets for every document returned in the results list.
  • The search for queries with high-frequency words was particularly expensive.
  • The CPU use was ok.
  • The ratio of JVM ram versus the Operative System ram was not well balanced, so the file system cache was unable to manage our big shards and documents.
  • The Garbage Collector (GC) pauses often and for too long (sometimes more than a minute) — this is probably because some of our documents contain very big values for the body field.
  • We discovered that the Solid State Disk (SSD) bandwidth was saturated during these expensive queries.

:: Modeling the problem

╔═══════════════════╦═════════════════╗
║ Without Highlight ║ 1.3 trans/sec ║
╠═══════════════════╬═════════════════╣
║ With Highlight ║ 0.8 trans/sec ║
╚═══════════════════╩═════════════════╝

:: Exploring the solutions

╔══════════════════════╦══════╦═══════╦═══════════╗
║ Node number and type ║ CPU ║ RAM ║ Storage ║
╠══════════════════════╬══════╬═══════╬═══════════╣
║ 10 datanodes ║ 14 ║ 88 GB ║ 6.4TB SSD ║
╚══════════════════════╩══════╩═══════╩═══════════╝
  • Optimizing writes: In a linux system, for performance reasons, written data goes into a cache before being sent to disk (called “dirty cache“). Write caches allow us to write to memory very quickly, but then we will have to pay the cost of writing out all the data to the discs, SSD in our case (3). To reduce the use of the SSD bandwidth we decided to reduce the size of this dirty cache, to make the writings smaller but more frequent. We set the dirty_bytes value from the default value 33.554.432 to 1.048.576 with:
    sudo sysctl -w vm.dirty_bytes=1048576
    By doing this, we got better performance from the disks, reducing the SSD saturation during writings drastically.
the — quick — and — brown — fox 
the — the_quick — quick — quick_and — and_brown — brown — fox
╔═══════╦═══════════╦══════════╗
║ Rank ║ Frequency ║ Word ║
╠═══════╬═══════════╬══════════╣
║ 1 ║ 817.695 ║ and ║
║ 2 ║ 810.060 ║ of ║
║ 3 ║ 773.365 ║ in ║
║ 4 ║ 753.855 ║ a ║
║ 5 ║ 735.855 ║ to ║
║ ... ║ ... ║ ║
║ 49999 ║ 44.417 ║ 391 ║
║ 50000 ║ 44.416 ║ remedy ║
╚═══════╩═══════════╩══════════╝
╔═══════════════════╦═════════════════╗
║ Without Highlight ║ 5 trans/sec ║
╠═══════════════════╬═════════════════╣
║ With Highlight ║ 2.4 trans/sec ║
╚═══════════════════╩═════════════════╝

:: The new index

:: The final results

╔═══════════╦═════════════╦════════════╦══════════════════╗
║ ║ hits found ║ max score ║ execution time ║
╠═══════════╬═════════════╬════════════╬══════════════════╣
║ Old Index ║ 6.479.232 ║ 0.4427 ║ 20s 80ms ║
╠═══════════╬═════════════╬════════════╬══════════════════╣
║ New Index ║ 501.476 ║ 0.7738 ║ 2s 605ms ║
╚═══════════╩═════════════╩════════════╩══════════════════╝
...
{
"doc_title": {
"type": "string",
"analyzer": "textIcu",
"search_analyzer":"textIcuSearch",
"analyzer": "textIcu"
},
"doc_author": {
"type": "string",
"index": "analyzed",
"analyzer": "textIcu",
"search_analyzer":"textIcuSearch"
},
"doc_body": {
"type": "string",
"index": "analyzed",
"analyzer": "textIcu",
"search_analyzer":"textIcuSearch",
"term_vector": "with_positions_offsets_payloads"
},
...
{
"analyzer" : {
"textIcu": {
"type": "custom",
"tokenizer": "icu_tokenizer",
"char_filter": [ "icu_normalizer" ],
"filter": [ "icu_folding", "common_grams" ]
},
"textIcuSearch": {
"type": "custom",
"tokenizer": "icu_tokenizer",
"char_filter": [ "icu_normalizer" ],
"filter": [ "icu_folding", "common_grams_query" ] }
},
"filter":{
"common_grams":{
"type":"common_grams",
"common_words_path":"/etc/elasticsearch/word_list_common_grams"
},
"common_grams_query":{
"type":"common_grams",
"query_mode":"True",
"common_words_path":"/etc/elasticsearch/word_list_common_grams"
}
}

--

--

--

Software Engineer at Internet Archive - Full Stack Immigrant

Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Episode 10. Deleting Images and Containers

Introducing a Research Game: GnomansLand

Your project is not ready for contributors, part III

Implementing a Reload Animation in Unity!

How to Make Better Use of SVG in Flutter Applications?

Simple Rest API With SpringBoot, Postgres and Docker

Create Your Own N-body Simulation (With Python)

N-body simulation

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Giovanni Damiola

Giovanni Damiola

Software Engineer at Internet Archive - Full Stack Immigrant

More from Medium

Omnisearch — Wrangling with Elasticsearch

Silence Authentication

What is Eraser: The Dynamic Data Race Detection Tool?

Redis Transactions and LUA script