Indexing File System and File Contents with Elasticsearch

Published in

cloudnesil

4 min readAug 31, 2019

I want to share the experience of using Elasticsearch for searching over thousands of files and indexing gigabytes of content. The approach in this article is vertically and horizontally scalable that you may use it for over billions of files and petabytes of content, thanks to Elasticsearch.

In my setup I have used fscrawler to collect data from file system (https://github.com/dadoonet/fscrawler) on Windows 10. I also run Elasticsearch on Windows 10.

Elasticsearch is running as two nodes cluster. The following configuration is for one node only. If you want to run it more than one node, you need make the following configuration for each node. Elasticsearch uses ports 9200 and 9300 so make sure that these ports are allowed on firewall.

Running Elasticsearch on Windows as system service

Download elasticsearch: https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.3.0-windows-x86_64.zip
Unzip it with your favourite unzip tool. This will create a folder called “elasticsearch-7.3.0” which we will refer to as %ES_HOME%.
Uncomment and edit the following variables in the elasticsearch configuration file “elasticsearch-7.3.0/config/elastisearch.yaml”:

- node.name : name of the node
- network.host : ip address to server the service
- discovery.seed_hosts : list of hosts to discover
- cluster.initial_master_nodes: name of the nodes (it should match with the node.name)

4. Edit the default java heap size which is -Xms1g and -Xmx1g. You should use same values for max and min.

5. To create windows service for elasticsearch, use “elasticsearch-service.bat” binary which is in the folder elasticsearch-7.3.0/bin. Run command: “elasticsearch-service.bat install.

Running File System Crawler

I have used fscrawler to collect data about file system and its’ contents.

Download fscrawler from: https://oss.sonatype.org/content/repositories/snapshots/fr/pilato/elasticsearch/crawler/fscrawler-es7/2.7-SNAPSHOT/fscrawler-es7-2.7-20190724.065618-41.zip
Unzip it with your favourite unzip tool. This will create a folder called “fscrawler-es7–2.7-SNAPSHOT”.
Create a directory for fscrawler data, you will use this directory in the following steps. If you want to store logs of fscrawler also create a directory for logs.
Create a batch file contains the following:

set JAVA_HOME=c:\Program Files\Java\jdk1.8.0_161
set FS_JAVA_OPTS=-Xmx6g -Xms6g
c:\fscrawler-es7–2.7-SNAPSHOT\bin\fscrawler.bat --config_dir c:\fscrawler-data myjob --loop -1 >> c:\fscrawler-log\fscrawler.log 2>&1

You may need to change the directories above according to your configuration. The name of the job for fscrawler is “myjob”. For this example you must have a directory with the name of the job in our case “myjob” in the configuration.

5. Create configuration for the job “myjob” in “myjob” directory which is in “config_dir”. The configuration file name should be: “_settings.yaml”. The content of the file is as follows:

---
name: "myjob"
fs:
  url: "c:\\tmp\\es"
  update_rate: "5s"
  excludes:
    - "*/~*"
  json_support: false
  filename_as_id: false
  add_filesize: true
  remove_deleted: true
  add_as_inner_object: false
  store_source: false
  index_content: true
  attributes_support: false
  raw_metadata: false
  xml_support: false
  index_folders: true
  lang_detect: false
  continue_on_error: true
  ocr:
    language: "eng"
    enabled: true
    pdf_strategy: "ocr_and_text"
  follow_symlinks: false
elasticsearch:
  nodes:
    - url: "http://192.168.58.251:9200"
  bulk_size: 100
  flush_interval: "5s"
  byte_size: "10mb"

6. I have used “nssm” to create windows system service for fscrawler. Use the following link to download https://nssm.cc/release/nssm-2.24.zip

7. Unzip it with your favourite unzip tool. This will create a folder called “nssm-2.2.4”

8. Go to the directory “nssm-2.24\win64” and run “nssm.exe install”

9. There will be a window for service parameters, just fill the boxes:

- Service name: name of your service.
- Application Tab>Application>Path: Path to your batch file to run fscrawler

If you have followed the steps above, you are ready to go now. Since elasticsearch and file system crawler as Windows system service, you should start the services from Computer Management>Applications and Services>Services

In my setup I have indexed a directory which contains 150Gb of files with various types like: doc, xls, txt, pdf, html. It took 3 hours to index 12 thousand files. If you don’t index contents of files it will be obviously quicker.