Indexing File System and File Contents with Elasticsearch

Ahmet Kaftan
cloudnesil
Published in
4 min readAug 31, 2019

I want to share the experience of using Elasticsearch for searching over thousands of files and indexing gigabytes of content. The approach in this article is vertically and horizontally scalable that you may use it for over billions of files and petabytes of content, thanks to Elasticsearch.

In my setup I have used fscrawler to collect data from file system (https://github.com/dadoonet/fscrawler) on Windows 10. I also run Elasticsearch on Windows 10.

Elasticsearch is running as two nodes cluster. The following configuration is for one node only. If you want to run it more than one node, you need make the following configuration for each node. Elasticsearch uses ports 9200 and 9300 so make sure that these ports are allowed on firewall.

Running Elasticsearch on Windows as system service

  1. Download elasticsearch: https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.3.0-windows-x86_64.zip
  2. Unzip it with your favourite unzip tool. This will create a folder called “elasticsearch-7.3.0” which we will refer to as %ES_HOME%.
  3. Uncomment and edit the following variables in the elasticsearch configuration file “elasticsearch-7.3.0/config/elastisearch.yaml”:
- node.name : name of the node
- network.host : ip address to server the service
- discovery.seed_hosts : list of hosts to discover
- cluster.initial_master_nodes: name of the nodes (it should match with the node.name)

4. Edit the default java heap size which is -Xms1g and -Xmx1g. You should use same values for max and min.

5. To create windows service for elasticsearch, use “elasticsearch-service.bat” binary which is in the folder elasticsearch-7.3.0/bin. Run command: “elasticsearch-service.bat install.

My elasticsearch setup

Running File System Crawler

I have used fscrawler to collect data about file system and its’ contents.

  1. Download fscrawler from: https://oss.sonatype.org/content/repositories/snapshots/fr/pilato/elasticsearch/crawler/fscrawler-es7/2.7-SNAPSHOT/fscrawler-es7-2.7-20190724.065618-41.zip
  2. Unzip it with your favourite unzip tool. This will create a folder called “fscrawler-es7–2.7-SNAPSHOT”.
  3. Create a directory for fscrawler data, you will use this directory in the following steps. If you want to store logs of fscrawler also create a directory for logs.
  4. Create a batch file contains the following:
set JAVA_HOME=c:\Program Files\Java\jdk1.8.0_161
set FS_JAVA_OPTS=-Xmx6g -Xms6g
c:\fscrawler-es7–2.7-SNAPSHOT\bin\fscrawler.bat --config_dir c:\fscrawler-data myjob --loop -1 >> c:\fscrawler-log\fscrawler.log 2>&1

You may need to change the directories above according to your configuration. The name of the job for fscrawler is “myjob”. For this example you must have a directory with the name of the job in our case “myjob” in the configuration.

5. Create configuration for the job “myjob” in “myjob” directory which is in “config_dir”. The configuration file name should be: “_settings.yaml”. The content of the file is as follows:

---
name: "myjob"
fs:
url: "c:\\tmp\\es"
update_rate: "5s"
excludes:
- "*/~*"
json_support: false
filename_as_id: false
add_filesize: true
remove_deleted: true
add_as_inner_object: false
store_source: false
index_content: true
attributes_support: false
raw_metadata: false
xml_support: false
index_folders: true
lang_detect: false
continue_on_error: true
ocr:
language: "eng"
enabled: true
pdf_strategy: "ocr_and_text"
follow_symlinks: false
elasticsearch:
nodes:
- url: "http://192.168.58.251:9200"
bulk_size: 100
flush_interval: "5s"
byte_size: "10mb"

6. I have used “nssm” to create windows system service for fscrawler. Use the following link to download https://nssm.cc/release/nssm-2.24.zip

7. Unzip it with your favourite unzip tool. This will create a folder called “nssm-2.2.4”

8. Go to the directory “nssm-2.24\win64” and run “nssm.exe install”

9. There will be a window for service parameters, just fill the boxes:

Nssm service create window
- Service name: name of your service.
- Application Tab>Application>Path: Path to your batch file to run fscrawler

If you have followed the steps above, you are ready to go now. Since elasticsearch and file system crawler as Windows system service, you should start the services from Computer Management>Applications and Services>Services

In my setup I have indexed a directory which contains 150Gb of files with various types like: doc, xls, txt, pdf, html. It took 3 hours to index 12 thousand files. If you don’t index contents of files it will be obviously quicker.

Properties of indexed directory

After successfully indexing all files, the result of the elasticsearch is as follows:

I have used Kibana to display the results:

--

--