Provisioning large search server farms in 30 minutes (or so…) with Ansible

6 min readOct 10, 2018

I’ve standardized on Ansible for deployment scripting, after using a myriad other tools. With the momentum around Ansible increasing, it is becoming ever so simple to deploy many software packages with Ansible.

In this article, I will show some Ansible snippets I use to deploy Elasticsearch, Solr, Lucidworks Fusion and ManifoldCF (my document crawler of choice) to VMs on-premises, and in various cloud providers.

Ansible does have somewhat of a learning curve, and introduces some new concepts, but it is well within grasp of technologists.

Installing an Elasticsearch Cluster (commercial with X-Pack, Logstash and Kibana):

The very first step is to install Ansible. I prefer to download the latest Ansible source from GitHub (https://github.com/ansible/ansible) and execute it from the commandline. If you’re using Windows, download and install Cygwin (a UNIX-flavoured emulator) here https://www.cygwin.com/ . Create a folder for your Ansible playbook and open a terminal. Then source the Ansible settings with

$source /home/steph/Ansible/ansible/hacking/env-setup

Once you have Cygwin installed, use “ssh-keygen” to create a new ssh key as in this link https://www.ssh.com/ssh/keygen/. This really helps in accessing the created servers, as it allows one to have “password-less” server access. Saving keystrokes is what it’s all about.

Create your server VMs, bare-metal servers or cloud servers:

Use your platform provider of choice. Ansible works very well on Google Cloud Platform, AWS and many, many others.

Copy your key to the VMs: using $ssh-copy-id path-to-key username@server-ip, copy your public key to every server. Also, make sure you can ssh to each server, and that you have sudo rights. This is important to have if you’re going to install to say “/opt” , “/etc” and “/var/log” etc. If you don’t have sudo rights, you can always install to your /home folder.

Now to Ansible.

Using Ansible is as simple as creating a folder and a couple of subfolders as in the following image:

This is my particular Ansible folder structure for an Elasticsearch-X-Pack-ManifoldCF-Kibana-Logstash install. Yours can and will most likely vary.

Notice how I numbered every Ansible yml playbook file. It does help to order things.

Now, in the playbook directory, create your Ansible playbook files. These are in YAML (*.yml) files. Find a good YAML editor (I use Microsoft’s Visual Code on both Linux (preferred) and Windows (not so much). Either way, it has some excellent linting tools for Yaml.

When writing your playbook YAML files, you’ll find yourself accessing Ansible’s examples a number of times. I prefer to use Google to find items on Ansible’s website, if only to save time. So you’ll find yourself typing “Ansible keyword” into the Google Search Bar quite a lot.

When you’re first debugging your Ansible scripts, you’ll find a throwaway VM to be essential. Test, test and when you’ve messed it up past the point of no return, just recreate it. I use ssh to create VMWare, Oracle VB, GCP, AWS and other instances. That’s the topic of another article though.

If your Vms don’t have internet access, it is somewhat more complex, but still infinitely do-able. In that case, download the repositories to your Ansible host and copy them over to the guests (Vms) from there.

A short snippet may help:

---
- hosts: localhost
# Download installation files to host where VMs do not have www access
tasks:
- name: Download x-pack from url
  get_url: url={{ es_xpack_custom_url }} dest=/tmp/x-pack-{{ es_version }}.zip
  when: (x_pack_installed.rc == 1 or es_version_changed) and (es_enable_xpack and es_xpack_custom_url is defined)
etc.

This downloads everything you need from the internet and places it on the localhost — the deployment server/desktop.

Then set the IP address and hostnames. Also add other server names/IPS to the /etc/hosts file. This is not necessary if one has working DNS, and is also frowned upon as it hard codes the servername to an IP. If you don’t have a DNS in place, hosts is the way to go. (Unless your DEVOPS says otherwise. And their word is from up high. Or very high, depending upon your organisation).

Set up your firewall. Make sure you allow port 22 (ssh), 9200/9300 (or whichever port you’re configuring for Elasticsearch to be accessed and to talk among themselves). I prefer to have two virtual NICs, if only to have 9300 (the background, chatty port), not near anything remotely public (9200, 443, 80. Better safe, as the saying has it.

Then add the elasticsearch user. I prefer to use a user “elastic” or “elasticsearch” or anything obfuscated if your organisation dictates it. Ansible’s Elasticsearch role will add the required user rights, so just create a quick-and-dirty basic user. You may or may not create a /home folder for it. It is probably better not to — ask your DEVSEC people. If you have to deploy to the /home folder, then of course it has to be there.

Install Elasticsearch. For this I use Elastic’s very own Ansible role (https://github.com/elastic/ansible-elasticsearch) and configure my playbook accordingly. I also download and build Elasticsearch from scratch, making sure all X-Pack plugins are built. More about that later.

Install NGINX — or whatever you’re using to proxy/Load Balance your cluster. I have found NGINX robust and very capable. Many internet sites swear by it. Again, ask your DEVOPS/DEVSEC people. They may have other ideas. As in F5, etc.

Install your SSL/ TLS certificates. That is something your DEVSEC people will definitely have to be involved in. This allows NGINX to make port 443 (SSL) available to the outside world. I usually use SSL internally as well, but it your firewall is locked down very tight, and your DEVSEC people will sign off on it, by all means skip that.

Certificates: I’ve taken to using certificates from Let’s Encrypt (https://letsencrypt.org/). These have to be renewed every 90 days, but you’re good enough to script the renew process. If you’re part of a business, you’ll most likely use the business’ certificates.

If you’re using ManifoldCF as a web/file/jdbc crawler (as well you should), you have to install PostgreSql. I have had excellent results with version 10, which is also the latest version. Make sure you follow all the Postgres configuration changes in the ManifoldCF setup. You’ll be glad you did. The changes from default settings are:

postgresql_global_config_options:
        - option: unix_socket_directories
          value: '{{ postgresql_unix_socket_directories | join(",") }}'        - option: standard_conforming_strings
          value: 'on'        - option: shared_buffers
          value: '1024MB'        # max_wal_size = (3 * checkpoint_segments) * 16MB
        # checkpoint_segments=300
        - option: max_wal_size
          value: '14400MB'        - option: min_wal_size
          value: '80MB'        - option: maintenance_work_mem
          value: '2MB'        - option: listen_addresses
          value: '*'        - option: max_connections
          value: '400'        - option: checkpoint_timeout
          value: '900'        - option: datestyle
          value: "iso, mdy"        - option: autovacuum
          value: 'off'

You’re at ManifoldCF. Thank you for sticking with me so long. Quick blurb for ManifoldCF: It is a very powerful web/file/jdbc/sharepoint/younameit document/record ingest engine. It is an Apache project and has a very active community. So you know it’s good.

A short extract from my ManifoldCF Ansible script follows. There isn’t an Ansible MCF role in Galaxy yet, so I’ve written my own.

---
#Copyright © 2018 by S. van Schalkwyk/remcam llc
#All rights reserved. No part of this publication may be reproduced, 
#distributed, or transmitted in any form or by any means, 
#including photocopying, recording, or other electronic or 
#mechanical methods, without the prior written permission of the publisher.
#This code is not shared with clients in any form or format.- hosts: index_nodes
  gather_facts: no
  become: true  tasks:
    - name: Create MCF directory
      file:
        path: /opt/manifoldcf
        state: directory    - name: Create MCF logging directory
      file:
        path: /var/log/manifoldcf
        state: directory    - name: Copy MCF files to guest
      copy:
        src: "{{ inventory_dir }}/roles/manifoldcf_single"
        dest: "/opt/manifoldcf"
        remote_src: false
        directory_mode: true
etc.

Set up MCF to tun as a systemctl/upstart or whatever service startup your servers use.

Everything should now be running. Point your browser to either http://your_elasticsearch:9200, or https:// if you’ve installed certificates. Otherwise, to the NGINX port that’s proxying to the Elasticsearch servers.

Good searching!

PS. I’ll expand this later. Feel free to contact me for commercial installations, short questions, sales leads. Especially sales leads. I love those.

Provisioning large search server farms in 30 minutes (or so…) with Ansible

Written by Steph vanSchalkwyk