Log Analytics With Amazon Elasticsearch

Learn how to configure a secure, petabyte-scale Elasticsearch Service cluster and build Kibana dashboards to analyze your data.

Here is my breakout of a really informative deep-dive from Nathan Peck — Developer Advocate, AWS that was presented last week at the AWS San Francisco Summit. His talk highlighted how Expedia, MirrorWeb and The Financial Times use Elasticsearch clusters to analyze data.

Elasticsearch is a service where users can deploy, secure, operate, and scale log analytics, full text search, and application monitoring. It is interesting to see where Elasticsearch fits into the Databases and Analytics Portfolio from AWS.

Business Intelligence & Machine Learning

  • QuickSight
  • Sagemaker
  • Comprehend

Relational Databases

  • Aurora
  • RDS

Non-Relational Databases Analytics

  • DynamoDB
  • ElastiCache (Redis, Memcached)
  • Neptune (Graph)

Analytic

  • Redshift
  • EMR
  • Athena

Real-Time

  • Elasticsearch Service
  • Kinesis Data Analytics

Data Lake

  • S3
  • Amazon Glacier
  • AWS Glue (ETL & Data Catalog)

Data Movement

  • Database Migration Service
  • Snowball
  • Snowmobile
  • Kinesis Data Firehose
  • Kinesis Data Streams

Database Characteristics:

Relational — Referential integrity with strong consistency, transactions, and hardened scale (Complex query support via SQL)

Key-value — Low-latency key based queries with high throughput and fast ingestion of data (Simple query with filters)

Document — Index and storing documents with support for query on any property (Simple query with filter, projections, and aggregates)

Graph — Creating and navigating relations between data easily and quickly (Easily express queries in terms of relations)

Time Series — Time-stamped data with large range- scans for summarization and processing

Simple Query (Computational support for summarized results)

Elasticsearch’s Purpose:

  • Text Search: Natural language, boolean queries, relevance
  • Streaming Data: High-volume ingest, near real time, distributed storage
  • Analysis: Time-based visualizations, nestable statistics, time series tools

Like this story? Give feedback and connect with me on HackerNews

Amazon Elasticsearch Service’s Storage Layer
Each log line or other event constitutes a search document.
Key Idea: Log lines contain fields

Send JSON to Elasticsearch, with fields and values
{
“host”: “199.72.81.55”, “verb”: “GET”, “request”:
“GET /history/apollo/
HTTP/1.0”, “@timestamp”:
“1995–07–01T00:00:01”, “timezone”: “-0400”, “ident”: “-”, “authuser”: “-”, “response”: 200, “bytes”: 6245 }

Lucene creates and stores an index for each field
Doc
Name Value
Fields Analysis

Field indices
Terms 1–7
Posting lists
1, 4, 8, 12, 30, 42, 58, 100. ..

Field indices are managed by shards, organized into API-level indices

  • Replica shards
  • Primary shards
  • Logs index

Elasticsearch assigns shards to instances — Logs index

Storage Required:
• On disk, indices are ~10% larger than source
• Each replica adds an additional 1x storage requirement
• You choose the per-instance storage
Example: a 1 TB corpus will need 2 instances
With one replica and 10% inflation, you need 2.2 TB of storage Choose 1.5 TB of EBS per instance, and you need 2

Shards as units of storage
• Set primary shard count based on storage, 40 GB per primary (90 GB for I3 instances)
• Always use at least 1 replica in production
• Keep shard sizes as equivalent as possible
Example: Set shard count = 50 for a 2- TB corpus (2 TB / 40 GB = 50 shards)

Users can build an ingest pipeline that completes tasks like: Data source, Collect, Transform, Buffer and Deliver

Organize data in daily indexes
• On ingest, create indexes with a root string, e.g., logs_
• Depending on volume, rotate at regular intervals — normally daily
• Daily indexes simplify index management. Delete the oldest index to create more space on your cluster.

logs_01.23.2018 through logs_01.29.2018
Use templates to set shard count
PUT* <endpoint>/_template/template1 {
“index_patterns”: [”movies*”], “settings”: {
“number_of_shards”: 50, “number_of_replicas”: 1 } }
All new indexes that match the index pattern receive the settings
*Note: ES 6.0+ syntax

Amazon Elasticsearch Service’s Query Engine

Query Distribution
Query: Status 500
Logs index
Coordinator Instance
Shards on Instances
Each shard computes and returns a result to the coordinator, which re- aggregates a final result

Query Processing
Field1–3:value
Terms 1 — 7

1, 4, 8, 12,30, 42, 58, 100 (Posting Items) 
1, 12, 58, 100 
58, 12, 1, 100 (Result)

Analyze field values to get statistics and build visualizations
58, 12, 1, 100, 115 123, 214, 947 (Result)
GET GET POST GET PUT GET GET POST (Field Data)
GET POST PUT (Buckets)
5 2 1 (Counts)

Like the post so far? Let’s connect on Linkedin

Visualize Your Data

Case study: MirrorWeb
Full text search — PROBLEM: Make the UK Government and UK Parliament’s web archives searchable. A large scale ingestion scenario: 120 TB of data (1.2 MM 100-MB files), duplicates and bad data, Warc format

SOLUTION: S3 (Storage)
BENEFITS

  • Scalability: Started on a 9-node, R4.4Xlarge cluster for fast ingest, reduced to 6 R4.Xlarge instances for search. Able to reconfigure the cluster with no down time
  • Cost effective: Indexed 1.4 billion documents for $337
  • Fast: 146 MM docs per hour indexed. 14x faster than the previous best for this data set (using Hadoop)

Services Used: EC2 (Filtering), EC2 (Extraction), ES (Search)
For more on this case, see http://tinyurl.com/ybqwbolq

Case study: Financial Times (Business and Clickstream Analytics)

PROBLEM:
What stories do our readers care about? What’s hot?
FT required a custom clickstream analytics solution and needed a solution that delivers analytics in real time,
FT did not have a team to manage analytics infrastructure.

SOLUTION: Streaming user data to Amazon ES for analysis. FT created their own custom dashboards for editors and journalists using “Lantern”, which “shines a light” on reader activity for the editors and journalists at the FT. This is a critical tool for making editorial decisions. Daily editorial meetings start by looking at Lantern dashboard

BENEFITS

  • Reliability: Lantern is used throughout the day by journalists and editors. Relying on Amazon to manage their systems for maximum uptime.
  • Cost savings: Able to easily tune their cluster to meet their needs with minimal management overhead

Benefits of Amazon Elasticsearch Service:

  • Supports open-source APIs and tools
  • Drop-in replacement with no need to learn new APIs or skills
  • Easy to use
  • Deploy a production-ready Elasticsearch cluster in minutes
  • Scalable
  • Resize your cluster with a few clicks or a single API call
    Secure
  • Deploy into your VPC and restrict access using security groups and IAM policies
  • Highly available
  • Replicate across Availability Zones, with monitoring and automated self-healing
  • Tightly integrated with other services
  • Seamless data ingestion, security, auditing and orchestration

Service Architecture:
SDK, CLI, CloudFormation, IAM, Elastic Load Balancing, Elasticsearch data nodes, CloudTrail, Elasticsearch master nodes and CloudWatch

Security
Public endpoints — IAM
Private endpoints — IAM and security groups
Encryption

  • Use three dedicated master instances in production
  • Master instances orchestrate and make your cluster more stable
  • Use zone awareness in production
  • 100% data redundancy in two zones makes your cluster more highly available

Set CloudWatch metrics and alarms
ClusterStatus.red Maximum >= 1 1 
ClusterIndexWritesBlocked Maximum >= 1 1 
CPUUtilization/MasterCPUUtilization Average >= 80% 3 
JVMMemoryPressure/Master… Maximum >= 80% 3
FreeStorageSpace Minimum <= (25% of avail space) 1
AutomatedSnapshotFailure Maximum >= 1 1

Monitor Elastisearch slow logs
• Easy console setup
• Integrated with CloudWatch Logs
• Set thresholds to receive log events corresponding to slow queries and slow indexing

Queries and Updates — Amazon ES Domain — Slow query logs — Slow Indexing logs — CloudWatch Logs

• index.search.slowlog.threshold.query.warn
• index.search.slowlog.threshold.query.info
• index.search.slowlog.threshold.query.debug
• index.search.slowlog.threshold.query.trace
• index.search.slowlog.threshold.fetch.warn
• index.search.slowlog.threshold.fetch.info
• index.search.slowlog.threshold.fetch.debug
• index.search.slowlog.threshold.fetch.trace
• index.indexing.slowlog.threshold.index.warn
• index.indexing.slowlog.threshold.index.info
• index.indexing.slowlog.threshold.index.debug
• index.indexing.slowlog.threshold.index.trace
• index.indexing.slowlog.level: trace
• index.indexing.slowlog.source: 255

With Elasticsearch you only pay only for what you use

  • Instance hours For data and master instances
  • EBS GB/Mo For volumes deployed
  • AWS data transfer For transfer out

Amazon Elasticsearch Service usage at Expedia

  • +175 AmazonES clusters
  • +500 EC2 instances
  • +40B documents
  • +35 TB of data

Why did Expedia choose Amazon Elasticsearch Service?

  • Easy to set up
  • Set up for high availabilite
  • Security

Elasticsearch access policy example
{
“Effect”: “Allow”, “Principal”: {
“AWS”: “arn:aws:iam::xxxxx:root” }, “Action”: “es:*”, “Resource”: “arn:aws:es:us-west-2:xxxxx:domain/xxxxx/*” }, {
“Effect”: “Allow”, “Principal”: {
“AWS”: “*” }, “Action”: “es:Http*”, “Resource”: “arn:aws:es:us-west-2:xxxx:domain/xxxxx/*”, “Condition”: {
“IpAddress”: {
“aws:SourceIp”: [
“0.0.0.0/28” ],

Monitoring & backups

Different log analytics architectures using Elasticsearch Service at Expedia

Different Log Analytics Architectures
• Docker startup logs to Elasticsearch
• CloudTrail log analytics using Elasticsearch Service
• Distributed tracing platform using Elasticsearch Service

Docker startup logs to Elasticsearch
Amazon ECS — ECS agent — log driver — Docker-Docker startup logs to Elasticsearch — fluentd — Forwarding logs — Visualize logs — Kibana

Docker fluentd log_driver configuration
{
“log_driver”: “fluentd”, “options”: {
“fluentd-address”: “<fluentd>:24224”, “tag”: “#{ImageName}” } }

fluentd configuration to receive Docker logs
<source>
@type forward port 24224 bind 0.0.0.0 </source> <match *.**>
@type copy

fluentd to ES configuration
<match *.**>
@type copy <store>
@type elasticsearch host <elasticsearch domain> include_tag_key true tag_key @log_name flush_interval 1s </store>

CloudTrail log analytics using Elasticsearch
AWS CloudTrail — Amazon S3 — log delivery — SNS — objectCreate (all) — log delivery — message to SNS

CloudTrail log analytics using Elasticsearch
CloudTrail — log delivery — S3 — objectCreate (all) — message to SNS — SNS
Triggers Lambda — Read log from S3 — Store in Elasticsearch — Visualize logs — Create dashboards — Kibana

CloudTrail logs from S3 to Elasticsearch
try:
response = s3.get_object(Bucket=s3Bucket, Key=s3ObjectKey) content = gzip.GzipFile(fileobj=StringIO(response[‘Body’].read())).read() for record in json.loads(content)[‘Records’]:
recordJson = json.dumps(record) logger.info(recordJson) indexName = ‘ct-’ + datetime.datetime.now().strftime(“%Y-%m-%d”) res = es.index(index=indexName, doc_type=’record’, id=record[‘eventID’], body=recordJson)
logger.info(res) return True

How did Expedia use this CloudTrail log data that is in Elasticsearch?
Top 10 API calls (per 10 mins) dashboard

This solution is open sourced at
https://github.com/ExpediaDotCom/cloudtrail-log-analytics
as a serverless application

Distributed tracing platform using Elasticsearch — Microservices

Distributed tracing platform using Elasticsearch
Microservices Java library — telemetry — Trace — Java library — Push data to
Kinesis — Span collector — from Kinesis — Java app — Transform and Push data to Kafka

Node/Black Box UI — Retrieve span/meta data — ES — Push transactional — log/Metadata to ES — Java app — Span collector/From Kafka — Kafka

Haystack document in Elasticsearch
{
“_index”: “transactions-2017–10–26–04”, “_type”: “transactions_logs”, “_id”: “AV9W9lCq5Jdrc3Uj0vCg”, “_score”: null, “_source”: {
“transactionid”: “5e66cad8-d7ea-49e8–94c8–24d2298d4cdc” }, “fields”: {
“startTime”: [
1508992503395 ] }, “sort”: [
1508992503395 ] }

How is Elasticsearch used in distributed tracing?

  • Time-based queries for traces
  • Filtering traces by services

Things to keep in mind
• Scaling of cluster results in a new cluster with the data being synchronized
• Monitor and optimize the cluster yourself
• No upgrade button between Elasticsearch versions (wish list)
• Monitoring doesn’t show how much disk space in use (wish list)

Tell me what you think of this posting! Follow Me on Twitter

copyright 2018 — Amazon Web Services

Like what you read? Give Brian Tunison a round of applause.

From a quick cheer to a standing ovation, clap to show how much you enjoyed this story.