Elastic Search

Characteristics

EL is built upon the Lucene search engine. Everything is stored in an inversed index. It features:

HA
automatic index creation by generating a "mapping"

Terminology

index

All documents live in an index. An index is roughtly the same as a database which mean it is just a namespace.

type

Before version 6, a index could have on or more type. A type was like a table, a collection of similar thing. This notion of type is totally deprecated. In version 6, you still have to indicate one type (usually called _doc by convention. From version 7, the type would be optional and it will ultimately disappear from EL jargon.

document

A document is like a row. It is composed of field/value (a field is alike a column in RDB).

version

ES only keeps one version of a document. The version number is kept by ES for engineering purpose but should not be used in the applicative/business layer.

mappings

Map fields with data types.

analysis

process of converting full text into terms for the inverted index

node

An instance of EL. Usually one per machine.

cluster

A set of nodes. You might separate nodes into cluster because:

the usage/ownership/… of the data are different
the nodes are located in two different datacenter

shards

By default each index is divided in 5 pieces called shards. This number is defined at index creation. A document will live on a single shard. EL tries to evenly distribute documents within an index among all the shards.

A shard is a single instance of Lucene and would roughly reach for a size of about 10G.

replica

Shards are replicated usually by 2 (number of replica = 1)

segments

A shard is written on disks in multiple segment files.

Data types

Simple
- text: full text analyzed strings
- keyword: sorting/aggregation of exact values (not analyzed).
- byte/short/integer/float/double: numeric value
- date
- boolean
- ip
Hierarchical: object, nested
Range
Specialized: geo_point, percolator

APIs

Index search

GET blogs/_search
{
  "query": {
    "match": {
      "content": "ingest nodes"
    }
  }
}

Create your own mapping

PUT blogs
{
  "mappings": {
    "_doc": {
      "properties": {
        "content": {
          "type": "text"
        },
        ...
      }
    }
  }
}

Test the analyzer

GET _analyze

Cluster

GET _cluster/state

Update index settings

PUT blogs/_settings
{
  "settings": {
    "number_of_replicas": 0 (1)
  }
}

1	you can dynamically change the number of replicas but not the number_of_shards

Reindex

POST _reindex
{
  "source": {
    "index": "blogs",
      "query": {
        ...
      }
  },
  "dest": {
    "index": "blogs_fixed"
  }
}

Ingest

PUT _ingest/pipeline/fix_locales
{
  "processors": [
    {
      "script": {
        "source": """
if("".equals(ctx.locales)) {
  ctx.locales = "en-en";
}
ctx.reindexBatch = 3;
"""
      }
    },
    {
      "split": {
        "field": "locales",
        "separator": ","
      }
    }
  ]
}

Node roles

master eligible

Only one master node per cluster. It is the sole capable of changing the cluster state. You need an odd number of eligible master nodes (quorum) to avoid split brains.
data

Hold the shards and execute CRUD operations.
ingest

Use to perform common data transformation and enrichments. Each task is represented by a processor. By default, the ingest functionality is enabled on any node.

The ingest node is a push-based system. Beats can directly push data into it but ingest nodes will not read from a message queue. For the latter case (and for more complex transformation), it’s necessary to use Logstash.
coordinator

Receive client request. Every node is implicitly a coordinating node. Act as smart load balancers.

Elastic Search

Characteristics

Terminology

Data types

APIs

Node roles

Cluster management

shard filtering

shard allocation awareness

Application Performance Monitoring (APM)