Elastic Search
Characteristics
EL is built upon the Lucene search engine. Everything is stored in an inversed index. It features:
-
HA
-
automatic index creation by generating a "mapping"
Terminology
- index
-
All documents live in an index. An index is roughtly the same as a database which mean it is just a namespace.
- type
-
Before version 6, a index could have on or more type. A type was like a table, a collection of similar thing. This notion of type is totally deprecated. In version 6, you still have to indicate one type (usually called
_docby convention. From version 7, the type would be optional and it will ultimately disappear from EL jargon. - document
-
A document is like a row. It is composed of field/value (a field is alike a column in RDB).
- version
-
ES only keeps one version of a document. The version number is kept by ES for engineering purpose but should not be used in the applicative/business layer.
- mappings
-
Map fields with data types.
- analysis
-
process of converting full text into
termsfor the inverted index - node
-
An instance of EL. Usually one per machine.
- cluster
-
A set of nodes. You might separate nodes into cluster because:
-
the usage/ownership/… of the data are different
-
the nodes are located in two different datacenter
-
- shards
-
By default each index is divided in 5 pieces called
shards. This number is defined at index creation. A document will live on a single shard. EL tries to evenly distribute documents within an index among all the shards.A shard is a single instance of Lucene and would roughly reach for a size of about 10G.
- replica
-
Shards are replicated usually by 2 (number of replica = 1)
- segments
-
A shard is written on disks in multiple segment files.
Data types
-
Simple
-
text: full text analyzed strings -
keyword: sorting/aggregation of exact values (not analyzed). -
byte/short/integer/float/double: numeric value -
date -
boolean -
ip
-
-
Hierarchical: object, nested
-
Range
-
Specialized: geo_point, percolator
APIs
GET blogs/_search
{
"query": {
"match": {
"content": "ingest nodes"
}
}
}
PUT blogs
{
"mappings": {
"_doc": {
"properties": {
"content": {
"type": "text"
},
...
}
}
}
}
GET _analyze
GET _cluster/state
PUT blogs/_settings
{
"settings": {
"number_of_replicas": 0 (1)
}
}
| 1 | you can dynamically change the number of replicas but not the number_of_shards |
POST _reindex
{
"source": {
"index": "blogs",
"query": {
...
}
},
"dest": {
"index": "blogs_fixed"
}
}
PUT _ingest/pipeline/fix_locales
{
"processors": [
{
"script": {
"source": """
if("".equals(ctx.locales)) {
ctx.locales = "en-en";
}
ctx.reindexBatch = 3;
"""
}
},
{
"split": {
"field": "locales",
"separator": ","
}
}
]
}
Node roles
-
mastereligibleOnly one master node per cluster. It is the sole capable of changing the cluster state. You need an odd number of eligible master nodes (quorum) to avoid split brains.
-
dataHold the shards and execute CRUD operations.
-
ingestUse to perform common data transformation and enrichments. Each task is represented by a processor. By default, the ingest functionality is enabled on any node.
The ingest node is a push-based system. Beats can directly push data into it but ingest nodes will not read from a message queue. For the latter case (and for more complex transformation), itβs necessary to use Logstash.
-
coordinatorReceive client request. Every node is implicitly a coordinating node. Act as smart load balancers.