Elastic Search
Characteristics
EL is built upon the Lucene search engine. Everything is stored in an inversed index. It features:
-
HA
-
automatic index creation by generating a "mapping"
Terminology
- index
-
All documents live in an index. An index is roughtly the same as a database which mean it is just a namespace.
- type
-
Before version 6, a index could have on or more type. A type was like a table, a collection of similar thing. This notion of type is totally deprecated. In version 6, you still have to indicate one type (usually called
_doc
by convention. From version 7, the type would be optional and it will ultimately disappear from EL jargon. - document
-
A document is like a row. It is composed of field/value (a field is alike a column in RDB).
- version
-
ES only keeps one version of a document. The version number is kept by ES for engineering purpose but should not be used in the applicative/business layer.
- mappings
-
Map fields with data types.
- analysis
-
process of converting full text into
terms
for the inverted index - node
-
An instance of EL. Usually one per machine.
- cluster
-
A set of nodes. You might separate nodes into cluster because:
-
the usage/ownership/… of the data are different
-
the nodes are located in two different datacenter
-
- shards
-
By default each index is divided in 5 pieces called
shards
. This number is defined at index creation. A document will live on a single shard. EL tries to evenly distribute documents within an index among all the shards.A shard is a single instance of Lucene and would roughly reach for a size of about 10G.
- replica
-
Shards are replicated usually by 2 (number of replica = 1)
- segments
-
A shard is written on disks in multiple segment files.
Data types
-
Simple
-
text
: full text analyzed strings -
keyword
: sorting/aggregation of exact values (not analyzed). -
byte/short/integer/float/double
: numeric value -
date
-
boolean
-
ip
-
-
Hierarchical: object, nested
-
Range
-
Specialized: geo_point, percolator
APIs
GET blogs/_search
{
"query": {
"match": {
"content": "ingest nodes"
}
}
}
PUT blogs
{
"mappings": {
"_doc": {
"properties": {
"content": {
"type": "text"
},
...
}
}
}
}
GET _analyze
GET _cluster/state
PUT blogs/_settings
{
"settings": {
"number_of_replicas": 0 (1)
}
}
1 | you can dynamically change the number of replicas but not the number_of_shards |
POST _reindex
{
"source": {
"index": "blogs",
"query": {
...
}
},
"dest": {
"index": "blogs_fixed"
}
}
PUT _ingest/pipeline/fix_locales
{
"processors": [
{
"script": {
"source": """
if("".equals(ctx.locales)) {
ctx.locales = "en-en";
}
ctx.reindexBatch = 3;
"""
}
},
{
"split": {
"field": "locales",
"separator": ","
}
}
]
}
Node roles
-
master
eligibleOnly one master node per cluster. It is the sole capable of changing the cluster state. You need an odd number of eligible master nodes (quorum) to avoid split brains.
-
data
Hold the shards and execute CRUD operations.
-
ingest
Use to perform common data transformation and enrichments. Each task is represented by a processor. By default, the ingest functionality is enabled on any node.
The ingest node is a push-based system. Beats can directly push data into it but ingest nodes will not read from a message queue. For the latter case (and for more complex transformation), itβs necessary to use Logstash.
-
coordinator
Receive client request. Every node is implicitly a coordinating node. Act as smart load balancers.