Getting up-to-speed with the Elastic Stack (formerly the ELK stack) which consists of Elasticsearch, Logstash, and Kibana.
And now Beats.
The content are as follows:
- Motivation
- Elasticsearch overview
- Use Cases
- Creating a cluster
- Version
- Elasticsearch and Semver
- Health
- PUT
- Indices
- Aliases
- POSTs
- GET
- Kibana
- Searches
- Leaf query clauses
- DELETE
- Bulk loading
- Index/Alias problems
- Aggregates
- Beats and Logstash
- X-Pack
- Podcasts
- Reference
- To Do
- Credits
Having looked at Cassandra with Python, Couchbase and DynamoDB, this time Elasticsearch is in the cross-hairs.
Under the covers Elasticsearch uses Apache Lucene.
Elasticsearch is very similiar to Apache Solr. Both Elasticsearch and Solr are built on top of Lucene, and seem to offer roughly similiar features. It's worth remembering that Elasticsearch, while open-source, is backed by a commercial (for-profit) company.
Amazon offers both as services: Elasticsearch is marketed as Amazon Elasticsearch Service while Solr is marketed as Amazon CloudSearch. Amazon Elasticsearch Service is based on a cluster of managed servers (where scaling needs to be managed) while Amazon CloudSearch is a managed service which autoscales.
There are a number of options for Elasticsearch (on premise, etc), but I will be looking at Amazon Elasticsearch Service.
Amazon describes it as follows:
Amazon Elasticsearch Service is a fully managed service that makes it easy for you to deploy, secure, and operate Elasticsearch at scale with zero down time.
[at scale, zero down time]
The service offers open-source Elasticsearch APIs, managed Kibana, and integrations with Logstash and other AWS Services, enabling you to securely ingest data from any source and search, analyze, and visualize it in real time.
[So a full ELK stack. This stack is popular with developers.]
Amazon Elasticsearch Service lets you pay only for what you use – there are no upfront costs or usage requirements.
[So a good match for serverless computing.]
All of the above quotes are from: http://aws.amazon.com/elasticsearch-service/
Elasticsearch is a distributed, sharded database with no required indices.
It is a NoSQL database, so theoretically schema-less (although the recommended practice is to define a schema).
It is primarily oriented towards full-text search.
In terms of NoSQL categories, it seems to usually be defined as a real-time search and analytics engine first, and a Document store second. Of course, many relational databases also offer full-text search capabilities, however it is probably its real-time streaming characteristics that make Elasticsearch attractive.
[As a document-store, it seems to offer a strong challenge to MongoDB. And it can also handle search.]
Elasticsearch seems to be very similiar to Apache Cassandra.
Cassandra has its own CQL while Couchbase has N1QL; on the other hand Elasticsearch is RESTful.
Scales good for reads, ok’ish for writes
From: http://vlkan.com/blog/post/2018/11/14/elasticsearch-primary-data-store/cncml-vienna-2019.pdf
Writes are handled via shards while reads are handled by replicas. Replicas may be scaled up dynamically (for instance, for Black Friday).
Its terminology is a little weird - for instance it refers to what might normally be called databases as indices.
[It seems these actually were just indices - just with the data included (this enables re-indexing, which is a big plus).]
Integrity is maintained with a CAS (compare and swap) versioning mechanism.
It's a little unusual (for a database anyway) in that it offers fuzzy search options.
As noted above, Elasticsearch is optimized for search, and more specifically - full-text search.
Even so, it can be put to a surprisingly varied number of uses, some of which are listed below.
As a result of its full-text search capabilities, Elasticsearch is ideal for Product Catalogues.
[Although probably not for all languages (Latin-based languages should be okay). And as noted below, ES does not handle frequent updates particularly well, so may not be up-to-the-minute accurate for pricing data.]
To quickly add search to an app or website, there is Search UI.
At the enterprise level, it's very useful for the types of data visualization and dashboarding offered by say, Tableau (among others).
For log ingestion and analysis, offerings from DataDog and Splunk may be more feature-rich, still this is a very common use case for Elasticsearch.
There are quick and easy ingestion options for many common log formats such as:
- Apache logs
- Cloudwatch logs
- Elasticsearch logs
- Kafka logs
- Logstash logs
- MySQL logs
- Nats logs
- Nginx logs
- PostgreSQL logs
- Syslogs
- Traefik logs
In addition, there are numerous other ingestion options available via Logstash plugins.
[Elastic appear to have taken note of the value of standard log formats, and with Elastic 7.0 have introduced the Elastic Common Schema, to which all Beats now conform.]
For assembling data from disparate sources, standardization and enrichment (if needed) are critical. Elastic has recognized this - and offer Beats to cover most eventualities.
There are specialized Beats for monitoring Network, System and Performance data.
For these purposes it can be used for the type of dashboarding provided by Grafana (among others).
Due to it's ability to ingest and scan documents, it is becoming very useful for data science.
Elasticsearch offers easy integration for Machine Learning.
There are a number of options for creating an Elasticsearch cluster.
It is possible to create an AWS test cluster from http://www.elastic.co/ but it is also possible to do this from AWS itself.
Follow the link to see the steps involved with the AWS option.
Of course, Elasticsearch is also available on GCP, which may in fact be a stronger option than the AWS offering (which tends to lag behind the ES releases).
Follow the link to see the steps involved with the Docker option.
And of course, if ES is available on Docker, it's no surprise to see it available on Kubernetes - where it can also be used as an alternative to the default Stackdriver logging.
Let's see what version of Elasticsearch we are dealing with first.
Click the wrench icon (LHS) to navigate to the Kibana Dev Tools console.
From the Kibana Dev Tools console:
GET /
[Click the green Play button to execute]
Response:
{
"name" : "b7f9cb7ad431",
"cluster_name" : "docker-cluster",
"cluster_uuid" : "uXuvtXXnQpSJ3I-F9VqxBg",
"version" : {
"number" : "7.3.1",
"build_flavor" : "default",
"build_type" : "docker",
"build_hash" : "4749ba6",
"build_date" : "2019-08-19T20:19:25.651794Z",
"build_snapshot" : false,
"lucene_version" : "8.1.0",
"minimum_wire_compatibility_version" : "6.8.0",
"minimum_index_compatibility_version" : "6.0.0-beta1"
},
"tagline" : "You Know, for Search"
}
[This should exactly match what we got from our ES cluster via our http://localhost:9200 URL.]
Note the version number (affects API calls, etc) which is 7.3.1.
THIS POINT CANNOT BE STRESSED ENOUGH. YOU HAVE BEEN WARNED!
Elastic does not follow the standard practice (usually referred to as Semver or Semantic Versioning) of only releasing breaking changes with major releases.
[This is based upon my own personal experience, but feel free to disregard my comments.]
While it should be expected there might be breaking changes going from a 6.y.z version to a 7.y.z version, it seems there are often breaking changes going from a x.1.z version to an x.2.z version. As an example, for creating aggregate histograms in Elasticsearch 7.1.1 versus 7.2.1, the interval keyword was deprecated and replaced with calendar_interval and/or fixed_interval.
Compare:
https://www.elastic.co/guide/en/elasticsearch/reference/7.1/search-aggregations-bucket-datehistogram-aggregation.html
And:
https://www.elastic.co/guide/en/elasticsearch/reference/7.2/search-aggregations-bucket-datehistogram-aggregation.html
While it may be that the new terms have clearer semantics than the old term, and while I applaud Elastic for continuing to support and expand their software, I think a good argument could be made that this type of change requires a major release and not simply a minor release.
[I have not tested this so it may well be that interval will not actually be removed until the next major release, however no developer really likes breaking changes and there seem to have been far too many of these during the brief time that I have been working with Elasticsearch. The problem with breaking changes is not so much that code needs to be changed - every developer expects and understands this - but that it then needs to be tested afterwards. Testing code that may have been running happily for some time can often turn out to be a non-trivial problem.]
Let's check the health of our cluster:
From the Kibana Dev Tools console:
GET /_cat/health?v
Which should look something like the following:
Let's create an index (database instance) first.
[This step is actually optional as Elasticsearch will auto-magically create indices and types that do not exist, but it is a good step to try in terms of getting familiar with Elasticsearch and Kibana.]
From the Kibana Dev Tools console:
PUT school
[Creates index - Click the green Play button to execute]
Response:
{
"acknowledged" : true,
"shards_acknowledged" : true,
"index" : "school"
}
To delete the index (database instance), there is DELETE.
Having created an index, let's see what indices we have:
From the Kibana Dev Tools console:
GET /_cat/indices
[Click the green Play button to execute]
Response:
green open .kibana_task_manager Bw7bdH5oTBOPIwB7JrzXaQ 1 0 2 0 45.5kb 45.5kb
yellow open school HOGuhBS5TKCjpP84f-CnjQ 1 1 0 0 230b 230b
green open .kibana_1 DuPECNdXSJu0vaUeOnmwxA 1 0 3 0 11.7kb 11.7kb
[The yellow
status for our schools index indicates we do not have the recommended number of replicas
for our school index, which is fine for a testing cluster. This is a warning status, where red
indicates an error condition and green
signifies all systems go
. So a standard traffic light.]
Note that /_cat/indices
will not show Aliases.
A definition:
aliases are like soft links or shortcuts to actual indexes
the advantage is to be able to have an alias pointing to index1a while building or re-indexing on index2b and the moment of swapping them is atomic thanks to the alias
As:
Renaming an alias is a simple remove then add operation within the same API. This operation is atomic, no need to worry about a short period of time where the alias does not point to an index
Also:
Multiple indices can be specified for an action
And:
We will talk more about the other uses for aliases later in the book. For now we will explain how to use them to switch from an old index to a new index with zero downtime.
All from: http://stackoverflow.com/questions/48907041/what-are-aliases-in-elasticsearch-for
So let's create an alias.
From the Kibana Dev Tools console:
POST /_aliases
{
"actions" : [
{ "add" : { "index" : "school", "alias" : "academia" } }
]
}
[Click the green Play button to execute]
Response:
{
"acknowledged" : true
}
And check it exists (200 for yes, 404 for no):
HEAD academia
[Click the green Play button to execute]
Response:
200 - OK
And check out the statistics for this alias:
GET academia/_stats
And just for fun, Ctrl-Enter (on OS/X, Command-Enter) to execute:
{
"_shards" : {
"total" : 2,
"successful" : 1,
"failed" : 0
},
"_all" : {
"primaries" : {
"docs" : {
...
}
},
"total" : {
"docs" : {
...
}
}
},
"indices" : {
"school" : {
"uuid" : "JN_oqm2fTfCyjSMzWskbfA",
"primaries" : {
"docs" : {
...
}
},
"total" : {
"docs" : {
...
},
"fielddata" : {
"memory_size_in_bytes" : 0,
"evictions" : 0
...
}
}
}
}
}
[In general, use of fielddata
is probably to be avoided.]
Now let's create some entries.
From the Kibana Dev Tools console:
POST school/_doc/10
{
"name":"Saint Paul School", "description":"ICSE Afiliation",
"street":"Dawarka", "city":"Delhi", "state":"Delhi", "zip":"110075",
"location":[28.5733056, 77.0122136], "fees":5000,
"tags":["Good Faculty", "Great Sports"], "rating":"4.5"
}
[Click the green Play button to execute]
Response:
{
"_index" : "school",
"_type" : "_doc",
"_id" : "10",
"_version" : 1,
"result" : "created",
"_shards" : {
"total" : 2,
"successful" : 1,
"failed" : 0
},
"_seq_no" : 0,
"_primary_term" : 1
}
[Looks like anything starting with an underscore is metadata.]
And:
POST school/_doc/16
{
"name":"Crescent School", "description":"State Board Affiliation",
"street":"Tonk Road",
"city":"Jaipur", "state":"RJ", "zip":"176114","location":[26.8535922,75.7923988],
"fees":2500, "tags":["Well equipped labs"], "rating":"4.5"
}
[Click the green Play button to execute]
Response:
{
"_index" : "school",
"_type" : "_doc",
"_id" : "16",
"_version" : 1,
"result" : "created",
"_shards" : {
"total" : 2,
"successful" : 1,
"failed" : 0
},
"_seq_no" : 0,
"_primary_term" : 1
}
Let's get a description (definition) of our index.
From the Kibana Dev Tools console:
GET school
[Click the green Play button to execute]
Response:
{
"school" : {
"aliases" : { },
"mappings" : {
"_doc" : {
"properties" : {
"city" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"description" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"fees" : {
"type" : "long"
},
"location" : {
"type" : "float"
},
"name" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"rating" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"state" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"street" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"tags" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
},
"zip" : {
"type" : "text",
"fields" : {
"keyword" : {
"type" : "keyword",
"ignore_above" : 256
}
}
}
}
}
},
"settings" : {
"index" : {
"creation_date" : "1566930221735",
"number_of_shards" : "5",
"number_of_replicas" : "1",
"uuid" : "xxxxxxxxxxxxxxxxxxxxxx",
"version" : {
"created" : "6050499"
},
"provided_name" : "school"
}
}
}
}
[fees is a long, zip is a float; others are text fields.]
Now let's try retrieving a specific entry.
From the Kibana Dev Tools console:
GET /school/_search?q=name:"Saint Paul School"
[Click the green Play button to execute]
Response:
{
"took" : 46,
"timed_out" : false,
"_shards" : {
"total" : 5,
"successful" : 5,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 0.8630463,
"hits" : [
{
"_index" : "school",
"_type" : "_doc",
"_id" : "10",
"_score" : 0.8630463,
"_source" : {
"name" : "Saint Paul School",
"description" : "ICSE Afiliation",
"street" : "Dawarka",
"city" : "Delhi",
"state" : "Delhi",
"zip" : "110075",
"location" : [
28.5733056,
77.0122136
],
"fees" : 5000,
"tags" : [
"Good Faculty",
"Great Sports"
],
"rating" : "4.5"
}
}
]
}
}
Kibana is an open-source data visualization plugin for Elasticsearch.
Click the wrench icon (LHS) to navigate to the Kibana Dev Tools console.
Your last few Kibana Dev Tools sessions will be cached, which is a very useful feature.
Check cluster is running:
GET /
Check cluster health:
GET /_cat/health?v
List all indices:
GET /_cat/indices?v
List all aliases:
GET /_cat/aliases?v
List an index's aliases, mappings and settings:
GET /some_index
List only an index's mappings:
GET /some_index/_mapping
GET /some_alias/_mapping
How to check a field's mapping:
GET /some_index/_mapping/field/some_field
GET /some_alias/_mapping/field/some_field
In general leading slashes may be omitted. However, as they are sometimes needed, perhaps a better practice is to get into the habit of always using a leading slash.
For human-readable output add ?v to the end of the initial line:
GET /_cat/indices
Gives:
yellow open some_index JDyFX-XjQMu95f3KXZnPpQ 1 1 1 0 5kb 5kb
green open .kibana_task_manager 4wLH-SPCRTKreXpzmjmhoA 1 0 2 0 79.1kb 79.1kb
green open .kibana_1 Un0M5PpqTlWDlib_qmsXGw 1 0 4 1 23.9kb 23.9kb
While:
GET /_cat/indices?v
Gives:
health status index uuid pri rep docs.count docs.deleted store.size pri.store.size
yellow open some_index JDyFX-XjQMu95f3KXZnPpQ 1 1 1 0 5kb 5kb
green open .kibana_task_manager 4wLH-SPCRTKreXpzmjmhoA 1 0 2 0 79.1kb 79.1kb
green open .kibana_1 Un0M5PpqTlWDlib_qmsXGw 1 0 4 1 23.9kb 23.9kb
Kibana has a really useful Intellisense
auto-complete feature:
In addition to executing Elasticsearch queries (green Play button), the Dev Tools console in Kibana also has some nice tools (diagonal wrench icon).
cURL (from Spanner - 'Copy as cURL'):
curl -XGET "http://localhost:9200/school/_search?q=rating:4.5"
The 'Auto indent' feature is also useful - it toggles between expanded and compact formats. It will only work for correctly-formatted JSON, which can be a quick and convenient way to check for correctly-formed JSON.
The workspaces in Kibana can get pretty crowded very quickly, so being able to find things via searching is a very nice feature. It is possible to search in both the query and the results panels, as the following screencap shows:
Although the Kibana workspace can include dozens of queries, the results panel will always refer to the current Elasticsearch command as if it had been executed in isolation.
Spacing out your query so it starts on a line ending in a 1 should make debugging and problems slightly easier, as the following annotated screencap shows:
It is possible to search without any criteria (a 'match all query', the equivalent of head
or tail
in *nix) to explore data.
GET /shakespeare/_search
There is a soft limit of 10,000 items on queries. This is to prevent runaway queries from killing the cluster.
[The 10,000 items soft limit for queries was introduced with Elasticsearch 7.0+]
To report on more than 10,000 items, add the following line:
"track_total_hits": true,
As in:
GET some_index/_search
{
"track_total_hits": true,
"query": {
"bool": {
"must_not": {
"exists": {
"field": "some_field"
}
}
}
}
}
[This is a query for all documents that do not contain the term 'some_field'.]
Case is important - in some cases only lower-case will match (presumably text fields are canonicalized, so "USD" will only match if "usd" is requested), whereas in other situations only the exact text will match:
GET /_search
{
"query": {
"query_string" : {
"default_field" : "city",
"query" : "jaipur"
}
}
}
[Click the green Play button to execute]
Response:
{
"took" : 53,
"timed_out" : false,
"_shards" : {
"total" : 324,
"successful" : 324,
"skipped" : 0,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 0.2876821,
"hits" : [
{
"_index" : "school",
"_type" : "_doc",
"_id" : "16",
"_score" : 0.2876821,
"_source" : {
"name" : "Crescent School",
"description" : "State Board Affiliation",
"street" : "Tonk Road",
"city" : "Jaipur",
"state" : "RJ",
"zip" : "176114",
"location" : [
26.8535922,
75.7923988
],
"fees" : 2500,
"tags" : [
"Well equipped labs"
],
"rating" : "4.5"
}
}
]
}
}
http://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl.html
Seems to be used for exact matches (EQ).
Seems to be used for text matching - contains the specified term (LIKE, etc).
Used for numeric matching (LT, LTE, GT, GTE).
GET /_search
{
"query": {
"range": {
"fees": {
"gt": "2500"
}
}
}
}
[Click the green Play button to execute]
Response:
{
"took" : 9,
"timed_out" : false,
"_shards" : {
"total" : 324,
"successful" : 324,
"skipped" : 319,
"failed" : 0
},
"hits" : {
"total" : 1,
"max_score" : 1.0,
"hits" : [
{
"_index" : "school",
"_type" : "_doc",
"_id" : "10",
"_score" : 1.0,
"_source" : {
"name" : "Saint Paul School",
"description" : "ICSE Afiliation",
"street" : "Dawarka",
"city" : "Delhi",
"state" : "Delhi",
"zip" : "110075",
"location" : [
28.5733056,
77.0122136
],
"fees" : 5000,
"tags" : [
"Good Faculty",
"Great Sports"
],
"rating" : "4.5"
}
}
]
}
}
It's pretty error-tolerant, even the following worked:
GET /_search
{
"query": {
"range": {
"eventAt": {
"gte": "2500"
}
}
}
}
Deletes the specified index (and all of the indexed data as well):
DELETE school
[Click the green Play button to execute]
Response:
{
"acknowledged" : true
}
DELETE school/_doc/10
[Click the green Play button to execute]
Response:
{
"_index" : "school",
"_type" : "_doc",
"_id" : "10",
"_version" : 2,
"result" : "deleted",
"_shards" : {
"total" : 2,
"successful" : 1,
"failed" : 0
},
"_seq_no" : 1,
"_primary_term" : 1
}
[Note that result is deleted; also version is 2 as index has been deleted and then re-created.]
If you are brave, things like this usually work:
DELETE some_index
{
"query": {
"exists": {
"field": "some_field"
}
}
}
[The risk of deleting more documents than you intended hardly needs to be stated.]
Bulk loading can be accessed at /_bulk
.
Using curl
more schools can be loaded as follows:
$ curl -H "Content-Type: application/x-ndjson" -XPOST localhost:9200/school/_bulk?pretty --data-binary "@more_schools"; echo
[The ?pretty option means pretty-print the output. This can be omitted. Can optionally add the --silent
option to curl.
The ; echo
part at the end is simply to get a newline so that our console output is readable.]
It is quite easy to get into a chicken & egg situation with indices and aliases. Apparently this is a common enough problem that there is a pretty easy fix:
From: http://www.elastic.co/guide/en/elasticsearch/reference/current/indices-aliases.html
Aggregates are kind of a fancy way to say summaries.
It's possible to specify that you only want the big picture (no details) by specifying "size": 0
:
GET some_index/_search
{
"query": {...},
"size": 0,
"aggregations": {
"dateHistogram": {
"date_histogram": {
"field": "@timestamp",
"fixed_interval": "10s"
}
}
}
}
Both Beats and Logstash are aggregation agents. Logstash can ship to Elasticsearch while Beats can ship to either Logstash or to Elasticsearch directly.
I have yet to determine how they differ, but at first glance it seems that Beats are a "push" arrangement while Logstash is more of a "pull" arrangement. However this is merely a guess on my part. Presumably Logstash itself can either poll or receive, as it can be used to aggregate/enrich data from Beats.
For more details about Beats and Logstash, refer to Logstash plugins.
These seem to be small modules that are designed for specific uses, such as network and/or packet monitors and the like.
For instance, there is Functionbeat which can monitor FaaS modules, such as serverless components. [The Elastic diagram shows Cloudwatch Logs, Kinesis, and SQS. It is not entirely clear whether Functionbeat is meant to be an adjunct or an alternative to AWS X-Ray.]
There is also Heartbeat which can be used to monitor system uptime data.
In addition to the Beats provided by Elastic, there are also Beats supported by the Elastic community.
As far as I can tell, this is more of a traditional log ingestion tool.
For additional requirements such as security, access logging and alerting (to comply with the GDPR, perhaps) there is the X-Pack - which has now been opened up (for more details of this, refer to the The Changelog podcast).
Some of the following podcast episodes may be helpful for getting an overview of Elasticsearch.
Software Engineering Radio is a podcast targeted at the professional software developer.
A good overview of the Elastic Stack:
http://www.se-radio.net/2017/05/se-radio-episode-292-philipp-krenn-on-elasticsearch/
Jeff Meyerson, the interviewer, went on to found Software Engineering Daily.
As the stack has evolved, Logstash seems to have spilt into Beats (which are aggregation agents that can ship to either Elasticsearch or Logstash) and Logstash (which is used for data ingestion and parsing - and also enrichment [for instance, geolocating IP addresses]).
[Apparently geolocation has evolved to become a strength of Elasticsearch. And types are on their way out.]
Presumably well-structured data can be shipped to Elasticsearch directly by Beats, but less well-structured data (that may need to be parsed, processed, aggregated, transformed or enriched) must be passed to Logstash instead.
Software Engineering Daily features daily interviews about technical software topics.
An interesting chat about using Elasticsearch at Scale:
http://softwareengineeringdaily.com/2019/03/20/elasticsearch-at-scale-with-volkan-yazici/
Jeff Meyerson interviews Volkan Yazici about his blog post "Using Elasticsearch as the Primary Data Store":
http://vlkan.com/blog/post/2018/11/14/elasticsearch-primary-data-store/
Summary points:
- Elasticsearch does not handle frequent updates particularly well
- Dynamic fields such as prices are not normally completely up-to-date but need to be indexed so as to be searchable
- Nested fields (or objects) can be problematic
- Great for real-time streaming (which probably correlates with eventual consistency)
And a quote:
Cassandra is really difficult to operate if you want to have it in your own premises. But compared with that Elasticsearch, you can get it right without breaking a sweat.
[Interesting and well worth a listen.]
News and podcasts for developers
The Changelog is a podcast in its own right, as well as an umbrella organization for various other interesting podcasts.
Jerod and Adam talk with Philipp Krenn about Elasticsearch and doubling down on "open"
http://changelog.com/podcast/292
Some interesting discussions about software licenses and Elasticsearch as a replacement for PostgreSQL's full-text search
[Some good background about search in general and Elasticsearch in particular.]
Some useful references follow.
Probably the place to start:
http://www.elastic.co/guide/en/elasticsearch/reference/current/glossary.html
For instance, about shards:
A shard is a single Lucene instance
Amazon layers their branding on top of Elastic's, so in AWS a cluster is known as an Amazon ES Domain. Probably other terms are similiarly translated.
It is impossible to do much in ES without understanding mappings. Read this page:
http://www.elastic.co/guide/en/elasticsearch/reference/current/mapping.html
Individual mappings (fees, in this case) may be checked with GET /school/_mapping/field/fees
.
If using Amazon Elasticsearch Service, their console can be used to view index mappings:
There are plugins to cover many eventualities and use cases: http://www.elastic.co/guide/en/logstash/current/input-plugins.html
[This page lists an extensive list of plugins that may be used for such things as ingesting logs, metrics and events. And once they are in Elasticsearch, Kibana can then provide many templated monitoring and dashboarding options.]
Elastic also offers a pretty useful Support Matrix: http://www.elastic.co/support/matrix#show_logstash_plugins
Careful consideration needs to be paid to dates, as most queries will involve time-related processing or date ranges:
http://www.elastic.co/guide/en/elasticsearch/reference/current/date.html
Time processing in Elasticsearch is generally processed in either seconds or milliseconds, however Elasticsearch 7.0 introduced nanosecond processing as an option (internally, Elasticsearch 7.0+ processing will use nanoseconds).
Interestingly, Elasticsearch can support multiple date formats at the same time:
http://www.elastic.co/guide/en/elasticsearch/reference/current/date.html#multiple-date-formats
For date processing with Javascript, there is Luxon.
[Luxon operates in milliseconds.]
It is possible to format numbers:
http://numeraljs.com/
The index APIs:
http//www.elastic.co/guide/en/elasticsearch/reference/current/indices.html
A little long, but worth reading:
http://www.elastic.co/guide/en/elasticsearch/reference/current/indices-aliases.html
The bulk API makes it possible to perform many index/delete operations in a single API call. This can greatly increase the indexing speed.
http://www.elastic.co/guide/en/elasticsearch/reference/current/docs-bulk.html
Note that a successful create returns a 201 HTTP status code while an unsuccessful delete returns a 404 HTTP status code.
The update (usually an upsert I think) should normally return a 200 HTTP status code.
When using the
update
action,retry_on_conflict
can be used as a field in the action itself (not in the extra payload line), to specify how many times an update should be retried in the case of a version conflict.
And:
The
update
action payload supports the following options:doc
(partial document),upsert
,doc_as_upsert
,script
,params
(for script),lang
(for script), and_source
. See update documentation for details on the options.
[Both quotes are from the article linked above.]
The update documentation is available here:
http://www.elastic.co/guide/en/elasticsearch/reference/current/docs-update.html
Indices (database instances) are - for all practical purposes - immutable.
[It's possible to add fields, and change the display formats of existing fields, but that's about it.]
In order to make any changes to the field mappings, it is pretty much required to reindex.
This can actually be pretty fast: with a medium-sized cluster, millions of documents can be reindexed in a matter of minutes.
Note that the destination
index must be set up before the reindex:
http://www.elastic.co/guide/en/elasticsearch/reference/current/docs-reindex.html
This should be enabled, especially if you are planning on upgrading Elasticsearch in the near future.
In addition to regular logging, Elasticsearch allows you to enable logging of deprecated actions.
from: http://www.elastic.co/guide/en/elasticsearch/reference/7.0/logging.html#deprecation-logging
AWS has a pretty good synopsis of their process for upgrading Elasticsearch:
http://docs.aws.amazon.com/elasticsearch-service/latest/developerguide/es-version-migration.html
Elastic has a short summary also:
http://www.elastic.co/guide/en/cloud/current/ec-migrate-data.html
[It's not particularly helpful, although it does recommend taking a snapshot first.]
Probably definitive when working with Amazon Elasticsearch Service:
http://docs.aws.amazon.com/elasticsearch-service/latest/developerguide/what-is-amazon-elasticsearch-service.html
http://www.elastic.co/guide/en/elasticsearch/reference/current/docs-index_.html
Versioning is complicated; can also specify timeouts.
While most of this is from my painful experiences with wrangling Elasticsearch, I started from:
http://www.tutorialspoint.com/elasticsearch/elasticsearch_populate.htm
All of my testing was from the Kibana Dev Tools console.