Table of Contents

Using Elasticsearch for Similarity and Search
Backup and restore with elasticdump
Install Logstash

Using Elasticsearch for Similarity and Search

We use Elasticsearch (≥ v7.9.2) to index bills. With the index, we can calculate similarity between documents or parts of documents using built-in or custom similarity metrics, and provide for full-text search in the app.

Install Elasticsearch

See https://www.elastic.co/guide/en/elasticsearch/reference/current/getting-started-install.html

For MacOS, for example:

Install

$ curl -L -O https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.9.2-darwin-x86_64.tar.gz
$ tar -xvf elasticsearch-7.9.2-darwin-x86_64.tar.gz

Run

$ cd elasticsearch-7.9.2/bin
$ ./elasticsearch

Update memory settings

Elasticsearch may be `kill`ed by the jvm process when running bill similarity calculations. To try to avoid this, we set vm parameters:

See https://www.elastic.co/guide/en/elasticsearch/reference/5.5/setup-configuration-memory.html and https://discuss.elastic.co/t/elasticsearch-process-getting-killed/205691/6

flatgov$ sudo sh -c 'echo 0 > /proc/sys/vm/swappiness'
flatgov$ sudo sh -c 'echo 1 > /proc/sys/vm/overcommit_memory'

Logs

Elasticsearch logs may be found in /var/log/elasticsearch. To see recent activity: journalctl -u elasticsearch.service -xe

Indexing with Python

See, e.g., https://kb.objectrocket.com/elasticsearch/use-python-to-index-files-into-elasticsearch-index-all-files-in-a-directory-part-2-852

The functions in scripts/elastic_load.py load a bill file into the Elasticsearch index and provide a sample query. XML files are loaded with nested indexing and nested query, to account for the hierarchical levels. Initially, we are indexing only at the section level, as follows:

{
    'headers': list(OrderedDict.fromkeys(headers_text)),
     'sections': [{
     'section_number': section.find('enum').text,
     'section_text': etree.tostring(section, method="text", encoding="unicode"),
     'section_xml': etree.tostring(section, method="xml", encoding="unicode"),
     'section_header':  section.find('header').text
     ]
     }

Once a bill is converted into the form above, it is indexed. Both as a whole document, and with the sections indexed separatetly (as 'nested' documents).

Note	When an inner mapping field is set with a 'similarity' key (e.g. `"section_text": {"type": "text", "similarity": "classic"}), it appears to break the nesting; the nested query no longer works and an Exception is thrown, indicating that 'sections' is not a nested field.

Backup and restore with elasticdump

Install elasticdump command-line application with npm

npm install -g elasticdump

Store billsections index to a .gz file

elasticdump --input=http://localhost:9200/billsections --output=$ | gzip > ./elasticdump.billsections.json.gz

Import data from .json
- Unzip the .json.gz

gzip -d elasticdump.billsections.json.gz

Restore data to Elasticsearch

# Import data from .json into ES
elasticdump \
  --input "${file_name}.json" \
  --output=http://localhost:9200/billsections

Or import from S3

# Import data from S3 into ES (using s3urls)
elasticdump \
  --s3AccessKeyId "${access_key_id}" \
  --s3SecretAccessKey "${access_key_secret}" \
  --input "s3://${bucket_name}/${file_name}.json" \
  --output=http://localhost:9200/billsections

Install Logstash

Note	we are not using Logstash to index; the logstash-filter.conf we built in the `elasticsearch` directory does not (yet) work.

For MacOs, for example:

Install

$ curl -L -O https://artifacts.elastic.co/downloads/logstash/logstash-7.9.2.tar.gz
$ tar -xvf logstash-7.9.2.tar.gz

Set up a logstash config file

See https://www.elastic.co/guide/en/logstash/current/configuration.html

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ES_SIMILARITY.adoc

ES_SIMILARITY.adoc

Using Elasticsearch for Similarity and Search

Install Elasticsearch

Update memory settings

Logs

Indexing with Python

Backup and restore with elasticdump

Install Logstash

Files

ES_SIMILARITY.adoc

Latest commit

History

ES_SIMILARITY.adoc

File metadata and controls

Using Elasticsearch for Similarity and Search

Install Elasticsearch

Update memory settings

Logs

Indexing with Python

Backup and restore with elasticdump

Install Logstash