Skip to content

Latest commit

 

History

History
119 lines (78 loc) · 3.92 KB

ES_SIMILARITY.adoc

File metadata and controls

119 lines (78 loc) · 3.92 KB

We use Elasticsearch (≥ v7.9.2) to index bills. With the index, we can calculate similarity between documents or parts of documents using built-in or custom similarity metrics, and provide for full-text search in the app.

Install Elasticsearch

For MacOS, for example:

  • Install

$ curl -L -O https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.9.2-darwin-x86_64.tar.gz
$ tar -xvf elasticsearch-7.9.2-darwin-x86_64.tar.gz
  • Run

$ cd elasticsearch-7.9.2/bin
$ ./elasticsearch

Update memory settings

Elasticsearch may be `kill`ed by the jvm process when running bill similarity calculations. To try to avoid this, we set vm parameters:

flatgov$ sudo sh -c 'echo 0 > /proc/sys/vm/swappiness'
flatgov$ sudo sh -c 'echo 1 > /proc/sys/vm/overcommit_memory'

Logs

Elasticsearch logs may be found in /var/log/elasticsearch. To see recent activity: journalctl -u elasticsearch.service -xe

Indexing with Python

The functions in scripts/elastic_load.py load a bill file into the Elasticsearch index and provide a sample query. XML files are loaded with nested indexing and nested query, to account for the hierarchical levels. Initially, we are indexing only at the section level, as follows:

{
    'headers': list(OrderedDict.fromkeys(headers_text)),
     'sections': [{
     'section_number': section.find('enum').text,
     'section_text': etree.tostring(section, method="text", encoding="unicode"),
     'section_xml': etree.tostring(section, method="xml", encoding="unicode"),
     'section_header':  section.find('header').text
     ]
     }

Once a bill is converted into the form above, it is indexed. Both as a whole document, and with the sections indexed separatetly (as 'nested' documents).

Note
When an inner mapping field is set with a 'similarity' key (e.g. `"section_text": {"type": "text", "similarity": "classic"}), it appears to break the nesting; the nested query no longer works and an Exception is thrown, indicating that 'sections' is not a nested field.

Backup and restore with elasticdump

  • Install elasticdump command-line application with npm

npm install -g elasticdump

  • Store billsections index to a .gz file

elasticdump --input=http://localhost:9200/billsections --output=$ | gzip > ./elasticdump.billsections.json.gz

  • Import data from .json

    • Unzip the .json.gz

gzip -d elasticdump.billsections.json.gz

  • Restore data to Elasticsearch

# Import data from .json into ES
elasticdump \
  --input "${file_name}.json" \
  --output=http://localhost:9200/billsections

Or import from S3

# Import data from S3 into ES (using s3urls)
elasticdump \
  --s3AccessKeyId "${access_key_id}" \
  --s3SecretAccessKey "${access_key_secret}" \
  --input "s3://${bucket_name}/${file_name}.json" \
  --output=http://localhost:9200/billsections

Install Logstash

Note
we are not using Logstash to index; the logstash-filter.conf we built in the elasticsearch directory does not (yet) work.

For MacOs, for example:

  • Install

$ curl -L -O https://artifacts.elastic.co/downloads/logstash/logstash-7.9.2.tar.gz
$ tar -xvf logstash-7.9.2.tar.gz
  • Set up a logstash config file