We use Elasticsearch (≥ v7.9.2) to index bills. With the index, we can calculate similarity between documents or parts of documents using built-in or custom similarity metrics, and provide for full-text search in the app.
For MacOS, for example:
-
Install
$ curl -L -O https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.9.2-darwin-x86_64.tar.gz
$ tar -xvf elasticsearch-7.9.2-darwin-x86_64.tar.gz
-
Run
$ cd elasticsearch-7.9.2/bin
$ ./elasticsearch
Elasticsearch may be `kill`ed by the jvm process when running bill similarity calculations. To try to avoid this, we set vm parameters:
See https://www.elastic.co/guide/en/elasticsearch/reference/5.5/setup-configuration-memory.html and https://discuss.elastic.co/t/elasticsearch-process-getting-killed/205691/6
flatgov$ sudo sh -c 'echo 0 > /proc/sys/vm/swappiness'
flatgov$ sudo sh -c 'echo 1 > /proc/sys/vm/overcommit_memory'
Elasticsearch logs may be found in /var/log/elasticsearch
. To see recent activity:
journalctl -u elasticsearch.service -xe
The functions in scripts/elastic_load.py
load a bill file into the Elasticsearch index and provide a sample query. XML files are loaded with nested
indexing and nested
query, to account for the hierarchical levels. Initially, we are indexing only at the section level, as follows:
{
'headers': list(OrderedDict.fromkeys(headers_text)),
'sections': [{
'section_number': section.find('enum').text,
'section_text': etree.tostring(section, method="text", encoding="unicode"),
'section_xml': etree.tostring(section, method="xml", encoding="unicode"),
'section_header': section.find('header').text
]
}
Once a bill is converted into the form above, it is indexed. Both as a whole document, and with the sections indexed separatetly (as 'nested' documents).
Note
|
When an inner mapping field is set with a 'similarity' key (e.g. `"section_text": {"type": "text", "similarity": "classic"} ), it appears to break the nesting; the nested query no longer works and an Exception is thrown, indicating that 'sections' is not a nested field.
|
-
Install
elasticdump
command-line application with npm
npm install -g elasticdump
-
Store
billsections
index to a .gz file
elasticdump --input=http://localhost:9200/billsections --output=$ | gzip > ./elasticdump.billsections.json.gz
-
Import data from
.json
-
Unzip the
.json.gz
-
gzip -d elasticdump.billsections.json.gz
-
Restore data to Elasticsearch
# Import data from .json into ES
elasticdump \
--input "${file_name}.json" \
--output=http://localhost:9200/billsections
Or import from S3
# Import data from S3 into ES (using s3urls)
elasticdump \
--s3AccessKeyId "${access_key_id}" \
--s3SecretAccessKey "${access_key_secret}" \
--input "s3://${bucket_name}/${file_name}.json" \
--output=http://localhost:9200/billsections
Note
|
we are not using Logstash to index; the logstash-filter.conf we built in the elasticsearch directory does not (yet) work.
|
For MacOs, for example:
-
Install
$ curl -L -O https://artifacts.elastic.co/downloads/logstash/logstash-7.9.2.tar.gz
$ tar -xvf logstash-7.9.2.tar.gz
-
Set up a logstash config file