Skip to content

willguv/rummager

 
 

Repository files navigation

Rummager

Rummager is now primarily based on elasticsearch.

Get started

Install elasticsearch 0.90. Rummager doesn't yet work with 1.0 or later.

Install Redis 2.x

Install GNU Aspell

Run the application with ./startup.sh this uses shotgun/thin.

To create indices, or to update them to the latest index settings, run:

RUMMAGER_INDEX=all bundle exec rake rummager:migrate_index

If you have indices from a Rummager instance before aliased indices, run:

RUMMAGER_INDEX=all bundle exec rake rummager:migrate_from_unaliased_index

If you don't know which of these you need to run, try running the first one; it will fail safely with an error if you have an unmigrated index.

Rummager has an asynchronous mode, disabled in development by default, that posts documents to a queue to be indexed later by a worker. To run this in development, you need to run both of these commands:

ENABLE_QUEUE=1 ./startup.sh
bundle exec rake jobs:work

Indexing GOV.UK content

### Memory requirements

In order to build the search index on a VM, you'll need to ensure that your VM has sufficient memory: 4Gb is probably a good amount; with 2Gb, the indexing process has a tendency to get killed by the out of memory killer. Do this by adding a Vagrantfile.localconfig to the same directory as your Vagrantfile:

$ cat ./Vagrantfile.localconfig
config.vm.provider :virtualbox do |vm|
  vm.customize [ "modifyvm", :id, "--memory", "4096", "--cpus", "2" ]
end

It's probably a good idea to give elasticsearch more memory, too, since that will make indexing faster, and also avoid risk of elasticsearch running out of memory and killing itself. Do this by editing /etc/init/elasticsearch-govuk-development.conf to include the line:

env ES_HEAP_SIZE="1024m"

Restart the VM (eg, with vagrant reload) after making these changes.

Popularity information

The gov.uk search uses page popularity information extracted from Google Analytics as one of the factors in weighting search results. This is extracted from Google Analytics by the search-analytics project, but for dev machines, you should be able to obtain a copy of the page traffic index from preview when you run the standard replication of search indexes from preview to dev.

If you do need to fetch the analytics data directly yourself, the search-analytics project README describes how to set up and run the extraction of page traffic information from Google Analytics. It will produce a dump file suitable for loading into an elasticsearch index using rummager's bulk_load tool.

Once you have the popularity data in a file named, say, page-traffic.dump, load it into elasticsearch using:

bundle exec bin/bulk_load page-traffic < page-traffic.dump

The popularity information won't affect search results until an index migration is run after populating the page-traffic index. As part of the migration, the popularity for each document will be computed from the page-traffic index and merged into the documents. To do this, run:

RUMMAGER_INDEX=all bundle exec rake rummager:migrate_index

Indexing panopticon content

Since search indexing happens through Panopticon's single registration API, you'll need to have both Panopticon and Rummager running. By default, Panopticon will not try to index search content in development mode, so you'll need to pass an extra environment variable to it.

If you have Bowler installed, you can set these both running with a single command from the development repository:

UPDATE_SEARCH=1 bowl panopticon rummager

The next stage is to register content from the applications you want. For example:

  • Business Support Finder
  • Calendars
  • Licence Finder
  • Publisher
  • Smart Answers
  • Trade Tariff

To re-register content for a single application, go to its directory and run:

bundle exec rake panopticon:register

To register content for all the applications, go to the replication directory in the development project and run:

./rebuild-search-local.sh

To rebuild from the Whitehall application, follow the instructions in the app.

Adding a new index

To add a new index to Rummager, you'll first need to add it to the list of index names Rummager knows about in elasticsearch.yml. For instance, you might change it to:

index_names: ["mainstream", "detailed", "government", "my_new_index"]

To create the index, you'll need to run:

RUMMAGER_INDEX=my_new_index bundle exec rake rummager:migrate_index

This task will fail if you've already created an index with this name, as Rummager can't add an alias that is the name of an existing index. In this case, you'll either need to delete your existing index or, if you want to keep its contents, run:

RUMMAGER_INDEX=my_new_index bundle exec rake rummager:migrate_from_unaliased_index

Health check

As we work on rummager we want some objective metrics of the performance of search. That's what the health check is for.

To run it first download the healthcheck data:

$ ./bin/health_check -d

Then run against your chosen indices:

$ ./bin/health_check government mainstream

By default it will run against the local search instance. You can run against a remote search service using the --json or --html options.

About

Public-facing site search

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Ruby 99.7%
  • Shell 0.3%