Skip to content
This repository has been archived by the owner on Oct 25, 2022. It is now read-only.

Latest commit

 

History

History
366 lines (253 loc) · 18.1 KB

importing-data.md

File metadata and controls

366 lines (253 loc) · 18.1 KB

MapIt - import postcode and boundary data

Introduction

MapIt is a product of MySociety which we use within GOV.UK to allow people to enter their postcode and be pointed to their local services.

It is composed of broadly 2 things:

  1. A database of postcodes and the location of their geographical centre
  2. A database of administrative entities (e.g. local councils) and their boundaries on a map

When a user enters a postcode on GOV.UK, MapIt will look it up to find the geographical centre and then return information about the administrative entities that postcode belongs to. We then use this information to refer the user to local services provided by those entities.

Updating Datasets

Datasets for boundary lines are published twice a year (in May and October) and postcodes are released four times a year (in February, May, August and November). This means over time our database will become more and more out of date and users will start to complain. When they do (or before if possible) we should update the data.

To update a live mapit server we:

  1. Generate a new database locally
  2. Export the new database to Amazon S3
  3. Update servers with the new database

Checkout the mapit, mapit-scripts and govuk-docker repos.

Prepare your Mapit installation in Docker by:

  1. Running make mapit in the govuk-docker repo - this will build the image and install all the dependencies.
  2. Reset the database by running govuk-docker run mapit-app ./reset-db.sh in the mapit repo (you can skip this step if you haven't run Mapit locally before) - this will drop any existing database, create a new empty one and migrate it to the empty schema. See the Troubleshooting section if you have issues.
  3. Start your Docker container by running govuk-docker up mapit-app, and check you are able to access mapit.dev.gov.uk - it is expected that the frontend looks somewhat "broken", that's okay - we only need to worry about the database for importing data.

This consists of the Office for National Statistics Postcode Directory (ONSPD), Ordnance Survey Boundary Line (BL), and Ordnance Survey of Northern Ireland (OSNI) datasets.

MySociety may have mirrored the latest datasets on their cache server: http://parlvid.mysociety.org/os/ so check there first.

  1. ONS Postcode Directory - ONSPD releases can be found via the Office for National Statistics (ONS) by going to the ONS Open Geography Portal then selecting the most recent ONS Postcode Directory.

  2. Boundary Line data - BL releases can be downloaded in ESRI format from the Ordnance Survey (OS).

  3. ONSI data - For the ONSI data we now point to Mysociety's URL in the check-onsi-downloads script, as there have not been any changes since December 2015. It's still worth checking if there are any updates. See about datasets for more information.

Save the files you have downloaded to your ~/govuk/mapit directory. This will make them available in Docker at /govuk/mapit. Ensure the rules in .gitignore cover the files you have downloaded, to stop them being erroneously commited to Git.

Update the import-uk-onspd script in mapit-scripts to refer to the paths of the new releases you have downloaded to ~/govuk/mapit.

Note: If the ONSI data has been updated and uploaded to S3 update the check-onsi-downloads script to refer to the new S3 URL.

In your Mapit directory run the import-uk-onspd script to import the data using Docker:

$ govuk-docker run mapit-app ../mapit-scripts/import-uk-onspd

This is a long process, it's importing 1000s of boundary objects and ~2.6 million postcodes, and takes at least 4.5 hours to run. The first hour is particularly important to monitor as this is when it has typically failed in the past.

If the scripts fail you'll have to investigate why and possibly update it to change how we import the data.

Some suggestions of places to look for help:

Also see the Troubleshooting section for past issues.

If you do have to fix the import scripts, or create new ones, consider talking with Mysociety developers to see if they're aware and if you can push those changes back upstream.

Note: If the script fails, you'll need to drop and recreate the database by running the reset-db.sh script and then run import-uk-onspd again. If you have database issues see the troubleshooting section.

The ONS used to identify areas with SNAC codes (called ONS in mapit). They stopped doing this in 2011 and started using GSS codes instead. New areas will not receive SNAC codes, and (for the moment at least) much of GOV.UK relies on SNAC codes to link things up, for example Frontend's AuthorityLookup.

The import-uk-onspd script's last action is to run a script to show missing codes. It's the line that reads

$MANAGE mapit_UK_show_missing_codes

This iterates over all area types we care about and lists those that are missing a GSS code (hopefully none) and how many are missing an ONS/SNAC code. If it lists any areas that are missing codes and you don't expect them (run the script on production or integration if you're not sure) you'll need to investigate.

SSH into one of the machines and run:

gds govuk c ssh -e integration mapit
$ cd /var/apps/mapit
$ sudo -u deploy govuk_setenv mapit venv3/bin/python manage.py mapit_UK_show_missing_codes

Then compare the output with that generated in your local Docker container.

An example of the output (May 2019 data) - import looks good:

    Show missing codes

    11874 areas in current generation (1)

    Checking ['EUR', 'CTY', 'DIS', 'LBO', 'LGD', 'MTD', 'UTA', 'COI'] for missing ['ons', 'gss', 'govuk_slug']
    12 EUR areas in current generation
      2 EUR areas have no ons code:
        Northern Ireland 11874 (gen 1-1) Northern Ireland
        Scotland 9199 (gen 1-1) Scotland
    26 CTY areas in current generation
    192 DIS areas in current generation
    33 LBO areas in current generation
    11 LGD areas in current generation
    36 MTD areas in current generation
    109 UTA areas in current generation
    1 COI areas in current generation

An example of the output (November 2020 data) - import has some flagged issues:

    11870 areas in current generation (1)

    Checking ['EUR', 'CTY', 'DIS', 'LBO', 'LGD', 'MTD', 'UTA', 'COI'] for missing ['ons', 'gss', 'govuk_slug']
    12 EUR areas in current generation
      2 EUR areas have no ons code:
        Northern Ireland 11870 (gen 1-1) Northern Ireland
        Scotland 9195 (gen 1-1) Scotland
    25 CTY areas in current generation
    188 DIS areas in current generation
    33 LBO areas in current generation
    11 LGD areas in current generation
    36 MTD areas in current generation
    110 UTA areas in current generation
      1 UTA areas have no ons code:
        Buckinghamshire Council 1767 (gen 1-1) England
      1 UTA areas have no govuk_slug code:
        Buckinghamshire Council 1767 (gen 1-1) England
    1 COI areas in current generation

Note the last few lines where it says Buckinghamshire Council has no ons code or govuk_slug code. You may have output similar to this if these councils have had some updates, and you will have to make some additional updates in the code:

You will have to reset the db and re-import the data again. Once these have been updated, the API will return the new GSS code, albeit mislabelled as an ONS code.

Note Licensify also depends on knowledge of SNAC codes to build it's own API paths. It will be necessary to update this file with the new GSS codes and corresponding area.

If you've had users complaining that their postcode isn't recognised, then try those postcodes and any other ones you know. You can get latest postcodes from https://checkmypostcode.uk/date/ and test them on your local database:

$ curl http://mapit.dev.gov.uk/postcode/ME206QZ

You should expect a 200 response with data present in the areas field of the response. See this example output for an idea of what to expect.

You can also compare the response to existing data we have in one of our environments and on Mysociety.

$ curl https://mapit.integration.govuk-internal.digital/postcode/ME206QZ
$ curl https://mapit.mysociety.org/postcode/ME206QZ

Ensure you test postcodes from all parts of the UK, since Northern Ireland data has been loaded separately.

You will have changed the import-uk-onspd and check-onsi-downloads scripts to refer to new datasets. If anything failed you may have had to change other things in the mapit repo too.

Export the database you just built in your Docker container:

$ govuk-docker run mapit-app pg_dump -U postgres mapit | gzip > mapit.sql.gz

It should be ~500Mb in size. You'll want to give it a name that refers to what data it contains. Perhaps mapit-<%b%Y>.sql.gz (using strftime parlance) for a standard release, or mapit-<%b%Y>-<a-description-of-change>.sql.gz if you've had to change the data outside the normal dataset releases.

Create a new publicly readable directory in S3 for this import and upload the file:

DATE=`date '+%Y-%m'`
gds aws govuk-production-poweruser aws s3api put-object --acl public-read --bucket govuk-custom-formats-mapit-storage-production --key source-data/${DATE}/
gds aws govuk-production-poweruser aws s3 cp mapit-<%b%Y>-<a-description-of-change>.sql.gz s3://govuk-custom-formats-mapit-storage-production/source-data/${DATE}/ --acl public-read

NB: THIS REQUIRES ACCESS TO GOV.UK PRODUCTION

  1. Update import-db-from-s3.sh in govuk-puppet to refer to your new file.

  2. Deploy your Puppet changes to Staging.

  3. Reprovision one instance of Mapit (keep note of the Mapit machine index) so that a new Mapit instance is created with the updated database.

  4. SSH into the replacement Mapit instance after it has been provisioned (the instance is successfully provisioned if it is added to the Mapit target group in the AWS EC2 console) and follow the instructions in Testing a server with an updated Mapit database.

Now that you are happy with the changes in Staging, you can reprovision other instances of Mapit in Staging before deploying your puppet changes in Production and reprovision the instances there.

Note: Only deploy this change to production once the new data has been tested in staging. If a new Mapit machine gets created in AWS, it will automatically try importing the data from the new database.

Remember to clear the shared cache otherwise old data may still be served. Refer to these docs on how to clear the cache.

If there are new local authorities in this release, you will need to add these to Local Links Manager. This step can be skipped if there are no new authorities.

  1. Run a Rake task to import the local authority slugs.

    rake import:local_authorities:import_all
    
  2. Run a Rake task to add the homepage URL for each local authority that has been added.

    rake local_authority:update_homepage[govuk-slug,https://www.website.gov.uk/]
    
  3. Run a Rake task to add the service links for the new local authorities.

    rake import:missing_links
    

Troubleshooting

Useful database queries

  • Find area by name

      $ mapit=> select * from mapit_area where name = 'Abbey';
    
  • Find area by id

      $ mapit=> select * from mapit_area where id=1767;
    
  • Find authority by name

      $ mapit=> select * from mapit_name where name like 'Buckinghamshire%';
    

Unable to drop the database when running the ./reset-db.sh script

  1. Log into the database

      $ python ./manage.py dbshell
    
  2. Prevent future connections by running

     $ REVOKE CONNECT ON DATABASE mapit FROM public;
    
  3. Terminate all connections to the database except your own:

     $ SELECT pid, pg_terminate_backend(pid) FROM pg_stat_activity WHERE datname = current_database() AND pid <> pg_backend_pid();
    

Failing that, stop collectd, it probably connects again the moment the old connection gets terminated:

$ sudo service collectd stop

Hitting exceptions where an area has a missing parent or spelt incorrectly

If you see something similar to:

get() returned more than one area -- it returned 2!
/var/govuk/mapit/mapit/management/find_parents.py:49
Exception: Area Moray [9325] (SPC) does not have a parent?

or:

Exception: ONS code S17000011 is used for Highland and Islands PER and Highlands and Islands PER

Compare the entry with the database in integration or staging to identify what information is missing or needs to be corrected. Searching for the data on https://mapit.mysociety.org and checking the logged issues on Mysociety's repo might also be helpful. Also see useful database queries.

You can also search for the area by its ONS code on the ONS website e.g. http://statistics.data.gov.uk/atlas/resource?uri=http://statistics.data.gov.uk/id/statistical-geography/S17000011

You can manually fix it by adding a correction in mapit/management/find_parents.py. See this example and this example for more information.