CVD Prevent Tool curated data pipeline
Repository owner: NHS England Analytical Services
Email: datascience@nhs.net
To contact us, raise an issue on Github or via email and we will respond promptly.
Warning - this repository is a snapshot of a repository internal to NHS England. This means that some links may not work for external readers.
This repository includes a suite of spark notebooks used to build a new data pipeline. These are used to build a new data asset to link and curate CVDPREVENT audit data to existing administrative data tables.
This codebase can only be run on NHS England's Data Access Environment Apache Spark V3.2.1. This is being shared for transparency and feedback on the algorithms used.
No sensitive data is stored within this repository.
- The pipeline is structured using an object-oriented approach.
- The pipeline is designed to be configured, using params notebooks, without altering the codebase
- Outputs can be restricted for particular cohort populations or to include a subset of data sources.
- Potential to include bespoke outcomes and patient characteristics in the output tables
It takes information from a range of data sources, and summarises them in a number of standardised tables. The pipeline produces the following outputs:
- Events table (row per event from each data source used)
- Patient table (row per patient – patient only recorded if they satisfy inclusion criteria for either Cohort 1 or 2)
- Report table (output of results check, error catching)
The Prevent Tool Pipeline is run from the run_pipeline notebook.
This notebook will run the full pipeline run and uses several haardcoded parameters to determine how the pipeline is run:
PARAMS_PATH: Path to the parameters notebook that controls the pipeline. Default is default. A custom path should only be used when using a non-standard parameters file.
VERSION: Git commit hash from the current master branch in gitlab. Can also be set to dev_XX where XX are the initials of the user running the pipeline - used when testing pipeline code.
RUN_LOGGER: Boolean (True or False) of if to run the logger stage of the pipeline. If True the stage produces metadata around the pipeline's written assets, this information is written into a seperate logger_table asset.
RUN_ARCHIVE: Boolean (True or False) of if to run the archive stage of the pipeline. If True the stage copies current pipeline assets wiith date and git hash before overwriting the assets with the new versions.
DEV_MODE: Boolean (True or False) of if to run the pipeline in development mode. If True the pipeline assets are written with the prefix _dev.
The pipeline run function run_pipeline() outputs a verbose progress log of the running stages and times of the pipeline.
Once completed, assets will be available in the prevent_tool_collab database.
The pipeline functionality and running can be controlled using the pipeline parameters (found in the params folder). Below is a brief summary of the different parameter notebooks and their purpose.
params
The main notebook for creating the params object. This notebook checks for the parameters path (default is default) and loads the specified params_util notebook.
params_util
This notebook contains the main parameter definitions and the creation of the params dataclass. Input and output data fields (columns) are specified here, alongside any intermediate fields used as part of the pipeline processing. This notebook loads the params_diagnostic_codes notebook to load the relevant SNOMED and ICD10 codes that form part of the inclusion criteria.
params_diagnostic_codes
This notebook is used to specify any clinical coding variables (ICD-10, SNOMED) that are used to create the pipeline parameters.
params_pipeline_assets
This notebook is used to specify the input (pipeline parameters) and output (table names) used in the pipeline stages.
params_table_schemas
This notebook is used to specify the expected final schemas for the events and patient table assets.
The homepage of the pipeline's documentation is here.
Unless stated otherwise, the codebase is released under the MIT Licence. This covers both the codebase and any sample code in the documentation.
Documentation is © Crown copyright and available under the terms of the Open Government 3.0 licence.