Skip to content

This repo includes the pipeline used to link and curate CVD PREVENT audit data to HES and death registration data into two tables. These are subsequently sent to OHID for analysis and publication.

License

Notifications You must be signed in to change notification settings

NHSDigital/cvd-prevent-tool

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

50 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DS_234: README

CVD Prevent Tool curated data pipeline

Repository owner: NHS England Analytical Services

Email: datascience@nhs.net

To contact us, raise an issue on Github or via email and we will respond promptly.

Warning - this repository is a snapshot of a repository internal to NHS England. This means that some links may not work for external readers.

This repository includes a suite of spark notebooks used to build a new data pipeline. These are used to build a new data asset to link and curate CVDPREVENT audit data to existing administrative data tables.

This codebase can only be run on NHS England's Data Access Environment Apache Spark V3.2.1. This is being shared for transparency and feedback on the algorithms used.

No sensitive data is stored within this repository.

Key features

  • The pipeline is structured using an object-oriented approach.
  • The pipeline is designed to be configured, using params notebooks, without altering the codebase
  • Outputs can be restricted for particular cohort populations or to include a subset of data sources.
  • Potential to include bespoke outcomes and patient characteristics in the output tables

What does the pipeline do

It takes information from a range of data sources, and summarises them in a number of standardised tables. The pipeline produces the following outputs:

  • Events table (row per event from each data source used)
  • Patient table (row per patient – patient only recorded if they satisfy inclusion criteria for either Cohort 1 or 2)
  • Report table (output of results check, error catching)

Quick Start Guide

The Prevent Tool Pipeline is run from the run_pipeline notebook.

This notebook will run the full pipeline run and uses several haardcoded parameters to determine how the pipeline is run:

PARAMS_PATH: Path to the parameters notebook that controls the pipeline. Default is default. A custom path should only be used when using a non-standard parameters file.

VERSION: Git commit hash from the current master branch in gitlab. Can also be set to dev_XX where XX are the initials of the user running the pipeline - used when testing pipeline code.

RUN_LOGGER: Boolean (True or False) of if to run the logger stage of the pipeline. If True the stage produces metadata around the pipeline's written assets, this information is written into a seperate logger_table asset.

RUN_ARCHIVE: Boolean (True or False) of if to run the archive stage of the pipeline. If True the stage copies current pipeline assets wiith date and git hash before overwriting the assets with the new versions.

DEV_MODE: Boolean (True or False) of if to run the pipeline in development mode. If True the pipeline assets are written with the prefix _dev.

The pipeline run function run_pipeline() outputs a verbose progress log of the running stages and times of the pipeline.

Once completed, assets will be available in the prevent_tool_collab database.

Configuration

The pipeline functionality and running can be controlled using the pipeline parameters (found in the params folder). Below is a brief summary of the different parameter notebooks and their purpose.

params

The main notebook for creating the params object. This notebook checks for the parameters path (default is default) and loads the specified params_util notebook.

params_util

This notebook contains the main parameter definitions and the creation of the params dataclass. Input and output data fields (columns) are specified here, alongside any intermediate fields used as part of the pipeline processing. This notebook loads the params_diagnostic_codes notebook to load the relevant SNOMED and ICD10 codes that form part of the inclusion criteria.

params_diagnostic_codes

This notebook is used to specify any clinical coding variables (ICD-10, SNOMED) that are used to create the pipeline parameters.

params_pipeline_assets

This notebook is used to specify the input (pipeline parameters) and output (table names) used in the pipeline stages.

params_table_schemas

This notebook is used to specify the expected final schemas for the events and patient table assets.

Documentation

The homepage of the pipeline's documentation is here.

Further documentation

Configuring the pipeline

Output data specification

Licence

Unless stated otherwise, the codebase is released under the MIT Licence. This covers both the codebase and any sample code in the documentation.

Documentation is © Crown copyright and available under the terms of the Open Government 3.0 licence.

About

This repo includes the pipeline used to link and curate CVD PREVENT audit data to HES and death registration data into two tables. These are subsequently sent to OHID for analysis and publication.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Contributors 4

  •  
  •  
  •  
  •  

Languages