Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First draft for text description of the data #69

Merged
merged 16 commits into from
May 16, 2024
Merged
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
174 changes: 174 additions & 0 deletions vignettes/algorithm_logic.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,174 @@
---
title: "Description of algorithm contents & logic"
output: rmarkdown::html_vignette
bibliography: references.bib
csl: vancouver.csl
vignette: >
%\VignetteIndexEntry{Design}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
library(dplyr)
```

## Contents

This document describes the data components involved in the algorithm.
It also describes the implemented algorithm logic, changes compared to
the originally validated algorithm, and road a map for potential changes
in future revisions. Refer to the other vignettes for background
information and a more general description of the algorithm.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You could add links to specific vignettes with relevant information here?


## Data components

The algorithm uses five different types of data, contained in five
register sources:

1. Hospital diagnoses
- The National Patient Register [Landspatientregisteret]
2. Prescription drugs purchased
- The Register of Pharmaceutical Sales
[Lægemiddelstatistikregisteret]
3. Hemoglobin-A1c tests
- The Register of Laboratory Results for Research
[Laboratoriedatabasens Forskertabel]
4. Diabetes-specific podiatrist services
- The National Health Insurance Service Register
[Sygesikringsregisteret]
5. Sex & date of birth
- The Danish Civil Registration System [CPR-registeret]

In a future revision, the algorithm can also utilise the Danish Medical
Birth Register to extend the period of time of valid inclusions further
back in time compared to what is possible using obstetric codes from the
National Patient Register.

## Pre-processing steps

This section describes the necessary steps required to format raw data
into a format that can be fed as input to the algorithm. The description
assumes that raw data is stored/structured in the most common format for
raw data provided on Statistics Denmark's servers (from our experience).

Using the most common scenario when working with the above data on
Statistics Denmark's servers, this paragraph lists the common register
abbreviations/raw file names, their structure (year-on-year files vs. a
large single file, plus changes/breaks over time), raw variable names
and relevant values. Variable names are presented in lower case here,
but case may vary between data sources (and even between years in the
same data source) in real data.

Depending on the contents and format of your specific raw data, you may
need to adapt the pre-processing pipeline accordingly.

## Structure of raw data

### National Patient Register

The National Patient Register contains several tables and types of data.
The algorithm uses only hospital diagnosis data, which is contained in
two tables:

1. A table containing administrative information, e.g. personal ID,
`pnr`/`cpr`, and the first date of the contact,
`d_inddto`/`dato_start`.

- Named `lpr_adm` in the LPR2-formatted data prior to 2019, and
`kontakter` in contact-based LPR3-formatted data from 2019
onward.

2. A table containing all information on diagnoses recorded at each
contact, `c_diag`, and the type of diagnosis (e.g. primary or
secondary to the contact), `c_diagtype`.

- Named `lpr_diag` in the LPR2-formatted data prior to 2019, and
`diagnoser` in contact-based LPR3-formatted data from 2019
onward.

On Statistics Denmark, these tables are provided as a mix of separate
files for each calendar year prior to 2019 (in LPR2 format) and a single
file containing all the data from 2019 onward (LPR3 format). The two
tables can be joined with either the `recnum` variable (LPR2 data) or
the `dw_ek_kontakt` variable (LPR3 data).

Examples of this data is shown below:

| pnr | recnum | d_inddto |
|-----|--------|------------|
| 01 | 001 | 2003-01-31 |
| 02 | 002 | 2003-02-01 |
| 02 | 003 | 2003-02-01 |

: Raw structure of lpr_adm: administrative data in the National Patient
Register before 2019. Corresponding variable names 2019 onward: `pnr`=
`cpr`, `recnum` = `dw_ek_kontakt`, `d_inddto` = `dato_start`

| recnum | c_diag | c_diagtype |
|--------|--------|------------|
| 001 | DE101 | A |
| 002 | DI21 | A |
| 003 | DE115 | B |

: Raw structure of lpr_diag: diagnosis data in the National Patient
Register before 2019. Corresponding variable names 2019 onward:
`recnum`= `dw_ek_kontakt`, `c_diag` = `diagnosekode`, `c_diagtype` =
`diagnosetype`

### Register of Pharmaceutical Sales

To-do

### National Health Insurance Service Register

To-do

Content: SSSY and SYSI (overlap in 2005)

### Register of Laboratory Results for Research

To-do

### Civil Registration System

To-do

## Expected input

This section describes the required structure of the data objects that
can be used as input parameters to the OSDC algorithm (preferably
presented as table examples, maybe based on the synthetic data objects)

## Algorithm logic

This section describes what operations are performed on the input data.

## Expected output

This section describes the output object.

## Changes since original validation

1. Purchases of semaglutid, dapagliflozin or empagliflozin are no
longer used for inclusion events or classification of diabetes type
(due to increasing use in treatment of non-diabetes).
2. Diabetes type reclassification based on insulin purchases in the
previous year is no longer used.

## Roadmap for potential changes

1. Add support for using medical birth register to define pregnancies
to censor GDM. Allows censoring GLD purchases all the way back to
1995 (rather than 1997 onward, as the obstetric codes are limited
to), and extends the window of valid dates of diagnosis to 1996
onward.
2. Simplify logic defining pregnancy index dates to remove dependency
on maternal care visits (if performance in validation allows)
3. Limit the scope of primary diagnoses used to evaluate majority of
diabetes-specific diagnoses in type classification.

Loading