Skip to content

Commit

Permalink
Added high level authoring guide (#35)
Browse files Browse the repository at this point in the history
* Added high level authoring guide

* Minor style updates

* capitalize ETL
  • Loading branch information
dogversioning authored Aug 20, 2024
1 parent e7ee513 commit 5e24b71
Show file tree
Hide file tree
Showing 5 changed files with 155 additions and 4 deletions.
2 changes: 1 addition & 1 deletion .gitmodules
Original file line number Diff line number Diff line change
Expand Up @@ -9,4 +9,4 @@
url = https://github.com/smart-on-fhir/chart-review.git
[submodule "sources/cumulus-aggregator"]
path = sources/cumulus-aggregator
url = https://github.com/smart-on-fhir/cumulus-aggregator
url = https://github.com/smart-on-fhir/cumulus-aggregator.git
151 changes: 151 additions & 0 deletions docs/authoring_studies.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,151 @@
---
title: Authoring studies in Cumulus
_parent: Overview
nav_order: 1
# audience: clinical researcher, unfamiliar with the project
# type: Explanation
---
# Introduction to studies in Cumulus

Cumulus studies aim to help enable address the following kinds of research
questions:

- How can I compute a phenotype from the data in my EHR? And how would I apply
that phenotype so I could conduct a clinical study?
- How do I sample a cohort from my institution's patient population?
- How do I make my work repeatable at another institution?

By providing a set of tools to help answer these questions, we aim to
accelerate the time from clinical question to actionable answer based on data
you already have. After the initial configuration to get the Cumulus framework
up and running, you can reuse much of that work for future research questions.

We've published several papers using Cumulus, to give you the idea of the work
the project enables. You can see the full list on the
[study list](https://docs.smarthealthit.org/cumulus/library/study-list.html)
\- each study contains a link to the associated paper.

## What you'll need

- You'll need the
[Cumulus ETL](https://docs.smarthealthit.org/cumulus/etl/)
set up to get data from your EHR if you haven't already. This usually requires
an IT project to initially configure. This handles extracting and de-identifying
the patient data.
- You'll need the
[Cumulus Library](https://docs.smarthealthit.org/cumulus/library/)
to help you analyze the data.
- You may need someone from your EHR support team to help you define a
patient cohort in your EHR system.
- You may need a data analyst who knows SQL and Python.

## Designing a study

Here's a workflow that most Cumulus studies will follow some version of.

### Define the question you're asking

Here are some examples of questions that have used Cumulus as a tool for analysis:

- [Can I detect patient symptoms from physician notes?](https://github.com/smart-on-fhir/cumulus-library-covid)
- [What was the effect of the Covid-19 pandemic on suicidality rates?](https://github.com/smart-on-fhir/cumulus-library-suicidality-los)
- [How much does my EHR comply with the requirements of the US Core Profile?](https://github.com/smart-on-fhir/cumulus-library-data-metrics/)

Generally, a good question for cumulus has one or more of the following qualities:

- Leverages the kind of data that would be found inside an EHR (i.e. encounter/patient
focused summary data)
- Targets a population larger than the footprint of a particular institution, requiring
data sharing between hospitals
- Uses clinical notes for identification of features, which requires NLP/LLM

Additionally, after initial setup, Cumulus offers good workflow repeatability for
studies. With a large extract from an EHR, you can ask questions, and get answers,
very quickly, so it becomes a more convenient source to use in general biostatistics
analysis.


### Create a patient cohort

There are two strategies for creating patient cohorts:

#### Inside Cumulus

Cumulus is designed to have a large export of hospital EHR data in it. If you
have such an extract, you'll need to filter relevant patients from the rest of
the population. Common strategies include:
- Encounters within a specific date range
- Encounters where a patient exhibited a symptom/set of symptoms
- Patients of a certain age range at time of encounter

You (or your data analyst) can create a table of relevant patients/encounters
implementing these constraints, and use this as a basis for additional downstream
queries.

#### Inside your EHR

If you are only loading targeted data into Cumulus, you can use the cohort creation
tools provided by your EHR to accomplish the same goals. You'll need to work with someone
on your EHR team to get the cohort specified and available for export.

### Consider if you need NLP/LLM analysis

Using natural language tools requires either leveraging an existing, or adding a new,
step to the Cumulus ETL pipeline. These are a little more complex, and will require some
more support from IT staff. The general approach is as follows:

#### Sample patients and perform chart review

For validation, you'll need some source of ground truth to evaluate your natural
language approach. We recommend getting an instance of
[Label Studio](https://labelstud.io/)
deployed inside an environment suitable for PHI. You can then ID a human manageable
number of patients (i.e. in the low hundreds). If you want, you can leverage
[propensity score matching](https://docs.smarthealthit.org/cumulus/library/statistics/propensity-score-matching.html)
to generate a set of positive/negative groups around a specific feature.
Once you've got a cohort, you can use
[Cumulus ETL](https://docs.smarthealthit.org/cumulus/etl/chart-review.html)
to load notes into this environment.

#### Decide on approach with your chart review data

You can use Label Studio to annotate the appropriate features (i.e. symptoms,
clinical procedures, etc) in your patient cohort. With this labeled data, you
can try NLP/LLM approaches, and use this dataset for validating how they
performed. The exact form of this will depend on the particular technology used -
for an example, see the
[Infectious Respiratory Disease Symptoms](https://github.com/smart-on-fhir/infectious-symptoms-llm-study)
project for an approach based on prompt iteration for LLMs.

#### Configure ETL to use your approach and rerun extract

Once you've got a LLM approach, your IT staff will need to
[add it to the ETL](https://docs.smarthealthit.org/cumulus/etl/nlp.html).
This is fairly simple for tasks leveraging a previously configured LLM.
After this, you simply rerun the ETL, and a table containing the outputs
of the LLM should be available for use in your study.

### Build your study

A
[Cumulus study](https://docs.smarthealthit.org/cumulus/library/creating-studies.html)
is a collection of intersections between different
[FHIR resources](https://docs.smarthealthit.org/cumulus/resources.html)
based on some properties among those resources. They can be expressed
via sentences, like:
- Give me every patient from the cohort with an encounter betweeen two dates
- Give me every patient that had a medication request for a specific medication
- Give me every patient that had a symptom that was IDed by my LLM module

You, or your data analyst, can convert statements like these into SQL queries.
The result of these can then be counted and binned.

### Analyzing results

Cumulus counts are outputted as power sets - you can analyze these in your
statistical package of choice, or you can upload them to the
dashboard
to use our chart building tools to review your data. If you are collaborating
with other institutions, you can upload your data to the
[aggregator](https://docs.smarthealthit.org/cumulus/aggregator/integration.html)
to automatically merge together data from the study participants.
2 changes: 1 addition & 1 deletion docs/contact.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
title: Got Questions?
_parent: Overview
nav_order: 1
nav_order: 2
# audience: non-programmers vaguely familiar with Cumulus, potentially annoyed
# type: explanation
---
Expand Down
2 changes: 1 addition & 1 deletion docs/resources.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
title: Supported Resources
_parent: Overview
nav_order: 2
nav_order: 3
# audience: non-programmers vaguely familiar with Cumulus
# type: explanation
---
Expand Down
2 changes: 1 addition & 1 deletion sources/cumulus-aggregator
Submodule cumulus-aggregator updated 36 files
+15 −9 .github/workflows/ci.yaml
+15 −8 .pre-commit-config.yaml
+21 −25 pyproject.toml
+1 −1 scripts/cumulus_upload_data.py
+0 −86 scripts/migrations/migration.002.column_types.py
+4 −1 src/handlers/dashboard/filter_config.py
+7 −8 src/handlers/dashboard/get_chart_data.py
+0 −122 src/handlers/dashboard/get_csv.py
+3 −3 src/handlers/dashboard/get_data_packages.py
+2 −4 src/handlers/dashboard/get_metadata.py
+3 −3 src/handlers/dashboard/get_study_periods.py
+1 −1 src/handlers/shared/awswrangler_functions.py
+2 −2 src/handlers/shared/decorators.py
+5 −14 src/handlers/shared/enums.py
+40 −107 src/handlers/shared/functions.py
+9 −8 src/handlers/site_upload/api_gateway_authorizer.py
+3 −3 src/handlers/site_upload/cache_api.py
+3 −3 src/handlers/site_upload/fetch_upload_url.py
+50 −103 src/handlers/site_upload/powerset_merge.py
+45 −41 src/handlers/site_upload/process_upload.py
+26 −29 src/handlers/site_upload/study_period.py
+10 −87 template.yaml
+33 −49 tests/conftest.py
+3 −3 tests/dashboard/test_filter_config.py
+9 −3 tests/dashboard/test_get_chart_data.py
+0 −150 tests/dashboard/test_get_csv.py
+29 −28 tests/dashboard/test_get_metadata.py
+12 −1 tests/dashboard/test_get_study_periods.py
+5 −1 tests/dashboard/test_get_subscriptions.py
+10 −7 tests/site_upload/test_api_gateway_authorizer.py
+3 −1 tests/site_upload/test_cache_api.py
+6 −1 tests/site_upload/test_fetch_upload_url.py
+42 −54 tests/site_upload/test_powerset_merge.py
+6 −3 tests/site_upload/test_process_upload.py
+8 −3 tests/site_upload/test_study_period.py
+1 −21 tests/utils.py

0 comments on commit 5e24b71

Please sign in to comment.