FHIR Validation on Apache Spark

This project provides tools for validating large sets of FHIR resources using Apache Spark parallel processing. It supports validation of resources in NDJSON format with HL7 FHIR Validator.

Prerequisites

Java 11
Spark 3.4.x
Maven
Python 3.x
Pip

Installing the Project from the distribution

The binary distribution of the project is available as a tar.gz file in the GitHub releases.

To install the project from the distribution tar.gz follow these steps:

Download the distribution tar.gz file:
- Navigate to the Releases page of the project on GitHub.
- Download the latest release tar.gz file (e.g., fhir-validator-spark-VERSION-dist.tar.gz).
Extract the Tar.gz File:
- Use the following command to extract the tar.gz file:
```
tar -xzf fhir-validator-spark-VERSION-dist.tar.gz
```
Navigate to the Extracted Directory:
- Change to the directory created by extracting the tar.gz file:
```
cd fhir-validator-spark-VERSION/
```

Building the Project

To build the project from sources, run the following command:

mvn clean install

Validating FHIR Resources

The validation of FHIR resources is a two-step process:

Validate the FHIR resources using the command line application (e.g.: validate-fhir) to produce a parquet dataset with the validation results.
Generate a validation report that from the parquet validation results dataset using one of the provided python scripts (e.g.: validation-report-issues.py).

Setting up the environment

To set up the environment install the required python packages defined in the env/requirements.txt file into your python environment (conda, virtualenv, etc).

pip install -r env/requirements.txt

Validating a single FHIR resource

data/mimic-iv-demo-10 contains a small samples of mimic-fhir-demo resources in NDJSON format.

To validate a single FHIR resource (MimicPatient) against the MIMIC-FHIR IG, run the following command:

  bin/validate-fhir data/mimic-iv-demo-10/MimicPatient.ndjson target/validation-patient --ig data/packages/kindlab.fhir.mimic/package.tgz

The validation results in parquest are stored in the target/validation-patient directory.

To generate a validation report that includes all levels of issues, run the following command:

  bin/validation-report-issues.py target/validation-patient target/report-patient.html --min-level 0

Then open the target/report-patient.html file in a web browser to see the results.

Please, note that the value of the "File name" column in the report corresponds to the input file name.

Validating multiple FHIR resources

data/mimic-iv-demo-10_partitioned contains a small samples of mimic-fhir-demo dataset partitioned by filename using the hive partitioning scheme.

To validate the all resources in the data/mimic-iv-demo-10_partitioned directory, run the following command:

  bin/validate-fhir data/mimic-iv-demo-10_partitioned target/validation-result-partitioned

The validation results in parquest are stored in the target/validation-result-partitioned directory.

To generate a validation report that includes all levels of issues, run the following command:

  bin/validation-report-issues.py target/validation-result-partitioned target/report-partitioned.html --min-level 0

Then open the target/report-partitioned.html file in a web browser to see the results.

Please, note that the value of the "File name" column in the report corresponds to the filename partition in the dataset.

Command line applications

validate-fhir

Command line application to validate large number of FHIR resources in ndjson format using Apache Spark for parallel processing and HL7 FHIR Validator for fhir validation.

The input is a text file in ndjson format with each line containing a FHIR resource or a directory containing such ndjson files. Additionally, the intput directory can be partitioned using the hive style partitioning with the filename column, e.g.:

dataset/
     filename=MimicPatient/
          part-00000.ndjson
          part-00001.ndjson
          ...
      filename=MimicObservation/
          part-00000.ndjson
          part-00001.ndjson
          ...
      ...

For non-partitioned data the filename column is added to the dataset with the value of the input file.

The output is a parquet dataset with the following schema:

      root
     |-- resource: string (nullable = false)  // FHIR resource
     |-- filename: string (nullable = fase)  // filename of the resource
     |-- issues: array (nullable = true)
     |    |-- element: struct (containsNull = false)
     |    |    |-- level: string (nullable = false) // Issue severity (information, warning, error, fatal)
     |    |    |-- type: string (nullable = false) // Issue type (according to the validator classification)
     |    |    |-- message: string (nullable = false) // Issue message
     |    |    |-- messageId: string (nullable = true) // Issue message id
     |    |    |-- location: string (nullable = true) // Issue location, e.g. the fhirpath expression
     |    |    |-- line: integer (nullable = true) // Issue line number
     |    |    |-- col: integer(nullable = true) // Issue column number

To see the available options, run the following command:

  bin/validate-fhir --help

The application is implemented in the au.csiro.fhir.validation.cli.ValidateApp class.

Report generation

validation-report-issues.py

Generates a single report file with the issues are aggregated by:

level
type
message_id
filename

For each group an example is provided (including the actual message, the location of the issue and the json representation of the resource) as well as the count of the issues.

Command line options can be used to:

filter the issues by level (--min-level)
limit the total number of issues in the report (--limit)
exclude messages matching the given SQL LIKE pattern(s) (--exclude-message)

To see the available options, run the following command:

  bin/validation-report-issues.py --help

Important note

This software is currently in alpha. It is not yet ready for production use.

Name		Name	Last commit message	Last commit date
Latest commit History 40 Commits
.github/workflows		.github/workflows
bin		bin
data		data
env		env
src		src
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FHIR Validation on Apache Spark

Prerequisites

Installing the Project from the distribution

Building the Project

Validating FHIR Resources

Setting up the environment

Validating a single FHIR resource

Validating multiple FHIR resources

Command line applications

validate-fhir

Report generation

validation-report-issues.py

Important note

About

Releases 1

Packages

Contributors 2

Languages

License

aehrc/fhir-validator-spark

Folders and files

Latest commit

History

Repository files navigation

FHIR Validation on Apache Spark

Prerequisites

Installing the Project from the distribution

Building the Project

Validating FHIR Resources

Setting up the environment

Validating a single FHIR resource

Validating multiple FHIR resources

Command line applications

validate-fhir

Report generation

validation-report-issues.py

Important note

About

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 2

Languages

Packages