- Fixes for issue #19 (thanks to @nsandau for the help):
- Where OPCS searches were not always performed correctly if only OPCS3/4 codes were provided.
- When using "group_by" in
get_df()
some diagnoses were incorrectly carried over between groups when different vocabs were provided for each group (condition).
- Additional checking of
get_diagnoses()
input to abort if "blank" codes are provided to the grep. - When getting date first from self-reported illness data exclude "year" if < 1936 (earliest birth year for any participant)
- Baseline dates TSV is now correctly located even if user changes working directory
- HES operations dates were sometimes parsed as character - this is now fixed to parse as dates
- Warnings relating to parsing issues during grepping that are safe to ignore are now suppressed
- Updates to documentation / examples / pkgdown site
- New website articles to
ascertain_diagnoses
,label_fields
and forspark_functions
- New function
label_ukb_field()
allows user to add titles and labels to UK Biobank fields provided as integers but are categorical. - New function
label_ukb_fields()
is a wrapper for the above. User just provides a data frame containing UK Biobank fields, and they all get formatted with titles (and labels if categorical). - Data from the UK Biobank schema (https://biobank.ctsu.ox.ac.uk/crystal/schema.cgi) are stored internally in
ukbrapR:::ukb_schema
- {haven} dependency added for labelling
- Exported
baseline_dates.tsv
now also includes the assessment centres for completeness (but keeps the same filename to avoid any issues for current projects relying on already-exported files)
- Fix for issue #10. Grep issues if user provided only Read2 or CTV3 codes, if Read2 or CTV3 were <5 characters, or if Read2/CTV3 codes contained a hyphen. Thanks to @Simon-Leyss for highlighting.
- Fix for issue #11. When getting self-reported illness codes there was a problem joining the tables if user only provided cancer codes. Thanks to @LauricF for highlighting.
- Fix for when both types self-reported illness codes were provided. (Incorrect subsetting to just those codes provided after pivoting the long object.)
- When getting the date first cancer registry diagnosis, some rows were duplicated. This is now fixed so only one row per participant (the date first for any matched cancer ICD10) is returned.
- Updated internal paths for my servers
indy
andsnow
(for ongoing projects whilst we can still use local files...) - Updated how
get_diagnoses()
andget_df()
handle a user-providedfile_paths
object
- Fix for issue #8. In moving the HES ICD10 code block below the cancer registry code I acctidently put it within the
if (get_canreg) { }
condition. Thanks to @LauricF for highlighting. - Fix bullet points in pkgdown version of docs
- The HESIN diagnosis search can now also include ICD9 codes in the provided codes data frame. These use fuzzy matching (similar to the ICD10s) so that searching for "280" also returns "2809" etc
- Fix for issue #5. The file paths for exported tables were not correctly specified in later calls of
get_diagnoses()
when the working directory is not the home directory. Thanks to @LauricF for highlighting.
This is a major update as I move away from using Spark as the default environment, mostly due to the cost implications; it is significantly cheaper (and quicker!) to store and search exported raw text files in the RAP persistant storage than do everything in a Spark environment (plus the added benefit that the RStudio interface is available in "normal" instances).
The Spark functions are available as before but all updates are to improve functionality in "normal" instances using RStudio, as we move to the new era of RAP-only UK Biobank analysis.
- Added internal data frame containing default paths for exported files in a RAP project (view with
ukbrapR:::ukbrapr_paths
) - Added function
export_tables()
which only needs to be run once when a new project is created. This submits the required table exporter commands to extract each of the tables inukbrapR:::ukbrapr_paths
. This can take ~15 minutes to export all the tables. ~10Gb of text files are created. This will cost ~£0.15 per month to store in the RAP standard storage. get_emr()
is split into two primary underlying functions:get_emr_spark()
which has not changed, andget_emr()
which is the "new way" (i.e.,get_emr_local()
is entirely removed)- Added functionality for
hesin_oper
(HES OPCS operations) searching for ICD10 codes inget_emr()
- New/updated internal functions
get_cancer_registry()
asceratains cases using ICD10s in thecancer_registry
data, and works much the same asget_selfrep_illness()
- New function
get_diagnoses()
is a wrapper to get HES diagnosis, operations, cause of death, GP, cancer registry, and self-reported illness data -- i.e., once function to provide all codes to, and return all health-related data get_df()
takes all output fromget_diagnoses()
i.e., now also identifies date of first in matchedcancer_registry
andhesin_oper
entries, in addition tohes_diag
,gp_clinical
,death_cause
andselfrep_illness
as before.- When getting "date first" using
get_df()
the baseline data is used to create binary case/control variables (for ever and prevalent), and for controls the censoring date is included in the overall_df
variable (default is 30-10-2022).
To make it absolutely clear: the Spark function get_emr_spark()
has not been updated but I am no longer focussed on doing things this way. If you want to submit Pull Requests to improve functions please do. The below changes are to substantially improve the experience of using exported tables in the RAP environment only (if you have all the data on a local system already it will work, assuming you format correctly and provide the paths, but the RAP is the future).
- Fix Spark database error when >1 dataset file is available. Fixes issue #3
- Fix
get_df()
error when ascertaining GP diagnoses if 7-character codes were provided rather than 5
get_emr()
now accepts option "file_paths" - if not provided, attempts to get from Spark- Improve documentation and examples
- Fix
get_df()
error occurring when not all sources are desired
get_emr_local()
option "local_paths" is now "file_paths"- Improve documentation and examples
- Fix problem identifying ICD10 column name in RAP HESIN
- Fix problem getting date first for GP data (excluding missing dates before summarizing)
- It is quicker/easier to ascertain multiple conditions at once to supply
get_emr()
with all the codes (as before), but now can useget_df()
with option "group_by" to indicate the condition names in thecodes_df
object provided. See documentation.
- It is no longer possible to provide custom names for the
codes_df
toget_emr()
-- these now must bevocab_id
andcode
-- makes things much simpler. - Remove ICD9 code from
codes_df_hh
example as these are not currently used
- New function
get_emr_local()
. If the user has text files forhesin_diag
andgp_clinical
etc. these can be searched (rather than Apache Spark queries). This therefore can work on "normal" DNAnexus nodes, or local servers. Most downstream functions also do not rely on Spark clusters if data extracts are available.
- Change URL to reflect my GitHub username change from
lukepilling
tolcpilling
to be more consistent between different logins, websites, and social media -- https://lcpilling.github.io/ukbrapR -- https://github.com/lcpilling/ukbrapR - Added dependency {cli} for improved alert/error reporting
- New argument "prefix" for
get_df()
- user can provide a string to prefix to the output variable names
get_selfrep_illness()
- gets illness information from self-report fields. Derives a "date first" from the age/year reported, incorporating all visits for the participant- Two example code lists are incuded:
codes_df_ckd
(GEMINI CKD), andcodes_df_hh
(haemochromatosis, with self-report)
get_emr_df()
is re-namedget_df()
to reflect it can now include information from self-reported illnessget_emr_diagnoses()
is re-namedget_emr()
to reflect it actually retrieves any record ingp_clinical
not just diagnoses (e.g., BMI if appropriate codes provided)
- So many
get_emr_diagnoses()
- function to get electronic medical records diagnoses from Spark-based death records, hospital episode statistics, and primary care (GP) databases.get_emr_df()
- function to get date first diagnosed with any provided code from any above Electronic Medical Record source.
- Extra input checking in
get_rap_phenos()
and output more consistent for direct use withget_emr_*()
functions - Updated URL for example CKD clinical codes
Initial release containing two functions:
get_rap_phenos()
upload_to_rap()