Project 2: Assessing quality and privacy metrics of synthetic health data for benchmarking: the variant callers use case
This project aims to generate genomic synthetic data and to guarantee the privacy and safety of sensitive private data. This is a hot topic and every day there are published new more sophisticated generators and infrastructure has already been provided during last year’s BH22 for the community, where we focused on clinical data. Now the current challenge is quality and privacy of the generated synthetic data. We need metrics to assess this in a transparent and systematic way.
However, there is not a current consensus in the community on what evaluation metrics to use to check the quality and privacy of generated synthetic datasets for health research such as for genomics or phenomics analyses, which is KEY for benchmarking (like the effort from openEBench), and for the downstream application like the case of variant calling where there is a need for ground truth datasets with which one can compare and asses their results.
Participants will form two groups: the generation group and the benchmarking group and will work in parallel. In this way they will learn about the generation of synthetic data and the importance of accurately capturing the characteristics of real-world data. Participants will also learn about the limitations and biases of different generative models, the importance of evaluation indicators, benchmarking variant calling algorithms and how they can affect their performance.
Overall, the proposed project will provide an exciting opportunity to explore the use of synthetic data in life sciences and also connect with previous ELIXIR efforts.
Focus: to promote the use of synthetic data, set evaluation best practices, connect with efforts of ELIXIR ML Focus Group.
- Identify a list of genomics and phenomics metrics, (short term)
- suggest evaluation best practices, (short term),
- BioHackrXiv paper (short term),
- introduce practices and metrics viz in the infrastructure developed last year BH-EU or/and in OpenEBench, to be used by the ELIXIR community, (long term)
- expand the project scope including additional variant callers and synthetic data generators (long term)
- Paper(long term)
We plan to spend the 5-days BH to provide the first three outcomes, start designing and implementing the fourth, and discuss the fifth.
Required expertise: python/R developer(s) with experience in data science, Synthetic data experts and users, Researcher(s) with experience in variant calling and/or statistics/ML.
Generating (NEAT, ART) and variant calling (GATK, VarScan) algorithms.
Styliani-Christina Fragkouli, Alberto Labarga, Núria Queralt Rosinach