Skip to content

Commit

Permalink
Merge pull request #8 from monarch-initiative/develop
Browse files Browse the repository at this point in the history
documentation
  • Loading branch information
pnrobinson authored Apr 29, 2024
2 parents 3c157fd + 28f1c42 commit 50e31d5
Show file tree
Hide file tree
Showing 10 changed files with 116 additions and 98 deletions.
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -22,3 +22,6 @@
# virtual machine crash logs, see http://www.java.com/en/download/help/error_hotspot.xml
hs_err_pid*
replay_pid*
/.idea/
/data/
/prompts/
34 changes: 34 additions & 0 deletions docs/batch.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
# batch

This command creates prompts from all phenopackets in the input directory.

## Getting the input data

Go to the [Releases](https://github.com/monarch-initiative/phenopacket-store/releases) section of
[phenopacket-store](https://github.com/monarch-initiative/phenopacket-store){:target="_blank"}, and download the
latest release (currently 0.1.5 on April 29, 2024, but evolving rapidly). Currently, this resource contains over 4300 phenopackets.


Download one of the archives (e.g., ``all_phenopackets.zip``) and unpack in a location of your choice.


Then run the following command.

```shell title="batch"
java -jar phenopacket2prompt.jar batch -d <all_phenopackets>
```
where ``<all_phenopackets>`` is the complete relative or absolute path to the unpacked directory.

phenopacket2prompt will create a new subdirectory called ``prompts``in the current directory. It will contain
one folder for each language (currently, English-en and Spanish-es), as well as a file called ``correct_results.tsv``
with the following structure


| Disease name | OMIM identifier | Prompt file name |
|--------------------------------------------|:---------------:|-------------------------------------------------:|
| Birt-Hogg-Dube syndrome 2 | OMIM:620459 | PMID_36440963_IIIPMID_36440963_III-33-prompt.txt |
| Immunodeficiency 115 with autoinflammation | OMIM:620632 | PMID_26008899_patient-prompt.txt |
| Jacobsen syndrome | OMIM:147791 | PMID_15266616_148-prompt.txt |


Note that the prompt file name is the same for every language.
3 changes: 3 additions & 0 deletions docs/english.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# English

Todo -- let's write a summary of the translations in each language.
30 changes: 0 additions & 30 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,36 +8,6 @@ GA4GH phenopackets.



## Installation


Most users should download the prebuilt executable file from the
[Releases](https://github.com/monarch-initiative/phenopacket2prompt/releases) page of the GutHub repository.

It is also possible to build the application from source using standard Maven and Java tools.

```shell title="building the app"
git clone https://github.com/monarch-initiative/phenopacket2prompt.git
cd phenopacket2prompt
maven package
java -jar target/phenopacket2prompt.jar
```

## Setup


First download the latest copy of the [Human Phenotype Ontology](https://hpo.jax.org/app/) hp.json file. This file is
used for text mining of clinical signs and symptoms. For more information about the HPO, see
[Koehler et al. (2021)](https://pubmed.ncbi.nlm.nih.gov/33264411/). Adjust the path to the `phenopacket2prompt.jar`
file as necessary.



```shell title="download"
java -jar phenopacket2prompt.jar download
```




## Running phenopacket2prompt
Expand Down
41 changes: 28 additions & 13 deletions docs/setup.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Set-up

TODO -- how to setup Java etc.
phenopacket2prompt requires at least Java 17. To build it from scratch, maven is also required.

## Download command
Before running the batch command, run the download command to get the necessary files
Expand All @@ -9,19 +9,34 @@ Before running the batch command, run the download command to get the necessary
java -jar target/phenopacket2prompt.jar download
```

## Batch command
To run the batch command, first download the latest release from the
[releases](https://github.com/monarch-initiative/phenopacket-store/releases) section of the phenopacket-store
repository. Unpack either all_phenopackets.tgz or all_phenopackets.zip (the files are identical except for the
method of compression).


## Installation


Most users should download the prebuilt executable file from the
[Releases](https://github.com/monarch-initiative/phenopacket2prompt/releases) page of the GutHub repository.

It is also possible to build the application from source using standard Maven and Java tools.

```shell title="building the app"
git clone https://github.com/monarch-initiative/phenopacket2prompt.git
cd phenopacket2prompt
maven package
java -jar target/phenopacket2prompt.jar
```
java -jar target/phenopacket2prompt.jar batch -d <all_phenopackets>
```
Replasce `<all_phenopackets>` with the actual path on your system.

The app should create a folder "prompts", with two subdirectories, "en" and "es" with English and Spanish prompts.
There are some errors that still need to be fixed, but several thousand prompts should appear.
## Setup


First download the latest copy of the [Human Phenotype Ontology](https://hpo.jax.org/app/) hp.json file. This file is
used for text mining of clinical signs and symptoms. For more information about the HPO, see
[Koehler et al. (2021)](https://pubmed.ncbi.nlm.nih.gov/33264411/). Adjust the path to the `phenopacket2prompt.jar`
file as necessary.



```shell title="download"
java -jar phenopacket2prompt.jar download
```

## Todo
also output a file with expected diagnosis
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@
import org.monarchinitiative.phenopacket2prompt.international.HpInternationalOboParser;
import org.monarchinitiative.phenopacket2prompt.model.PhenopacketDisease;
import org.monarchinitiative.phenopacket2prompt.model.PpktIndividual;
import org.monarchinitiative.phenopacket2prompt.output.CorrectResult;
import org.monarchinitiative.phenopacket2prompt.output.PromptGenerator;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
Expand Down Expand Up @@ -60,12 +61,26 @@ public Integer call() throws Exception {
LOGGER.info("Got {} translations", internationalMap.size());
List<File> ppktFiles = getAllPhenopacketJsonFiles();
createDir("prompts");
outputPromptsEnglish(ppktFiles, hpo);
List<CorrectResult> correctResultList = outputPromptsEnglish(ppktFiles, hpo);
// output all non-English languages here
PromptGenerator spanish = PromptGenerator.spanish(hpo, internationalMap.get("es"));
outputPromptsInternational(ppktFiles, hpo, "es", spanish);
// output file with correct diagnosis list
outputCorrectResults(correctResultList);
return 0;
}

private void outputCorrectResults(List<CorrectResult> correctResultList) {
File outfile = new File("prompts" + File.separator + "correct_results.tsv");
try (BufferedWriter bw = new BufferedWriter(new FileWriter(outfile))) {
for (var cres : correctResultList) {
bw.write(String.format("%s\t%s\t%s\n", cres.diseaseLabel(), cres.diseaseId().getValue(), cres.promptFileName()));
}
} catch (IOException e) {
e.printStackTrace();
}
System.out.printf("[INFO] Output a total of %d prompts in en and es.\n", correctResultList.size());
}


private String getFileName(String phenopacketID) {
Expand Down Expand Up @@ -99,10 +114,11 @@ private void outputPromptsInternational(List<File> ppktFiles, Ontology hpo, Stri
}


private void outputPromptsEnglish(List<File> ppktFiles, Ontology hpo) {
private List<CorrectResult> outputPromptsEnglish(List<File> ppktFiles, Ontology hpo) {
createDir("prompts/en");
List<CorrectResult> correctResultList = new ArrayList<>();
PromptGenerator generator = PromptGenerator.english(hpo);
List<String> diagnosisList = new ArrayList<>();

for (var f: ppktFiles) {
PpktIndividual individual = new PpktIndividual(f);
List<PhenopacketDisease> diseaseList = individual.getDiseases();
Expand All @@ -114,13 +130,15 @@ private void outputPromptsEnglish(List<File> ppktFiles, Ontology hpo) {
String promptFileName = getFileName( individual.getPhenopacketId());
String diagnosisLine = String.format("%s\t%s\t%s\t%s", pdisease.getDiseaseId(), pdisease.getLabel(), promptFileName, f.getAbsolutePath());
try {
diagnosisList.add(diagnosisLine);
String prompt = generator.createPrompt(individual);
outputPrompt(prompt, promptFileName, "prompts/en");
var cres = new CorrectResult(promptFileName, pdisease.getDiseaseId(), pdisease.getLabel());
correctResultList.add(cres);
} catch (Exception e) {
e.printStackTrace();
}
}
return correctResultList;
}


Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
package org.monarchinitiative.phenopacket2prompt.output;

import org.monarchinitiative.phenol.ontology.data.TermId;

public record CorrectResult(String promptFileName, TermId diseaseId, String diseaseLabel) {
}
Original file line number Diff line number Diff line change
Expand Up @@ -6,30 +6,14 @@ public class PpktTextEnglish implements PhenopacketTextGenerator {
@Override
public String QUERY_HEADER() {
return """
I am running an experiment on a clinicopathological case conference to see how your diagnoses
compare with those of human experts. I am going to give you part of a medical case. These have
all been published in the New England Journal of Medicine. You are not trying to treat any patients.
As you read the case, you will notice that there are expert discussants giving their thoughts.
In this case, you are “Dr. GPT-4,” an Al language model who is discussing the case along with
human experts. A clinicopathological case conference has several unspoken rules. The first is
that there is most often a single definitive diagnosis (though rarely there may be more than one),
and it is a diagnosis that is known today to exist in humans. The diagnosis is almost always
confirmed by some sort of clinical pathology test or anatomic pathology test, though in
rare cases when such a test does not exist for a diagnosis the diagnosis can instead be
made using validated clinical criteria or very rarely just confirmed by expert opinion.
You will be told at the end of the case description whether a diagnostic test/tests are
being ordered, which you can assume will make the diagnosis/diagnoses. After you read the case,
I want you to give two pieces of information. The first piece of information is your most likely
diagnosis/diagnoses. You need to be as specific as possible -- the goal is to get the correct
answer, not a broad category of answers. You do not need to explain your reasoning, just give
the diagnosis/diagnoses. The second piece of information is to give a robust differential diagnosis,
ranked by their probability so that the most likely diagnosis is at the top, and the least likely
is at the bottom. There is no limit to the number of diagnoses on your differential. You can give
as many diagnoses as you think are reasonable. You do not need to explain your reasoning,
just list the diagnoses. Again, the goal is to be as specific as possible with each of the
diagnoses.
Do you have any questions, Dr. GPT-4?
I am running an experiment on a clinical case report to see how your diagnoses compare with those of human experts. I am going to give you part of a medical case. You are not trying to treat any patients. In this case, you are “Dr. GPT-4,” an AI language model who is providing a diagnosis Here are some guidelines. First, there is a single definitive diagnosis, and it is a diagnosis that is known today to exist in humans. The diagnosis is almost always confirmed by some sort of genetic test, though in rare cases when such a test does not exist for a diagnosis the diagnosis can instead be made using validated clinical criteria or very rarely just confirmed by expert opinion. After you read the case, I want you to give a differential diagnosis with a list of candidate diagnoses ranked by probability starting with the most likely candidate. Each candidate should be specified with the OMIM identifier and disease name. For instance, if the first candidate is Branchiooculofacial syndrome and the second is Cystic fibrosis, provide this:
1. OMIM:113620 - Branchiooculofacial syndrome
2. OMIM:219700 - Cystic fibrosis
This list should provide as many diagnoses as you think are reasonable.
You do not need to explain your reasoning, just list the diagnoses together with the OMIM identifiers.
Here is the case:
""";
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -343,7 +343,7 @@ private String lastEncounterAvailable(PhenopacketSex psex, PhenopacketAge lastEx
// should never happen
throw new PhenolRuntimeException("Did not recognize last exam age type " + lastExamAge.ageType());
}
return String.format("The proband was a %s who presented with", individualDescription);
return String.format("El paciente era %s quien se presentó con", individualDescription);
}

/**
Expand All @@ -370,9 +370,9 @@ private String onsetAvailable(PhenopacketSex psex, PhenopacketAge onsetAge) {

private String ageNotAvailable(PhenopacketSex psex) {
return switch (psex) {
case FEMALE -> "The proband was a female who presented with";
case MALE -> "The proband was a male who presented with";
default -> "The proband presented with";
case FEMALE -> "La paciente se presentó con";
case MALE -> "El paciente se presentó con";
default -> "El paciente se presentó con";
};
}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -7,31 +7,16 @@ public class PpktTextSpanish implements PhenopacketTextGenerator {
@Override
public String QUERY_HEADER() {
return """
Estoy realizando un experimento en una conferencia de casos clinicopatológicos para ver cómo sus diagnósticos\s
se comparan con los de los expertos humanos. Les voy a dar parte de un caso médico. Estos han sido\s
todos han sido publicados en el New England Journal of Medicine. Usted no está tratando a ningún paciente.
Cuando lea el caso, observará que hay expertos que exponen sus opiniones.\s
En este caso, usted es el "Dr. GPT-4", un modelo de lenguaje Al que está discutiendo el caso junto con expertos humanos.\s
expertos humanos. Una conferencia clinicopatológica tiene varias reglas tácitas. La primera es\s
que la mayoría de las veces hay un único diagnóstico definitivo (aunque rara vez puede haber más de uno),
y se trata de un diagnóstico que hoy se sabe que existe en humanos. El diagnóstico casi siempre se\s
confirmado mediante algún tipo de prueba de patología clínica o anatomopatológica, aunque en\s
casos raros en los que no existe una prueba de este tipo para un diagnóstico, éste puede\s
diagnóstico puede realizarse mediante criterios clínicos validados o, en muy raras ocasiones, simplemente confirmarse mediante la opinión de un experto.\s
Al final de la descripción del caso se le indicará si se solicita alguna prueba o pruebas diagnósticas.\s
diagnósticas, que puede suponer que harán el diagnóstico o diagnósticos. Después de leer el caso\s
quiero que des dos datos. El primer dato es su diagnóstico o diagnósticos más probables.\s
diagnóstico/diagnósticos. El objetivo es obtener la respuesta correcta, no una amplia categoría de respuestas.\s
correcta, no una amplia categoría de respuestas. No es necesario que explique su razonamiento.\s
el/los diagnóstico/s. El segundo dato es dar un diagnóstico diferencial sólido,\s
ordenados por su probabilidad, de modo que el diagnóstico más probable esté arriba y el menos probable, abajo.\s
esté en la parte inferior. El número de diagnósticos diferenciales es ilimitado. Puede dar\s
Puede dar tantos diagnósticos como considere razonables. No es necesario que explique su razonamiento,\s
sólo enumere los diagnósticos. De nuevo, el objetivo es ser lo más específico posible con cada uno de los\s
diagnósticos.\s
¿Tiene alguna pregunta, Dr. GPT-4?
Estoy realizando un experimento con el informe de un caso clínico para comparar sus diagnósticos con los de expertos humanos. Les voy a dar parte de un caso médico. No estás intentando tratar a ningún paciente. En este caso, usted es el “Dr. GPT-4”, un modelo de lenguaje de IA que proporciona un diagnóstico. Aquí hay algunas pautas. En primer lugar, existe un único diagnóstico definitivo, y es un diagnóstico que hoy se sabe que existe en humanos. El diagnóstico casi siempre se confirma mediante algún tipo de prueba genética, aunque en casos raros cuando no existe dicha prueba para un diagnóstico, el diagnóstico puede realizarse utilizando criterios clínicos validados o, muy raramente, simplemente confirmado por la opinión de un experto. Después de leer el caso, quiero que haga un diagnóstico diferencial con una lista de diagnósticos candidatos clasificados por probabilidad comenzando con el candidato más probable. Cada candidato debe especificarse con el identificador OMIM y el nombre de la enfermedad. Por ejemplo, si el primer candidato es el síndrome branquiooculofacial y el segundo es la fibrosis quística, proporcione lo siguiente:
1. OMIM:113620 - Síndrome branquiooculofacial
2. OMIM:219700 - Fibrosis quística
Esta lista debe proporcionar tantos diagnósticos como considere razonables.
No es necesario que explique su razonamiento, simplemente enumere los diagnósticos junto con los identificadores OMIM.
Este es el caso:
""";
}

Expand Down

0 comments on commit 50e31d5

Please sign in to comment.