Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Anthropic Claude 3 Haiku to generate topics via AWS Bedrock #503

Open
wants to merge 5 commits into
base: develop
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
120 changes: 65 additions & 55 deletions THIRD-PARTY-LICENSES.txt

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ token: 08932ab0c9ae39a880905666902f8659633ae0232e94ba9f3d2094cb928397e7

[s3]
bucketName: local-dockstore-metrics-data
endpointOverride: http://localhost:4566
endpointOverride: https://s3.localhost.localstack.cloud:4566

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this spurious or does it relate to the ai topics somehow?

Copy link
Member

@denis-yuen denis-yuen Sep 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Think this was a side-effect of a necessary AWS SDK upgrade

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I should've commented in this PR as well, but here's the reason why this changed: dockstore/dockstore#6003 (comment)


[athena]
workgroup: local-dockstore-metrics-workgroup
Expand Down
2 changes: 1 addition & 1 deletion pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@

<github.url>scm:git:git@github.com:dockstore/dockstore-support.git</github.url>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<dockstore-core.version>1.16.0-alpha.16</dockstore-core.version>
<dockstore-core.version>1.16.0-beta.1</dockstore-core.version>
<maven-surefire.version>3.0.0-M5</maven-surefire.version>
<maven-failsafe.version>2.22.2</maven-failsafe.version>
<skipTests>false</skipTests>
Expand Down
34 changes: 27 additions & 7 deletions topicgenerator/README.md
Original file line number Diff line number Diff line change
@@ -1,22 +1,19 @@
# Topic Generator

This is a Java program that generates topics for public Dockstore entries using OpenAI's gpt-3.5-turbo-16k AI model.
This is a Java program that generates topics for public Dockstore entries using AI.

The [entries.csv](entries.csv) file contains the TRS ID and default versions of public Dockstore entries to generate topics for. The [results](results) directory contains the generated topics for those entries from running the topic generator.

## Setup

### Configuration file

Create a configuration file like the following. A template `metrics-aggregator.config` file can be found [here](templates/topic-generator.config).
Create a configuration file like the following. A template `topic-generator.config` file can be found [here](templates/topic-generator.config).

```
[dockstore]
server-url: <Dockstore server url>
token: <Dockstore admin or curator token>
[ai]
openai-api-key: <OpenAI API key>
```

**Required:**
Expand All @@ -26,7 +23,26 @@ openai-api-key: <OpenAI API key>
- `https://staging.dockstore.org/api`
- `https://dockstore.org/api`
- `token`: The Dockstore token of an admin or curator. This token is used to upload topics to the webservice.
- `openai-api-key`: The OpenAI API key required for using the OpenAI APIs. See https://platform.openai.com/docs/api-reference/authentication for more details. This is used to generate topics.

### Authentication to invoke AI models

#### AWS Bedrock

By default, the program uses AWS Bedrock to invoke the Anthropic Claude 3 Haiku model to generate topics.
AWS credentials that have permissions to use the AWS Bedrock API are required and they must have access to the Anthropic Claude models on AWS.
There are several ways that this can be provided.
Read [this](https://docs.aws.amazon.com/sdk-for-java/latest/developer-guide/credentials.html#credentials-chain) for the default credential provider chain.


#### OpenAI (deprecated)

We have moved away from using OpenAI models to generate topics, but if you wish to use it, you need to add the following section to your configuration file.
See https://platform.openai.com/docs/api-reference/authentication for more details.

```
[ai]
openai-api-key: <OpenAI API key>
```

## Running the program

Expand All @@ -49,6 +65,10 @@ Usage: <main class> [options] [command] [command options]
name of the entries to generate topics for. The first line of the
file should contain the CSV fields: trsID,version
Default: ./entries.csv
-m, --model
The AI model to use
Default: CLAUDE_3_HAIKU
Possible Values: [CLAUDE_3_5_SONNET, CLAUDE_3_HAIKU, GPT_4O_MINI]
upload-topics Upload AI topics, generated by the generate-topics
command, for public Dockstore entries.
Expand All @@ -59,7 +79,7 @@ Usage: <main class> [options] [command] [command options]
of the entries to upload topics for. The first line of the file
should contain the CSV fields: trsId,aiTopic. The output file
generated by the generate-topics command can be used as the
argument.
argument.
```

### generate-topics
Expand Down
15 changes: 14 additions & 1 deletion topicgenerator/pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -112,7 +112,7 @@
<dependency>
<groupId>com.knuddels</groupId>
<artifactId>jtokkit</artifactId>
<version>1.0.0</version>
<version>1.1.0</version>
</dependency>
<dependency>
<groupId>org.apache.commons</groupId>
Expand Down Expand Up @@ -148,6 +148,19 @@
<groupId>org.apache.commons</groupId>
<artifactId>commons-csv</artifactId>
</dependency>
<dependency>
<groupId>software.amazon.awssdk</groupId>
<artifactId>bedrockruntime</artifactId>
</dependency>
<dependency>
<groupId>software.amazon.awssdk</groupId>
<artifactId>auth</artifactId>
</dependency>
<dependency>
<groupId>software.amazon.awssdk</groupId>
<artifactId>sdk-core</artifactId>
</dependency>

<dependency>
<groupId>io.dockstore</groupId>
<artifactId>dockstore-webservice</artifactId>
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
trsId,version,descriptorUrl,descriptorChecksum,isTruncated,promptTokens,completionTokens,cost,finishReason,aiTopic
"#workflow/github.com/iwc-workflows/sars-cov-2-pe-illumina-artic-variant-calling/COVID-19-PE-ARTIC-ILLUMINA",main,https://raw.githubusercontent.com/iwc-workflows/sars-cov-2-pe-illumina-artic-variant-calling/main//pe-artic-variation.ga,dcc2761eb35156d7d09479112daf089439774fc29938f02cb6ee8cda87906758,false,24595,43,0.0062025,end_turn,"Trim ARTIC primer sequences, realign reads, call and filter variants, annotate variants, and apply a strand-bias soft filter to the final annotated variants."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How come some of the verbs have s and some don't? Trim here, but Filters on the next line? I guess it depends on the content?

I'd like it if they were all the same, but that's probably just me and probably isn't worth the effort and/or may not even make sense.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be easy to fix by adding a description of the tense to the prompt, ala

starts with a first-person present tense verb

"#workflow/github.com/iwc-workflows/sars-cov-2-variation-reporting/COVID-19-VARIATION-REPORTING",main,https://raw.githubusercontent.com/iwc-workflows/sars-cov-2-variation-reporting/main//variation-reporting.ga,2dce46106d669248d5858b56269c9cbc26057acfdb121f693dd6321d0350105c,false,24503,48,0.00618575,end_turn,"Filters and extracts variants from a VCF dataset, generates tabular lists of variants by Samples and by Variant, and creates an overview plot of variants and their allele-frequencies."
"#workflow/github.com/iwc-workflows/sars-cov-2-ont-artic-variant-calling/COVID-19-ARTIC-ONT",main,https://raw.githubusercontent.com/iwc-workflows/sars-cov-2-ont-artic-variant-calling/main//ont-artic-variation.ga,03c9318d50df9a3c2d725da77f9a175ce7eb78e264cc4a40fa17d48a99dd124b,false,20290,57,0.00514375,end_turn,"Perform read filtering, mapping, primer trimming, variant calling, and annotation on ONT-sequenced ARTIC data using tools like fastp, minimap2, ivar, medaka, and SnpEff."
"#workflow/github.com/iwc-workflows/sars-cov-2-pe-illumina-wgs-variant-calling/COVID-19-PE-WGS-ILLUMINA",main,https://raw.githubusercontent.com/iwc-workflows/sars-cov-2-pe-illumina-wgs-variant-calling/main//pe-wgs-variation.ga,f2357986dd72af73efb2b8110f0f08f6f715017518498edda61d50d23b2e560f,false,11352,48,0.0028980000000000004,end_turn,"Perform paired-end read mapping with bwa-mem, deduplicate and realign the reads, and then call and annotate variants using lofreq and SnpEff."
"#workflow/github.com/iwc-workflows/sars-cov-2-se-illumina-wgs-variant-calling/COVID-19-SE-WGS-ILLUMINA",main,https://raw.githubusercontent.com/iwc-workflows/sars-cov-2-se-illumina-wgs-variant-calling/main//se-wgs-variation.ga,89993f9570eebd53983cdf4f8e4e0f44631f536e855b2e0741f4a5b907858e8d,false,9521,57,0.0024515000000000006,end_turn,"Perform single-end read mapping with Bowtie2, mark duplicates with Picard, realign reads with LoFreq, call variants with LoFreq, and annotate variants with SnpEff."
"#workflow/github.com/iwc-workflows/sars-cov-2-consensus-from-variation/COVID-19-CONSENSUS-CONSTRUCTION",main,https://raw.githubusercontent.com/iwc-workflows/sars-cov-2-consensus-from-variation/main//consensus-from-variation.ga,cb8acce2a1b0d059b117ae3737307be82e81807dad3efcbfe6c8f6a09cbd2798,false,15395,45,0.003905,end_turn,"Build a consensus sequence from FILTER PASS variants with intrasample allele-frequency above a configurable consensus threshold, hard-mask regions with low coverage, and ambiguous sites."
"#workflow/github.com/iwc-workflows/sars-cov-2-pe-illumina-artic-ivar-analysis/SARS-COV-2-ILLUMINA-AMPLICON-IVAR-PANGOLIN-NEXTCLADE",main,https://raw.githubusercontent.com/iwc-workflows/sars-cov-2-pe-illumina-artic-ivar-analysis/main//pe-wgs-ivar-analysis.ga,11e9e375b01f1a5b33fb64b184798bc6e46181201a255cb23147a7f021007864,false,13894,49,0.00353475,end_turn,"Find and annotate variants in ampliconic SARS-CoV-2 Illumina sequencing data, classify samples with pangolin and nextclade, and generate a quality control report."
"#workflow/github.com/iwc-workflows/parallel-accession-download/main",main,https://raw.githubusercontent.com/iwc-workflows/parallel-accession-download/main//parallel-accession-download.ga,3e9aee6218674651a981b3437b9ca8ec294ff492984b153db0526989e716e674,false,3629,39,9.56E-4,end_turn,"Downloads fastq files for sequencing run accessions provided in a text file using fasterq-dump, creating one job per listed run accession."
"#workflow/github.com/nf-core/rnaseq",1.4.2,https://raw.githubusercontent.com/nf-core/rnaseq/1.4.2//main.nf,0d0db0adf13e907e33e44b1357486cf57747ad78c3a5e31f187919d844530fe9,false,24811,32,0.00624275,end_turn,"Trim the raw reads, align them to the reference genome, perform quality control analysis, and quantify gene expression."
"#workflow/github.com/nf-core/vipr",master,https://raw.githubusercontent.com/nf-core/vipr/master//main.nf,a8b50e8afa5730e6ede16e4588cec2f03c3ae881385daed22edb3e8376af5793,false,4734,78,0.001281,end_turn,"Execute the ViPR workflow, which performs viral amplicon/enrichment analysis and intrahost variant calling, starting with trimming and combining read pairs, followed by decontamination, metagenomics classification, assembly, polishing, mapping, variant calling, coverage computation, and finally plotting and preparing the final reference sequence."
"#workflow/github.com/nf-core/methylseq",1.4,https://raw.githubusercontent.com/nf-core/methylseq/1.4//main.nf,5123fc239af84cd0357ddfe1ca1f582d1f7412f08ecaa9bb2b8f19160bd75cfb,false,16091,46,0.0040802500000000005,end_turn,"Runs the nf-core/methylseq pipeline, which performs alignment, deduplication, methylation extraction, and quality control analysis on bisulfite-sequencing data."
"#workflow/github.com/sevenbridges-openworkflows/Broad-Best-Practice-Data-pre-processing-CWL1.0-workflow-GATK-4.1.0.0/GATK_4_1_0_0_data_pre_processing_workflow",master,https://raw.githubusercontent.com/sevenbridges-openworkflows/Broad-Best-Practice-Data-pre-processing-CWL1.0-workflow-GATK-4.1.0.0/master//broad-best-practice-data-pre-processing-workflow-4-1-0-0_decomposed.cwl,2f87eaf01d47acf0d70b41609d6faba135e11db5ca337f99ae18c614996e387e,false,7682,35,0.0019642500000000003,end_turn,"Perform data pre-processing by aligning to a reference genome, cleaning up the data, and preparing it for variant calling analysis."
"#workflow/github.com/DataBiosphere/topmed-workflows/UM_variant_caller_wdl",1.32.0,https://raw.githubusercontent.com/DataBiosphere/topmed-workflows/1.32.0/variant-caller/variant-caller-wdl/topmed_freeze3_calling.wdl,03daf00da2f90efd50368bd392f1c32f93b8f48e968dcc573598c9432df5ba21,false,18179,66,0.004627249999999999,end_turn,"Execute the variant caller workflow by creating symlinks for CRAM and CRAI files, configuring the reference files, running the variant detection and merging steps, and optionally performing variant filtering using pedigree information, and finally compressing the output directories into a tar.gz file."
"#workflow/github.com/DataBiosphere/analysis_pipeline_WDL/vcf-to-gds-wdl",v7.1.1,https://raw.githubusercontent.com/DataBiosphere/analysis_pipeline_WDL/v7.1.1/vcf-to-gds/vcf-to-gds.wdl,e92888af471f11793f2c5c4fa7f8da825dfb51603aa6b7f6ecb8d51c9a815fd7,false,3195,41,8.5E-4,end_turn,"Converts VCF files to GDS files, assigns unique variant IDs, and optionally checks the GDS files against the original VCF files."
"#workflow/github.com/DataBiosphere/analysis_pipeline_WDL/ld-pruning-wdl",v7.1.1,https://raw.githubusercontent.com/DataBiosphere/analysis_pipeline_WDL/v7.1.1/ld-pruning/ld-pruning.wdl,eabd52bf3f223c58ad220a418f87f2ec23d2dd91958b61f12066144a8a84a444,false,4460,41,0.0011662500000000002,end_turn,"Calculates linkage disequilibrium, subsets GDS files, merges the subsetted files, and checks the merged file against the inputs."
"#workflow/github.com/AnalysisCommons/genesis_wdl/genesis_GWAS",v1_5,https://raw.githubusercontent.com/AnalysisCommons/genesis_wdl/v1_5//genesis_GWAS.wdl,9c0c50df22bb95869dcc66d77693ecaf702b980878f38feca3ebd75e00eb9fbe,false,4265,43,0.00112,end_turn,"Execute the null model generation, association testing, and summarization tasks to perform a genome-wide association study (GWAS) using the GENESIS biostatistical package."
"#workflow/github.com/aofarrel/covstats-wdl",master,https://raw.githubusercontent.com/aofarrel/covstats-wdl/master/covstats/covstats.wdl,49dec26c695bbb88e0e1198e04c6857f497c80b8b59f8bd86377eaf76ee74a4a,false,2532,42,6.855E-4,end_turn,"Perform read length and coverage analysis on input BAM/CRAM files, generate a report summarizing the results, and handle various file types and runtime configurations."
"#workflow/github.com/broadinstitute/warp/Optimus",aa-PD2413,https://raw.githubusercontent.com/broadinstitute/warp/aa-PD2413/pipelines/skylab/optimus/Optimus.wdl,fb1b9fbd4be73e7210dec444d446c7405afcbcb11f9030391b5e63dd9defe4b6,false,3305,66,9.0875E-4,end_turn,"Imports necessary WDL workflows, defines input parameters, performs input checks, splits FASTQ files, aligns reads, merges BAM files, calculates gene and cell metrics, generates sparse count matrix, runs EmptyDrops, and produces an H5AD output file."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Imports necessary WDL workflows"
Amusing but I guess it's not wrong

"#workflow/github.com/theiagen/terra_utilities/Concatenate_Column_Content",v1.4.1,https://raw.githubusercontent.com/theiagen/terra_utilities/v1.4.1/workflows/wf_cat_column.wdl,e080a455cb741c152c056e55af55cb42b2b3b46e24cb7185c2e1a6f1b74389bc,false,225,33,9.750000000000001E-5,end_turn,"Import task files, concatenate column content, capture versioning, and output the concatenated files and versioning details."
"#workflow/github.com/gatk-workflows/seq-format-conversion/BAM-to-Unmapped-BAM",3.0.0,https://raw.githubusercontent.com/gatk-workflows/seq-format-conversion/3.0.0//bam-to-unmapped-bams.wdl,d75de73b26fd49d71a29e4709f77c4decb2b51209d169c1b3fa6fa427f53dd04,false,1266,43,3.7025E-4,end_turn,"Converts a BAM file into unmapped BAMs by reverting the BAM, sorting the unmapped BAMs, and outputting the sorted unmapped BAMs."
"#notebook/github.com/denis-yuen/test-notebooks/ibm-tax-maps",0.2,https://raw.githubusercontent.com/denis-yuen/test-notebooks/0.2/ibm-et/jupyter-samples/tax-maps/Interactive_Data_Maps.ipynb,7ffefdbf8c4ab6333b9ad78c6b811365365e2ae433bca2acf261cc1209c59027,false,107540,42,0.026937500000000003,end_turn,"This notebook analyzes state tax data from the US Census Bureau, creates interactive maps to visualize the data, and provides insights into the tax revenue collected by different states."
quay.io/pancancer/pcawg-sanger-cgp-workflow,2.1.0,https://raw.githubusercontent.com/ICGC-TCGA-PanCancer/CGP-Somatic-Docker/2.1.0//Dockstore.cwl,5eb9e182fc2e313606a9445604a4b47e0546ed17a641448da12fcc0371ead3d8,false,2573,57,7.145000000000001E-4,end_turn,Execute the Seqware-Sanger-Somatic-Workflow command-line tool to perform somatic variant calling on tumor and normal whole-genome sequencing data using the PCAWG Sanger variant calling workflow.
github.com/dockstore/dockstore-tool-bamstats/bamstats_sort_cwl,1.25-9,https://raw.githubusercontent.com/dockstore/dockstore-tool-bamstats/1.25-9//bamstats_sort.cwl,1fd7d8637cb91e031a6db604898f5728e51f4591bf4acfbb75ecefd0bfda3448,false,426,37,1.5275E-4,end_turn,Utilize the commandlinetool to execute the sort command within a Docker container and generate a sorted file based on the specified key positions.
Loading