Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use transcription service for transcripting/translation of audio/video files #239

Merged
merged 11 commits into from
Oct 9, 2024

Conversation

marjisound
Copy link
Contributor

@marjisound marjisound commented Oct 2, 2024

Paired on this with @philmcmahon

What does this change?

This PR integrates transcription service into Giant.

  • Adds ExternalTranscriptionExtractor which sends transcription message into the transcription service task (input) queue
  • Adds ExternalWorkerScheduler which runs in intervals to check if there's any transcription output message
  • Adds ExternalTranscriptionWorker which retrieves messages from giant transcription output queue
    • Success message:
      • updates elastic with the resulting transcript search if it's a success message
      • updates neo4j relationship between the blob and extractor to processed
      • deletes the message
    • Failure message. Retries 3 times and if all are failure
      • updates neo4j relationship between the blob and extractor to failure
      • doesn't delete the message because the message will be moved to dead letter queue
  • Creates download signed url (for downloading the audio/video file) before sending the message to transcription service task queue
  • Creates upload signed urls (for uploading the transcript output) before sending the message to transcription service task queue
  • Adds a new relationship between blob and extractor PROCESSING_EXTERNALLY for when the message is sent to external transcription service until the transcript output is ready and output message is delivered in the output queue
  • Handling translation if the audio/video is not in English

The following SSM parameters were created for playground but should also be created for pfi-giant (prod):

  • /pfi/pfi-playground/rex/transcribe/transcriptionServiceQueueUrl
  • /pfi/pfi-playground/rex/transcribe/transcriptionOutputQueueUrl
  • /pfi/pfi-playground/rex/transcribe/transcriptionOutputDeadLetterQueueUrl

TODO in upcoming PR

  • zipping & unzipping the transcripts file rather than handling 3 file formats separately

How to test

Tested locally and in code

The relevant PRs for this change and the order they need to be released are as followed:
1- https://github.com/guardian/investigations-platform/pull/521
2- guardian/transcription-service#103
3- Current PR

* status: z.literal('SUCCESS'),
* languageCode: z.string(),
* outputBucketKeys: OutputBucketKeys,
*/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this could probably be removed - maybe replace with a comment linking to the relevant types.ts file in the transcription-service repo

if (completed > 0) {
go()
} else {
println(s"try again ExternalWorkerScheduler in ${interval}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

either remove or change to log


def go(): Unit = {
try {
println("running ExternalWorkerScheduler")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rogue println

"uri", uri,
"extractorName", extractorName,
)
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

congratulations on writing all this cipher!

@@ -83,6 +85,7 @@ class AppComponents(context: Context, config: Config)
// data storage services
val ingestStorage = S3IngestStorage(s3Client, config.s3.buckets.ingestion, config.s3.buckets.deadLetter).valueOr(failure => throw new Exception(failure.msg))
val blobStorage = S3ObjectStorage(s3Client, config.s3.buckets.collections).valueOr(failure => throw new Exception(failure.msg))
val transcriptionStorage = S3ObjectStorage(s3Client, config.s3.buckets.transcription).valueOr(failure => throw new Exception(failure.msg))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could this be transcriptStorage or transcriptionOutputStorage - just to make clear that it should only have transcripts in it, not source media

@philmcmahon
Copy link
Contributor

This is looking great - just a few minor comments above

logger.error(s"failed to process sqs message", failure.toThrowable)
if (messageAttributes.receiveCount > 2) {
markAsFailure(new Uri(messageAttributes.messageGroupId), "ExternalTranscriptionExtractor", failure.msg)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this handler needs updating to deal with the case where the message was processed succsesfully but there was a failure message

Copy link
Contributor Author

@marjisound marjisound Oct 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good spot 👏 I updated the failure cases to handle the error differently when the message is a failure one. This happens in the failure message scenario:

  • move message to dead letter queue
  • delete message from output queue
  • mark blob/extractor relationship as failure

@marjisound marjisound marked this pull request as ready for review October 8, 2024 17:45
@marjisound marjisound requested a review from a team as a code owner October 8, 2024 17:45
@marjisound marjisound changed the title WIP - Use transcription service Use transcription service for transcripting/translation audio/video files Oct 8, 2024
@marjisound marjisound changed the title Use transcription service for transcripting/translation audio/video files Use transcription service for transcripting/translation of audio/video files Oct 8, 2024
Copy link
Contributor

@philmcmahon philmcmahon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fantastic work! Very excited to see this go into giant and try it out!

@marjisound marjisound merged commit f9aabcb into main Oct 9, 2024
4 checks passed
@marjisound marjisound deleted the use-transcription-service branch October 9, 2024 10:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants