Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AJ-1517 download files only once #448

Merged
merged 25 commits into from
Jan 26, 2024
Merged

AJ-1517 download files only once #448

merged 25 commits into from
Jan 26, 2024

Conversation

calypsomatic
Copy link
Contributor

@calypsomatic calypsomatic commented Jan 9, 2024

https://broadworkbench.atlassian.net/browse/AJ-1507
Although I couldn't find it explicitly stated in any documentation, it's strongly implied and empirically verified that avro's ParquetReader simply cannot handle URLs with query parameters. So this PR still downloads parquets to a temporary file, but now does so ahead of time so that each file is downloaded only once.

Todo:

  • Return a custom object instead of a directory path?
  • Validate with AppSec
  • Will there be problems downloading all parquet files at once?
  • Add retry for file downloads?
  • Verify/allowlist URLs?

Reminder:

PRs merged into main will not automatically generate a PR in https://github.com/broadinstitute/terra-helmfile to update the WDS image deployed to kubernetes - this action will need to be triggered manually by running the following github action: https://github.com/DataBiosphere/terra-workspace-data-service/actions/workflows/tag.yml. Dont forget to provide a Jira Id when triggering the manual action, if no Jira ID is provided the action will not fully succeed.

After you manually trigger the github action (and it completes with no errors), you must go to the terra-helmfile repo and verify that this generated a PR that merged successfully.

The terra-helmfile PR merge will then generate a PR in leonardo. This will automerge if all tests pass, but if jenkins tests fail it will not; be sure to watch it to ensure it merges. To trigger jenkins retest simply comment on PR with "jenkins retest".

@calypsomatic calypsomatic changed the title first pass one download AJ-1517 download files only once Jan 9, 2024
// Azure urls, with SAS tokens, don't need any particular auth.
File tempFile = File.createTempFile("tdr-", "download");
logger.info("downloading to temp file {} ...", tempFile.getPath());
FileUtils.copyURLToFile(path, tempFile);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Coincidentally, I just completed server-side request forgery training, so I have a heightened awareness of downloading arbitrary URLs...one of the attack vectors that was suggested was to not properly lock down what scheme was used in URLs. Do we know anything about the source location of the files being downloaded here that we can validate against?

Possible ideas:

  • allowlist of domains where expect these files to live
  • allowlist of schemes (eg: http / https) that we expect these URLs to be composed from

try {
// download the file from the URL to a temp file on the local filesystem
// Azure urls, with SAS tokens, don't need any particular auth.
File tempFile = File.createTempFile("tdr-", "download");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do the temp files get cleaned up? Might be worth adding a code comment explaining their lifecycle.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question! Seems like they get cleaned up on JVM exit - which is not useful for us. I guess I assume they were garbage collected, but since not, we should delete them manually.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might want to use Files.createTempDirectory to create a dedicated subdirectory for this import, then you can delete the entire directory. Bonus: I think you can configure this dedicated subdirectory so that all the files you download are read-only and non-executable, which is good for security

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without a custom object for holding the files and directory information, it seems easier to leave out the dedicated subdirectory. Is there a big difference between setting permissions on each file and setting one on an entire directory?

FileSystem fileSystem = FileSystem.get(configuration);
FileStatus fileStatus = fileSystem.getFileStatus(hadoopFilePath);
if (fileStatus.getLen() == 0) {
logger.info("Empty file in parquet, skipping");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be worth logging out the URL path here in the log statement so we know which file is considered empty?

(Assuming the path includes enough information to tell us the original filename)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There may be security considerations with logging the full original URL, since it's a signed URL - we could log a sanitized version of it.

Copy link
Collaborator

@davidangb davidangb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm disappointed (in avro-parquet/Hadoop, not you!) that we can't read these files directly over HTTP.

A few comments inline about temp dirs. I agree with Josh's comment about making sure these temp files get cleaned up.

// Azure urls, with SAS tokens, don't need any particular auth.
File tempFile = File.createTempFile("tdr-", "download");
logger.info("downloading to temp file {} ...", tempFile.getPath());
FileUtils.copyURLToFile(path, tempFile);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably want a retry (with backoff) for these downloads, for resilience against temporary network conditions. This could be a separate/follow-on PR.

FileSystem fileSystem = FileSystem.get(configuration);
FileStatus fileStatus = fileSystem.getFileStatus(hadoopFilePath);
if (fileStatus.getLen() == 0) {
logger.info("Empty file in parquet, skipping");
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There may be security considerations with logging the full original URL, since it's a signed URL - we could log a sanitized version of it.

// check files for length and ignore any that are empty
if (tempFile.length() == 0) {
logger.info("Empty file in parquet, skipping");
Files.delete(tempFile.toPath());
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this empty temp file fails to delete, then as currently written it'll fail the import. Unsure how likely this is to happen, but if we want to avoid it I could:

  • Put the deletion in its own try/catch
  • Not delete the file since it's empty anyway and assume the accumulation of empty temp files will be small and not an issue
  • Check for emptiness before creating the temp file, which I believe means making an extra http call to ask for the content length of the file located at the remote URL

@@ -106,16 +107,24 @@ protected void executeInternal(UUID jobId, JobExecutionContext context) {
List<TdrManifestImportTable> tdrManifestImportTables =
extractTableInfo(snapshotExportResponseModel);

// get all the parquet files from the manifests
// TODO do we really want to download them all at once ahead of time?
Multimap<String, File> fileMap = getFilesForImport(tdrManifestImportTables);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if there's a good way to avoid downloading all parquet files ahead of time, given that we're downloading them, but I don't know how large we expect them to be and if there might be any performance or memory issues with the pod.

@calypsomatic
Copy link
Contributor Author

Added custom download class in 7122290 . Please to give opinion on whether this is better than without it. I moved the actual downloading of the files in there as well, could then be a place to add in retries.

this.fileMap = HashMultimap.create();
}

public void downloadFileFromURL(String tableName, URL pathToRemoteFile) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This class is juggling three different responsibilities (being a multimap data structure, dealing with temp directory shenanigans, and doing file downloads and permissions), so I'm thinking about ways you could break those responsibilities apart.

I have an idea to kick out the Multimap responsibility:

  1. Change downloadFileFromUrl to return Optional<File> from this class. The file would only be absent in the empty file scenario.

  2. Move all the Multimap creation and management out of this class and into the caller. This would let you eliminate the isEmpty() and get() methods, and leave this class with just two responsibilities (which maybe can be broken apart further after the dust settles).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And here's my followup idea to eliminate the temp directory shenanigans.

Make the caller supply Path fileDir as a constructor arg. Then this class just has the responsibility of downloading its stuff into that directory. This class doesn't even have to know its a temporary, or that everything will be deleted some day. Then the caller can decide when it's time to blow everything up.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That almost makes me think it's not necessary to have the class at all... originally the idea of the class was to hold both the multimap and the path to the directory so that I could know where to find the files, but also know which files were there so as to easily iterate over them. Once the class existed, it seemed to make sense to give it some methods to deal with the file download. I agree it's kind of a mess as written, but if all it does is download files it's not a big win, plus I still have to deal with keeping track of a multimap and a path to a directory separately.

@@ -106,16 +102,24 @@ protected void executeInternal(UUID jobId, JobExecutionContext context) {
List<TdrManifestImportTable> tdrManifestImportTables =
extractTableInfo(snapshotExportResponseModel);

// get all the parquet files from the manifests

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this class should be responsible for creating the temp directory, and passing the Path to that directory down through getFilesForImport

Then, later on, when this class wants to delete all the downloaded files, it can just delete the directory it already has a reference to.

@calypsomatic calypsomatic marked this pull request as ready for review January 19, 2024 20:41
private final DownloadHelper downloadHelper;

public interface DownloadHelper {
default void copyURLToFile(URL sourceUrl, File destinationFile) throws IOException {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needed this method to not be abstract in order to mock it

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very weird, I wasn't expecting that

.when(mockDownloadHelper)
.copyURLToFile(any(URL.class), any(File.class));

// Create a RetryTemplate to set off Spring's retryable
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spring's @Retryable wasn't working and among the several hideous options I could find to try to rig it to trigger this one seemed the least convoluted

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact that you need to include a RetryTemplate here is indication that @Retryable is not working correctly in FileDownloadHelper. From this test, you should be able to just call helper.downloadFileFromURL() outside of a RetryTemplate and it will work.

I am pretty certain the reason that @Retryable is not working is that FileDownloadHelper is not a Spring bean, and therefore Spring can't set up any proxying for it and therefore can't implement retries. To use @Retryable you'll have to lean into Spring-ness. I can see a few options:

  1. FileDownloadHelper is a singleton bean (which is what we use everywhere else), and you move all statefulness about the temp dir into DownloadHelper
  2. FileDownloadHelper is a prototype bean which gets created on demand, including its temp-dir statefulness. You'll probably need a separate singleton factory bean to do the creation.

there are probably other ways to do it …

( I could pair on this/contribute code if my explanations don't make sense)

Copy link
Contributor

@yuliadub yuliadub left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like sonar is still unhappy about us writing to a "public" dir, but otherwise the changes make sense to me!

still sad there isnt a straightforward way to stream this, not sure if we want to create a ticket to revisit in a bit (6 months?) to see if anything changes

this(dirName, FileUtils::copyURLToFile);
this(
dirName,
new DownloadHelper() {
Copy link
Contributor

@jladieu jladieu Jan 24, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this doesn't need the anonymous sublcass & override anymore since the default impl does the right thing. Just pass in new DownloadHelper()

Copy link
Contributor

@jladieu jladieu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bummer about @Retryable not doing the trick; I think that's worthy of a followup ticket to sort out as it'd be really nice to have access to those annotation based features. I think @ashanhol encountered problems with the@Observable annotation too.

EnumSet.of(
PosixFilePermission.OWNER_READ,
PosixFilePermission.GROUP_READ,
PosixFilePermission.OTHERS_READ);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could this be locked down even further, to just OWNER_READ? Is there a need for group/others to also read?

.when(mockDownloadHelper)
.copyURLToFile(any(URL.class), any(File.class));

// Create a RetryTemplate to set off Spring's retryable
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fact that you need to include a RetryTemplate here is indication that @Retryable is not working correctly in FileDownloadHelper. From this test, you should be able to just call helper.downloadFileFromURL() outside of a RetryTemplate and it will work.

I am pretty certain the reason that @Retryable is not working is that FileDownloadHelper is not a Spring bean, and therefore Spring can't set up any proxying for it and therefore can't implement retries. To use @Retryable you'll have to lean into Spring-ness. I can see a few options:

  1. FileDownloadHelper is a singleton bean (which is what we use everywhere else), and you move all statefulness about the temp dir into DownloadHelper
  2. FileDownloadHelper is a prototype bean which gets created on demand, including its temp-dir statefulness. You'll probably need a separate singleton factory bean to do the creation.

there are probably other ways to do it …

( I could pair on this/contribute code if my explanations don't make sense)

@calypsomatic
Copy link
Contributor Author

Removed retry in 51e0682 and created https://broadworkbench.atlassian.net/browse/AJ-1555 to put it back and also take a look at possible memory issues.

Resource emptyParquet;

@Test
void downloadEmptyFile() throws IOException {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is a duplicate of the test in TdrManifestQuartzJobTest. This seems like the place for it, but should I alter the other test to verify the behavior of the actual TdrManifestQuartzJob with an empty file, or is this test sufficient?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's worth testing both TdrManifestQuartzJob.getFilesForImport() in TdrManifestQuartzJobTest and FileDownloadHelper.downloadFileFromURL() here, even though they're very similar.

Copy link
Collaborator

@davidangb davidangb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thaaaank you

Resource emptyParquet;

@Test
void downloadEmptyFile() throws IOException {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's worth testing both TdrManifestQuartzJob.getFilesForImport() in TdrManifestQuartzJobTest and FileDownloadHelper.downloadFileFromURL() here, even though they're very similar.

Copy link

sonarcloud bot commented Jan 25, 2024

@calypsomatic calypsomatic merged commit 499ad69 into main Jan 26, 2024
14 checks passed
@calypsomatic calypsomatic deleted the aj-1517-direct-http branch January 26, 2024 14:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants