AJ-1517 download files only once #448

calypsomatic · 2024-01-09T19:13:25Z

https://broadworkbench.atlassian.net/browse/AJ-1507
Although I couldn't find it explicitly stated in any documentation, it's strongly implied and empirically verified that avro's ParquetReader simply cannot handle URLs with query parameters. So this PR still downloads parquets to a temporary file, but now does so ahead of time so that each file is downloaded only once.

Todo:

Return a custom object instead of a directory path?
Validate with AppSec
Will there be problems downloading all parquet files at once?
Add retry for file downloads?
Verify/allowlist URLs?

Reminder:

PRs merged into main will not automatically generate a PR in https://github.com/broadinstitute/terra-helmfile to update the WDS image deployed to kubernetes - this action will need to be triggered manually by running the following github action: https://github.com/DataBiosphere/terra-workspace-data-service/actions/workflows/tag.yml. Dont forget to provide a Jira Id when triggering the manual action, if no Jira ID is provided the action will not fully succeed.

After you manually trigger the github action (and it completes with no errors), you must go to the terra-helmfile repo and verify that this generated a PR that merged successfully.

The terra-helmfile PR merge will then generate a PR in leonardo. This will automerge if all tests pass, but if jenkins tests fail it will not; be sure to watch it to ensure it merges. To trigger jenkins retest simply comment on PR with "jenkins retest".

jladieu · 2024-01-09T20:45:07Z

...ce/src/main/java/org/databiosphere/workspacedataservice/dataimport/TdrManifestQuartzJob.java

+                  // Azure urls, with SAS tokens, don't need any particular auth.
+                  File tempFile = File.createTempFile("tdr-", "download");
+                  logger.info("downloading to temp file {} ...", tempFile.getPath());
+                  FileUtils.copyURLToFile(path, tempFile);


Coincidentally, I just completed server-side request forgery training, so I have a heightened awareness of downloading arbitrary URLs...one of the attack vectors that was suggested was to not properly lock down what scheme was used in URLs. Do we know anything about the source location of the files being downloaded here that we can validate against?

Possible ideas:

allowlist of domains where expect these files to live

allowlist of schemes (eg: http / https) that we expect these URLs to be composed from

...ce/src/main/java/org/databiosphere/workspacedataservice/dataimport/TdrManifestQuartzJob.java

jladieu · 2024-01-09T20:52:40Z

...ce/src/main/java/org/databiosphere/workspacedataservice/dataimport/TdrManifestQuartzJob.java

+                try {
+                  // download the file from the URL to a temp file on the local filesystem
+                  // Azure urls, with SAS tokens, don't need any particular auth.
+                  File tempFile = File.createTempFile("tdr-", "download");


How do the temp files get cleaned up? Might be worth adding a code comment explaining their lifecycle.

Good question! Seems like they get cleaned up on JVM exit - which is not useful for us. I guess I assume they were garbage collected, but since not, we should delete them manually.

Might want to use Files.createTempDirectory to create a dedicated subdirectory for this import, then you can delete the entire directory. Bonus: I think you can configure this dedicated subdirectory so that all the files you download are read-only and non-executable, which is good for security

Without a custom object for holding the files and directory information, it seems easier to leave out the dedicated subdirectory. Is there a big difference between setting permissions on each file and setting one on an entire directory?

jladieu · 2024-01-09T20:55:02Z

...ce/src/main/java/org/databiosphere/workspacedataservice/dataimport/TdrManifestQuartzJob.java

+                  FileSystem fileSystem = FileSystem.get(configuration);
+                  FileStatus fileStatus = fileSystem.getFileStatus(hadoopFilePath);
+                  if (fileStatus.getLen() == 0) {
+                    logger.info("Empty file in parquet, skipping");


Would it be worth logging out the URL path here in the log statement so we know which file is considered empty?

(Assuming the path includes enough information to tell us the original filename)

There may be security considerations with logging the full original URL, since it's a signed URL - we could log a sanitized version of it.

...ce/src/main/java/org/databiosphere/workspacedataservice/dataimport/TdrManifestQuartzJob.java

...rc/test/java/org/databiosphere/workspacedataservice/dataimport/TdrManifestQuartzJobTest.java

davidangb

I'm disappointed (in avro-parquet/Hadoop, not you!) that we can't read these files directly over HTTP.

A few comments inline about temp dirs. I agree with Josh's comment about making sure these temp files get cleaned up.

davidangb · 2024-01-10T14:40:48Z

...ce/src/main/java/org/databiosphere/workspacedataservice/dataimport/TdrManifestQuartzJob.java

+                  // Azure urls, with SAS tokens, don't need any particular auth.
+                  File tempFile = File.createTempFile("tdr-", "download");
+                  logger.info("downloading to temp file {} ...", tempFile.getPath());
+                  FileUtils.copyURLToFile(path, tempFile);


We probably want a retry (with backoff) for these downloads, for resilience against temporary network conditions. This could be a separate/follow-on PR.

davidangb · 2024-01-10T14:42:42Z

...ce/src/main/java/org/databiosphere/workspacedataservice/dataimport/TdrManifestQuartzJob.java

+                  FileSystem fileSystem = FileSystem.get(configuration);
+                  FileStatus fileStatus = fileSystem.getFileStatus(hadoopFilePath);
+                  if (fileStatus.getLen() == 0) {
+                    logger.info("Empty file in parquet, skipping");


There may be security considerations with logging the full original URL, since it's a signed URL - we could log a sanitized version of it.

...rc/main/java/org/databiosphere/workspacedataservice/dataimport/tdr/TdrManifestQuartzJob.java

calypsomatic · 2024-01-18T20:56:49Z

...rc/main/java/org/databiosphere/workspacedataservice/dataimport/tdr/TdrManifestQuartzJob.java

+                  // check files for length and ignore any that are empty
+                  if (tempFile.length() == 0) {
+                    logger.info("Empty file in parquet, skipping");
+                    Files.delete(tempFile.toPath());


If this empty temp file fails to delete, then as currently written it'll fail the import. Unsure how likely this is to happen, but if we want to avoid it I could:

Put the deletion in its own try/catch

Not delete the file since it's empty anyway and assume the accumulation of empty temp files will be small and not an issue

Check for emptiness before creating the temp file, which I believe means making an extra http call to ask for the content length of the file located at the remote URL

calypsomatic · 2024-01-18T20:58:35Z

...rc/main/java/org/databiosphere/workspacedataservice/dataimport/tdr/TdrManifestQuartzJob.java

@@ -106,16 +107,24 @@ protected void executeInternal(UUID jobId, JobExecutionContext context) {
    List<TdrManifestImportTable> tdrManifestImportTables =
        extractTableInfo(snapshotExportResponseModel);

+    // get all the parquet files from the manifests
+    // TODO do we really want to download them all at once ahead of time?
+    Multimap<String, File> fileMap = getFilesForImport(tdrManifestImportTables);


Not sure if there's a good way to avoid downloading all parquet files ahead of time, given that we're downloading them, but I don't know how large we expect them to be and if there might be any performance or memory issues with the pod.

calypsomatic · 2024-01-18T21:21:51Z

Added custom download class in 7122290 . Please to give opinion on whether this is better than without it. I moved the actual downloading of the files in there as well, could then be a place to add in retries.

jladieu · 2024-01-18T22:51:32Z

service/src/main/java/org/databiosphere/workspacedataservice/dataimport/FileDownload.java

+    this.fileMap = HashMultimap.create();
+  }
+
+  public void downloadFileFromURL(String tableName, URL pathToRemoteFile) {


This class is juggling three different responsibilities (being a multimap data structure, dealing with temp directory shenanigans, and doing file downloads and permissions), so I'm thinking about ways you could break those responsibilities apart.

I have an idea to kick out the Multimap responsibility:

Change downloadFileFromUrl to return Optional<File> from this class. The file would only be absent in the empty file scenario.

Move all the Multimap creation and management out of this class and into the caller. This would let you eliminate the isEmpty() and get() methods, and leave this class with just two responsibilities (which maybe can be broken apart further after the dust settles).

And here's my followup idea to eliminate the temp directory shenanigans.

Make the caller supply Path fileDir as a constructor arg. Then this class just has the responsibility of downloading its stuff into that directory. This class doesn't even have to know its a temporary, or that everything will be deleted some day. Then the caller can decide when it's time to blow everything up.

That almost makes me think it's not necessary to have the class at all... originally the idea of the class was to hold both the multimap and the path to the directory so that I could know where to find the files, but also know which files were there so as to easily iterate over them. Once the class existed, it seemed to make sense to give it some methods to deal with the file download. I agree it's kind of a mess as written, but if all it does is download files it's not a big win, plus I still have to deal with keeping track of a multimap and a path to a directory separately.

jladieu · 2024-01-18T23:05:32Z

...rc/main/java/org/databiosphere/workspacedataservice/dataimport/tdr/TdrManifestQuartzJob.java

@@ -106,16 +102,24 @@ protected void executeInternal(UUID jobId, JobExecutionContext context) {
    List<TdrManifestImportTable> tdrManifestImportTables =
        extractTableInfo(snapshotExportResponseModel);

+    // get all the parquet files from the manifests
+


I think this class should be responsible for creating the temp directory, and passing the Path to that directory down through getFilesForImport

Then, later on, when this class wants to delete all the downloaded files, it can just delete the directory it already has a reference to.

calypsomatic · 2024-01-22T20:22:32Z

service/src/main/java/org/databiosphere/workspacedataservice/dataimport/FileDownloadHelper.java

+  private final DownloadHelper downloadHelper;
+
+  public interface DownloadHelper {
+    default void copyURLToFile(URL sourceUrl, File destinationFile) throws IOException {


Needed this method to not be abstract in order to mock it

Very weird, I wasn't expecting that

calypsomatic · 2024-01-22T20:23:55Z

.../src/test/java/org/databiosphere/workspacedataservice/dataimport/FileDownloadHelperTest.java

+        .when(mockDownloadHelper)
+        .copyURLToFile(any(URL.class), any(File.class));
+
+    // Create a RetryTemplate to set off Spring's retryable


Spring's @Retryable wasn't working and among the several hideous options I could find to try to rig it to trigger this one seemed the least convoluted

The fact that you need to include a RetryTemplate here is indication that @Retryable is not working correctly in FileDownloadHelper. From this test, you should be able to just call helper.downloadFileFromURL() outside of a RetryTemplate and it will work.

I am pretty certain the reason that @Retryable is not working is that FileDownloadHelper is not a Spring bean, and therefore Spring can't set up any proxying for it and therefore can't implement retries. To use @Retryable you'll have to lean into Spring-ness. I can see a few options:

FileDownloadHelper is a singleton bean (which is what we use everywhere else), and you move all statefulness about the temp dir into DownloadHelper

FileDownloadHelper is a prototype bean which gets created on demand, including its temp-dir statefulness. You'll probably need a separate singleton factory bean to do the creation.

there are probably other ways to do it …

( I could pair on this/contribute code if my explanations don't make sense)

yuliadub

looks like sonar is still unhappy about us writing to a "public" dir, but otherwise the changes make sense to me!

still sad there isnt a straightforward way to stream this, not sure if we want to create a ticket to revisit in a bit (6 months?) to see if anything changes

jladieu · 2024-01-24T14:18:31Z

service/src/main/java/org/databiosphere/workspacedataservice/dataimport/FileDownloadHelper.java

-    this(dirName, FileUtils::copyURLToFile);
+    this(
+        dirName,
+        new DownloadHelper() {


I think this doesn't need the anonymous sublcass & override anymore since the default impl does the right thing. Just pass in new DownloadHelper()

jladieu

Bummer about @Retryable not doing the trick; I think that's worthy of a followup ticket to sort out as it'd be really nice to have access to those annotation based features. I think @ashanhol encountered problems with the@Observable annotation too.

davidangb · 2024-01-24T14:19:03Z

service/src/main/java/org/databiosphere/workspacedataservice/dataimport/FileDownloadHelper.java

+      EnumSet.of(
+          PosixFilePermission.OWNER_READ,
+          PosixFilePermission.GROUP_READ,
+          PosixFilePermission.OTHERS_READ);


could this be locked down even further, to just OWNER_READ? Is there a need for group/others to also read?

davidangb · 2024-01-24T14:46:42Z

.../src/test/java/org/databiosphere/workspacedataservice/dataimport/FileDownloadHelperTest.java

+        .when(mockDownloadHelper)
+        .copyURLToFile(any(URL.class), any(File.class));
+
+    // Create a RetryTemplate to set off Spring's retryable


The fact that you need to include a RetryTemplate here is indication that @Retryable is not working correctly in FileDownloadHelper. From this test, you should be able to just call helper.downloadFileFromURL() outside of a RetryTemplate and it will work.

I am pretty certain the reason that @Retryable is not working is that FileDownloadHelper is not a Spring bean, and therefore Spring can't set up any proxying for it and therefore can't implement retries. To use @Retryable you'll have to lean into Spring-ness. I can see a few options:

FileDownloadHelper is a singleton bean (which is what we use everywhere else), and you move all statefulness about the temp dir into DownloadHelper

FileDownloadHelper is a prototype bean which gets created on demand, including its temp-dir statefulness. You'll probably need a separate singleton factory bean to do the creation.

there are probably other ways to do it …

( I could pair on this/contribute code if my explanations don't make sense)

calypsomatic · 2024-01-24T18:31:13Z

Removed retry in 51e0682 and created https://broadworkbench.atlassian.net/browse/AJ-1555 to put it back and also take a look at possible memory issues.

calypsomatic · 2024-01-24T18:36:41Z

.../src/test/java/org/databiosphere/workspacedataservice/dataimport/FileDownloadHelperTest.java

+  Resource emptyParquet;
+
+  @Test
+  void downloadEmptyFile() throws IOException {


This test is a duplicate of the test in TdrManifestQuartzJobTest. This seems like the place for it, but should I alter the other test to verify the behavior of the actual TdrManifestQuartzJob with an empty file, or is this test sufficient?

I think it's worth testing both TdrManifestQuartzJob.getFilesForImport() in TdrManifestQuartzJobTest and FileDownloadHelper.downloadFileFromURL() here, even though they're very similar.

davidangb

thaaaank you

davidangb · 2024-01-24T18:52:07Z

.../src/test/java/org/databiosphere/workspacedataservice/dataimport/FileDownloadHelperTest.java

+  Resource emptyParquet;
+
+  @Test
+  void downloadEmptyFile() throws IOException {


I think it's worth testing both TdrManifestQuartzJob.getFilesForImport() in TdrManifestQuartzJobTest and FileDownloadHelper.downloadFileFromURL() here, even though they're very similar.

sonarcloud · 2024-01-25T18:03:31Z

Quality Gate passed

Kudos, no new issues were introduced!

0 New issues
0 Security Hotspots
84.5% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarCloud

calypsomatic added 3 commits January 9, 2024 10:40

first pass one download

6ba24d4

merge main and fix test

d3a173c

some cleanup

c8f2e2f

calypsomatic changed the title ~~first pass one download~~ AJ-1517 download files only once Jan 9, 2024