Always use the original file format when downloading files #314
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Otherwise, the files will not have the correct hashes, as dataverse replaces, e.g., csv or R table files with its own tabular data format.
To reproduce the issue #307 before and after the fix, either use an account at demo.dataverse.org, or do the docker setup.
Preparation
I used pipx to install everything into a venv, but pip should do the trick as well – with pipx I was able to ensure git-annex would pick up the binaries easily without having to fiddle around with the PATH too much.
Next we do the git setup, a local and "remote" repository:
After that, we can setup the datalad dataset and the token with the credential helper; you might have to
remove
the credential if you already stored it (also note that copying from the website and then doing a "formatted paste", which unfortunately became the default on macOS a while ago, inserts a couple of new lines, causing the credential helper to fail to store the token; watch out for that little checkmark).I also set up the sibling here, a dataset which I already created at my local instance (or at the demo instance, it does not really matter).
$ datalad create --force $ datalad credentials set demodataverse secret: secret (repeat): demodataverse(secret ✓): $ datalad add-sibling-dataverse --credential demodataverse -s dataverse http://localhost:8080 doi:10.5072/FK2/S1OTS1
At this point, the current main branch will throw an exception and fail, I prepared a fix in #309.
After applying that fix, the add-sibling-dataverse call works as expected.
Actual issue
As @behinger described and I already discussed briefly in #307, the issue occurs only with tabular data, so let's add a simple csv:
So far, so good. Next, we need to ensure that the file is "gone", so we drop it in order to get it again later.
Before we manage to get to @behinger's issue, there is two other errors occurring, which I report in #310 and #312, with corresponding PRs #311 and #313. However, with the fixes #309, #311, and #313 in place, we can finally reproduce the issue:
To debug this, I adjusted the download_file method and let it write the retrieved file to a location where git-annex would not delete it after a successful transfer but a failed validation (removed comments for brevity):
Now we can see the two files:
Downloaded (has tabs on the filesystem...)
Original
and it becomes clear that the checksums will not match.
To fix it, we can pass the data_format="original" as a parameter to the get_datafile method, which is all this PR is about.
Doing so, and applying the other patches in #309, #311, #313, leads to:
Note that this merge request will cause a conflict with #311, to resolve it, add both parameters (unless we rework #311).