Always use the original file format when downloading files #314

shoeffner · 2024-07-17T00:49:27Z

Otherwise, the files will not have the correct hashes, as dataverse replaces, e.g., csv or R table files with its own tabular data format.

To reproduce the issue #307 before and after the fix, either use an account at demo.dataverse.org, or do the docker setup.

Preparation

I used pipx to install everything into a venv, but pip should do the trick as well – with pipx I was able to ensure git-annex would pick up the binaries easily without having to fiddle around with the PATH too much.

$ cd datalad-dataverse
$ pipx install datalad
$ pipx inject --include-apps --force datalad datalad-next
$ pipx inject --include-apps --force --editable datalad .

Next we do the git setup, a local and "remote" repository:

$ cd ..  # back out of the git checkout of datalad-dataverse
$ mkdir remote-git  # this will be our git remote "origin", but any git remote will be fine; in fact we could probably do without
$ cd remote-git
$ git init --bare
$ cd ..
$ mkdir local-ds  # this is our local dataset
$ cd local-ds
$ git init
$ git remote add origin ../remote-git
$ git commit --allow-empty -m "initial commit"
$ git push --set-upstream origin main

After that, we can setup the datalad dataset and the token with the credential helper; you might have to remove the credential if you already stored it (also note that copying from the website and then doing a "formatted paste", which unfortunately became the default on macOS a while ago, inserts a couple of new lines, causing the credential helper to fail to store the token; watch out for that little checkmark).
I also set up the sibling here, a dataset which I already created at my local instance (or at the demo instance, it does not really matter).

$ datalad create --force
$ datalad credentials set demodataverse
secret:
secret (repeat):
demodataverse(secret ✓):
$ datalad add-sibling-dataverse --credential demodataverse -s dataverse http://localhost:8080 doi:10.5072/FK2/S1OTS1

At this point, the current main branch will throw an exception and fail, I prepared a fix in #309.
After applying that fix, the add-sibling-dataverse call works as expected.

Actual issue

As @behinger described and I already discussed briefly in #307, the issue occurs only with tabular data, so let's add a simple csv:

$ cat <<EOF > test.csv
a,b,c,d,e
1,2,3,4,5
6,7,8,9,0
EOF
$ datalad save test.csv -m "add test.csv"
$ git push
$ datalad push --to dataverse

So far, so good. Next, we need to ensure that the file is "gone", so we drop it in order to get it again later.

$ datalad drop test.csv

Before we manage to get to @behinger's issue, there is two other errors occurring, which I report in #310 and #312, with corresponding PRs #311 and #313. However, with the fixes #309, #311, and #313 in place, we can finally reproduce the issue:

$ datalad get test.csv
get(error): test.csv (file) [Verification of content failed]

To debug this, I adjusted the download_file method and let it write the retrieved file to a location where git-annex would not delete it after a successful transfer but a failed validation (removed comments for brevity):

    def download_file(self, fid: int, path: Path):
        response = self.data_access_api.get_datafile(fid, is_pid=False)
        response.raise_for_status()
        with path.open("wb") as f, Path('~/test.csv').expanduser().open('wb') as f2:
            for chunk in response.iter_bytes(chunk_size=None):
                f.write(chunk)
                f2.write(chunk)

Now we can see the two files:
Downloaded (has tabs on the filesystem...)

a       b       c       d       e
1       2       3       4       5
6       7       8       9       0

Original

a,b,c,d,e
1,2,3,4,5
6,7,8,9,0

and it becomes clear that the checksums will not match.
To fix it, we can pass the data_format="original" as a parameter to the get_datafile method, which is all this PR is about.
Doing so, and applying the other patches in #309, #311, #313, leads to:

    def download_file(self, fid: int, path: Path):
        # pydataverse does not support streaming downloads
        # https://github.com/gdcc/pyDataverse/issues/49
        # the code below is nevertheless readied for such a
        # scenario
        response = self.data_access_api.get_datafile(fid, is_pid=False, data_format="original")
        # http error handling
        response.raise_for_status()
        with path.open("wb") as f:
            # `chunk_size=None` means
            # "read data in whatever size the chunks are received"
            for chunk in response.iter_bytes(chunk_size=None):
                f.write(chunk)

Note that this merge request will cause a conflict with #311, to resolve it, add both parameters (unless we rework #311).

Otherwise, the files will not have the correct hashes, as dataverse replaces, e.g., csv or R table files with its own tabular data format. Closes datalad#307.

shoeffner · 2024-07-17T00:54:03Z

Note that I forced pushed twice on the branch. Once to include the "Closes" in the git message. The second time because I accidentally pushed my local copy with all other merged requests resolved instead of the fixed commit. Sorry!

mih

Thanks you for the PR and the detailed description!

I resolved the conflict you anticipated and will merge.

Always use the original file format when downloading files.

addbe3c

Otherwise, the files will not have the correct hashes, as dataverse replaces, e.g., csv or R table files with its own tabular data format. Closes datalad#307.

shoeffner force-pushed the original-format branch from 6ec65f7 to addbe3c Compare July 17, 2024 00:52

Merge branch 'main' into original-format

66c62df

mih approved these changes Jul 19, 2024

View reviewed changes

mih merged commit 0621e2d into datalad:main Jul 19, 2024
4 of 5 checks passed

adswa mentioned this pull request Aug 27, 2024

depend on recent pydataverse and create new release #326

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Always use the original file format when downloading files #314

Always use the original file format when downloading files #314

shoeffner commented Jul 17, 2024

shoeffner commented Jul 17, 2024

mih left a comment

Always use the original file format when downloading files #314

Always use the original file format when downloading files #314

Conversation

shoeffner commented Jul 17, 2024

Preparation

Actual issue

shoeffner commented Jul 17, 2024

mih left a comment

Choose a reason for hiding this comment