Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Always use the original file format when downloading files #314

Merged
merged 2 commits into from
Jul 19, 2024

Conversation

shoeffner
Copy link
Contributor

Otherwise, the files will not have the correct hashes, as dataverse replaces, e.g., csv or R table files with its own tabular data format.

To reproduce the issue #307 before and after the fix, either use an account at demo.dataverse.org, or do the docker setup.

Preparation

I used pipx to install everything into a venv, but pip should do the trick as well – with pipx I was able to ensure git-annex would pick up the binaries easily without having to fiddle around with the PATH too much.

$ cd datalad-dataverse
$ pipx install datalad
$ pipx inject --include-apps --force datalad datalad-next
$ pipx inject --include-apps --force --editable datalad .

Next we do the git setup, a local and "remote" repository:

$ cd ..  # back out of the git checkout of datalad-dataverse
$ mkdir remote-git  # this will be our git remote "origin", but any git remote will be fine; in fact we could probably do without
$ cd remote-git
$ git init --bare
$ cd ..
$ mkdir local-ds  # this is our local dataset
$ cd local-ds
$ git init
$ git remote add origin ../remote-git
$ git commit --allow-empty -m "initial commit"
$ git push --set-upstream origin main

After that, we can setup the datalad dataset and the token with the credential helper; you might have to remove the credential if you already stored it (also note that copying from the website and then doing a "formatted paste", which unfortunately became the default on macOS a while ago, inserts a couple of new lines, causing the credential helper to fail to store the token; watch out for that little checkmark).
I also set up the sibling here, a dataset which I already created at my local instance (or at the demo instance, it does not really matter).

$ datalad create --force
$ datalad credentials set demodataverse
secret:
secret (repeat):
demodataverse(secret ✓):
$ datalad add-sibling-dataverse --credential demodataverse -s dataverse http://localhost:8080 doi:10.5072/FK2/S1OTS1

At this point, the current main branch will throw an exception and fail, I prepared a fix in #309.
After applying that fix, the add-sibling-dataverse call works as expected.

Actual issue

As @behinger described and I already discussed briefly in #307, the issue occurs only with tabular data, so let's add a simple csv:

$ cat <<EOF > test.csv
a,b,c,d,e
1,2,3,4,5
6,7,8,9,0
EOF
$ datalad save test.csv -m "add test.csv"
$ git push
$ datalad push --to dataverse

So far, so good. Next, we need to ensure that the file is "gone", so we drop it in order to get it again later.

$ datalad drop test.csv

Before we manage to get to @behinger's issue, there is two other errors occurring, which I report in #310 and #312, with corresponding PRs #311 and #313. However, with the fixes #309, #311, and #313 in place, we can finally reproduce the issue:

$ datalad get test.csv
get(error): test.csv (file) [Verification of content failed]

To debug this, I adjusted the download_file method and let it write the retrieved file to a location where git-annex would not delete it after a successful transfer but a failed validation (removed comments for brevity):

    def download_file(self, fid: int, path: Path):
        response = self.data_access_api.get_datafile(fid, is_pid=False)
        response.raise_for_status()
        with path.open("wb") as f, Path('~/test.csv').expanduser().open('wb') as f2:
            for chunk in response.iter_bytes(chunk_size=None):
                f.write(chunk)
                f2.write(chunk)

Now we can see the two files:
Downloaded (has tabs on the filesystem...)

a       b       c       d       e
1       2       3       4       5
6       7       8       9       0

Original

a,b,c,d,e
1,2,3,4,5
6,7,8,9,0

and it becomes clear that the checksums will not match.
To fix it, we can pass the data_format="original" as a parameter to the get_datafile method, which is all this PR is about.
Doing so, and applying the other patches in #309, #311, #313, leads to:

    def download_file(self, fid: int, path: Path):
        # pydataverse does not support streaming downloads
        # https://github.com/gdcc/pyDataverse/issues/49
        # the code below is nevertheless readied for such a
        # scenario
        response = self.data_access_api.get_datafile(fid, is_pid=False, data_format="original")
        # http error handling
        response.raise_for_status()
        with path.open("wb") as f:
            # `chunk_size=None` means
            # "read data in whatever size the chunks are received"
            for chunk in response.iter_bytes(chunk_size=None):
                f.write(chunk)

Note that this merge request will cause a conflict with #311, to resolve it, add both parameters (unless we rework #311).

Otherwise, the files will not have the correct hashes, as dataverse replaces, e.g., csv or R table files with its own tabular data format.

Closes datalad#307.
@shoeffner
Copy link
Contributor Author

Note that I forced pushed twice on the branch. Once to include the "Closes" in the git message. The second time because I accidentally pushed my local copy with all other merged requests resolved instead of the fixed commit. Sorry!

Copy link
Member

@mih mih left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks you for the PR and the detailed description!

I resolved the conflict you anticipated and will merge.

@mih mih merged commit 0621e2d into datalad:main Jul 19, 2024
4 of 5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants