Skip to content
This repository has been archived by the owner on Mar 8, 2020. It is now read-only.

Expose node hashes #105

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Expose node hashes #105

wants to merge 1 commit into from

Conversation

dennwc
Copy link
Member

@dennwc dennwc commented Jun 25, 2019

Node hashes were already available in libuast, but only on C level. This PR adds the same functionality to the C++ layer used by clients. Also adds a test case.

Fixes #78

Signed-off-by: Denys Smirnov denys@sourced.tech


This change is Reviewable

@dennwc dennwc requested review from creachadair and bzz June 25, 2019 15:34
@dennwc dennwc self-assigned this Jun 25, 2019
Copy link
Contributor

@creachadair creachadair left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 4 of 4 files at r1.
Reviewable status: all files reviewed, 2 unresolved discussions (waiting on @bzz and @dennwc)


src/libuast.hpp, line 63 at r1 (raw file):

    // NodeHash is a hash of a node subtree.
    struct NodeHash {
        uint8_t data[UAST_HASH_SIZE];

Maybe this is already defined elsewhere, but what is the format of these data? A binary sha1 or something?


src/uast.h, line 93 at r1 (raw file):

typedef enum { UAST_BINARY = 0, UAST_YAML = 1 } UastFormat;

typedef enum {

Can you please add some comments here, to define what these mean?

Is this meant to be a bitwise-or of flags (e.g., all minus whatever is set?), or will these be mutually exclusive?

Copy link
Contributor

@bzz bzz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 4 of 4 files at r1.
Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @dennwc)


tests/nodes_test.h, line 656 at r1 (raw file):

    0x37, 0xcc, 0xfc, 0x42, 0xf, 0x52, 0x65, 0x48,
    0x1d, 0x59, 0x18, 0xce, 0x1, 0xad, 0xda, 0xa2,
    0x82, 0x4c, 0x74, 0x77, 0xae, 0xa1, 0x26, 0xb5};

May be something obvious, but what is the best way to update this value, in case one would need to re-generate it later?

I.e is there some way to verify correctness of the expected hash value with some external tool e.g piping something to sha256sum, etc

Copy link
Member Author

@dennwc dennwc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @bzz, @creachadair, and @dennwc)


src/libuast.hpp, line 63 at r1 (raw file):

Previously, creachadair (M. J. Fromberger) wrote…

Maybe this is already defined elsewhere, but what is the format of these data? A binary sha1 or something?

Yes, it's raw bytes of some hash. I hoped the uint8_t type will give an insight into it (as opposed to char that may mean text data).


src/uast.h, line 93 at r1 (raw file):

Previously, creachadair (M. J. Fromberger) wrote…

Can you please add some comments here, to define what these mean?

Is this meant to be a bitwise-or of flags (e.g., all minus whatever is set?), or will these be mutually exclusive?

Good point! Will document that.


tests/nodes_test.h, line 656 at r1 (raw file):

Previously, bzz (Alexander) wrote…

May be something obvious, but what is the best way to update this value, in case one would need to re-generate it later?

I.e is there some way to verify correctness of the expected hash value with some external tool e.g piping something to sha256sum, etc

The way to regenerate is to call the UastHash itself mostly, which will call into the same function in SDK. Manually piping won't work, because SDK constructs a special (fast, non-reversible) serialization of the tree for the hash, whatever it is.

It's not ideal of course, because the only way to check the hash is to trust the current implementation. But at least it will notify us if the binding breaks, or if SDK breaks the hash.

Also, it doesn't seem we have the same test case in SDK for some reason. Adding it will fix the issues you pointed out, at least partially.

Copy link
Contributor

@creachadair creachadair left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @bzz and @dennwc)


src/libuast.hpp, line 63 at r1 (raw file):

Previously, dennwc (Denys Smirnov) wrote…

Yes, it's raw bytes of some hash. I hoped the uint8_t type will give an insight into it (as opposed to char that may mean text data).

I was hoping we might document which hash somewhere. Of course it could be anything, and we can change it later if we need to, but to verify it someone will need to know which it is 🙂


tests/nodes_test.h, line 656 at r1 (raw file):

Previously, dennwc (Denys Smirnov) wrote…

The way to regenerate is to call the UastHash itself mostly, which will call into the same function in SDK. Manually piping won't work, because SDK constructs a special (fast, non-reversible) serialization of the tree for the hash, whatever it is.

It's not ideal of course, because the only way to check the hash is to trust the current implementation. But at least it will notify us if the binding breaks, or if SDK breaks the hash.

Also, it doesn't seem we have the same test case in SDK for some reason. Adding it will fix the issues you pointed out, at least partially.

That's still useful to know, and might be worth a mention in the comment:

// This value must be updated if the algorithm for structural hashing is changed,
// or if the tree structure for the test module changes in a way that modifies the
// structural hash. To update it, copy the actual value from the test failure and
// update this array.

or words to that effect. I suggest also making sure the test failure will print the actual value so that someone doesn't have to write a special program to update the test.

Copy link
Member Author

@dennwc dennwc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 2 of 4 files reviewed, all discussions resolved (waiting on @bzz and @creachadair)


src/libuast.hpp, line 63 at r1 (raw file):

Previously, creachadair (M. J. Fromberger) wrote…

I was hoping we might document which hash somewhere. Of course it could be anything, and we can change it later if we need to, but to verify it someone will need to know which it is 🙂

I'd rather not to. It's nearly useless for the client since it cannot verify the hash anyway because of custom node serialization.


tests/nodes_test.h, line 656 at r1 (raw file):

Previously, creachadair (M. J. Fromberger) wrote…

That's still useful to know, and might be worth a mention in the comment:

// This value must be updated if the algorithm for structural hashing is changed,
// or if the tree structure for the test module changes in a way that modifies the
// structural hash. To update it, copy the actual value from the test failure and
// update this array.

or words to that effect. I suggest also making sure the test failure will print the actual value so that someone doesn't have to write a special program to update the test.

Done!

Copy link
Contributor

@creachadair creachadair left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 2 files at r2.
Reviewable status: :shipit: complete! all files reviewed, all discussions resolved


src/libuast.hpp, line 63 at r1 (raw file):

Previously, dennwc (Denys Smirnov) wrote…

I'd rather not to. It's nearly useless for the client since it cannot verify the hash anyway because of custom node serialization.

That makes sense. It's probably still helpful to document (somewhere) what properties the hash requires. For example maybe:

// The data are a cryptographic-quality fingerprint of the tree structure.
// The specific algorithm is not specified and may change across library versions.
// Clients may compare hash values for equality to match equivalent nodes, but cannot
// recompute hash values directly, as the input depends on details of serialization.

or words to that effect.

If we say this, though, we're also implicitly saying that the client may not safely persist node hashes if they will be read later by a different library version. So maybe we don't want to allow it to change.

Signed-off-by: Denys Smirnov <denys@sourced.tech>
Copy link
Member Author

@dennwc dennwc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 3 of 4 files reviewed, all discussions resolved (waiting on @bzz and @creachadair)


src/libuast.hpp, line 63 at r1 (raw file):

Previously, creachadair (M. J. Fromberger) wrote…

That makes sense. It's probably still helpful to document (somewhere) what properties the hash requires. For example maybe:

// The data are a cryptographic-quality fingerprint of the tree structure.
// The specific algorithm is not specified and may change across library versions.
// Clients may compare hash values for equality to match equivalent nodes, but cannot
// recompute hash values directly, as the input depends on details of serialization.

or words to that effect.

If we say this, though, we're also implicitly saying that the client may not safely persist node hashes if they will be read later by a different library version. So maybe we don't want to allow it to change.

Added the comment. Also opened an issue in SDK for persistent hashes: bblfsh/sdk#410.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Feature request: node hashes
3 participants