-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Indexing language improvements #81
base: main
Are you sure you want to change the base?
Indexing language improvements #81
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall supportive of the added narrative. There is some usage of attribute name/key/value that didn't seem to be consistent throughout the added text. We just need to pick something that's consistent.
index.html
Outdated
<a>storage agents</a> to search the information without leaking information | ||
that would create privacy concerns. | ||
that would create privacy concerns. See <a href="#creating-encrypted-indexes"></a> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that would create privacy concerns. See <a href="#creating-encrypted-indexes"></a> | |
that would create privacy concerns. See <a href="#blinded-document-attributes"></a> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed!
index.html
Outdated
@@ -2933,10 +2931,10 @@ <h3> | |||
|
|||
<p> | |||
If <code>returnFullDocuments</code> was set to false, a successful query will result in a standard HTTP 200 response with a list | |||
of <a>EncryptedDocument</a> identifiers that contain the value: | |||
of <a>EncryptedDocument</a> identifiers that contain the attribute key: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be attribute value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are the "has" query examples that only use an attribute key - but regardless, I changed it to say "...that match the query" to be more consistent with the previous examples - so you just have to check the example above it to see the query type
index.html
Outdated
</p> | ||
|
||
<pre class="example highlight" title="data vault query for a particular attribute name"> | ||
<pre class="example highlight" title="data vault query for a particular attribute key"> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The language above uses "attribute value" and "attribute name" -- wondering what we're going for wrt. consistency?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops - I had intended to use the term "key" instead of "name" everywhere. Looks like I missed a few spots. I've updated it now.
index.html
Outdated
@@ -2947,10 +2945,10 @@ <h3> | |||
|
|||
<p> | |||
If <code>returnFullDocuments</code> was set to true, a successful query will result in a standard HTTP 200 response with a list | |||
of <a>EncryptedDocuments</a> that contain the value: | |||
of <a>EncryptedDocuments</a> that contain the attribute key: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
attribute key or attribute name or attribute value? :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed it to say "...that match the query" to be more consistent with the previous examples - so you just have to check the example to see what type of query it is (attribute key in this case)
index.html
Outdated
@@ -3001,6 +2999,38 @@ <h3> | |||
|
|||
</section> | |||
|
|||
<section class="normative"> | |||
<h3> | |||
Add Index |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add Index | |
Adding an index |
W3C style guides are moving to "sentence capitalization for titles" based on some longitudinal study they did on titles. Don't know if this title fits, but we might think about how folks would read this entry in a table of contents. The section seems to be about "adding an index"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
index.html
Outdated
</h3> | ||
|
||
<p> | ||
Encrypted Documents can have embedded key-value attribute pairs, as seen in <a href="#blinded-document-attributes"></a>. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Encrypted Documents can have embedded key-value attribute pairs, as seen in <a href="#blinded-document-attributes"></a>. | |
Encrypted Documents can have embedded attribute key-value pairs, as seen in <a href="#blinded-document-attributes"></a>. |
... or maybe "attribute name-value pairs"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
index.html
Outdated
|
||
{ | ||
"operation": "add", | ||
"attributeKeys":["DUQaxPtSLtd8L3WBAIkJ4DiVJeqoF6bdnhR7lSaPloZ", "AarngVIZLl0kIp2xEHUH5o5uVc-470roQaOIbqMUD7DFQQypWQ=="] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"attributeKeys":["DUQaxPtSLtd8L3WBAIkJ4DiVJeqoF6bdnhR7lSaPloZ", "AarngVIZLl0kIp2xEHUH5o5uVc-470roQaOIbqMUD7DFQQypWQ=="] | |
"attributeKeys":["DUQaxPtSLtd8L3WBAIkJ4DiVJeqoF6bdnhR7lSaPloZ", "AarngVIZLl0kIp2xEHUH5o5uVc-470roQaOIbqMUD7DFQQypWQ=="] |
is this an attribute key or an attribute name-value pair?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a set of attribute keys.
498bad2
to
8c67acf
Compare
Thanks so much for reviewing @msporny! I made some changes/fixes based on your comments. I was meaning to use "key-value" everywhere (replacing any previous instances of name-value). I was thinking that "key-value" is probably the most common term for this sort of thing, but I'm not opposed to "name-value". Let me know if you have any other questions/comments! |
Google says "key-value" has 8.8B results and "name-value" has 9.7B results... close enough to just follow your heart on this one, @DRK3. :P -- I'm fine w/ either. |
@msporny After reading your comment, it got me thinking - I did some searches and found a Wikipedia article for "Key-value database", but I also found an article for "Name-value pair" which uses the term "attribute" a lot... so now my heart is torn, haha... I'm thinking maybe "name-value" might be better after all? I'm not sure. Maybe it's better to avoid the term "key" to prevent confusion with cryptographic keys? hmm... Side note - totally inconsequential at this point, but here's something (somewhat) interesting - I tried doing the same Google searches you did, but I enclosed the terms with quotes to make Google do exact matches only, and I get 33.7M results for "key-value" and 11.4M results for "name-value"... I think because without the quotes, it's doing partial matches as well... Anyway, I think I'm starting to fall into the "name-value" camp! Anyone else that can help us with our bikeshedding? :P |
@msporny Even found someone asking the same question from 2001. They think "key-value" has more of a database lookup connotation to it. Which our usage here kind of sort of does...? But our usage also isn't exactly a hash table necessarily? Bleargh, names are hard. |
Yes, that would be my gut feeling as well.
Oh, weird... yes, I tried quotes and got totally different results:
😆 Welcome to spec-writers hell, @DRK3! One of @dlongley's favorite jokes is: There are only two hard problems in computer science: cache invalidation, naming things, and off-by-1 errors. (originally quipped by Martin Fowler) |
8c67acf
to
9c9d6a3
Compare
@msporny Since it sounds like "name-value" is the better name at this point, I've updated my PR to use "name" instead of "key". I also realized that the actual Encrypted Document JSON uses the term "name" and "value" for each attribute... I somehow missed that in all this. But now they're aligned again with the spec text, which is good! |
I'm +1 to the language changes in this PR, but -1 to the new "add index" endpoint, at least at this time. I think adding a new endpoint for this significantly changes the design and responsibilities. That doesn't mean it's a direction we couldn't head, but right now, the EDV architecture and client / server responsibilities are different from traditional databases. A lot of that has to do with the server being blind to data relationships. Therefore, every attribute is always indexed -- and that's a guarantee that any client has when working with any server. Well, at the very least, queries can always be performed on any attribute. Additionally, if a client wants a compound index (e.g., to enforce uniqueness within groups), they create a new attribute for it -- such that it looks like any other attribute to the server. They don't instruct the server to combine existing attributes in new indexes to support their queries, potentially revealing information to the server in the process. If a server implementation doesn't want to build indexes to make queries on attributes fast, that's fine, I guess, but I don't think we should add more complexity such that clients must tell servers to build indexes for the attributes they would like queries to be fast for. Why do we need this feature? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-1 to the new "add index" endpoint
I completely missed the addition of the new feature, which is obvious in hindsight. I'm a -1 as well on that change in this particular PR. I thought all this PR was attempting to do was improve the narrative, not add indexing to the server.
That said, it's not clear to me what "add index" is meant to accomplish (making searching faster, yes, but our approach has always been to build these indexes client-side if necessary, IIRC).
@dlongley That's a good question! Here's why this endpoint is necessary: in theory, having the EDV server determine when/what to build indexes on sounds nice, but in practice this is problematic. Our EDV server implementation works with another database provider to do the actual storage, for instance MongoDB, CouchDB, Postgres, etc. The way these databases work (and I think how most work) is that, in general, they expect an explicit command to tell it what to create indexes on and when. So the EDV server must tell these databases what to create indexes on. If we want to just assume all attributes are indexed (which you probably should not, as building indexes by itself carries a cost), then there are some challenges here. How does the EDV server tell the underlying database to create indexes? When does it do this? Here are the possibilities I can think of:
So basically the issue here is that when you go and try to implement an EDV, you end up with this dilemma. It becomes complicated. The underlying databases expect index creation to be explicit calls. From what I can see, the proper solution is to make an EDV follow the pattern of other databases whenever possible (in this case, explicit index management). This would be essentially just a pass-through type of operation. I can't think of a realistic privacy issue here. You're basically telling the server that there's some attribute that you may or may not want to query a lot in the future. Remember that a server will be able to tell if some attribute is being queried a lot anyway, so this type of information can't really be hidden. This whole perspective is from my point of view as an implementer, and the actual issues that I just recently ran into (and I assume others will run into eventually when trying to make an EDV work in the real world). @msporny Having the client build and manage indexes limits the usefulness of EDVs. Here's the example from our Trustbloc platform: we have a web wallet that uses an EDV for storing credentials. The web wallet runs in a browser and does not have guaranteed persistent storage. Where would the index be stored, if not by the EDV? When logging in to another browser, the index would be lost. It makes sense to me that the EDV (the persistent storage) deals with indexes. Looking forward to hearing your thoughts! (thanks for reviewing my PR as well BTW!) |
We have implemented EDV storage using both MongoDB and PouchDB (which is CouchDB-based). We did not need to do either option above (1 nor 2). With MongoDB you can simply create one compound index on the fields in an EDV doc's With PouchDB/CouchDB, you can create computed fields of attribute names and attribute names + values (plus a uniqueness flag if indicated in the document) and index these. Again, no polling solutions are necessary. The same approach should be replicable with just about any other database backend. EDIT: It's worth noting that the attributes that a client wants indexed ... are the very attributes that are expressed in the document. So there's no need to communicate that information in another way via some other channel. If the document includes encrypted attributes, it's because those should be indexed for queries. |
@dlongley I spent some time today looking into it today and... You're absolutely right. While I knew that MongoDB could do this type of indexing using a multikey index, but what was causing me an issue was other DBs like CouchDB and PostgreSQL. CouchDB, for instance, allows you to query on those attributes, but can't directly index them - you have to do a workaround like what you were suggesting where you sort of "collapse" all the attributes into a flat array (and should work for any DB that can do an index on an array). The workaround I came up with was to put those attributes in a map in the JSON as direct name-value pairs and then I could create indexes directly using those attribute names, but I also realized now that that has the side effect of not allowing multiple name-value pairs that share the same name (due to the map structure), so your workaround is better and doesn't require a spec change. I'll remove that Add Index endpoint from my PR shortly. Oh, one more thing - one of the reasons I was doing the "Add Index" type of endpoint initially was because the spec actually already briefly mentions a "Create Index" type of operation for a vault in section 3.4.1, so I thought index creation was perhaps intended to be an explicit thing, but maybe it wasn't in the spec yet... Based on our discussion here, I suppose that should be removed, right? Or is that referring to something else? @msporny |
The spec has the concept of encrypted indexing which essentially is the use of blinded attributes/properties/tags in Encrypted Documents stored within a vault. Some of the language used in the spec on this subject is somewhat confusing. For instance, the term "index" is used in some places to refer to what are really just key-value attribute pairs. In database/storage terminology, in general an "index" refers to the underlying mechanism by which a database speeds up data queries. It is usually optional - queries can still be done without them. I've updated some of the terminology to be more consistent and hopefully easier to follow by using the term "attribute" when referring to the key-value pairs that may exist in Encrypted Documents and the term "index" to refer to the mechanism by which querying can be sped up.
9c9d6a3
to
5f3cfb7
Compare
Just updated the PR to remove the add index endpoint. I also updated the PR title and commit message. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
One other thing perhaps you can take a look at @dlongley:
|
The spec has the concept of encrypted indexing which essentially is the use of blinded attributes/properties/tags in Encrypted Documents stored within a vault. Some of the language used in the spec on this subject is somewhat confusing. For instance, the term "index" is used in some places to refer to what are really just key-value attribute pairs. In database/storage terminology, in general an "index" refers to the underlying mechanism by which a database speeds up data queries. It is usually optional - queries can still be done without them.
I've updated some of the terminology to be more consistent and hopefully easier to follow by using the term "attribute" when referring to the key-value pairs that may exist in Encrypted Documents and the term "index" to refer to the mechanism by which querying can be sped up.