Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Indexing language improvements #81

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

DRK3
Copy link

@DRK3 DRK3 commented Apr 26, 2022

The spec has the concept of encrypted indexing which essentially is the use of blinded attributes/properties/tags in Encrypted Documents stored within a vault. Some of the language used in the spec on this subject is somewhat confusing. For instance, the term "index" is used in some places to refer to what are really just key-value attribute pairs. In database/storage terminology, in general an "index" refers to the underlying mechanism by which a database speeds up data queries. It is usually optional - queries can still be done without them.

I've updated some of the terminology to be more consistent and hopefully easier to follow by using the term "attribute" when referring to the key-value pairs that may exist in Encrypted Documents and the term "index" to refer to the mechanism by which querying can be sped up.

Copy link
Contributor

@msporny msporny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall supportive of the added narrative. There is some usage of attribute name/key/value that didn't seem to be consistent throughout the added text. We just need to pick something that's consistent.

index.html Outdated
<a>storage agents</a> to search the information without leaking information
that would create privacy concerns.
that would create privacy concerns. See <a href="#creating-encrypted-indexes"></a>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
that would create privacy concerns. See <a href="#creating-encrypted-indexes"></a>
that would create privacy concerns. See <a href="#blinded-document-attributes"></a>

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed!

index.html Outdated
@@ -2933,10 +2931,10 @@ <h3>

<p>
If <code>returnFullDocuments</code> was set to false, a successful query will result in a standard HTTP 200 response with a list
of <a>EncryptedDocument</a> identifiers that contain the value:
of <a>EncryptedDocument</a> identifiers that contain the attribute key:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be attribute value?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are the "has" query examples that only use an attribute key - but regardless, I changed it to say "...that match the query" to be more consistent with the previous examples - so you just have to check the example above it to see the query type

index.html Outdated
</p>

<pre class="example highlight" title="data vault query for a particular attribute name">
<pre class="example highlight" title="data vault query for a particular attribute key">
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The language above uses "attribute value" and "attribute name" -- wondering what we're going for wrt. consistency?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops - I had intended to use the term "key" instead of "name" everywhere. Looks like I missed a few spots. I've updated it now.

index.html Outdated
@@ -2947,10 +2945,10 @@ <h3>

<p>
If <code>returnFullDocuments</code> was set to true, a successful query will result in a standard HTTP 200 response with a list
of <a>EncryptedDocuments</a> that contain the value:
of <a>EncryptedDocuments</a> that contain the attribute key:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

attribute key or attribute name or attribute value? :)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed it to say "...that match the query" to be more consistent with the previous examples - so you just have to check the example to see what type of query it is (attribute key in this case)

index.html Outdated
@@ -3001,6 +2999,38 @@ <h3>

</section>

<section class="normative">
<h3>
Add Index
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Add Index
Adding an index

W3C style guides are moving to "sentence capitalization for titles" based on some longitudinal study they did on titles. Don't know if this title fits, but we might think about how folks would read this entry in a table of contents. The section seems to be about "adding an index"?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

index.html Outdated
</h3>

<p>
Encrypted Documents can have embedded key-value attribute pairs, as seen in <a href="#blinded-document-attributes"></a>.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Encrypted Documents can have embedded key-value attribute pairs, as seen in <a href="#blinded-document-attributes"></a>.
Encrypted Documents can have embedded attribute key-value pairs, as seen in <a href="#blinded-document-attributes"></a>.

... or maybe "attribute name-value pairs"?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

index.html Outdated

{
"operation": "add",
"attributeKeys":["DUQaxPtSLtd8L3WBAIkJ4DiVJeqoF6bdnhR7lSaPloZ", "AarngVIZLl0kIp2xEHUH5o5uVc-470roQaOIbqMUD7DFQQypWQ=="]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"attributeKeys":["DUQaxPtSLtd8L3WBAIkJ4DiVJeqoF6bdnhR7lSaPloZ", "AarngVIZLl0kIp2xEHUH5o5uVc-470roQaOIbqMUD7DFQQypWQ=="]
"attributeKeys":["DUQaxPtSLtd8L3WBAIkJ4DiVJeqoF6bdnhR7lSaPloZ", "AarngVIZLl0kIp2xEHUH5o5uVc-470roQaOIbqMUD7DFQQypWQ=="]

is this an attribute key or an attribute name-value pair?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a set of attribute keys.

@DRK3 DRK3 force-pushed the AddIndexEndpointAndLanguageImprovements branch from 498bad2 to 8c67acf Compare April 26, 2022 19:19
@DRK3
Copy link
Author

DRK3 commented Apr 26, 2022

Thanks so much for reviewing @msporny! I made some changes/fixes based on your comments. I was meaning to use "key-value" everywhere (replacing any previous instances of name-value). I was thinking that "key-value" is probably the most common term for this sort of thing, but I'm not opposed to "name-value". Let me know if you have any other questions/comments!

@msporny
Copy link
Contributor

msporny commented Apr 26, 2022

I was thinking that "key-value" is probably the most common term for this sort of thing, but I'm not opposed to "name-value".

Google says "key-value" has 8.8B results and "name-value" has 9.7B results... close enough to just follow your heart on this one, @DRK3. :P -- I'm fine w/ either.

@DRK3
Copy link
Author

DRK3 commented Apr 27, 2022

@msporny After reading your comment, it got me thinking - I did some searches and found a Wikipedia article for "Key-value database", but I also found an article for "Name-value pair" which uses the term "attribute" a lot... so now my heart is torn, haha... I'm thinking maybe "name-value" might be better after all? I'm not sure. Maybe it's better to avoid the term "key" to prevent confusion with cryptographic keys? hmm...

Side note - totally inconsequential at this point, but here's something (somewhat) interesting - I tried doing the same Google searches you did, but I enclosed the terms with quotes to make Google do exact matches only, and I get 33.7M results for "key-value" and 11.4M results for "name-value"... I think because without the quotes, it's doing partial matches as well...

Anyway, I think I'm starting to fall into the "name-value" camp! Anyone else that can help us with our bikeshedding? :P

@DRK3
Copy link
Author

DRK3 commented Apr 27, 2022

@msporny Even found someone asking the same question from 2001. They think "key-value" has more of a database lookup connotation to it. Which our usage here kind of sort of does...? But our usage also isn't exactly a hash table necessarily? Bleargh, names are hard.

@msporny
Copy link
Contributor

msporny commented Apr 27, 2022

Maybe it's better to avoid the term "key" to prevent confusion with cryptographic keys? hmm...

Yes, that would be my gut feeling as well.

Side note - totally inconsequential at this point, but here's something (somewhat) interesting - I tried doing the same Google searches you did, but I enclosed the terms with quotes to make Google do exact matches only, and I get 33.7M results for "key-value" and 11.4M results for "name-value"... I think because without the quotes, it's doing partial matches as well...

Oh, weird... yes, I tried quotes and got totally different results:

image

image

image

They think "key-value" has more of a database lookup connotation to it. Which our usage here kind of sort of does...? But our usage also isn't exactly a hash table necessarily? Bleargh, names are hard.

😆 Welcome to spec-writers hell, @DRK3!

One of @dlongley's favorite jokes is: There are only two hard problems in computer science: cache invalidation, naming things, and off-by-1 errors. (originally quipped by Martin Fowler)

@DRK3 DRK3 force-pushed the AddIndexEndpointAndLanguageImprovements branch from 8c67acf to 9c9d6a3 Compare April 27, 2022 16:55
@DRK3
Copy link
Author

DRK3 commented Apr 27, 2022

@msporny Since it sounds like "name-value" is the better name at this point, I've updated my PR to use "name" instead of "key". I also realized that the actual Encrypted Document JSON uses the term "name" and "value" for each attribute... I somehow missed that in all this. But now they're aligned again with the spec text, which is good!

@dlongley
Copy link
Contributor

I'm +1 to the language changes in this PR, but -1 to the new "add index" endpoint, at least at this time.

I think adding a new endpoint for this significantly changes the design and responsibilities. That doesn't mean it's a direction we couldn't head, but right now, the EDV architecture and client / server responsibilities are different from traditional databases. A lot of that has to do with the server being blind to data relationships.

Therefore, every attribute is always indexed -- and that's a guarantee that any client has when working with any server. Well, at the very least, queries can always be performed on any attribute. Additionally, if a client wants a compound index (e.g., to enforce uniqueness within groups), they create a new attribute for it -- such that it looks like any other attribute to the server. They don't instruct the server to combine existing attributes in new indexes to support their queries, potentially revealing information to the server in the process.

If a server implementation doesn't want to build indexes to make queries on attributes fast, that's fine, I guess, but I don't think we should add more complexity such that clients must tell servers to build indexes for the attributes they would like queries to be fast for.

Why do we need this feature?

Copy link
Contributor

@msporny msporny left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-1 to the new "add index" endpoint

I completely missed the addition of the new feature, which is obvious in hindsight. I'm a -1 as well on that change in this particular PR. I thought all this PR was attempting to do was improve the narrative, not add indexing to the server.

That said, it's not clear to me what "add index" is meant to accomplish (making searching faster, yes, but our approach has always been to build these indexes client-side if necessary, IIRC).

@DRK3
Copy link
Author

DRK3 commented Apr 27, 2022

@dlongley That's a good question! Here's why this endpoint is necessary: in theory, having the EDV server determine when/what to build indexes on sounds nice, but in practice this is problematic.

Our EDV server implementation works with another database provider to do the actual storage, for instance MongoDB, CouchDB, Postgres, etc. The way these databases work (and I think how most work) is that, in general, they expect an explicit command to tell it what to create indexes on and when. So the EDV server must tell these databases what to create indexes on.

If we want to just assume all attributes are indexed (which you probably should not, as building indexes by itself carries a cost), then there are some challenges here. How does the EDV server tell the underlying database to create indexes? When does it do this?

Here are the possibilities I can think of:

  1. When the EDV server receives a new document, it could then make an additional call to the underlying database to create indexes. But then the issue here is that the EDV server would constantly need to make these extra calls (potentially large overhead), unless it knows already that an index exists. How would it know this? A local cache? What about other EDV instances? How do these caches sync up? What if all documents containing a certain attribute are removed, rendering the index useless? How do we detect that?
  2. Periodically poll the database and somehow determine when to create indexes. This is fairly complicated. Who does this? When? How? Periodically iterating through all documents in a vault?

So basically the issue here is that when you go and try to implement an EDV, you end up with this dilemma. It becomes complicated. The underlying databases expect index creation to be explicit calls.

From what I can see, the proper solution is to make an EDV follow the pattern of other databases whenever possible (in this case, explicit index management). This would be essentially just a pass-through type of operation. I can't think of a realistic privacy issue here. You're basically telling the server that there's some attribute that you may or may not want to query a lot in the future. Remember that a server will be able to tell if some attribute is being queried a lot anyway, so this type of information can't really be hidden.

This whole perspective is from my point of view as an implementer, and the actual issues that I just recently ran into (and I assume others will run into eventually when trying to make an EDV work in the real world).

@msporny Having the client build and manage indexes limits the usefulness of EDVs. Here's the example from our Trustbloc platform: we have a web wallet that uses an EDV for storing credentials. The web wallet runs in a browser and does not have guaranteed persistent storage. Where would the index be stored, if not by the EDV? When logging in to another browser, the index would be lost. It makes sense to me that the EDV (the persistent storage) deals with indexes.

Looking forward to hearing your thoughts! (thanks for reviewing my PR as well BTW!)

@dlongley
Copy link
Contributor

dlongley commented May 1, 2022

@DRK3,

If we want to just assume all attributes are indexed (which you probably should not, as building indexes by itself carries a cost), then there are some challenges here. How does the EDV server tell the underlying database to create indexes? When does it do this?

Here are the possibilities I can think of:

  1. When the EDV server receives a new document, it could then make an additional call to the underlying database to create indexes. But then the issue here is that the EDV server would constantly need to make these extra calls (potentially large overhead), unless it knows already that an index exists. How would it know this? A local cache? What about other EDV instances? How do these caches sync up? What if all documents containing a certain attribute are removed, rendering the index useless? How do we detect that?
  2. Periodically poll the database and somehow determine when to create indexes. This is fairly complicated. Who does this? When? How? Periodically iterating through all documents in a vault?

So basically the issue here is that when you go and try to implement an EDV, you end up with this dilemma. It becomes complicated. The underlying databases expect index creation to be explicit calls.

We have implemented EDV storage using both MongoDB and PouchDB (which is CouchDB-based). We did not need to do either option above (1 nor 2).

With MongoDB you can simply create one compound index on the fields in an EDV doc's indexed property and to enforce the uniqueness constraint, another index on some meta field that holds an appropriate hash of any unique attributes (that is computed per doc). You only need to do this index creation once / ensure it exists on server application start up; there is no polling required.

With PouchDB/CouchDB, you can create computed fields of attribute names and attribute names + values (plus a uniqueness flag if indicated in the document) and index these. Again, no polling solutions are necessary.

The same approach should be replicable with just about any other database backend.

EDIT: It's worth noting that the attributes that a client wants indexed ... are the very attributes that are expressed in the document. So there's no need to communicate that information in another way via some other channel. If the document includes encrypted attributes, it's because those should be indexed for queries.

@DRK3
Copy link
Author

DRK3 commented May 2, 2022

@dlongley I spent some time today looking into it today and... You're absolutely right.

While I knew that MongoDB could do this type of indexing using a multikey index, but what was causing me an issue was other DBs like CouchDB and PostgreSQL. CouchDB, for instance, allows you to query on those attributes, but can't directly index them - you have to do a workaround like what you were suggesting where you sort of "collapse" all the attributes into a flat array (and should work for any DB that can do an index on an array). The workaround I came up with was to put those attributes in a map in the JSON as direct name-value pairs and then I could create indexes directly using those attribute names, but I also realized now that that has the side effect of not allowing multiple name-value pairs that share the same name (due to the map structure), so your workaround is better and doesn't require a spec change.

I'll remove that Add Index endpoint from my PR shortly.

Oh, one more thing - one of the reasons I was doing the "Add Index" type of endpoint initially was because the spec actually already briefly mentions a "Create Index" type of operation for a vault in section 3.4.1, so I thought index creation was perhaps intended to be an explicit thing, but maybe it wasn't in the spec yet... Based on our discussion here, I suppose that should be removed, right? Or is that referring to something else? @msporny

The spec has the concept of encrypted indexing which essentially is the use of blinded attributes/properties/tags in Encrypted Documents stored within a vault. Some of the language used in the spec on this subject is somewhat confusing. For instance, the term "index" is used in some places to refer to what are really just key-value attribute pairs. In database/storage terminology, in general an "index" refers to the underlying mechanism by which a database speeds up data queries. It is usually optional - queries can still be done without them.

I've updated some of the terminology to be more consistent and hopefully easier to follow by using the term "attribute" when referring to the key-value pairs that may exist in Encrypted Documents and the term "index" to refer to the mechanism by which querying can be sped up.
@DRK3 DRK3 force-pushed the AddIndexEndpointAndLanguageImprovements branch from 9c9d6a3 to 5f3cfb7 Compare May 2, 2022 21:46
@DRK3 DRK3 changed the title Indexing language improvements and "Add Index" endpoint Indexing language improvements May 2, 2022
@DRK3
Copy link
Author

DRK3 commented May 2, 2022

Just updated the PR to remove the add index endpoint. I also updated the PR title and commit message.

@DRK3 DRK3 requested a review from msporny May 2, 2022 21:47
@DRK3
Copy link
Author

DRK3 commented Jun 8, 2022

@dlongley @msporny Can you take another look when you have a chance?

Copy link
Contributor

@dlongley dlongley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@DRK3
Copy link
Author

DRK3 commented Jun 8, 2022

One other thing perhaps you can take a look at @dlongley:

Oh, one more thing - one of the reasons I was doing the "Add Index" type of endpoint initially was because the spec actually already briefly mentions a "Create Index" type of operation for a vault in section 3.4.1, so I thought index creation was perhaps intended to be an explicit thing, but maybe it wasn't in the spec yet... Based on our discussion here, I suppose that should be removed, right? Or is that referring to something else? @msporny

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants