Indexing language improvements #81

DRK3 · 2022-04-26T00:39:46Z

The spec has the concept of encrypted indexing which essentially is the use of blinded attributes/properties/tags in Encrypted Documents stored within a vault. Some of the language used in the spec on this subject is somewhat confusing. For instance, the term "index" is used in some places to refer to what are really just key-value attribute pairs. In database/storage terminology, in general an "index" refers to the underlying mechanism by which a database speeds up data queries. It is usually optional - queries can still be done without them.

I've updated some of the terminology to be more consistent and hopefully easier to follow by using the term "attribute" when referring to the key-value pairs that may exist in Encrypted Documents and the term "index" to refer to the mechanism by which querying can be sped up.

msporny

Overall supportive of the added narrative. There is some usage of attribute name/key/value that didn't seem to be consistent throughout the added text. We just need to pick something that's consistent.

msporny · 2022-04-26T13:10:30Z

index.html

  <a>storage agents</a> to search the information without leaking information
-  that would create privacy concerns.
+  that would create privacy concerns. See <a href="#creating-encrypted-indexes"></a>


Suggested change

that would create privacy concerns. See <a href="#creating-encrypted-indexes"></a>

that would create privacy concerns. See <a href="#blinded-document-attributes"></a>

msporny · 2022-04-26T13:11:41Z

index.html

@@ -2933,10 +2931,10 @@ <h3>

          <p>
  If <code>returnFullDocuments</code> was set to false, a successful query will result in a standard HTTP 200 response with a list
-  of <a>EncryptedDocument</a> identifiers that contain the value:
+  of <a>EncryptedDocument</a> identifiers that contain the attribute key:


Shouldn't this be attribute value?

These are the "has" query examples that only use an attribute key - but regardless, I changed it to say "...that match the query" to be more consistent with the previous examples - so you just have to check the example above it to see the query type

msporny · 2022-04-26T13:12:34Z

index.html

          </p>

-          <pre class="example highlight" title="data vault query for a particular attribute name">
+          <pre class="example highlight" title="data vault query for a particular attribute key">


The language above uses "attribute value" and "attribute name" -- wondering what we're going for wrt. consistency?

Oops - I had intended to use the term "key" instead of "name" everywhere. Looks like I missed a few spots. I've updated it now.

msporny · 2022-04-26T13:12:50Z

index.html

@@ -2947,10 +2945,10 @@ <h3>

          <p>
  If <code>returnFullDocuments</code> was set to true, a successful query will result in a standard HTTP 200 response with a list
-  of <a>EncryptedDocuments</a> that contain the value:
+  of <a>EncryptedDocuments</a> that contain the attribute key:


attribute key or attribute name or attribute value? :)

I changed it to say "...that match the query" to be more consistent with the previous examples - so you just have to check the example to see what type of query it is (attribute key in this case)

msporny · 2022-04-26T13:14:28Z

index.html

@@ -3001,6 +2999,38 @@ <h3>

        </section>

+        <section class="normative">
+            <h3>
+                Add Index


Suggested change

Add Index

Adding an index

W3C style guides are moving to "sentence capitalization for titles" based on some longitudinal study they did on titles. Don't know if this title fits, but we might think about how folks would read this entry in a table of contents. The section seems to be about "adding an index"?

msporny · 2022-04-26T13:15:24Z

index.html

+            </h3>
+
+            <p>
+  Encrypted Documents can have embedded key-value attribute pairs, as seen in <a href="#blinded-document-attributes"></a>.


Suggested change

Encrypted Documents can have embedded key-value attribute pairs, as seen in <a href="#blinded-document-attributes"></a>.

Encrypted Documents can have embedded attribute key-value pairs, as seen in <a href="#blinded-document-attributes"></a>.

... or maybe "attribute name-value pairs"?

msporny · 2022-04-26T13:16:37Z

index.html

+
+  {
+    "operation": "add",
+    "attributeKeys":["DUQaxPtSLtd8L3WBAIkJ4DiVJeqoF6bdnhR7lSaPloZ", "AarngVIZLl0kIp2xEHUH5o5uVc-470roQaOIbqMUD7DFQQypWQ=="]


Suggested change

"attributeKeys":["DUQaxPtSLtd8L3WBAIkJ4DiVJeqoF6bdnhR7lSaPloZ", "AarngVIZLl0kIp2xEHUH5o5uVc-470roQaOIbqMUD7DFQQypWQ=="]

"attributeKeys":["DUQaxPtSLtd8L3WBAIkJ4DiVJeqoF6bdnhR7lSaPloZ", "AarngVIZLl0kIp2xEHUH5o5uVc-470roQaOIbqMUD7DFQQypWQ=="]

is this an attribute key or an attribute name-value pair?

It's a set of attribute keys.

DRK3 · 2022-04-26T19:25:27Z

Thanks so much for reviewing @msporny! I made some changes/fixes based on your comments. I was meaning to use "key-value" everywhere (replacing any previous instances of name-value). I was thinking that "key-value" is probably the most common term for this sort of thing, but I'm not opposed to "name-value". Let me know if you have any other questions/comments!

msporny · 2022-04-26T20:27:38Z

I was thinking that "key-value" is probably the most common term for this sort of thing, but I'm not opposed to "name-value".

Google says "key-value" has 8.8B results and "name-value" has 9.7B results... close enough to just follow your heart on this one, @DRK3. :P -- I'm fine w/ either.

DRK3 · 2022-04-27T16:19:54Z

@msporny After reading your comment, it got me thinking - I did some searches and found a Wikipedia article for "Key-value database", but I also found an article for "Name-value pair" which uses the term "attribute" a lot... so now my heart is torn, haha... I'm thinking maybe "name-value" might be better after all? I'm not sure. Maybe it's better to avoid the term "key" to prevent confusion with cryptographic keys? hmm...

Side note - totally inconsequential at this point, but here's something (somewhat) interesting - I tried doing the same Google searches you did, but I enclosed the terms with quotes to make Google do exact matches only, and I get 33.7M results for "key-value" and 11.4M results for "name-value"... I think because without the quotes, it's doing partial matches as well...

Anyway, I think I'm starting to fall into the "name-value" camp! Anyone else that can help us with our bikeshedding? :P

DRK3 · 2022-04-27T16:22:43Z

@msporny Even found someone asking the same question from 2001. They think "key-value" has more of a database lookup connotation to it. Which our usage here kind of sort of does...? But our usage also isn't exactly a hash table necessarily? Bleargh, names are hard.

msporny · 2022-04-27T16:29:29Z

Maybe it's better to avoid the term "key" to prevent confusion with cryptographic keys? hmm...

Yes, that would be my gut feeling as well.

Side note - totally inconsequential at this point, but here's something (somewhat) interesting - I tried doing the same Google searches you did, but I enclosed the terms with quotes to make Google do exact matches only, and I get 33.7M results for "key-value" and 11.4M results for "name-value"... I think because without the quotes, it's doing partial matches as well...

Oh, weird... yes, I tried quotes and got totally different results:

They think "key-value" has more of a database lookup connotation to it. Which our usage here kind of sort of does...? But our usage also isn't exactly a hash table necessarily? Bleargh, names are hard.

😆 Welcome to spec-writers hell, @DRK3!

One of @dlongley's favorite jokes is: There are only two hard problems in computer science: cache invalidation, naming things, and off-by-1 errors. (originally quipped by Martin Fowler)

DRK3 · 2022-04-27T16:56:37Z

@msporny Since it sounds like "name-value" is the better name at this point, I've updated my PR to use "name" instead of "key". I also realized that the actual Encrypted Document JSON uses the term "name" and "value" for each attribute... I somehow missed that in all this. But now they're aligned again with the spec text, which is good!

dlongley · 2022-04-27T20:55:19Z

I'm +1 to the language changes in this PR, but -1 to the new "add index" endpoint, at least at this time.

I think adding a new endpoint for this significantly changes the design and responsibilities. That doesn't mean it's a direction we couldn't head, but right now, the EDV architecture and client / server responsibilities are different from traditional databases. A lot of that has to do with the server being blind to data relationships.

Therefore, every attribute is always indexed -- and that's a guarantee that any client has when working with any server. Well, at the very least, queries can always be performed on any attribute. Additionally, if a client wants a compound index (e.g., to enforce uniqueness within groups), they create a new attribute for it -- such that it looks like any other attribute to the server. They don't instruct the server to combine existing attributes in new indexes to support their queries, potentially revealing information to the server in the process.

If a server implementation doesn't want to build indexes to make queries on attributes fast, that's fine, I guess, but I don't think we should add more complexity such that clients must tell servers to build indexes for the attributes they would like queries to be fast for.

Why do we need this feature?

msporny

-1 to the new "add index" endpoint

I completely missed the addition of the new feature, which is obvious in hindsight. I'm a -1 as well on that change in this particular PR. I thought all this PR was attempting to do was improve the narrative, not add indexing to the server.

That said, it's not clear to me what "add index" is meant to accomplish (making searching faster, yes, but our approach has always been to build these indexes client-side if necessary, IIRC).

DRK3 · 2022-04-27T23:35:07Z

@dlongley That's a good question! Here's why this endpoint is necessary: in theory, having the EDV server determine when/what to build indexes on sounds nice, but in practice this is problematic.

Our EDV server implementation works with another database provider to do the actual storage, for instance MongoDB, CouchDB, Postgres, etc. The way these databases work (and I think how most work) is that, in general, they expect an explicit command to tell it what to create indexes on and when. So the EDV server must tell these databases what to create indexes on.

If we want to just assume all attributes are indexed (which you probably should not, as building indexes by itself carries a cost), then there are some challenges here. How does the EDV server tell the underlying database to create indexes? When does it do this?

Here are the possibilities I can think of:

When the EDV server receives a new document, it could then make an additional call to the underlying database to create indexes. But then the issue here is that the EDV server would constantly need to make these extra calls (potentially large overhead), unless it knows already that an index exists. How would it know this? A local cache? What about other EDV instances? How do these caches sync up? What if all documents containing a certain attribute are removed, rendering the index useless? How do we detect that?
Periodically poll the database and somehow determine when to create indexes. This is fairly complicated. Who does this? When? How? Periodically iterating through all documents in a vault?

So basically the issue here is that when you go and try to implement an EDV, you end up with this dilemma. It becomes complicated. The underlying databases expect index creation to be explicit calls.

From what I can see, the proper solution is to make an EDV follow the pattern of other databases whenever possible (in this case, explicit index management). This would be essentially just a pass-through type of operation. I can't think of a realistic privacy issue here. You're basically telling the server that there's some attribute that you may or may not want to query a lot in the future. Remember that a server will be able to tell if some attribute is being queried a lot anyway, so this type of information can't really be hidden.

This whole perspective is from my point of view as an implementer, and the actual issues that I just recently ran into (and I assume others will run into eventually when trying to make an EDV work in the real world).

@msporny Having the client build and manage indexes limits the usefulness of EDVs. Here's the example from our Trustbloc platform: we have a web wallet that uses an EDV for storing credentials. The web wallet runs in a browser and does not have guaranteed persistent storage. Where would the index be stored, if not by the EDV? When logging in to another browser, the index would be lost. It makes sense to me that the EDV (the persistent storage) deals with indexes.

Looking forward to hearing your thoughts! (thanks for reviewing my PR as well BTW!)

dlongley · 2022-05-01T00:31:13Z

@DRK3,

If we want to just assume all attributes are indexed (which you probably should not, as building indexes by itself carries a cost), then there are some challenges here. How does the EDV server tell the underlying database to create indexes? When does it do this?

Here are the possibilities I can think of:

When the EDV server receives a new document, it could then make an additional call to the underlying database to create indexes. But then the issue here is that the EDV server would constantly need to make these extra calls (potentially large overhead), unless it knows already that an index exists. How would it know this? A local cache? What about other EDV instances? How do these caches sync up? What if all documents containing a certain attribute are removed, rendering the index useless? How do we detect that?

Periodically poll the database and somehow determine when to create indexes. This is fairly complicated. Who does this? When? How? Periodically iterating through all documents in a vault?

So basically the issue here is that when you go and try to implement an EDV, you end up with this dilemma. It becomes complicated. The underlying databases expect index creation to be explicit calls.

We have implemented EDV storage using both MongoDB and PouchDB (which is CouchDB-based). We did not need to do either option above (1 nor 2).

With MongoDB you can simply create one compound index on the fields in an EDV doc's indexed property and to enforce the uniqueness constraint, another index on some meta field that holds an appropriate hash of any unique attributes (that is computed per doc). You only need to do this index creation once / ensure it exists on server application start up; there is no polling required.

With PouchDB/CouchDB, you can create computed fields of attribute names and attribute names + values (plus a uniqueness flag if indicated in the document) and index these. Again, no polling solutions are necessary.

The same approach should be replicable with just about any other database backend.

EDIT: It's worth noting that the attributes that a client wants indexed ... are the very attributes that are expressed in the document. So there's no need to communicate that information in another way via some other channel. If the document includes encrypted attributes, it's because those should be indexed for queries.

DRK3 · 2022-05-02T21:23:17Z

@dlongley I spent some time today looking into it today and... You're absolutely right.

While I knew that MongoDB could do this type of indexing using a multikey index, but what was causing me an issue was other DBs like CouchDB and PostgreSQL. CouchDB, for instance, allows you to query on those attributes, but can't directly index them - you have to do a workaround like what you were suggesting where you sort of "collapse" all the attributes into a flat array (and should work for any DB that can do an index on an array). The workaround I came up with was to put those attributes in a map in the JSON as direct name-value pairs and then I could create indexes directly using those attribute names, but I also realized now that that has the side effect of not allowing multiple name-value pairs that share the same name (due to the map structure), so your workaround is better and doesn't require a spec change.

I'll remove that Add Index endpoint from my PR shortly.

Oh, one more thing - one of the reasons I was doing the "Add Index" type of endpoint initially was because the spec actually already briefly mentions a "Create Index" type of operation for a vault in section 3.4.1, so I thought index creation was perhaps intended to be an explicit thing, but maybe it wasn't in the spec yet... Based on our discussion here, I suppose that should be removed, right? Or is that referring to something else? @msporny

The spec has the concept of encrypted indexing which essentially is the use of blinded attributes/properties/tags in Encrypted Documents stored within a vault. Some of the language used in the spec on this subject is somewhat confusing. For instance, the term "index" is used in some places to refer to what are really just key-value attribute pairs. In database/storage terminology, in general an "index" refers to the underlying mechanism by which a database speeds up data queries. It is usually optional - queries can still be done without them. I've updated some of the terminology to be more consistent and hopefully easier to follow by using the term "attribute" when referring to the key-value pairs that may exist in Encrypted Documents and the term "index" to refer to the mechanism by which querying can be sped up.

DRK3 · 2022-05-02T21:47:30Z

Just updated the PR to remove the add index endpoint. I also updated the PR title and commit message.

DRK3 · 2022-06-08T18:10:46Z

@dlongley @msporny Can you take another look when you have a chance?

dlongley

Thanks!

DRK3 · 2022-06-08T18:23:17Z

One other thing perhaps you can take a look at @dlongley:

Oh, one more thing - one of the reasons I was doing the "Add Index" type of endpoint initially was because the spec actually already briefly mentions a "Create Index" type of operation for a vault in section 3.4.1, so I thought index creation was perhaps intended to be an explicit thing, but maybe it wasn't in the spec yet... Based on our discussion here, I suppose that should be removed, right? Or is that referring to something else? @msporny

DRK3 requested review from OR13, csuwildcat and msporny as code owners April 26, 2022 00:39

msporny requested changes Apr 26, 2022

View reviewed changes

DRK3 force-pushed the AddIndexEndpointAndLanguageImprovements branch from 498bad2 to 8c67acf Compare April 26, 2022 19:19

msporny approved these changes Apr 26, 2022

View reviewed changes

DRK3 force-pushed the AddIndexEndpointAndLanguageImprovements branch from 8c67acf to 9c9d6a3 Compare April 27, 2022 16:55

msporny requested changes Apr 27, 2022

View reviewed changes

DRK3 force-pushed the AddIndexEndpointAndLanguageImprovements branch from 9c9d6a3 to 5f3cfb7 Compare May 2, 2022 21:46

DRK3 changed the title ~~Indexing language improvements and "Add Index" endpoint~~ Indexing language improvements May 2, 2022

DRK3 requested a review from msporny May 2, 2022 21:47

dlongley approved these changes Jun 8, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Indexing language improvements #81

Indexing language improvements #81

DRK3 commented Apr 26, 2022 •

edited

Loading

msporny left a comment

msporny Apr 26, 2022

DRK3 Apr 26, 2022

msporny Apr 26, 2022

DRK3 Apr 26, 2022

msporny Apr 26, 2022

DRK3 Apr 26, 2022

msporny Apr 26, 2022

DRK3 Apr 26, 2022

msporny Apr 26, 2022

DRK3 Apr 26, 2022

msporny Apr 26, 2022

DRK3 Apr 26, 2022

msporny Apr 26, 2022

DRK3 Apr 26, 2022

DRK3 commented Apr 26, 2022

msporny commented Apr 26, 2022

DRK3 commented Apr 27, 2022

DRK3 commented Apr 27, 2022 •

edited

Loading

msporny commented Apr 27, 2022 •

edited

Loading

DRK3 commented Apr 27, 2022 •

edited

Loading

dlongley commented Apr 27, 2022

msporny left a comment

DRK3 commented Apr 27, 2022 •

edited

Loading

dlongley commented May 1, 2022 •

edited

Loading

DRK3 commented May 2, 2022

DRK3 commented May 2, 2022

DRK3 commented Jun 8, 2022

dlongley left a comment

DRK3 commented Jun 8, 2022

	that would create privacy concerns. See <a href="#creating-encrypted-indexes"></a>
	that would create privacy concerns. See <a href="#blinded-document-attributes"></a>

	Encrypted Documents can have embedded key-value attribute pairs, as seen in <a href="#blinded-document-attributes"></a>.
	Encrypted Documents can have embedded attribute key-value pairs, as seen in <a href="#blinded-document-attributes"></a>.

	"attributeKeys":["DUQaxPtSLtd8L3WBAIkJ4DiVJeqoF6bdnhR7lSaPloZ", "AarngVIZLl0kIp2xEHUH5o5uVc-470roQaOIbqMUD7DFQQypWQ=="]
	"attributeKeys":["DUQaxPtSLtd8L3WBAIkJ4DiVJeqoF6bdnhR7lSaPloZ", "AarngVIZLl0kIp2xEHUH5o5uVc-470roQaOIbqMUD7DFQQypWQ=="]

Indexing language improvements #81

Are you sure you want to change the base?

Indexing language improvements #81

Conversation

DRK3 commented Apr 26, 2022 • edited Loading

msporny left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

DRK3 commented Apr 26, 2022

msporny commented Apr 26, 2022

DRK3 commented Apr 27, 2022

DRK3 commented Apr 27, 2022 • edited Loading

msporny commented Apr 27, 2022 • edited Loading

DRK3 commented Apr 27, 2022 • edited Loading

dlongley commented Apr 27, 2022

msporny left a comment

Choose a reason for hiding this comment

DRK3 commented Apr 27, 2022 • edited Loading

dlongley commented May 1, 2022 • edited Loading

DRK3 commented May 2, 2022

DRK3 commented May 2, 2022

DRK3 commented Jun 8, 2022

dlongley left a comment

Choose a reason for hiding this comment

DRK3 commented Jun 8, 2022

DRK3 commented Apr 26, 2022 •

edited

Loading

DRK3 commented Apr 27, 2022 •

edited

Loading

msporny commented Apr 27, 2022 •

edited

Loading

DRK3 commented Apr 27, 2022 •

edited

Loading

DRK3 commented Apr 27, 2022 •

edited

Loading

dlongley commented May 1, 2022 •

edited

Loading