Data modeling in XTDB #1631

franz65 · 2021-09-24T22:39:02Z

franz65
Sep 24, 2021

How to model data in XTDB?
This time I'm back with a philosophical question about data modeling that could be interesting for many users.
Which is the best way to map a graph model in xtdb documents?

Example of Social Network data model.

Vertex
Persons
with personal data like Name, Surname, nickname etc.
Groups
contains subgroup of the whole network (like FB groups)
...

Edegs
with attributes
Relationships
Friends (bidirectional)
follower (directional)
Group membership directional

In this example we have persons who can be members of groups and can have relationships among them like friendship, or follower.

Person A is friend of Person B and Person C.
Person C follows Person B, is friend of Person A and is member of Group 1.
Person B is also member of Group 1, is friend of Person A and is followed by Person C.

Of course till the graph is small there is no problem in representing data, but if the complexity increase I think it's necessary to find the best way to do it.
I see three different ways:

Insert data in every person document

  Persons documents containing relationships
 {:xt/id uuid
  :person/name Person A
  :person/friends [{:friend/id (uuid friend B)
                  :friend/date 2021/08/21}
                 {:friend/id (uuid friend C)
                  :friend/date 2020/01/01}]}
 {:xt/id uuid
  :person/name Person C
  :person/friends [{ :friend/id (uuid friend B)
                :friend/date 2020/01/01}]
 :person/follws [{:following/id (uuid Person B)
               :following/data .... }]
 :person/goup [{:gorup/id (uuid Group 1)
             :membership/date ...}]

{:xt/id uuid
:person/name Person B
:person/friends [{:friend/id (uuid friend A)
                  :friend/date 2021/08/21}]
:person/followers [{:follower/id (uuid Person C)
                 .....}]

This way is overcomplicated. Every time there is a new friendship I have to update two documents and the network expands till having for example 5.000 friends (like FB limit) I have Person documents with 5,000 nested documents representing friendship.

The second way could be a little different creating a document for every relationship

Relationship node

{:xt/id uuid
 :friend/1 (uuid Person A)
 :friend/2 (uuid Person B)
 :friendship/date 2021/08/21}

At this point I can decide to insert the id of friendship inside Person's documents.

{:xt/id uuid
    :person/name Person A
    :person/friends [[uuid friendship 1][uuid friendship 2]...]}
    {:xt/id uuid
    :person/name Person B
    :person/friends [[uuid friendship 1][uuid friendship 3]...]}

This way I have a vector of edges id. The problem of writing a new version of both Person friends remains.
I could decide not to insert the friendship node id in Person, but I think this could become problematic if I want to search all friends of, for example, Person A. If I have a large network with a lot of Persons and relationships I could have to look for interesting friendships among millions of friendship nodes.

The third solution adds another table that contains only the list of friendship nodes owning to a Person:

{:xt/id uuid
  :friendship/person (uuid person A)
  :person/friendship [[uuid friendship 1][uuid friendship 2]...]}

{:xt/id uuid
  :friend/1 (uuid Person A)
  :friend/2 (uuid Person B)
  :friendship/date 2021/08/21}

{:xt/id uuid
  :person/name Person A
  :person/friendship (uuid friendship)}

This way I have a document representing a Person linked to a document containing the list of friendships that links to every friendship node (representing and edge if this were a graph). Adding a new friendship, I have to create a friendship node and update the two friendship lists owned by the two friends but I don't have to update the two persons documents.

Is one of these models the best one to represent this kind of data in XTDB or there is a different way?

Thanks a lot
Franz

Answered by franz65

Sep 26, 2021

Thanks a lot @deobald.
I had read the articles you cited, that's why I had already decided not to use the first solution. I proposed it there only because I think that for a small dataset it can still be useful.
As you supposed I was going in the direction of the second solution, but I wanted to have some confirmation from someone more skilled in the field.
I used relational databases and a little bit of graph databases but I'm totally new to document database and to xtdb.
Thanks again for your answer.

View full answer

deobald · 2021-09-26T16:25:49Z

deobald
Sep 26, 2021

Hi @franz65

We have an article which discusses the philosophy of "records" at a very high level. We have 3 or 4 more of these articles pending but I haven't found the days required to sit down and write them yet. 😅

https://xtdb.com/articles/strength-of-the-record.html

The most hand-wavy answer to your question would be "don't nest too deeply" or, if flipped around, "try to keep your documents reasonably flat." These are thumb-rules, but xtdb 1.x strongly supports this suggestion by not indexing beyond the root keys for any given document. Deeply-nested documents are permitted, but not recommended.

The "Strength of the Record" article references a post written by Sarah Mei, assertively titled "Why You Should Never Use MongoDB": http://www.sarahmei.com/blog/2013/11/11/why-you-should-never-use-mongodb/

Sarah's article addresses your question even more directly. If you start nesting "friendships" within the documents themselves, you have easy access to the data, but no way to traverse the graph. With that in mind, the first solution seems sub-optimal.

The second solution is probably the route you want to take, though it comes in different flavours. You do not need to model edges explicitly in xtdb and, initially, it might save you some grief not to do so. Instead, you can track a vector of friend ids on any given Person directly. Assuming you are querying with EDN Datalog, traversing these relationships is then quite easy (and direct).

Modeling relationships (such as friendships) as documents themselves is only necessary if you want to add (meta-)data to the edges themselves. "Friendship date" seems a bit contrived, but you might want to add weights to edges to alter "shape" of the graph. If you are confident you want this extra data on edges, it makes sense to record them as documents, as in your second example.

The third solution feels like excessive denormalization. Unless you really want to avoid updating Person records when friendships change (and I would question, "why?") adding another layer of documents might just make all your queries unnecessarily awkward. In the past, we've recommended this kind of pseudo-denormalization for examples like "Likes" or "Clicks." If a user creates a Post and the Post is represented as its own record, you don't want to update the Post 100,000 times if it receives 100,000 Likes. In that case, a separate document to track Likes makes sense. But friendships/relationships are unlikely to accumulate at that sort of order-of-magnitude and — especially initially — it may be much easier to record them as fields on your Person records.

Caveat Regarding Software Life Cycles

I keep saying "initially." There is the very real possibility someone reading this discussion thinks to herself "but I'm migrating a legacy database" or "I already know what my records look like." Fair enough. None of these are hard and fast rules. XTDB can even be used as an exploratory data store for unstructured data, which pretty quickly breaks the "don't nest too deeply" thumb-rule. In a throw-away exploratory or analytics db, it might make complete sense to do

[put some-horrible-json-blob]

and then start playing around with it, teasing it into pieces, querying over the pieces, and reassembling them as you go. The earlier in the life cycle of your project, the more carefree I would encourage you to be. It's been hard for me to break out of my old habits, personally. Part of my brain still wants to set up my schema migrations and schema-on-write tools before I put my first record in the database. Of course, that doesn't make a lot of sense if I don't know what my data shapes look like yet.

Similarly, once you've broken out of the early phases of a project, you may want to return to your existing data and migrate it into a new shape — perhaps a more complicated one, where a collection of "Friendships" is separate from a "Person." Thanks to the absence of schema-on-write, xtdb encourages you to store the data you have rather than the data you think you'll need. Thanks to immutable records, you can always access your old data based on older (simpler) models. If and when you decide to migrate to a new schema, your old data is still there, saved forever in tx-time. Thus, if you create simpler models earlier on, you can watch your modeling evolve into something more sophisticated only as your system requires it.

Hope that makes sense — and I hope it helps! Have fun playing with xtdb. (And I do mean that literally: try to play before you build. XT is a lot of fun! 😃)

2 replies

kyleerhabor Oct 30, 2021

Instead, you can track a vector of friend ids on any given Person directly. Assuming you are querying with EDN Datalog, traversing these relationships is then quite easy (and direct).

With XTDB being schemaless, is there any support for references/foreign keys/data integrity? If the document the ID refers to were deleted, it could just remain in the vector pointing to nothing (hence, when queried, returns nil), but I could imagine scenarios where this is unwanted. For example, if a user is following another user and that other user deletes their account (either with delete or evict), the ID points to nothing. But, in a rare case, maybe a new user creates an account and happens to have the same ID.

refset Nov 1, 2021
Maintainer

There is no first class feature to enforce these kinds of integrity constraints, but you can achieve virtually anything needed using transaction functions. Let me know if you would like to see a concrete example for the case you described and I'd be happy to spend the time :)

franz65 · 2021-09-26T16:49:46Z

franz65
Sep 26, 2021
Author

Thanks a lot @deobald.
I had read the articles you cited, that's why I had already decided not to use the first solution. I proposed it there only because I think that for a small dataset it can still be useful.
As you supposed I was going in the direction of the second solution, but I wanted to have some confirmation from someone more skilled in the field.
I used relational databases and a little bit of graph databases but I'm totally new to document database and to xtdb.
Thanks again for your answer.

0 replies

olivergg · 2021-12-23T10:48:23Z

olivergg
Dec 23, 2021

@deobald thanks for the thorough explanation, it helps a lot 👍

@franz65

I could decide not to insert the friendship node id in Person, but I think this could become problematic if I want to search all friends of, for example, Person A. If I have a large network with a lot of Persons and relationships I could have to look for interesting friendships among millions of friendship nodes.

But according to :

XTDB automatically indexes the top-level fields in all documents, supporting efficient ad-hoc joins and retrievals.

So shouldn't a simple query on the relationship nodes (knowing :friend/1 uuid) be sufficient enough ?

 '{:find [x]
   :where [[i :friend/1 "uuid of person A"]
           [i :friend/2 x]]}

1 reply

franz65 Dec 28, 2021
Author

I prefer not to put relationships in the node to avoid too long list of different version of the same node, with changes only in the info about relationships.
I created a node containing an array of relationships, connected with the node Person. In this way I can easily access the list of friends but at the same time I can keep untouched the Person node.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

XTDB

Data modeling in XTDB #1631

{{title}}

Replies: 3 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

XTDB

Data modeling in XTDB #1631

franz65 Sep 24, 2021

Replies: 3 comments · 3 replies

deobald Sep 26, 2021

Caveat Regarding Software Life Cycles

kyleerhabor Oct 30, 2021

refset Nov 1, 2021 Maintainer

franz65 Sep 26, 2021 Author

olivergg Dec 23, 2021

franz65 Dec 28, 2021 Author

franz65
Sep 24, 2021

Replies: 3 comments 3 replies

deobald
Sep 26, 2021

refset Nov 1, 2021
Maintainer

franz65
Sep 26, 2021
Author

olivergg
Dec 23, 2021

franz65 Dec 28, 2021
Author