Skip to content
This repository has been archived by the owner on Jun 6, 2024. It is now read-only.

Schema & documentation working updates #163

Merged
merged 12 commits into from
Apr 23, 2024
52 changes: 6 additions & 46 deletions documentation/documentation.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ docker compose -f docker-compose-cluster.yaml up
Initialize the `awesome` UBI store:

```
curl -X PUT "http://localhost:9200/_plugins/ubi/awesome?index=ecommerce&id_field=id"
curl -X PUT "http://localhost:9200/_plugins/ubi/awesome?index=ecommerce&object_id=id"
```

Send an event to the `awesome` store:
Expand Down Expand Up @@ -71,58 +71,18 @@ The plugin has a concept of a "store", which is a logical collection of the even
index is used to store events, and the other index is for storing queries.

### OpenSearch Data Mappings

#### Schema for events:

The current event mappings file can be found [here](https://github.com/o19s/opensearch-ubi/blob/main/src/main/resources/events-mapping.json).

**Primary fields include:**
- `action_name` - (size 100) - any name you want to call your event
- `timestamp` - unix epoch time. if not set, will be set by the plugin when the event is received
- `user_id`. `session_id`, `page_id` - (size 100) - are id's largely at the calling client's discretion for tracking users, sessions and pages
- `query_id` - (size 100) - ID for some query. Note that it could be a unique search string, or it could represent a cluster of related searches (i.e.: *dress*, *red dress*, *long dress* could all have the same `query_id`). Either the client could control these, or the `query_id` could be retrieved from the API's response headers as it keeps track of queries on the node
- `message_type` - (size 100) - originally thought of in terms of ERROR, INFO, WARN, but could be anything useful such as `QUERY` or `CONVERSION`. Use to group `action_name` together.
- `message` - (size 256) - optional text for the log entry

**Other fields & data objects**
- `event_attributes.object` - contains an associated JSONified data object (i.e. books, products, user info, etc) if there are any
- `event_attributes.object.object_id` - points to a unique, internal, id representing and instance of that object
- `event_attributes.object.key_value` - points to a unique, external key, matching the item that the user searched for, found and acted upon (i.e. sku, isbn, ean, etc.).
**This field value should match the value in for the object's value in the `id_field` [below](#id_field) from the search store**
It is possible that the `object_id` and `key_value` match if the same id is used both internally for indexing and externally for the users.
- `event_attributes.object.object_type` - indicates the type/class of object
- `event_attributes.object.description` - optional description of the object
- `event_attributes.object.transaction_id` - optionally points to a unique id representing a successful transaction
- `event_attributes.object.to_user_id` - optionally points to another user, if they are the recipient of this object
- `event_attributes.object.object_detail` - optional data object/map of further data details
- `event_attributes.position` - nested object to track user events to the location of the event origins
- `event_attributes.position.ordinal` - tracks the nth item within a list that a user could select, click
- `event_attributes.position.{x,y}` - tracks x and y values, that the client defines
- `event_attributes.position.page_depth` - tracks page depth
- `event_attributes.position.scroll_depth` - tracks scroll depth
- `event_attributes.position.trail` - text field for tracking the path/trail that a user took to get to this location

* Other mapped fields in the schema are intended to be optional placeholders for common attributes like `user_name`, `email`, `price`

**the users can dynamically add any further fields to the event mapping

#### Schema for queries:

The current query mappings file can be found [here](https://github.com/o19s/opensearch-ubi/blob/main/src/main/resources/queries-mapping.json).

- `timestamp` - A unix timestamp of when the query was received
- `query_id` - A unique ID of the query provided by the client or generated automatically by the plugin
- `query_response_id` - A unique ID for the collection of results for the query
- `user_id` - A user ID provided by the client
- `session_id` - An optional session ID provided by the client
Ubi has 2 primary indices:
- **UBi Queries** stores all queries and results.
- **UBi Events** store that the Ubi client writes events to.
*Please follow the [schema deep dive](./schemas.md) to understand how these two indices make Ubi into a causal framework for search.*

## Plugin API

The plugin exposes a REST API for managing UBI stores and persisting events.

| Method | Endpoint | Purpose |
|--------|-----------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| `PUT` | `/_plugins/ubi/{store}?index={index}&id_field={id_field}` | <p id="id_field">Initialize a new UBI store for the given index. The `id_field` is optional and allows for providing the name of a field in the `index`'s schema to be used as the unique result/item ID for each search result. If not provided, the `_id` field is used. </p>|
| `PUT` | `/_plugins/ubi/{store}?index={index}&object_id={object_id}` | <p id="object_id">Initialize a new UBI store for the given index. The `object_id` is optional and allows for providing the name of a field in the `index`'s schema to be used as the unique result/item ID for each search result. If not provided, the `_id` field is used. </p>|
| `DELETE` | `/_plugins/ubi/{store}` | Delete a UBI store |
| `GET` | `/_plugins/ubi` | Get a list of all UBI stores |
| `POST` | `/_plugins/ubi/{store}` | Index an event into the UBI store |
Expand Down
6 changes: 3 additions & 3 deletions documentation/queries/sql_queries.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ Although it's trivial on the server side to find queries with no results, we can
select
count(0)
from .ubi_log_queries
where query_response_hit_ids is null
where query_response_objects_ids is null
order by user_id
```

Expand All @@ -18,7 +18,7 @@ order by user_id
select
count(0)
from .ubi_log_events
where action_name='on_search' and event_attributes.data.data_detail.query_data.query_response_hit_ids is null
where action_name='on_search' and event_attributes.data.data_detail.query_data.query_response_objects_ids is null
order by timestamp
```

Expand Down Expand Up @@ -113,7 +113,7 @@ where query_id ='1065c70f-d46a-442f-8ce4-0b5e7a71a892'
order by timestamp
```
(In this generated data, the `query` field is plain text; however in the real implementation the query will be in the internal DSL of the query and parameters.)
query_response_id|query_id|user_id|query|query_response_hit_ids|session_id|timestamp
query_response_id|query_id|user_id|query|query_response_objects_ids|session_id|timestamp
---|---|---|---|---|---|---
1065c70f-d46a-442f-8ce4-0b5e7a71a892|1065c70f-d46a-442f-8ce4-0b5e7a71a892|155_7e3471ff-14c8-45cb-bc49-83a056c37192|Blanditiis quo sint repudiandae a sit.|8659955|fa6e3b1c-3212-44d2-b16b-690b4aeddbba_1975|2027-04-17 10:16:45

Expand Down
214 changes: 214 additions & 0 deletions documentation/schemas.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,214 @@

# Key User Behavior Insights concepts
**User Behavior Insights** (Ubi) **Logging** is really just a matter of linking and indexing queries, results and events within OpenSearch.
## Key ID's
Ubi is not functional unless the links between the following are consistently maintained within your Ubi-enabled application:

- [`user_id`](#user_id) represents a unique user.
- [`object_id`](#object_id) represents an id for whatever item the user is searching for, such as *epc*, *isbn*, *ssn*, *handle*, etc.
- [`query_id`](#query_id) is a unique id for the raw query language executed and the resultant `object_id`'s that the query returned. \
- [`action_name`](#action_name), though not technically an *id*, the `action_name` tells us what exact action (such as `click` or `add_to_cart`) was taken (or not) with this `object_id`.

To summarize: the `query_id` signals the beginning of a `user_id`'s *Search Journey*, the `action_name` tells us how the user is interacting with the query results within the application, and [`event_attributes.object.object_id`](#object_id) is referring to the precise query result that the user interacts with.

## Ubi Roles
- **Search Client**: in charge of searching, and then recieving *objects* from some document index in OpenSearch.
&ensp;(1, 2, *5* & 7, below)
- **User Behavior Insights** module: once activated, manages the **Ubi Queries** store in the background, indexing each underlying, technical, DSL, index query with a unique [`query_id`](#query_id) along with all returned resultant [`object_id`](#object_id)'s, and then passing the `query_id` back to the **Search Client** so that events can be linked to this query.
&ensp;(3, 4 & *5*, below)
- **objects**: are whatever items the user is searching for with the queries. Activating Ubi involves mapping your real-world objects (via its *isbn*, etc...) to the [`object_id`](#object_id) fields in the schemas below.
- The **Search Client**, if separate from the **Ubi Client**, forwards the indexed [`query_id`](#query_id) to the **Ubi Client**.
&ensp; *Note:* We break out the roles of *search* and *Ubi event indexing* here, but many implementations will likely use the same OpenSearch client instance for both roles of searching and index writing.
&ensp;(6, below)
- The **Ubi Client** then indexes all user events with this [`query_id`](#query_id) until a new search is performed, and a new `query_id` is generated by **User Behavior Insights** and passed back to the **Ubi Client**
- If the **Ubi Client** interacts with a result *object*, such as `onClick`, that [`object_id`](#object_id), *onClick* [`action_name`](#action_name) and `query_id` are all indexed together, signalling the causal link between the *search* and the *object*.
&ensp;(8 & 9, below)



```mermaid
graph LR
style L fill:none,stroke-dasharray: 5 5
subgraph L["`*Legend*`"]
style ss height:150px
subgraph ss["Standard Search"]
direction LR

style ln1a fill:blue
ln1a[ ]--->ln1b[ ];
end
subgraph ubi-leg["Ubi data flow"]
direction LR

ln2a[ ].->|"`**Ubi interaction**`"|ln2b[ ];
style ln1c fill:red
ln1c[ ]-->|<span style="font-family:Courier New">query_id</span> flow|ln1d[ ];
end
end
linkStyle 0 stroke-width:2px,stroke:#0A1CCF
linkStyle 2 stroke-width:2px,stroke:red
```
```mermaid
%%{init: {
"flowchart": {"htmlLabels": false},

}
}%%
graph TB

User--1) <i>raw search string</i>-->Search;
Search--2) <i>search string</i>-->Docs
style OS stroke-width:2px, stroke:#0A1CCF, fill:#62affb, opacity:.5
subgraph OS[OpenSearch Cluster fa:fa-database]
style E stroke-width:1px,stroke:red
E[(&emsp;<b>Ubi Events</b>&emsp;)]
style Docs stroke-width:1px,stroke:#0A1CCF
style Q stroke-width:1px,stroke:red
Docs[(Document Index)] -."3) {<i>DSL</i>...} & [<i>object_id's</i>,...]".-> Q[(&emsp;<b>Ubi Queries</b>&emsp;)];
Q -.4) <span style="font-family:Courier New">query_id</span>.-> Docs ;
end

Docs -- "5) <i>return</i> both <span style="font-family:Courier New">query_id</span> & [<i>objects</i>,...]" --->Search ;
Search-.6) <span style="font-family:Courier New">query_id</span>.->U;
Search --7) [<i>results</i>, ...]--> User

style *client-side* stroke-width:1px, stroke:#D35400
subgraph "`*client-side*`"
style User stroke-width:4px, stroke:#EC636
User["`**User**`" fa:fa-user]
App
Search
U
style App fill:#D35400,opacity:.35, stroke:#0A1CCF, stroke-width:2px
subgraph App[&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;&emsp;UserApp fa:fa-store]
style Search stroke-width:2px, stroke:#0A1CCF
Search(&emsp;Search Client&emsp;)
style U stroke-width:1px,stroke:red
U(&emsp;<b>Ubi Client</b>&emsp;)
end
end

User -.8) <i>selects</i> <span style="font-family:Courier New">object_id:123</span>.->U;
U-."9) <i>index</i> event:{<span style="font-family:Courier New">query_id, onClick, object_id:123</span>}".->E;

linkStyle 1,2,0,6 stroke-width:2px,fill:none,stroke:#0A1CCF
linkStyle 3,4,5,8 stroke-width:2px,fill:none,stroke:red
```

## Ubi Stores
There are 2 separate stores for Ubi:
### 1) **Ubi Queries**
All underlying query information and results ([`object_id`](#object_id)'s) are stored in the **Ubi Queries** store, and remains largely invisible in the background.
The only obvious difference will be in the `ubi` stanze of the json response, *which could cause index bloat if one forgets that this is enabled*.

**Ubi Queries** [schema](../src/main/resources/queries-mapping.json):
Since Ubi manages the **Ubi Queries** store, the developer should never have to write directly to this store (except for importing data).

- `timestamp`
&ensp; A unix timestamp of when the query was received

- `query_id`
&ensp; A unique ID of the query provided by the client or generated automatically. The same query text issued multiple times would generate different `query_id`.

- `user_id`
&ensp; A user ID provided by the client

- `session_id`
&ensp; An optional session ID provided by the client. _This is currently under review of if we keep this_.

- `query_response_objects_ids`
&ensp; This is an array of the `object_id`'s. This *could* be the same id as the `_id` but is meant to be the externally valid id of document/item/product.



### 2) **Ubi Events**
This is the event store that the client side directly indexes events to, linking the event [`action_name`](#action_name), [`object_id`](#object_id)'s and [`query_id`](#query_id)'s together with any other important event information.
Since this schema is dynamic, the developer can add any new fields and structures (such as *user* information, *geo-location* information, etc.) at index time that are not in the current **Ubi Events** [schema](../src/main/resources/events-mapping.json):
- `application`
<p id="application">

&ensp; (size 100) - name of the application tracking UBI events (e.g. *amazon-shop*, *ABC-microservice*)
- `action_name`
<p id="action_name">

&ensp; (size 100) - any name you want to call your event. For example, with *javascript* events, you could include `on_click`, `logon`, `add_to_cart`, `page_scroll`.... _This should be formalized. A list of standard ones and then custom ones._

- `query_id`
<p id="query_id">

&ensp; (size 100) - ID for some query. Either the client provides this, or the `query_id` is generated at index time by **Ubi Queries**.
- `user_id`. `session_id`, `source_id` <p id="user_id">

&ensp; (size 100) - are id's largely at the calling client's discretion for tracking users, sessions and sources (i.e. pages) of the event.
The `user_id` must be consistent in both the **Ubi Queries** and **Ubi Events** stores.

- `timestamp`:
&ensp; UTC-based, unix epoch time.

- `message_type`

&ensp; (size 100) - originally thought of in terms of ERROR, INFO, WARN, but could be anything useful such as `QUERY` or `CONVERSION`.
Can be used to group `action_name` together in logical bins. _Thinking this should be backend logic in analysis_

- `message`

&ensp; (size 256) - optional text message for the log entry. For example, with a `message_type` of `INFO`, people might expect an informational or debug type text for this field, but a `message_type` of `QUERY`, we would expect the text to be more about what the user is searching on.


- `event_attributes`'s structure is where any relevant information about the event can be stored.
There are two primary structures in the `event_attributes`:
- **`event_attributes.position`** - structure that contains information on the location of the event origin, such as screen *x,y* coordinates, or the *n*th object out of 10 results, ....

- `event_attributes.position.ordinal`

&ensp; tracks the *n*th item within a list that a user could select, click (i.e. selecting the 3rd element could be event{`onClick, results[4]`})

- `event_attributes.position.{x,y}`

&ensp; tracks x and y values, that the client defines

- `event_attributes.position.page_depth`

&ensp; tracks page depth of results

- `event_attributes.position.scroll_depth`

&ensp; tracks scroll depth of page results

- `event_attributes.position.trail`

&ensp; text field for tracking the path/trail that a user took to get to this location

<p id="object_id">

- **`event_attributes.object`**, which contains identifying information of the object returned from the query that the user interacts with (i.e.: a book, a product, a post, etc..).
The `object` structure has two ways to refer to the object, with `object_id` being the id that links prior queries to this object:

- `event_attributes.object.internal_id` is a unique id that OpenSearch can use to internally to index the object, think the `_id` field in the indices.
- `event_attributes.object.object_id`
&ensp; is the id that a user could look up amd find the object instance within the **document corpus**. Examples include: *ssn*, *isbn*, *primary_ean*, etc. Variants need to be incorporated in the `object_id`, so for a t-shirt that is red, you would need SKU level as the `object_id`.
Initializing Ubi requires mapping from the **Document Index**'s primary key to this `object_id`

- `event_attributes.object.object_type`

&ensp; indicates the type/class of object.

- `event_attributes.object.description`

&ensp; optional description of the object

- `event_attributes.object.transaction_id`

&ensp; optionally points to a unique id representing a successful transaction

- `event_attributes.object.to_user_id`

&ensp; optionally points to another user, if they are the recipient of this object, perhaps as a gift, from the user's `user_id`
- `event_attributes.object.object_detail`

&ensp; optional text for further data object details

- `event_attributes.object.object_detail.json`

&ensp; if the user has a json object representing what was acted upon, it can be stored here; however, note that that could lead to index bloat if the json objects are large.
- *extensible fields*: any new fields by any other names in the json objects that one indexes will dynamically expand this schema to that use-case.
Original file line number Diff line number Diff line change
Expand Up @@ -95,7 +95,7 @@ public List<Setting<?>> getSettings() {

settings.add(Setting.intSetting(SettingsConstants.VERSION_SETTING, 1, -1, Integer.MAX_VALUE, Setting.Property.IndexScope));
settings.add(Setting.simpleString(SettingsConstants.INDEX, "", Setting.Property.IndexScope));
settings.add(Setting.simpleString(SettingsConstants.ID_FIELD, "", Setting.Property.IndexScope));
settings.add(Setting.simpleString(SettingsConstants.object_id, "", Setting.Property.IndexScope));

return settings;

Expand Down
Loading
Loading