diff --git a/documentation/documentation.md b/documentation/documentation.md index 2bb18ce..ad74672 100644 --- a/documentation/documentation.md +++ b/documentation/documentation.md @@ -27,7 +27,7 @@ docker compose -f docker-compose-cluster.yaml up Initialize the `awesome` UBI store: ``` -curl -X PUT "http://localhost:9200/_plugins/ubi/awesome?index=ecommerce&id_field=id" +curl -X PUT "http://localhost:9200/_plugins/ubi/awesome?index=ecommerce&object_id=id" ``` Send an event to the `awesome` store: @@ -71,50 +71,10 @@ The plugin has a concept of a "store", which is a logical collection of the even index is used to store events, and the other index is for storing queries. ### OpenSearch Data Mappings - -#### Schema for events: - -The current event mappings file can be found [here](https://github.com/o19s/opensearch-ubi/blob/main/src/main/resources/events-mapping.json). - -**Primary fields include:** -- `action_name` - (size 100) - any name you want to call your event -- `timestamp` - unix epoch time. if not set, will be set by the plugin when the event is received -- `user_id`. `session_id`, `page_id` - (size 100) - are id's largely at the calling client's discretion for tracking users, sessions and pages -- `query_id` - (size 100) - ID for some query. Note that it could be a unique search string, or it could represent a cluster of related searches (i.e.: *dress*, *red dress*, *long dress* could all have the same `query_id`). Either the client could control these, or the `query_id` could be retrieved from the API's response headers as it keeps track of queries on the node -- `message_type` - (size 100) - originally thought of in terms of ERROR, INFO, WARN, but could be anything useful such as `QUERY` or `CONVERSION`. Use to group `action_name` together. -- `message` - (size 256) - optional text for the log entry - -**Other fields & data objects** -- `event_attributes.object` - contains an associated JSONified data object (i.e. books, products, user info, etc) if there are any - - `event_attributes.object.object_id` - points to a unique, internal, id representing and instance of that object - - `event_attributes.object.key_value` - points to a unique, external key, matching the item that the user searched for, found and acted upon (i.e. sku, isbn, ean, etc.). - **This field value should match the value in for the object's value in the `id_field` [below](#id_field) from the search store** - It is possible that the `object_id` and `key_value` match if the same id is used both internally for indexing and externally for the users. - - `event_attributes.object.object_type` - indicates the type/class of object - - `event_attributes.object.description` - optional description of the object - - `event_attributes.object.transaction_id` - optionally points to a unique id representing a successful transaction - - `event_attributes.object.to_user_id` - optionally points to another user, if they are the recipient of this object - - `event_attributes.object.object_detail` - optional data object/map of further data details -- `event_attributes.position` - nested object to track user events to the location of the event origins - - `event_attributes.position.ordinal` - tracks the nth item within a list that a user could select, click - - `event_attributes.position.{x,y}` - tracks x and y values, that the client defines - - `event_attributes.position.page_depth` - tracks page depth - - `event_attributes.position.scroll_depth` - tracks scroll depth - - `event_attributes.position.trail` - text field for tracking the path/trail that a user took to get to this location - -* Other mapped fields in the schema are intended to be optional placeholders for common attributes like `user_name`, `email`, `price` - -**the users can dynamically add any further fields to the event mapping - -#### Schema for queries: - -The current query mappings file can be found [here](https://github.com/o19s/opensearch-ubi/blob/main/src/main/resources/queries-mapping.json). - -- `timestamp` - A unix timestamp of when the query was received -- `query_id` - A unique ID of the query provided by the client or generated automatically by the plugin -- `query_response_id` - A unique ID for the collection of results for the query -- `user_id` - A user ID provided by the client -- `session_id` - An optional session ID provided by the client +Ubi has 2 primary indices: +- **UBi Queries** stores all queries and results. +- **UBi Events** store that the Ubi client writes events to. +*Please follow the [schema deep dive](./schemas.md) to understand how these two indices make Ubi into a causal framework for search.* ## Plugin API @@ -122,7 +82,7 @@ The plugin exposes a REST API for managing UBI stores and persisting events. | Method | Endpoint | Purpose | |--------|-----------------------------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------| -| `PUT` | `/_plugins/ubi/{store}?index={index}&id_field={id_field}` |
Initialize a new UBI store for the given index. The `id_field` is optional and allows for providing the name of a field in the `index`'s schema to be used as the unique result/item ID for each search result. If not provided, the `_id` field is used.
| +| `PUT` | `/_plugins/ubi/{store}?index={index}&object_id={object_id}` |Initialize a new UBI store for the given index. The `object_id` is optional and allows for providing the name of a field in the `index`'s schema to be used as the unique result/item ID for each search result. If not provided, the `_id` field is used.
| | `DELETE` | `/_plugins/ubi/{store}` | Delete a UBI store | | `GET` | `/_plugins/ubi` | Get a list of all UBI stores | | `POST` | `/_plugins/ubi/{store}` | Index an event into the UBI store | diff --git a/documentation/queries/sql_queries.md b/documentation/queries/sql_queries.md index a2dec00..68096c8 100644 --- a/documentation/queries/sql_queries.md +++ b/documentation/queries/sql_queries.md @@ -9,7 +9,7 @@ Although it's trivial on the server side to find queries with no results, we can select count(0) from .ubi_log_queries -where query_response_hit_ids is null +where query_response_objects_ids is null order by user_id ``` @@ -18,7 +18,7 @@ order by user_id select count(0) from .ubi_log_events -where action_name='on_search' and event_attributes.data.data_detail.query_data.query_response_hit_ids is null +where action_name='on_search' and event_attributes.data.data_detail.query_data.query_response_objects_ids is null order by timestamp ``` @@ -113,7 +113,7 @@ where query_id ='1065c70f-d46a-442f-8ce4-0b5e7a71a892' order by timestamp ``` (In this generated data, the `query` field is plain text; however in the real implementation the query will be in the internal DSL of the query and parameters.) -query_response_id|query_id|user_id|query|query_response_hit_ids|session_id|timestamp +query_response_id|query_id|user_id|query|query_response_objects_ids|session_id|timestamp ---|---|---|---|---|---|--- 1065c70f-d46a-442f-8ce4-0b5e7a71a892|1065c70f-d46a-442f-8ce4-0b5e7a71a892|155_7e3471ff-14c8-45cb-bc49-83a056c37192|Blanditiis quo sint repudiandae a sit.|8659955|fa6e3b1c-3212-44d2-b16b-690b4aeddbba_1975|2027-04-17 10:16:45 diff --git a/documentation/schemas.md b/documentation/schemas.md new file mode 100644 index 0000000..234e013 --- /dev/null +++ b/documentation/schemas.md @@ -0,0 +1,214 @@ + +# Key User Behavior Insights concepts +**User Behavior Insights** (Ubi) **Logging** is really just a matter of linking and indexing queries, results and events within OpenSearch. +## Key ID's +Ubi is not functional unless the links between the following are consistently maintained within your Ubi-enabled application: + +- [`user_id`](#user_id) represents a unique user. +- [`object_id`](#object_id) represents an id for whatever item the user is searching for, such as *epc*, *isbn*, *ssn*, *handle*, etc. +- [`query_id`](#query_id) is a unique id for the raw query language executed and the resultant `object_id`'s that the query returned. \ +- [`action_name`](#action_name), though not technically an *id*, the `action_name` tells us what exact action (such as `click` or `add_to_cart`) was taken (or not) with this `object_id`. + +To summarize: the `query_id` signals the beginning of a `user_id`'s *Search Journey*, the `action_name` tells us how the user is interacting with the query results within the application, and [`event_attributes.object.object_id`](#object_id) is referring to the precise query result that the user interacts with. + +## Ubi Roles +- **Search Client**: in charge of searching, and then recieving *objects* from some document index in OpenSearch. + (1, 2, *5* & 7, below) +- **User Behavior Insights** module: once activated, manages the **Ubi Queries** store in the background, indexing each underlying, technical, DSL, index query with a unique [`query_id`](#query_id) along with all returned resultant [`object_id`](#object_id)'s, and then passing the `query_id` back to the **Search Client** so that events can be linked to this query. + (3, 4 & *5*, below) +- **objects**: are whatever items the user is searching for with the queries. Activating Ubi involves mapping your real-world objects (via its *isbn*, etc...) to the [`object_id`](#object_id) fields in the schemas below. +- The **Search Client**, if separate from the **Ubi Client**, forwards the indexed [`query_id`](#query_id) to the **Ubi Client**. + *Note:* We break out the roles of *search* and *Ubi event indexing* here, but many implementations will likely use the same OpenSearch client instance for both roles of searching and index writing. + (6, below) +- The **Ubi Client** then indexes all user events with this [`query_id`](#query_id) until a new search is performed, and a new `query_id` is generated by **User Behavior Insights** and passed back to the **Ubi Client** +- If the **Ubi Client** interacts with a result *object*, such as `onClick`, that [`object_id`](#object_id), *onClick* [`action_name`](#action_name) and `query_id` are all indexed together, signalling the causal link between the *search* and the *object*. + (8 & 9, below) + + + +```mermaid +graph LR +style L fill:none,stroke-dasharray: 5 5 +subgraph L["`*Legend*`"] + style ss height:150px + subgraph ss["Standard Search"] + direction LR + + style ln1a fill:blue + ln1a[ ]--->ln1b[ ]; + end + subgraph ubi-leg["Ubi data flow"] + direction LR + + ln2a[ ].->|"`**Ubi interaction**`"|ln2b[ ]; + style ln1c fill:red + ln1c[ ]-->|query_id flow|ln1d[ ]; + end +end +linkStyle 0 stroke-width:2px,stroke:#0A1CCF +linkStyle 2 stroke-width:2px,stroke:red +``` +```mermaid +%%{init: { + "flowchart": {"htmlLabels": false}, + + } +}%% +graph TB + +User--1) raw search string-->Search; +Search--2) search string-->Docs +style OS stroke-width:2px, stroke:#0A1CCF, fill:#62affb, opacity:.5 +subgraph OS[OpenSearch Cluster fa:fa-database] + style E stroke-width:1px,stroke:red + E[( Ubi Events )] + style Docs stroke-width:1px,stroke:#0A1CCF + style Q stroke-width:1px,stroke:red + Docs[(Document Index)] -."3) {DSL...} & [object_id's,...]".-> Q[( Ubi Queries )]; + Q -.4) query_id.-> Docs ; +end + +Docs -- "5) return both query_id & [objects,...]" --->Search ; +Search-.6) query_id.->U; +Search --7) [results, ...]--> User + +style *client-side* stroke-width:1px, stroke:#D35400 +subgraph "`*client-side*`" + style User stroke-width:4px, stroke:#EC636 + User["`**User**`" fa:fa-user] + App + Search + U + style App fill:#D35400,opacity:.35, stroke:#0A1CCF, stroke-width:2px + subgraph App[ UserApp fa:fa-store] + style Search stroke-width:2px, stroke:#0A1CCF + Search( Search Client ) + style U stroke-width:1px,stroke:red + U( Ubi Client ) + end +end + +User -.8) selects object_id:123.->U; +U-."9) index event:{query_id, onClick, object_id:123}".->E; + +linkStyle 1,2,0,6 stroke-width:2px,fill:none,stroke:#0A1CCF +linkStyle 3,4,5,8 stroke-width:2px,fill:none,stroke:red +``` + +## Ubi Stores +There are 2 separate stores for Ubi: +### 1) **Ubi Queries** +All underlying query information and results ([`object_id`](#object_id)'s) are stored in the **Ubi Queries** store, and remains largely invisible in the background. +The only obvious difference will be in the `ubi` stanze of the json response, *which could cause index bloat if one forgets that this is enabled*. + +**Ubi Queries** [schema](../src/main/resources/queries-mapping.json): +Since Ubi manages the **Ubi Queries** store, the developer should never have to write directly to this store (except for importing data). + +- `timestamp` + A unix timestamp of when the query was received + +- `query_id` + A unique ID of the query provided by the client or generated automatically. The same query text issued multiple times would generate different `query_id`. + + - `user_id` + A user ID provided by the client + +- `session_id` + An optional session ID provided by the client. _This is currently under review of if we keep this_. + +- `query_response_objects_ids` + This is an array of the `object_id`'s. This *could* be the same id as the `_id` but is meant to be the externally valid id of document/item/product. + + + +### 2) **Ubi Events** +This is the event store that the client side directly indexes events to, linking the event [`action_name`](#action_name), [`object_id`](#object_id)'s and [`query_id`](#query_id)'s together with any other important event information. +Since this schema is dynamic, the developer can add any new fields and structures (such as *user* information, *geo-location* information, etc.) at index time that are not in the current **Ubi Events** [schema](../src/main/resources/events-mapping.json): +- `application` ++ + (size 100) - name of the application tracking UBI events (e.g. *amazon-shop*, *ABC-microservice*) +- `action_name` +
+ + (size 100) - any name you want to call your event. For example, with *javascript* events, you could include `on_click`, `logon`, `add_to_cart`, `page_scroll`.... _This should be formalized. A list of standard ones and then custom ones._ + +- `query_id` +
+ + (size 100) - ID for some query. Either the client provides this, or the `query_id` is generated at index time by **Ubi Queries**. +- `user_id`. `session_id`, `source_id`
+ + (size 100) - are id's largely at the calling client's discretion for tracking users, sessions and sources (i.e. pages) of the event. + The `user_id` must be consistent in both the **Ubi Queries** and **Ubi Events** stores. + +- `timestamp`: + UTC-based, unix epoch time. + +- `message_type` + + (size 100) - originally thought of in terms of ERROR, INFO, WARN, but could be anything useful such as `QUERY` or `CONVERSION`. + Can be used to group `action_name` together in logical bins. _Thinking this should be backend logic in analysis_ + +- `message` + + (size 256) - optional text message for the log entry. For example, with a `message_type` of `INFO`, people might expect an informational or debug type text for this field, but a `message_type` of `QUERY`, we would expect the text to be more about what the user is searching on. + + +- `event_attributes`'s structure is where any relevant information about the event can be stored. + There are two primary structures in the `event_attributes`: + - **`event_attributes.position`** - structure that contains information on the location of the event origin, such as screen *x,y* coordinates, or the *n*th object out of 10 results, .... + + - `event_attributes.position.ordinal` + + tracks the *n*th item within a list that a user could select, click (i.e. selecting the 3rd element could be event{`onClick, results[4]`}) + + - `event_attributes.position.{x,y}` + + tracks x and y values, that the client defines + + - `event_attributes.position.page_depth` + + tracks page depth of results + + - `event_attributes.position.scroll_depth` + + tracks scroll depth of page results + + - `event_attributes.position.trail` + + text field for tracking the path/trail that a user took to get to this location + +
+
+ - **`event_attributes.object`**, which contains identifying information of the object returned from the query that the user interacts with (i.e.: a book, a product, a post, etc..).
+ The `object` structure has two ways to refer to the object, with `object_id` being the id that links prior queries to this object:
+
+ - `event_attributes.object.internal_id` is a unique id that OpenSearch can use to internally to index the object, think the `_id` field in the indices.
+ - `event_attributes.object.object_id`
+ is the id that a user could look up amd find the object instance within the **document corpus**. Examples include: *ssn*, *isbn*, *primary_ean*, etc. Variants need to be incorporated in the `object_id`, so for a t-shirt that is red, you would need SKU level as the `object_id`.
+ Initializing Ubi requires mapping from the **Document Index**'s primary key to this `object_id`
+
+ - `event_attributes.object.object_type`
+
+ indicates the type/class of object.
+
+ - `event_attributes.object.description`
+
+ optional description of the object
+
+ - `event_attributes.object.transaction_id`
+
+ optionally points to a unique id representing a successful transaction
+
+ - `event_attributes.object.to_user_id`
+
+ optionally points to another user, if they are the recipient of this object, perhaps as a gift, from the user's `user_id`
+ - `event_attributes.object.object_detail`
+
+ optional text for further data object details
+
+ - `event_attributes.object.object_detail.json`
+
+ if the user has a json object representing what was acted upon, it can be stored here; however, note that that could lead to index bloat if the json objects are large.
+- *extensible fields*: any new fields by any other names in the json objects that one indexes will dynamically expand this schema to that use-case.
diff --git a/src/main/java/com/o19s/ubi/UserBehaviorInsightsPlugin.java b/src/main/java/com/o19s/ubi/UserBehaviorInsightsPlugin.java
index 3b451a4..8ca10a0 100644
--- a/src/main/java/com/o19s/ubi/UserBehaviorInsightsPlugin.java
+++ b/src/main/java/com/o19s/ubi/UserBehaviorInsightsPlugin.java
@@ -95,7 +95,7 @@ public List