Merge pull request #788 from PowerGridModel/docs/columnar-data-serial…

…ization-native Columnar data docs - native data interface serialization
PowerGridModel · Oct 18, 2024 · 33c184f · 33c184f
2 parents da0193b + 0a03cba
commit 33c184f
Show file tree

Hide file tree

Showing 2 changed files with 47 additions and 3 deletions.
diff --git a/docs/advanced_documentation/native-data-interface.md b/docs/advanced_documentation/native-data-interface.md
@@ -55,21 +55,57 @@ node_dtype = np.dtype(
 To recreate the same node input dataset, we just create a `numpy` array using this special defined `dtype`.
 The `numpy` array has exactly the same data layout as the `std::vector<NodeInput>` above.
 
-
 ```python
 node = np.empty(shape=2, dtype=node_dtype)
 node['id'] = [1, 2]
 node['u_rated'] = [150e3, 10e3]
 ```
 
+## Columnar data format
+
+Additionally, we can represent the contents mentioned `NodeInput` struct in [Structured Array](#structured-array) for only specific attributes.
+This is especially useful when the component in question, e.g., a transformer, has many default attributes. In that case, the user can save significantly on memory usage. Hence, we can term it into `NodeInputURated` which is of `double` type.
+(note again, its representation in C++ core might be different than that of `NodeInputURated`).
+
+One can create a `std::vector<NodeInputURated>` to hold input for multiple nodes.
+In a similar example we create attribute data with `u_rated` of two nodes of 150 kV and 10 kV.
+
+```c++
+using NodeInputURated = double;
+std::vector<NodeInputURated> node_u_rated_input{ 150.0e3 , 10.0e3 };
+```
+
+Similar would be the case for `NodeInputId` and `std::vector<NodeNodeInputId>`
+
+To recreate this in Python using NumPy arrays, we should create it with the correct dtype - as mentioned in [Structured Array](#structured-array) - for each attribute.
+
+```python
+node_id = np.empty(shape=2, dtype=node_dtype["id"])
+node_id['id'] = [1, 2]
+node_u_rated = np.empty(shape=2, dtype=node_dtype["u_rated"])
+node_u_rated['u_rated'] = [150e3, 10e3]
+```
+
+## Creating Dataset
+
 We further save this array into a dictionary.
 With other types of components, the dictionary is a valid input dataset for the constructor of `PowerGridModel`,
 see [Python API Reference](../api_reference/python-api-reference.md).
 
+For a row based data format,
+
 ```python
 input_data = {'node': node}
 ```
 
+or for columnar data format,
+
+```python
+input_data_columnar = {'node': {"id": node_id, "u_rated": node_u_rated}}
+```
+
+There can also be a combination of both row based and columnar data format in a dataset.
+
 In the `ctypes` wrapper the pointers to all the array data will be retrieved and passed to the C++ code.
 This is also true for result dataset.
 The memory block of result dataset is allocated using `numpy`.
@@ -141,9 +177,14 @@ The code below creates an array which is compatible with transformer input datas
 ```python
 from power_grid_model import ComponentType, DatasetType, power_grid_meta_data
 
-transformer = np.empty(shape=5, dtype=power_grid_meta_data[DatasetType.input][ComponentType.transformer]['dtype'])
+transformer_dtype = power_grid_meta_data[DatasetType.input][ComponentType.transformer].dtype
+# Array for row based data
+transformer = np.empty(shape=5, dtype=transformer_dtype)
+# Array for columnar data
+transformer_tap_pos = np.empty(shape=5, dtype=transformer_dtype["tap_pos"])
+
 # direct string access is supported as well:
-# transformer = np.empty(shape=5, dtype=power_grid_meta_data['input']['transformer']['dtype'])
+# transformer = np.empty(shape=5, dtype=power_grid_meta_data[DatasetType.input][ComponentType.transformer].dtype)
 ```
 
 Furthermore, there is an even more convenient function `initialize_array`

diff --git a/docs/user_manual/serialization.md b/docs/user_manual/serialization.md
@@ -94,6 +94,9 @@ A [`ComponentDataset`](#json-schema-component-dataset-object) is an array of [`C
 - [`ComponentDataset`](#json-schema-component-dataset-object): `Array`
   - [`ComponentData`](#json-schema-component-data-object): the data per single component.
 
+**NOTE:** The actual deserialized data representation may be row based or columnar, depending on the `data_filter` provided at deserialization (Check {py:function}`json_deserialize <power_grid_model.utils.json_deserialize>` for example).
+Regardless of whether the deserialized data representation data is row based or columnar, the serialization format remains the same.
+
 #### JSON schema component data object
 
 A [`ComponentData`](#json-schema-component-data-object) object is either a [`HomogeneousComponentData`](#json-schema-homogeneous-component-data-object) object or an [`InhomogeneousComponentData`](#json-schema-inhomogeneous-component-data-object) object