From 22dcf50ce01cee9bee89424308d041f3c4abd9ac Mon Sep 17 00:00:00 2001
From: Mingyu Kim <mingyu.kim@intel.com>
Date: Mon, 13 May 2024 09:17:12 +0400
Subject: [PATCH] [GPU] Documentation for use_device_mem (#24438)

### Details:
- Documentation for use_device_mem option from benchmark_app

### Tickets:
 - 140756
---
 src/plugins/intel_gpu/README.md               |  1 +
 .../docs/dynamic_shape/in_memory_cache.md     | 18 ++++++------
 src/plugins/intel_gpu/docs/use_device_mem.md  | 28 +++++++++++++++++++
 3 files changed, 38 insertions(+), 9 deletions(-)
 create mode 100644 src/plugins/intel_gpu/docs/use_device_mem.md

diff --git a/src/plugins/intel_gpu/README.md b/src/plugins/intel_gpu/README.md
index cf220164645416..1b49895790778e 100644
--- a/src/plugins/intel_gpu/README.md
+++ b/src/plugins/intel_gpu/README.md
@@ -30,6 +30,7 @@ GPU Plugin contains the following components:
 * [Debug utils](./docs/gpu_debug_utils.md)
 * [OpenCL Runtime issues troubleshooting](./docs/gpu_plugin_driver_troubleshooting.md)
 * [GPU plugin unit test](./docs/gpu_plugin_unit_test.md)
+* [Run benchmark from device_mem](./docs/use_device_mem.md)
 
 ## Documentation on dynamic-shape
 This contents explain the internal implementation of dynamic shape support in the GPU Plugin. For general usage of dynamic shape and limitations of the GPU plugin, please refer to this link: [GPU Device — OpenVINO™ documentation - Version(2023.1)](https://docs.openvino.ai/2023.1/openvino_docs_OV_UG_supported_plugins_GPU.html#dynamic-shapes).
diff --git a/src/plugins/intel_gpu/docs/dynamic_shape/in_memory_cache.md b/src/plugins/intel_gpu/docs/dynamic_shape/in_memory_cache.md
index 79847c6648bb38..e961178d48acd2 100644
--- a/src/plugins/intel_gpu/docs/dynamic_shape/in_memory_cache.md
+++ b/src/plugins/intel_gpu/docs/dynamic_shape/in_memory_cache.md
@@ -2,21 +2,21 @@
 
 ## Motivation
 
-When creating a primitive_impl in the Dynamic Shape model, if each primitive_impls are created about the same primitive with the same type and input / output shapes, it creates duplicated primitive_impl including new cl kernel build for same kernel source. this may result in inefficiency and performance degradation due to build the exactly same cl kernel source code multiple times for same layout and primitive type on the run time for dynamic model. To resolve this issue, ImplementationCache handle is newly introduced.
+When creating a primitive_impl in the Dynamic Shape model, if each primitive_impls are created about the same primitive with the same type and input / output shapes, it creates duplicated primitive_impl including new cl kernel build for same kernel source. this may result in inefficiency and performance degradation due to build the exactly same cl kernel source code multiple times for same layout and primitive type on the run time for dynamic model. To resolve this issue, `ImplementationsCache` is newly introduced.
 
 
 ## Property
 
-* ImplementationCache only handles primitive_impl which is created in primitive_inst::update_impl() and primitive_inst::update_weights() on dynamic shape model. In the case of static shape, kernels_cache handles static shape kernel duplication.
-* ImplementationCache inherits LRUCacheThreadSafe which is ThreadSafe version of LRUCache which handles primitive_impl cache by increasing the cache hit rate for frequently used items. Therefore, ImplementationCache optimizes the performance of dynamic execution through frequently used primitive_impl.
-* Since cldnn::program creates ImplementationCache as unique_ptr at cldnn::program constructor, its lifecycle is set to cldnn::program.
-* ImplementationCache supports multi-stream, so the cldnn::network of each stream manages primitive_impl in same cache.
-* ImplementationCache Capacity is set to 10000 by default, but may change in the future optimization.
+* `ImplementationsCache` only handles primitive_impl which is created in `primitive_inst::update_impl()` and `primitive_inst::update_weights()` on dynamic shape model. In the case of static shape, kernels_cache handles static shape kernel duplication.
+* `ImplementationsCache` inherits LruCacheThreadSafe which is ThreadSafe version of LruCache which handles primitive_impl cache by increasing the cache hit rate for frequently used items. Therefore, `ImplementationsCache` optimizes the performance of dynamic execution through frequently used primitive_impl.
+* Since cldnn::program creates ImplementationsCache as unique_ptr at `cldnn::program `constructor, its lifecycle is set to `cldnn::program`.
+* `ImplementationsCache` supports multi-stream, so the cldnn::network of each stream manages primitive_impl in same cache.
+* `ImplementationsCache` Capacity is set to 10000 by default, but may change in the future optimization.
 
 
 ## Usages
 
-ImplementationCache is used to handle primitive_impl cache at primitive_inst::update_impl() and primitive_inst::update_weights() in dynamic shape model.
+`ImplementationsCache` is used to handle primitive_impl cache at `primitive_inst::update_impl()` and `primitive_inst::update_weights()` in dynamic shape model.
 
-* In primitive_inst::update_impl(), it looks up the cache with key which is hash value of kernel_impl_param which is updated by the current primitive_inst. If it is not found from ImplementationCache, new primitive_impl is created and save it into the cache.
-* In primitive_inst::update_weights(), if it is not found a primitive_impl with a hash key value which matches the weights_reorder_kernel_params of the primitive inst, it also create a new primitive_impl for weight reorder and put it in the cache.
+* In `primitive_inst::update_impl()`, it looks up the cache with key which is hash value of kernel_impl_param which is updated by the current primitive_inst. If it is not found from `ImplementationsCache`, new primitive_impl is created and save it into the cache.
+* In `primitive_inst::update_weights()`, if it is not found a primitive_impl with a hash key value which matches the weights_reorder_kernel_params of the primitive inst, it also create a new primitive_impl for weight reorder and put it in the cache.
diff --git a/src/plugins/intel_gpu/docs/use_device_mem.md b/src/plugins/intel_gpu/docs/use_device_mem.md
new file mode 100644
index 00000000000000..370f86815b93d6
--- /dev/null
+++ b/src/plugins/intel_gpu/docs/use_device_mem.md
@@ -0,0 +1,28 @@
+# Introduction
+
+This document describes the use of '--use_device_mem' option in benchmark_app. It makes performance difference for the platforms where memory access for host memory and device memory are not identical. Discrete GPUs and recent version of iGPU get performance boost from this feature.
+
+# Motivation
+You can achieve best GPU performance when input data is placed on device memory. Intel OpenCL supports to specify such placement with USM(Unified Shared Memory) feature. It is recommended to place the input data on device memory if possible. For example, if the input data is decoded from a video stream by GPU, it is recommended to use that directly on GPU. On the other hand, if input data is generated from CPU, it is not possible to use this feature.
+The bottom line is that the usage of this feature depends on the application data flow. If possible, please place the input data on device memory.
+
+# Benchmark_app support for device memory
+OpenVINO benchmark_app sample contains feature to mimic the behavior of placing input data on device memory. It allocates input and output of the neural network on device memory. You can use feature with use_device_mem option from benchmark_app.
+
+### Restriction of use_device_mem
+Currently, benchmark_app does not support to fill input data when use_device_mem is on. Input data is filled with random numbers. It is fine to measure performance for the networks where performance does not depend on the input data. However, if the target network performance depends on the input data, this option might report an incorrect result. For example, some object detection networks contain NMS layer and its execution time depends on the input data. In such detection network, it is not recommended to measure performance with use_device_mem option.
+
+### How to build sample for use_device_mem (on Windows)
+The option depends on Intel OpenCL feature of USM memory. To use the option, you need to build sample with OpenCL enabled. Here's steps to build sample application with OpenCL.
+1. Setup env variable for compiler and OpenVINO release package
+1. \> git clone https://github.com/microsoft/vcpkg
+1. \> cd vcpkg
+1. \> .\bootstrap-vcpkg.bat
+1. \> vcpkg search opencl
+1. \> vcpkg install opencl
+1. openvino_install\samples\cpp> cmake -DCMAKE_BUILD_TYPE=Release -B build -DCMAKE_TOOLCHAIN_FILE=path/to/vcpkg/scripts/buildsystems/vcpkg.cmake
+1. openvino_install\samples\cpp> cmake --build build --config Release --parallel
+
+### How to build sample for use_device_mem (on Ubuntu)
+1. \# apt install opencl-c-headers opencl-clhpp-headers
+1. Build OpenVINO cpp sample with build script