Revision | Date | Changes | |
1.0 | June 21, 2022 | Initial version | |
1.1 | June 24, 2022 | ||
2.0 | July 14, 2023 | Updated designs for Libero 2023.2 | |
3.0 | Jan 22, 2024 | Updated for Libero 2024.1 | |
4.0 | Aug 7, 2024 | Updated for Libero 2024.2 |
This section provides all the requirements needed before starting the training.
You should install the following software:
- SmartHLS™ 2024.2 or later: this is packaged with Libero
- Libero® SoC 2024.2 (with QuestaSim Pro 2021.3) or later
- A terminal emulator such as PuTTY
This document uses the Windows versions of Libero® SoC 2024.2 and SmartHLS 2024.2. Depending on the version you use, the results generated from your Libero® SoC and SmartHLS could be slightly different from that presented in this document.
Additionally, while the default simulator for SmartHLS is now QuestaSim, ModelSim will still be supported. Some screenshots of the simulator may have been captured using Modelsim.
Download the training design files in advance:
- Linux image:
- Github:
- SHA256: 4a1406ba9e764a94026fcea2ee8fbb84f91384e953e7ba6176fcb7dadcbc5522
- Training design files for the Running Vector-Add Reference SoC Generation on the Board section can be found on Github under Training4/vector_add_soc
- Training design files for the Integrating SmartHLS into an Existing SoC design section can be found on Github under Training4/icicle-kit-reference-design
- Download
from the Release Assets. This archive contains the pre-compiled bitstreams required for this training. - Alternatively, you can re-generate the bitstreams and Libero project from scratch by following the instructions here: Compiling the hardware.
Later parts of the training involve running steps on the Icicle kit board. The following hardware is required:
- PolarFire® SoC FPGA Icicle Kit (MPFS-ICICLE-KIT-ES)
- 2 micro-USB cables for serial communication and flashing the Linux image
- Either a FlashPro6 external programmer or a micro-USB cable for the embedded FlashPro6
- Ethernet cable for network connection to the board for SSH access
This training will cover the following sections in the SmartHLS user guide: SoC Features, AXI4 Initiator Interface, AXI4 Target Interface, Driver Functions for AXI4 Target, and User-defined SmartDesigns.
We will use this cursor symbol throughout this tutorial to indicate sections where you need to perform actions to follow along.
Our previous trainings focused on using SmartHLS as an IP generator, where SmartHLS takes as input a C++ program and generates a SmartDesign IP component. The user then instantiates the generated SmartDesign IP component into their SmartDesign system in Libero before running synthesis, place and route, and ultimately programming the FPGA.
In this training, we introduce the SmartHLS SoC flow targeting a PolarFire SoC FPGA device as shown in Figure 4‑1. The SoC flow will now generate C++ software drivers with APIs that can be used to control the generated IP cores from the Microprocessor Sub-System (MSS). Given the generated software drivers and the generated SmartDesign component’s AXI4 interfaces, the SmartHLS IP block can be easily integrated into an existing PolarFire SoC system.
Figure 4‑1 SmartHLS IP Flow from Software to Hardware on FPGA
SmartHLS SoC flow also supports partitioning the input C++ program between software running on the MSS processor while user-specified functions are synthesized by SmartHLS into FPGA hardware cores. The SmartHLS SoC flow refers to FPGA IP cores synthesized by SmartHLS from user-specified C++ functions as hardware accelerators. FPGA hardware accelerators typically see a performance speedup (acceleration) compared to the original C++ software running on the MSS processor due to parallelism in the FPGA fabric.
As shown in Figure 5‑1, SmartHLS takes the C++ program as input and performs user-guided hardware/software partitioning. For the software partition, SmartHLS automatically transforms the original C++ software program by replacing the user-specified functions with calls to FPGA hardware using the generated software drivers. Then SmartHLS compiles the software using the RISC-V compiler toolchain to get a RISC-V software binary to run on the MSS. For the hardware partition, SmartHLS generates the hardware accelerators for user-specified C++ functions and then connects these accelerators to the MSS via an AXI4 interconnect in a generated Reference SoC hardware system targeting the PolarFire SoC.
Figure 5‑1 SmartHLS SoC flow Details
Alongside the accelerators, SmartHLS also produces a Tcl script for easy SmartDesign integration, and C++ accelerator driver code to control the accelerators, which can be directly called by the software program running on the MSS.
All SmartHLS-generated hardware accelerators will implement an AXI4 target interface, with memory-mapped registers for control and data transfer. Therefore, the accelerators can be instantiated into any existing AXI4-compatible SoC design. Figure 5‑2 below gives a system diagram of the PolarFire SoC Reference SoC that can be generated by SmartHLS.
Figure 5‑2 SmartHLS Generated Reference SoC Architecture Overview
On the left, we have the PolarFire SoC Microprocessor Sub-System (MSS), which contains four RISC-V processors running the user’s software on Linux. On the right, we have one or more hardware accelerators. The processor communicates with the accelerators using a memory-mapped AXI interconnect. Additional hardware accelerators can be added, if there is room in the memory-map, by simply attaching them to the AXI interconnect.
In this section, you will use SmartHLS SoC flow to target a vector addition program written in C++ to the PolarFire SoC FPGA. The vector addition will take two input arrays, add these two arrays element-by-element, and store the sum for each element into the output array.
On Windows, this can be done by double-clicking on the SmartHLS shortcut either in the start menu or the desktop.
On Linux, make sure that $(SMARTHLS_INSTALL_DIR)/SmartHLS/bin
is on
your PATH and the SmartHLS IDE can be opened by running the following
shls -g
You will first see a dialog box to select a workspace directory as shown in Figure 6‑1 below. You can use the default workspace for all parts of this tutorial by clicking on OK.
Figure 6‑1 Choosing a Workspace
Warning: Make sure there are no spaces in your workspace path. Otherwise, SmartHLS will give an error when running synthesis. Also, keep the path short as there’s a 90-character limit on file names on Windows.
Once the SmartHLS IDE opens, under the File menu, choose New and then SmartHLS C/C++ Project as shown below in Figure 6‑2.
Figure 6‑2 Creating a new SmartHLS C/C++ Project
For the project name, enter vector_add_soc
and select “Example Project 5: Vector Add” from the list of example
projects, as shown in Figure 6‑3. Then click on Next.
Figure 6‑3 Creating Vector Add SmartHLS Project
Finally, to complete the project creation, you will choose the FPGA device you intend to target. Use the selections shown in Figure 6‑4 for FPGA Family choose PolarFireSoC. For FPGA Device, choose MPFS250T_ES-FCVG484E on Icicle Board. Click on Finish when you are done. SmartHLS may take a few moments to create the project.
Figure 6‑4 Choosing FPGA device, select SoC IP Flow
If this is the first time you are using SmartHLS, you will need to set up the paths to QuestaSim (and Microsemi Libero® for later parts of this tutorial). To setup the paths, click on SmartHLS on the top menu bar, then click on Tool Path Settings. Once the dialog opens, set the paths for Simulator and Microsemi Libero® SoC as shown in Figure 6‑5 and click OK.
Figure 6‑5 SmartHLS Tool Path Settings
An important panel of the SmartHLS IDE is the Project Explorer on the left side of the window as shown in Figure 6‑6. We will use the project explorer throughout this tutorial to view source files and synthesis reports.
Click on the small arrow icon to expand the vector_add_soc project. You can now double click any of the source files, such as vector_add_soc.cpp, and you will see the source file appear in the main panel to the right of the Project Explorer.
Figure 6‑6 Project Explorer for browsing source files and reports
The SmartHLS IP flow refers to when SmartHLS generates a hardware IP core that can be integrated into a user’s SmartDesign system in Libero. SmartHLS can go one step further and integrate the generated IP core, which we refer to as an accelerator, into a Reference SoC targeting PolarFire SoC. We call this flow the SmartHLS SoC flow (described later).
Once a SmartHLS project is created, you should always open one of the
source files (such as vector_add_soc.cpp
) or double-click on the
directory in the Project Explorer pane (Figure 6‑6).
This will make vector_add_soc
the active project. You can also see the
active project name in the Console tab after running a SmartHLS command
in Figure 6‑7.
Figure 6‑7 SmartHLS Console active project
When there are multiple projects open in the workspace, you need to click on the project in the Project Explorer pane or open a file from the project to make the project active before running any SmartHLS commands. This is a standard guideline for Eclipse-based IDEs such as SmartHLS.
Figure 6‑8 SmartHLS Toolbar
Towards the top of SmartHLS, you will find a toolbar, as shown in Figure 6‑8, which you can use to execute the main features of the SmartHLS tool. Highlighted in red is the new SoC pulldown menu. We will describe the SoC pulldown menu in the SmartHLS SoC Flow section.
Figure 6‑9 SmartHLS Workflow
Figure 6‑9 summarizes the steps for the SmartHLS flow. We initially create a SmartHLS project and follow a standard software development flow on the C++ (compile/run/debug). Then we apply HLS constraints using SmartHLS C++ pragmas. These include HLS constraints covered in previous trainings such as the target clock period, loop optimizations, and memory configuration. For more details see our optimization guide.
There are new SmartHLS interface pragmas used to specify the data transfer method for each top-level function argument. These pragmas specify how the generated hardware accelerators interface with the rest of the SoC. Figure 6‑10 below contains a summary of the SmartHLS pragmas used in the vector-add example. More details on the interfaces will be covered in the SoC Data Transfer Methods section. For a complete pragma reference, see our pragma guide.
In Figure 6‑9, after specifying the argument interfaces, we can compile the software into a hardware IP core using SmartHLS, and review reports about the generated hardware. Then we run software/hardware co-simulation to verify the generated hardware. Finally, we can try synthesizing the generated IP, and integrate the IP into an existing hardware system using the output SmartDesign TCL script, software drivers, and Verilog for the FPGA hardware accelerators. The last SoC step of the workflow, “Generate SoC Project” will be covered in the SmartHLS SoC Flow section.
Pragma | Description |
#pragma HLS function top | Identify the function being compiled into an accelerator |
#pragma HLS interface default \ type(<axi_target|simple>) |
Set the default interface type, including interface type for control and arguments. |
#pragma HLS interface control \ type(<axi_target|simple>) |
Set the default module control interface type. The control interface is used for starting the accelerator, reading completion status and retrieving return data. |
#pragma HLS interface argument(<ARGUMENT_NAME>) \ type(axi_target) \ num_elements(<NUM_ARRAY_ELEMENTS>) \ dma(true|false) |
Set pointer argument of ARGUMENT_NAME to use axi_target as the interface. If dma(true), DMA will be used for transferring data. More details in the DMA Copy: AXI Target with DMA section. |
#pragma HLS interface argument(<ARGUMENT_NAME>) \ type(axi_initiator) \ ptr_addr_interface(<simple|axi_target>) \ num_elements(<NUM_ARRAY_ELEMENTS>) |
Set Pointer argument of ARGUMENT_NAME to use axi_initiator as the interface. ptr_addr_interface is the interface that receives the address from MSS. NUM_ARRAY_ELEMENTS indicates the number of elements in the array. |
Figure 6‑10 Summary of Pragmas Used in Vector-Add
We can now browse through the code in vector_add_soc.cpp
file. We will
first look at line 25 of the vector_add_sw
C++ function as shown in
Figure 6‑11. The function has three pointer arguments. Two input arrays:
and b
, and an output array result
. Each array is expressed in C++ as a
pointer to an int (32-bit) array of size: SIZE
. The loop on line 26
performs a vector addition of a
and b
and stores the sum in the result
24 // The core logic of this example
25 void vector_add_sw(int* a, int* b, int* result) {
26 for (int i = 0; i < SIZE; i++) {
27 result[i] = a[i] + b[i];
28 }
29 }
Figure 6‑11 Core Logic of Vector-Add
Now we look on line 70 at the vector_add_axi_target_memcpy
C++ function as shown in Figure 6‑12.
70 void vector_add_axi_target_memcpy(int* a, int* b, int* result) {
71 #pragma HLS function top
72 #pragma HLS interface control type(axi_target)
73 #pragma HLS interface argument(a) type(axi_target) num_elements(SIZE)
74 #pragma HLS interface argument(b) type(axi_target) num_elements(SIZE)
75 #pragma HLS interface argument(result) type(axi_target) num_elements(SIZE)
76 vector_add_sw(a, b, result);
77 }
Figure 6‑12 Accelerator Version of Vector-Add
We use SmartHLS to compile the vector addition into a hardware
accelerator running on the FPGA by adding SmartHLS pragmas. Immediately
following the function prototype, "#pragma HLS function top
", on
line 71 specifies that the vector_adder_axi_target_memcpy
function will be turned into a hardware accelerator by SmartHLS. The
sub-functions called by top-level functions will also be compiled to
hardware, for example the vector_add_sw()
here. SmartHLS can compile
multiple top-level functions, each with a "top
" pragma, into hardware
accelerators but for this example we will use a single accelerated
The pragmas on lines 73-75 describe the interface type of each argument
to the accelerated function. On line 72, the control type is set to
instead of the default simple
. The control type interface
must be axi_target
if the user wishes SmartHLS to generate a Reference
SoC automatically. Requiring an AXI target interface allows the
generated Reference SoC to interact with the accelerator through an AXI4
interface without manually configuring the input and output wiring. If
the control interface type is simple, the control interface will use
individual wires for clock
, reset
, ready
, etc., instead of an AXI target
port and users will be responsible for connecting these to their system.
On lines 73-75, the interface type for arguments a
, b
, and result
all set to axi_target
. For an axi_target
interface, the hardware
accelerator expects data to be sent to the data AXI target port and the
accelerator will store the data in local memory blocks. The
field specifies the length of the array that will be
transferred for each argument. For more information on the required
pragmas and tradeoffs, please see our pragma
In this example, we separated the core C++ algorithm into the
function. We can then call this function from multiple
different SmartHLS top-level functions. We also call this function from
our software test bench in main on line 158.
In the main function on line 141, we allocate the input arrays in
contiguous physical memory using hls_malloc
and initialize the input arrays on lines 150-155. The main function
calls the top-level function that will be turned into hardware on line
160, and compares the result against a software-computed golden output
on line 168. Note that the main function returns 0 if the results match,
which is required to run Software-Hardware Co-Simulation
. There are no restrictions on C++ code used in the main function
that will not be turned into hardware, for example, file I/O can be used
for your software testbench.
Click on the Compile Software icon in the toolbar. This compiles the software with the GCC compiler. You will see the output from the compilation appearing in the bottom of the screen in the Console window of the IDE.
Now, execute the compiled software by clicking on the Run Software icon in the toolbar. You should see the message RESULT: PASS appearing in the Console window, as shown below.
13:58:58 **** Build of configuration LegUp for project vector_add_soc ****
"C:\Microchip\SmartHLS-2024.2\SmartHLS\bin\shls.bat" -s sw_compile sw_run
Info: Running the following targets: sw_compile sw_run
Info: Compiling Software...
13:59:06 Build Finished (took 7s.703ms)
Now we can compile the C++ software into hardware using SmartHLS by clicking on the toolbar icon to Compile Software to Hardware. This command invokes SmartHLS to compile functions designated with the pragma “HLS function top” into hardware. If the top function calls descendant functions, all descendant functions are also compiled into hardware.
When the Compile Software to Hardware command is finished, SmartHLS
will open the report file hls_output/reports/summary.hls.vector_add_axi_target_memcpy.rpt
Notice that the top-level function name vector_add_axi_target_memcpy
is specified in the report filename. There is one report generated for
each top-level function.
The report shows the RTL interface of the generated Verilog module
corresponding to the C++ top-level function as shown below in Figure
6‑14. We can see that the generated IP’s interface has input ports for
clock and a single AXI4 Target port. Due to the large number of AXI4
ports in the RTL, SmartHLS uses a wildcard “axi4target_*
” to
simplify the table. The “Control AXI4 Target” indicates that
start/finish control is done using the AXI target interface. Each of the
function’s three arguments also use the AXI target interface. The
address map of the AXI target port is given later in the report.
| RTL Interface Generated by SmartHLS |
| C++ Name | Interface Type | Signal Name | Signal Bit-width | Signal Direction |
| | Clock & Reset | clk (positive edge) | 1 | input |
| | | reset (synchronous active high) | 1 | input |
| | Control via AXI4 Target | axi4target_* | | |
| a | AXI4 Target | axi4target_* | | |
| b | AXI4 Target | axi4target_* | | |
| result | AXI4 Target | axi4target_* | | |
Figure 6‑14 RTL Interface Generated for vector_add_axi_target_memcpy
Report section “Scheduling Result” gives the number of cycles scheduled for each basic block of the function. Report section “Memory Usage” lists the memories that are used in the hardware. Any memory that is accessed by both the software testbench (parent functions of the top-level function) and hardware functions (the top-level functions and its descendants) becomes an I/O memory. These are any non-constant arguments for top-level function or global variables that are accessed by both the software testbench and hardware functions. I/O memories become memory interfaces of the top-level module for the generated hardware. For more information on interfaces, please refer to Top-Level RTL Interface.
The “I/O Memories” table is shown in Figure 6‑15 and has an entry for
each top-level function argument, which each have a "Data Width" of
32-bits (int), a "Depth" of 16-bits (SIZE
), and a "Size [Bits]” of 512 bits (16x32).
| I/O Memories |
| Name | Accessing Function(s) | Type | Size [Bits] | Data Width | Depth | Read Latency |
| a | vector_add_axi_target_memcpy | ROM | 512 | 32 | 16 | 1 |
| b | vector_add_axi_target_memcpy | ROM | 512 | 32 | 16 | 1 |
| result | vector_add_axi_target_memcpy | RAM | 512 | 32 | 16 | 1 |
Figure 6‑15 An Example I/O Memory Usage Table
The report section on the AXI4 target interface address map, is shown
below in Figure 6‑16. This section first confirms that “Yes”
this HLS accelerator is compatible with the reference SoC
features (to be covered later in the training). An accelerator is compatible if
the control and all function arguments have an interface type of either
or axi_initiator
, so that the accelerator can be
automatically integrated into the Reference SoC. If any of the interface
types are the default of simple, the accelerator will be incompatible,
and the user will not be able to generate a Reference SoC automatically.
In addition, the target board must be the PolarFire SoC Icicle kit for
the accelerator to be compatible with the Reference SoC features. If the
accelerator is compatible, then the default base address for this
accelerator when automatically integrated in the generated reference SoC
is also shown: 0x70000000
The AXI4 Target Interface Address Map table informs the user of the address offsets, size, and direction for the Module Control (start and finish registers), and the three function arguments which are each 16 array elements (SIZE) x 4 bytes per element (int) = 64 bytes.
====== 4. AXI4 Target Interface Address Map ======
Compatibility of HLS accelerator with reference SoC features: Yes.
Default base address in reference SoC: 0x70000000.
| Accelerator Function: vector_add_axi_target_memcpy (Address Space Range: 0x100) |
| Argument | Address Offset | Size [Bytes] | Direction |
| Module Control | 0x008 | 4 | inout |
| a | 0x040 | 64* | input |
| b | 0x080 | 64* | input |
| result | 0x0c0 | 64* | output |
* On PolarFire SoC devices, it is recommended to use the PDMA engine
for data transfer when the transfer size is bigger than 16KB, and use
the memcpy driver functions when the transfer size is smaller than 16KB.
See memcpy and dma transfer driver functions in
Figure 6‑16 Compatibility with Reference SoC Features and Address Space of Accelerator’s Module Control and Arguments
You can find the generated Verilog code in
Figure 6‑17 Finding the SmartHLS-Generated Verilog in the Project Explorer
If you open the Verilog file you will see the clk
, reset
, and AXI Target
interface port (axi4target
) as shown in Figure 6‑18.
module vector_add_axi_target_memcpy_top
input clk,
input reset,
output axi4target_arready,
input axi4target_arvalid,
input [8 - 1:0] axi4target_araddr,
input [1 - 1:0] axi4target_arid,
input [1:0] axi4target_arburst,
Figure 6‑18 Snippet of vector_add_soc_vector_add_axi_target_memcpy.v
Now we can simulate the Verilog RTL hardware with QuestaSim to find out the number of cycles needed to execute the circuit – the cycle latency.
Click on the SW/HW Co-Simulation icon
in the toolbar. SW/HW co-simulation will
simulate the generated Verilog module,
, in RTL using QuestaSim, while
running the rest of the program, main, in software. The co-simulation
flow allows us to simulate and verify the SmartHLS-generated hardware
without writing a custom RTL testbench.
In the Console window, you will see various messages printed by QuestaSim related to loading simulation models for the hardware. The hardware may take a few minutes to simulate. We want to focus on the messages near the end of the simulation which will look like this:
# run 1000000000000000ns
# Running SW/HW co-simulation...
# Initializing AXI target input arguments at cycle = 0
# AXI target initialization: Writing argument "a" at cycle = 0
# AXI target initialization: Writing argument "b" at cycle = 0
# Finished initializing of AXI target input arguments at cycle = 0
# Starting DUT using AXI target interface CSR at cycle = 0
# --- vector_add_axi_target_memcpy_top Call 0: start at cycle = 1
# Polling AXI target interface CSR for finish signal at cycle = 1
# ...
# Received AXI target interface CSR finish signal at cycle = 16
# 1 / 1 function calls completed.
# --- vector_add_axi_target_memcpy_top Call 0: finish at cycle = 17, total latency = 16
# Retrieving AXI target output arguments at cycle = 18
# AXI target retrieval: Reading argument "result" at cycle = 18
# Finished retrieving AXI target output arguments at cycle = 18
# vector_add_axi_target_memcpy_top execution time (cycles): 16
# Number of calls: 1
# vector_add_axi_target_memcpy_top simulation time (cycles): 18
# ** Note: $finish :
# Time: 865 ns Iteration: 1 Instance: /cosim_tb
# End time: 12:23:02 on Aug 07,2024, Elapsed time: 0:00:04
# Errors: 0, Warnings: 0, Suppressed Warnings: 13
Info: Verifying RTL simulation
Retrieving hardware outputs from RTL simulation for vector_add_axi_target_memcpy function call 1.
| Top-Level Name | Number of calls | Simulation time (cycles) | Call Latency (min/max/avg) | Call II (min/max/avg) |
| vector_add_axi_target_memcpy_top | 1 | 18 | 16 (single call) | N/A (single call) |
Simulation time (cycles): 18
SW/HW co-simulation: PASS
Figure 6‑19 Sample CoSim Results
The simulation printed “SW/HW co-simulation: PASS
” which indicates
that the RTL generated by SmartHLS matches the software model.
The co-simulation flow uses the return value from the main software
function to determine whether the co-simulation has passed. If the main
function returns 0, then the co-simulation will PASS
; otherwise, a
non-zero return value will FAIL
. Please make sure that your main
function always follows this convention and returns 0 if the top-level
function tests are all successful.
Click the icon on the toolbar to Synthesize Hardware to FPGA. SmartHLS will run Libero synthesis and place & route on the generated hardware accelerator.
Once the command completes, SmartHLS will open the summary.results.rpt
report file. SmartHLS will summarize the resource usage and Fmax results
reported by Libero® after place and route. You should get similar
results as shown below in Figure 6‑20. Your numbers may differ slightly,
depending on the version of SmartHLS and Libero® you are using. This
tutorial used Libero® SoC v2024.2. The timing results and resource usage
might also differ depending on the random seed used in the Libero tool
====== 2. Timing Result of HLS-generated IP Core (top-level module: vector_add_axi_target_memcpy_top) ======
| Clock Domain | Target Period | Target Fmax | Worst Slack | Period | Fmax |
| clk | 10.000 ns | 100.000 MHz | 6.275 ns | 3.725 ns | 268.456 MHz |
The reported Fmax is for the HLS core in isolation (from Libero's post-place-and-route timing analysis).
When the HLS core is integrated into a larger system, the system Fmax may be lower depending on the critical path of the system.
====== 3. Resource Usage of HLS-generated IP Core (top-level module: vector_add_axi_target_memcpy_top) ======
| Resource Type | Used | Total | Percentage |
| Fabric + Interface 4LUT* | 1246 + 336 = 1582 | 254196 | 0.62 |
| Fabric + Interface DFF* | 365 + 336 = 701 | 254196 | 0.28 |
| I/O Register | 0 | 432 | 0.00 |
| User I/O | 0 | 144 | 0.00 |
| uSRAM | 22 | 2352 | 0.94 |
| LSRAM | 2 | 812 | 0.25 |
| Math | 0 | 784 | 0.00 |
* Interface 4LUTs and DFFs are occupied due to the uses of LSRAM, Math, and uSRAM.
Number of interface 4LUTs/DFFs = (36 * #.LSRAM) + (36 * #.Math) + (12 * #.uSRAM) = (36 * 2) + (36 * 0) + (12 * 22) = 336.
Figure 6‑20 Timing and Resource Usage Results
SmartHLS generates C++ driver functions that can be used to control the
generated hardware from an attached processor. This accelerator driver
code can be found under hls_output
in the accelerator_drivers
directory as shown in Figure 6‑21.
Figure 6‑21 Accelerator Driver Files Location
The header file, <PROJ_NAME>_accelerator_drivers.h
, in the
directory lists the user-callable functions that can be used to control
each HLS accelerator, while the <PROJ_NAME>_accelerator_driver.cpp
file implements the driver functions. The driver functions are generated
for arguments and module control if they are configured to use AXI4
target interface. Figure 6‑22 summarizes the different categories of
driver functions. Please visit Driver Functions for AXI4
section of our user guide for a more detailed explanation.
Example Function | Usage |
Module Control Driver Functions |
int MyTopFunc_is_idle(void *virt_addr) | Returns 1 if the HLS module is idle (or has finished the last invocation). |
void MyTopFunc_start(void *virt_addr) | This function starts the SmartHLS module. Input arguments, including the module's memory-mapped virtual address, are expected to have been set before this function is called. |
RETURN_TYPE MyTopFunc_join(void *virt_addr) | A blocking function that waits for the completion of the HLS module and returns the return value of the HLS module (if not void). |
Scalar Argument Driver Functions |
void MyTopFunc_write_MyScalarArg(TYPE val, void *virt_addr) | This function writes the value 'val' to the the scalar argument MyScalarArg. This essentially causes an AXI Memory Map write transaction into the SmartHLS module's on-chip storage. |
TYPE MyTopFunc_read_MyScalarArg(void *virt_addr) | This function reads the value of MyScalarArg. This is causes an AXI Memory Map read transaction from the SmartHLS module's on-chip storage. |
Pointer Argument Driver Functions - memcpy |
void MyTopFunc_memcpy_write_MyPtrArg (void* MyPtrArg, uint64_t byte_size, void *virt_addr) void MyTopFunc_memcpy_read_MyPtrArg (void* MyPtrArg, uint64_t byte_size, void *virt_addr) |
These functions performs memory-mapped write/read operations (using the standard memcpy function). It is the CPU who copies the data from its memory as pointed to by MyPtrArg and the SmartHLS module's on-chip storage. The total size to transfer is defined by the 'byte_size' argument. These functions do NOT use DMA. |
Pointer Argument Driver Functions - DMA |
void MyTopFunc_dma_write_MyPtrArg(void* MyPtrArg, uint64_t byte_size, void *virt_addr) void MyTopFunc_dma_read_MyPtrArg (void* MyPtrArg, uint64_t byte_size, void *virt_addr) |
These functions performs memory-mapped write/read operations using the DMA engine in the HSS to move data between the CPU's memory at MyPtrArg and the SmartHLS module's on-chip storage. The total size to transfer is defined by the 'byte_size' argument. |
AXI-Initiator Argument’s Pointer Address Driver Function |
void MyTopFunc_write_MyPtrArg_ptr_addr(void* arg_virt_addr, void *virt_addr) | This function sets the address for MyPtrArg using 'arg_virt_addr'. 'arg_virt_addr' is a virtual address and internally it will be mapped to a physical address before sending it to the SmartHLS module, which uses that address to access the content of MyPtrArg. The 'virt_addr' argument is the memory-mapped virtual base address of the top-level module. When the SmartHLS project's type is set to Icicle_SoC, the driver is assumed to run on a Linux Operating System and the CPU's memory referenced by the pointer argument MyPtrArg must be allocated using hls_malloc() function and released using hls_free(). |
Top-Level Driver Functions |
RETURN_TYPE MyTopFun_hls_driver(..., uint32_t base_addr = MyTopFun_BASE_ADDR) | This blocking function initializes all input argument data, starts the HLS module, waits for its completion, and retrieves output argument data and return value. It can be used as a direct replacement to the original top-level function. The arguments and return type are the same as the top-level function’s. |
void MyTopFun_write_input_and_start(..., void *virt_addr) | This function initializes all input argument data and starts the SmartHLS module. It is a non-blocking call that can be used to start the SmartHLS module and continue to execute other parts of the software while the SmartHLS module is running. The arguments of this function include the input arguments of the top-level function. |
RETURN_TYPE myTopFun_join_and_read_output(..., void *virt_addr) | This blocking function waits for the SmartHLS module to finish the execution, and retrieves output argument data and return value (if not void). The arguments are the same arguments of the top-level function. |
Figure 6‑22 Summary of Driver Functions
Open vector_add_soc_accelerator_driver.h
from the location shown in Figure 6‑21. Line 10 and line 11 define the
base address and the size of the address space the vector_add
occupies. These macro values have the same values as the values shown in
the report (see Figure 6‑23) and can be modified when incorporating
these driver functions into your own hardware system if the accelerator
base address changes.
====== 4. AXI4 Target Interface Address Map ======
Compatibility of HLS accelerator with reference SoC features: Yes.
Default base address in reference SoC: 0x70000000.
| Accelerator Function: vector_add_axi_target_memcpy (Address Space Range: 0x100) |
| Argument | Address Offset | Size [Bytes] | Direction |
| Module Control | 0x008 | 4 | inout |
| a | 0x040 | 64* | input |
| b | 0x080 | 64* | input |
| result | 0x0c0 | 64* | output |
Figure 6‑23 Module Base Address and Span in Header File
Open vector_add_soc_accelerator_driver.cpp
Line 207 to line 233 are the control module functions for the
top function.
writes to the
accelerator control register to start
the accelerator. The accelerator will write a 0 to the same control
register when the computation is done, and
runs a busy loop checking
for 0 on that same control register. These functions only control the
starting and waiting for the accelerator. They do NOT pass in the
int vector_add_axi_target_memcpy_is_idle(void *virt_addr) {
volatile int *acc_start_addr =
(volatile int *)((char*)virt_addr + 8); // base+8
return *acc_start_addr == 0;
// This is a non-blocking function that starts the computation on the accelerator.
// Any arguments, if any, should be written using the write functions given.
// Use vector_add_axi_target_memcpy_join_and_read_output() to wait for the accelerator to finish and return with the result.
void vector_add_axi_target_memcpy_start(void *virt_addr) {
// Run accelerator
volatile int *acc_start_addr =
(volatile int *)((char*)virt_addr + 8); // base+8
*acc_start_addr = 1;
// This is a blocking function that waits for the computation started by vector_add_axi_target_memcpy_start() to return.
// The return value is the result computed by the accelerator.
void vector_add_axi_target_memcpy_join(void *virt_addr) {
// Wait for accelerator to finish, acc_start_addr is set to 1 in the start function
while (!vector_add_axi_target_memcpy_is_idle(virt_addr)) {}
Figure 6‑24 Control Module Function of vector_add_axi_target_memcpy Top Module
Go to line 236.
is the direct
replacement function for the software version of
. SmartHLS will automatically replace
the body of vector_add_axi_target_memcpy()
by a single call to
when you click “Run
software with accelerators” (covered in the SmartHLS SoC Flow section.)
has the same
parameters as vector_add_axi_target_memcpy()
, but the parameters are
casted into void pointers, as void pointers can be used to point to any
data type.
// This is a blocking function that calls and waits for the accelerator to return.
// The return value is the result computed by the accelerator.
void vector_add_axi_target_memcpy_hls_driver(void* a, void* b, void* result, uint32_t base_addr) {
// Run setup function
void *virt_addr = vector_add_axi_target_memcpy_setup(base_addr);
if (virt_addr == NULL) {
printf("Error: setup function failed for vector_add_axi_target_memcpy");
vector_add_axi_target_memcpy_write_input_and_start(a, b, virt_addr);
vector_add_axi_target_memcpy_join_and_read_output(result, virt_addr);
// This is a non-blocking function that starts the computation on the accelerator.
// Use vector_add_axi_target_memcpy_join() to wait for the accelerator to finish and return with the result.
void vector_add_axi_target_memcpy_write_input_and_start(void* a, void* b, void *virt_addr) {
vector_add_axi_target_memcpy_memcpy_write_a(a, 64, virt_addr);
vector_add_axi_target_memcpy_memcpy_write_b(b, 64, virt_addr);
// This is a blocking function that waits for the computation started by vector_add_axi_target_memcpy_start() to return.
// The return value is the result computed by the accelerator.
void vector_add_axi_target_memcpy_join_and_read_output(void* result, void *virt_addr) {
vector_add_axi_target_memcpy_memcpy_read_result(result, 64, virt_addr);
Figure 6‑25 Top-Level Function for vector_add_axi_target_memcpy Top Module
On line 247, vector_add_axi_target_memcpy_hls_driver()
makes a
call to the non-blocking
function that writes the input arguments and starts the accelerator.
Then, vector_add_axi_target_memcpy_hls_driver()
calls the
blocking vector_add_axi_target_memcpy_join_and_read_output()
function on line 448 to wait for and read back the accelerator’s output.
Users can use
start the calculation, then execute other computations on the MSS in
parallel with the hardware accelerator execution in the FPGA fabric.
Later, users can use
retrieve the results from the hardware accelerator. This is like using
threads to do parallel computations.
SmartHLS can generate a reference SoC design, with user-specified
partitioning of software running on MSS and hardware accelerators
running on the FPGA fabric. We refer to this as the SmartHLS SoC flow.
No code changes are required to go from IP flow to SoC flow using the
example, since the top-level function is compatible
with the Reference SoC features (see Figure 6‑16). The main
function will run on the MSS in Linux and main
will call the driver
software APIs to run the hardware accelerator.
The SmartHLS SoC flow steps are broken down below (Figure 6‑26). These steps are available to users at the click of a button in the IDE, but we first want to provide further details to give users a better understanding of SmartHLS.
Figure 6‑26 SoC Project Generation Flow
Each box in Figure 6‑26 under User’s Source Code corresponds with a compilation step. Whenever the user code is changed, Compile Software to Hardware and Transform C++ source to invoke accelerator become out of date and require a remake. Following the right-hand path of Figure 6‑26, SmartHLS performs a C++ to C++ source transformation to replace the body of the functions marked by the users to invoke the FPGA accelerator instead of running the code in software. After Compile Software to Hardware generated the C++ driver code that runs the accelerators from software, SmartHLS cross-compiles the transformed C++ source and the driver code generated by Compile Software to Hardware to run on the RISC-V processor. This completes the software portion needed for running the software on the board.
Following the left-hand path of Figure 6‑26 and after Compile Software to Hardware has completed, Generate reference SoC with HLS Accelerator(s) invokes SmartDesign to integrate the accelerators using the TCL scripts generated by the previous step. SmartHLS then performs the place and route and generates the bitstream. Once the bitstream is available, SmartHLS programs the board and completes all the hardware prerequisites for running the system on board. Now that the board has been programmed and the software is generated, we can copy the RISC-V software binaries to the board using SSH and run the software with accelerator on the board.
You can run the SmartHLS SoC flow from the top menu of the SmartHLS IDE, under the SmartHLS -> RISC-V SoC Features (available for PolarFire SoC only) as shown in Figure 6-27. You can also find the same menu options by clicking the “SoC” button in the toolbar (previously shown in Figure 6‑8).
Figure 6-27: SmartHLS RISC-V SoC Features (available for PolarFire SoC only)
There are two options shown in Figure 6-27 under this menu:
Base SoC with no HLS Accelerators means that we are programming the prebuilt reference bitstream that ships with SmartHLS (Base SoC) and we are only running software on the MSS. The prebuilt FPExpress job file for the Base SoC can be found at
Reference SoC with HLS Accelerator(s) means that SmartHLS performs hardware/software partitioning between software running on the MSS and the FPGA accelerators. SmartHLS will generate a new bitstream (Reference SoC) with the accelerator connected to the MSS over AXI.
There are 3 options under Base SoC with no HLS Accelerators as shown in Figure 6-28. Running later steps (further down) can depend on running previous steps. For example, clicking Run software without accelerators will prompt a dialog asking to run Cross-compile software for RISC-V and Program board with prebuilt bitstream.
Figure 6-28: Base SoC with no HLS Accelerators menu options
There are 7 options under Reference SoC with HLS Accelerator(s) as shown in Figure 6-29. There are 3 more options than Figure 6-28 because the bitstream is not prebuilt like the Base SoC. We have additional steps to generate the Libero design, run RTL synthesis, and run place-and-route.
Figure 6-29: Reference SoC with HLS Accelerator(s) menu options
In the SmartHLS IDE, select SmartHLS -> RISC-V SoC Features (available for PolarFire SoC only) -> Reference SoC with HLS Accelerators(s) -> Generate Libero design, as seen in Figure 6‑30. This command will generate a “Reference SoC” Libero project containing the Icicle kit MSS, the generated vector-add accelerator, and setup a Libero project with the vector-add accelerator connected to the MSS. The Libero project is now ready for synthesis, place-and-route and programming onto the board as you would for a regular Libero project.
Figure 6‑30 “Reference SoC with HLS Accelerator(s) -\> Generate Libero Design” Menu
Generate Libero design (Figure 6‑30) will create a Libero project that
you can open with Libero in hls_output/soc/Icicle_SoC.prjx
. The
generated SmartDesign hardware system contains the MSS connected via
AXI4 to the vector_add_axi_target_memcpy
accelerator as shown in
Figure 6‑31. Note that the generated accelerator is the same as the one
generated using IP flow in the previous section.
The Reference SoC Libero design is generated in
the project directory, under hls_output/soc/Icicle_SoC.prjx
project file. Open this project in Libero to view the reference SoC
design. In the Design Hierarchy tab, double-click
FIC_0_PERIPHERALS to open the SmartDesign project, as seen in Figure
Figure 6‑31 SmartDesign for Vector-Add SmartHLS Generated Reference SoC
In the SmartDesign project, the accelerator IP
is instantiated on the right and
connected through an AXI interconnect IP (center) to the MSS . This is
the path through which the software main function running on the
processor communicates with the accelerator IPs as well as the path for
data transfers between the DDR and the accelerator. Any additional
accelerators would be connected to the same AXI interconnect. For more
information on the architecture of the Reference SoC, please see our
We can simplify the SmartDesign visualization by clicking Hide Nets, Compress Layout, then dragging the HLS_AXI4Interconnect_0, and vector add modules as shown in Figure 6‑32. We have highlighted the AXI4 interface connections between the MSS and the vector add accelerator.
Figure 6‑32 Simplified SmartDesign for Generated Reference SoC with Vector Add Accelerator
Now close the Libero project and go back to the SmartHLS IDE.
In this section, we will cover the three different SoC data transfer methods supported between the MSS and the hardware accelerator: CPU Copy, DMA Copy, and Accelerator Direct Access. The transfer method is specified for each function argument using the interface type pragmas. AXI interfaces are used to send and receive data from the accelerator to/from the MSS. Each function argument can be configured to a different interface type depending on the application, for example larger arguments could use DMA transfers.
When sharing data between the MSS and the FPGA fabric, we need to transfer data from the MSS main memory in off-chip DDR memory. For each pointer argument of an accelerator, data can be copied from DDR memory to accelerator’s on-chip memory buffer, or the data can be accessed directly in DDR by the accelerator. Any access to DDR, whether data is copied or accessed directly, goes through the MSS data cache to maintain cache coherency. See the SoC Data Transfer Methods user guide section for further reference.
In CPU Copy mode, the MSS handles the transfer of data between the DDR and the accelerator. The MSS requests the data from DDR and passes the data through the AXI4 interconnect to the accelerator. The accelerator has an on-chip buffer storing the received data. This is the recommended mode when transferring data under 16 kB in size. Figure 6‑33 shows how data travels between the accelerator and the DDR.
Figure 6‑33 CPU Copy Data Path
CPU Copy mode occurs when a function argument interface type is AXI target, for example in the `vector_add_axi_target_memcpy top-level function (see code previously in Figure 6‑12):
#pragma HLS interface argument(a) type(axi_target)
In DMA Copy mode, the MSS will use the hardened DMA engine (PDMA) to transfer data between the DDR and the accelerator (Figure 6‑34). This is the recommended mode when transferring data over 16 kB in size.
Figure 6‑34 DMA Copy Data Path
DMA Copy mode occurs when a function argument interface type is AXI
Target with the DMA sub-option specified. For example, in the function
as highlighted on line 85 shown in
Figure 6‑35. Note that the generated RTL for this accelerator is no
different from vector_add_axi_target_memcpy
. The only difference
is how the MSS transfers data to the accelerator. Since the size of the
array is only 16 in this example, the data transfer time doesn’t benefit
from using the DMA. We wrote this function for illustrative purposes.
81 void vector_add_axi_target_dma(int *a, int *b, int *result) {
82 #pragma HLS function top
83 #pragma HLS interface control type(axi_target)
84 #pragma HLS interface argument(a) type(axi_target) dma(true) num_elements(SIZE)
85 #pragma HLS interface argument(b) type(axi_target) dma(true) num_elements(SIZE)
86 #pragma HLS interface argument(result) type(axi_target) dma(true) \
87 num_elements(SIZE)
88 vector_add_sw(a, b, result);
89 }
Figure 6‑35 AXI Target DMA Pragma
Accelerator direct access mode allows the hardware accelerator to directly read and write to DDR. Unlike the AXI Target interface, AXI Initiator interface does not receive the data directly from the processor or the DMA, instead the accelerator will have two AXI interfaces:
AXI Target: The accelerator receives a pointer to where the argument is stored (e.g. address in DDR memory) through the
AXI Target interface from the MSS (Figure 6‑36), and then -
The accelerator using the AXI initiator interface accesses DDR memory through the MSS cache (Figure 6‑37).
The memory accesses are cache coherent between the accelerator and MSS since they share the L2 cache, but L1 cache could be invalidated. Since the data is accessed directly from DDR without copying, there are no additional on-chip memory needed for the accelerator.
Figure 6‑36 MSS Sends the Pointer to Accelerator AXI Target Interface in Direct Access Mode
Figure 6‑37 Accelerator AXI Initiator Interface Requests Data Directly from DDR in Direct Access Mode
Look at line 119 of vector_add_soc.cpp
(Figure 6‑38). We defined
another top-level function vector_add_axi_initiator
that uses AXI
initiator interface for its arguments. Line 123 defined the default type
for all arguments and control. If the default argument type were
unspecified, the default argument type for all arguments and control is
simple. If the interface for a, b and result were not defined, they
would be defaulted to axi_target
. So, when planning to use SmartHLS SoC
reference design, be sure to set the default or specify each interface
to either axi_target
or axi_initiator
119 void vector_add_axi_initiator(int *a, int *b, int *result) {
120 // Note that both the control and ptr_addr_interface are redundant since the
121 // default is already axi_target
122 #pragma HLS function top
123 #pragma HLS interface default type(axi_target)
124 #pragma HLS interface control type(axi_target)
125 #pragma HLS interface argument(a) type(axi_initiator) \
126 ptr_addr_interface(axi_target) num_elements(SIZE)
127 #pragma HLS interface argument(b) type(axi_initiator) num_elements(SIZE)
128 #pragma HLS interface argument(result) type(axi_initiator) num_elements(SIZE)
129 vector_add_sw(a, b, result);
130 }
Figure 6‑38 AXI Initiator Example
Line 124 specifies the default control type for axi_target
, which is
redundant since the default interface type was defined as axi_target
line 123. The interface type for arguments a
, b
and result
is set to
on lines 125-129. The ptr_addr_interface
sub-option on
line 126 specifies the type of interface that is used to receive the
pointer address to access the argument. In this case, the pointer
address of argument "a
" will be received with the AXI target interface
as shown in Figure 6‑36, and this pointer address will be used to access
the data for argument "a
" with the AXI initiator interface as shown in
Figure 6‑37. If the ptr_addr_interface
is not specified, for example
for argument b, SmartHLS will use the default interface type defined on
line 123 (axi_target
). See the AXI4 Initiator
section of the user guide.
If users specify the ptr_addr_interface
or any other interface type as
, then the accelerator is not compatible with Reference SoC
features and they would have to manually connect the input for the
interface using a TCL script or in Libero.
We will now change the top-level accelerator
argument interface from AXI target to AXI initiator. Go to line 21 of
, change the definition of INTERFACE
(highlighted in Figure 6‑39) to AXI_INITIATOR
14 // Choose which interface to compile
16 #define AXI_TARGET_MEMCPY 0
17 #define AXI_TARGET_DMA 1
18 #define AXI_INITIATOR 2
20 #ifndef INTERFACE
22 #endif
Figure 6‑39 Pragma for Choosing Example’s Interface Type
Go to SoC pulldown menu , select Reference SoC with HLS Accelerator(s) -> Generate Libero Design. You should see a pop-up window (Figure 6‑40) asking for confirmation to run Compile Software to Hardware. Click Yes to continue. If users have made changes in the future that does not affect the generated hardware in the source code, such as adding a comment, users can choose Skip above step(s) to save compilation time.
Figure 6‑40 Compilation Confirmation Pop-up Window
will be generated once the compilation has finished. You can see that
the RTL interface summary (Figure 6‑41) is drastically different from
s (Figure 6‑14). The pointer addresses of a
, b
, and result
are being passed to the accelerator through the axi_target
The accelerator will then read the data from the DDR directly using the
given address and write the result into the DDR memory directly.
====== 1. RTL Interface ======
| RTL Interface Generated by SmartHLS |
| C++ Name | Interface Type | Signal Name | Signal Bit-width | Signal Direction |
| | Clock & Reset | clk (positive edge) | 1 | input |
| | | reset (synchronous active high) | 1 | input |
| | Control via AXI4 Target | axi4target_* | | |
| a | AXI4 Initiator | axi4initiator_* | | |
| | with ptr_addr_interface(axi_target) | axi4target_* | | |
| b | AXI4 Initiator | axi4initiator_* | | |
| | with ptr_addr_interface(axi_target) | axi4target_* | | |
| result | AXI4 Initiator | axi4initiator_* | | |
| | with ptr_addr_interface(axi_target) | axi4target_* | | |
Figure 6-41 An Example RTL Interface Generated Table for AXI Initiator
In vector_add_soc.cpp
, on lines of 144-147 of the main function, we
used the hls_malloc
function to allocate physically contiguous memory
regions for the data passed to/from the hardware accelerator as shown in
Figure 6‑42.
143 // Allocating memory from DDR memory
144 int *a = (int *)hls_malloc(SIZE * sizeof(int));
145 int *b = (int *)hls_malloc(SIZE * sizeof(int));
146 int *result_hw = (int *)hls_malloc(SIZE * sizeof(int));
147 int *result_sw = (int *)hls_malloc(SIZE * sizeof(int));
Figure 6‑42 Allocating Memory in the DDR for Vectors
DMA Copy mode and Accelerator Direct Access require the memory to be
allocated using the hls_malloc
function from the SmartHLS Memory
to keep data in physically contiguous memory for the DMA engine. Using
prevents splitting data across different virtual memory
pages in physical memory. The accelerators and DMA engine do not perform
translation from virtual to physical memory addresses.
Unlike DMA Copy mode and Accelerator Direct Access, CPU Copy mode does
not require the use of hls_malloc
for allocating the argument data. In
CPU Copy mode, the MSS controls all data that is read/written to
accelerators and DDR and the MSS will automatically handle the virtual
memory address translations.
This section uses a PolarFire SoC Icicle Kit. The Icicle Kit is a low-cost development platform featuring a hardened five-core RISC-V processor, capable of running Linux, a PolarFire SoC FPGA, and many peripherals. For more details on the Icicle Kit, and information on how to obtain one, please see the product page.
In this part of the training, we will run the vector add application on the Icicle Kit board. We will generate the reference SoC, program the bitstream to the PolarFire SoC FPGA, and run the accelerator driver software on the MSS.
Users without an Icicle Kit can still follow along to learn about how a SoC reference project is generated.
To prepare your Icicle kit for use with SmartHLS, follow the Icicle Setup Instructions and note down the IP of the board.
Create a new file named Makefile.user
by right
clicking on vector_add_soc
then New -> File (Figure 7‑1).
Figure 7‑1 Creating a New File
Insert a line for your board’s network IP, like what is shown in Figure 7‑2.
BOARD_IP = <Your Board IP>
Figure 7‑2 Makefile.user’s Content
Since the makefile
is freshly regenerated every time the SmartHLS IDE
compiles, users must define makefile changes in Makefile.user
for the
changes to take effect. There are several predefined user flags that
SmartHLS reads in Makefile.user
where users can define and modify
options such as compiler and linker flags. For example, users can modify
to append additional C++ compilation flags for their
project. Visit the Makefile
section of our user guide for a full list of predefined user flags and
their uses.
From the SmartHLS menu, select SmartHLS -> RISC-V SoC Features (available for PolarFire SoC only) -> Base SoC with no HLS Accelerators -> Program Board with Prebuilt Bitstream (see Figure 7‑3). SmartHLS will program the prebuilt Base SoC bitstream to the attached Icicle board. After the Icicle board has been successfully programmed, you will see the message in Figure 7‑4.
Figure 7‑3 Program Board with Prebuilt Bitstream Option Menu
programmer '1380218' : device 'MPFS250T_ES' : Executing action PROGRAM PASSED.
programmer '1380218' : Chain programming PASSED.
Chain Programming Finished: Tue Feb 6 15:32:09 2024 (Elapsed time 00:00:58)
o - o - o - o - o - o
The 'run_selected_actions' command succeeded.
The Execute Script command succeeded.
Exported log file C:/Developers/hls_workspace/11/hls_output/FPExpress_project/job.log.
15:32:09 Build Finished (took 1m:13s.617ms)
Figure 7‑4 Program Board Successful
Users can run their program entirely in software
on the MSS without calling the accelerators. This is useful for
verifying the correctness of the software and the MSS system, as well as
profiling the performance of the system. To run only the software on the
board, go to SmartHLS -> RISC-V SoC Features (available for PolarFire
SoC only) -> Base SoC with no HLS Accelerators -> Run software without
accelerators (as shown in Figure 7‑5). SmartHLS will cross-compile the
source code for RISC-V, then SmartHLS will copy the RISC-V binary to the
board over SSH, using the BOARD_IP
specified in Makefile.user
. The
correct result should see RESULT: PASS
as seen on Figure 7‑6.
Figure 7‑5 Run Software without Accelerators Option Menu
Info: Running the following targets: soc_base_proj_run
Info: Checking for SmartHLS feature license.
Info: SmartHLS feature license was successfully checked out.
Info: The Programmer ID is not set. All connected programmers will be programmed. If only one of the multiple connected programmers is to be programmed, please specify it using "board.programmerID" in projConfig.json
Info: Waiting on board ready...
Info: Board ready!
Info: Connected (version 2.0, client dropbear_2020.81)
Info: Authentication (publickey) failed.
Info: Authentication (password) successful!
Info: [chan 0] Opened sftp connection (server version 3)
Info: Make dir root@
Info: Copying C:/Developers/hls_workspace/vector_add_soc/hls_output/vector_add_soc.no_accel.elf to root@
Info: Application starting (over ssh root@
Info: Running: pushd .; ./vector_add_soc.no_accel.elf |& tee bin_cl_out.txt; popd
Application output:
~ ~
Info: Application finished!
Info: Copying bin_cl_out.txt from root@ to C:/Developers/hls_workspace/vector_add_soc/hls_output/files
Info: [chan 0] sftp session closed.
Figure 7‑6 Expected Output from Running Software on Board
Now that you have verified that your software program can run correctly
on your Icicle Kit, you can run the software with accelerators that
SmartHLS generates. This software executable is the same as
, but with the calls to
automatically replaced with driver code
to control the accelerator IP on the FPGA fabric.
In the same menu as before, click SmartHLS -> RISC-V SoC Features (available for PolarFire SoC only) -> Reference SoC with HLS Accelerator(s) -> Run software with accelerators (Figure 7‑7). SmartHLS will automatically run all the steps prior to Run software with accelerators, i.e. Generate Libero design, RTL synthesis, Place-and-route and generate bitstream, Program board, Cross-compile software with accelerator drivers (Figure 7‑7).
Figure 7‑7 Run Software with Accelerators Option Menu
If everything works correctly, you will see the output from the executable running on the board, and the output should match Figure 7‑6. In this case, the vector add computation is being performed by the hardware accelerator generated by SmartHLS.
Until this point, we have been targeting the Reference SoC Libero project generated by SmartHLS. This allows users with no experience using FPGAs to port C++ code to PolarFire SoC devices and offload parts of the software to the FPGA fabric without knowing much about Libero’s TCL commands, Verilog or VHDL. SmartHLS provided a fully automated flow.
However, users with FPGA knowledge may already have an existing Libero SoC project, which could be different from the SmartHLS reference SoC. They can also have their own Linux image because of the differences in the device tree or simply because they have different software loaded on the image. In addition to the difference in SoC design and Linux image, advanced Libero users may have their own compilation flows. For example, using scripts to run specific tasks before, during or after calling Libero, with custom steps and setting different options for synthesis, place-and-route, bitstream generation, etc.
This section will show how SmartHLS can be used as a plugin into a custom compilation flow, and how to integrate the SmartHLS generated hardware modules into an existing SoC design. As an example, we show how you can integrate a SmartHLS system into the PolarFire SoC Icicle Kit Reference Design created by the Embedded Software Systems team, which is shipped with SmartHLS.
Custom SoC designs can have many different configurations. SmartHLS defines a set of TCL parameters, as shown underlined in Figure 8‑1, to simplify the automatic integration of SmartHLS-generated modules into a custom SoC.
Figure 8‑1 TCL Parameters for Interfacing between a custom SoC and SmartHLS Subsystem
Figure 8‑2 presents a brief description of each SoC integration parameter in Figure 8‑1 that needs to be passed on to SmartHLS to be able to automatically integrate the generated accelerators into a custom SoC.
TCL Parameter | Description |
SOC_BD_NAME | The name of the SmartDesign project into which the SmartHLS IP modules will be integrated. |
SOC_RESET | Identifies the reset signal to be used. |
SOC_CLOCK | Identifies the clock to use for the SmartHLS IP modules. Currently, the same clock is used for all modules. |
SOC_AXI_INITIATOR | Identifies the downstream AXI interface to use. This is used for register control and any data write and read transfers initiated by the CPU down to the SmartHLS IP modules. |
SOC_AXI_TARGET | Identifies the upstream AXI interface to use. This is used for writing and reading transfer requests issued by the SmartHLS IP modules targeting the CPU memory. |
SOC_CPU_MEM_SIZE | This is size of the CPU memory window used when the SmartHLS IP modules act as AXI initiators. |
SOC_CPU_MEM_BASE_ADDRESS | This base address identifies the beginning of a memory window in the CPU physical memory address space that the SmartHLS IP modules can use when they are AXI Initiators. This address is used to configure the HLS AXI interconnect and allow transactions to move upstream towards the CPU’s memory. |
SOC_FABRIC_SIZE | Determines the size of the memory window used for mapping control registers and on-chip buffers for ALL modules in each SmartHLS project instantiated on the fabric. The size can be larger than what a specific function may need. For example, a 2MB memory window could be reserved but the IP module may only use half of if, leaving the other half for future growth. Reserving a larger window does not mean more on-chip memory will be used. |
SOC_FABRIC_BASE_ADDRESS | This is the base address of a memory window in the CPU memory address space that is reserved for all SmartHLS modules instantiated on the FPGA fabric. Control registers and on-chip memory buffers are allocated and mapped from this memory window. This address is also used to configure the HLS AXI interconnect to allow AXI transactions to move downstream from the CPU towards the SmartHLS IP modules. |
Figure 8‑2 Description of the TCL Parameters.
These parameters allow SmartHLS not just to convert C++ functions into IP cores, but also to:
- Create SmartDesign HDL+ wrappers
- Instantiate an AXI interconnect and configure its address decoding
- Attach the HDL+ cores to the AXI interconnect
- Connect the clock signal (same clock for all HW modules and interconnect)
- Connect the reset signal
- Connect to the CPU via AXI channels (Initiator & Target)
Users can perform these steps by hand in the GUI or using TCL commands. However, with SmartHLS it is very easy to add and remove functions to the system and having an automated way of doing this is very helpful.
SmartHLS uses the Icicle Kit Reference Design as the base design to which the accelerators are automatically attached to. For the Reference Design, these parameters have default values and only need to be adjusted for custom SoCs. These default parameters are specified in the SmartHLS config.tcl file, for example:
# Parameters used for SoC integration
# Using FIC-0 Address range: 0x7000_0000 - 0x7040_0000 (4MB)
set_parameter SOC_FABRIC_BASE_ADDRESS 0x70000000
set_parameter SOC_FABRIC_SIZE 0x400000
# Starting from Cached memory base address (0x80000000) all the way up to just
# before FIC-1 (~1.7GB)
# NOTE. In the Icicle board not all the memory is contiguous for buffer allocation.
# The SW driver should know about those memory partitions. On the hardware side,
# it's just easier to set the max address range and rely on the software driver
# to not program memory accesses in invalid regions.
set_parameter SOC_CPU_MEM_BASE_ADDRESS 0x80000000
set_parameter SOC_CPU_MEM_SIZE 0x60000000
Figure 8‑3 Default Parameter Values for Integrating SmartHLS
Users can change the default parameters by creating a
file inside their HLS project. For example, if we wanted to change the
to start at 0x70100000
, we would include the
following in our custom_config.tcl file:
set_parameter SOC_FABRIC_BASE_ADDRESS 0x70100000
Figure 8‑4 Custom Parameter Values for Integrating SmartHLS
Note: If the GUI is not used, users must add the following line to their Makefile.user:
LOCAL_CONFIG += -legup-config=custom_config.tcl
Figure 8‑5 Additional Makefile Line
This change works with our current Icicle Kit Reference Design (though it is not needed, as the default parameters work fine.) This exercise is used to demonstrate how given a different reference design, the SoC integration parameters may be changed, as long as the changes are valid for that specific design.
The difference in the compilation processes between the Custom Flow and SoC Flow resides in what tool drives the flow. In the SoC Flow section, we used SmartHLS GUI as the main entry point and driver of the compilation process. SmartHLS has TCL scripts to generate the HDL hardware modules from the C++ description and integrate them automatically. In this case SmartHLS calls Libero to perform different tasks, such as synthesis, place and route, etc. The series of compilation steps are defined by SmartHLS.
In a custom flow, users are responsible for integrating SmartHLS
generated subsystem into their own SoC. In this example, our custom design based off
the PFSoC Icicle Kit Reference Design has a TCL file that drives the overall
compilation process. The compilation steps are defined in a file called
and the compilation steps are
shown in Figure 8‑6. This script is executed by Libero, and the TCL
script goes through a series of steps, and then calls SmartHLS only as an
extra step to generate the HDL modules for the C++ functions. Once
the HDL modules have been generated, then SmartHLS can automatically
integrate them into the design.
This custom flow (not SmartHLS) continues with synthesis,
place-and-route, and bitstream generation.
Figure 8‑6 Steps for User-Defined SoC with SmartHLS Integration
Compilation of the SmartHLS modules can be done on-the-fly using a TCL
script. An example script that users may write to call SmartHLS in their
custom flow is provided in the
file, which you will
see in Figure 8‑7.
After SmartHLS generates HDL modules from their C++ description, they
can be integrated by hand in Libero’s GUI or automatically by sourcing
SmartHLS-generated TCL script, shls_integrate_accels.tcl
, as shown in
Figure 8‑7.
Figure 8‑7 TCL Scripts Hierarchy
Figure 8‑7 shows the hierarchy of the TCL scripts generated by SmartHLS.
The user script, compile_and_integrate_shls_to_refdesign.tcl
, can
source the SmartHLS generated script shls_integrate_accels.tcl
, which
is responsible for generating the SmartHLS subsystem. In turn,
will source create_hdl_plus.tcl
is a SmartHLS-generated TCL script which can be
run by Libero to automatically import the generated Verilog files into a
SmartDesign HDL+ component, which can then be integrated with existing
SmartDesign projects.
We now introduce a simple image processing example to highlight some performance and resource aspects to keep in mind when using the SmartHLS SoC flow. These examples are kept deliberately simple to make straightforward explanations of the necessary concepts. Our objective is not to produce the fastest, most useful image filters.
We will be working with two hardware modules: a pixel value inversion
(i.e. simply flip the bits of every pixel value) and a
transformation. The latter is defined as:
Figure 8‑8 Visual Example of Invert and Threshold_to_Zero Transformations
As in the Vector Add On-Board section, a Linux image needs to be flashed to the eMMC memory in the Icicle board. If users have already flashed the Linux image as described in the Icicle Setup Guide, this section may be skipped, and users may move on to this section. A similar procedure can be followed for the user’s own Linux image when integrating SmartHLS design into their own existing system.
If you haven’t already, please download core-image-minimal-dev-icicle-kit-es.wic.gz. This Linux image is the Icicle Reference Image and has the same or extended functionality compared to the pre-programmed FPGA design on the Icicle Kit. This Linux image is from PolarFire SoC Yocto BSP (v2023.02.1) .
Follow the instructions on Icicle Setup
for setting the Icicle kit. As explained in the guide, when flashing the
Icicle board (Step 5 in the Icicle Setup Guide), use
that you have downloaded in
the previous step. It is important that the Linux image version, design
version, Libero version, and Hart Software Services (HSS) version all
match/are all compatible with each other. Failure to do this will result
in unexpected behaviour. Flashing the Linux image in this step could
take 15-30 minutes.
Navigate to
Rename ref_design
to icicle-kit-reference-design
, and move it to your
drive. Open it, and you should see the following files and
Figure 8‑9 Icicle Kit Reference Design Folder
tcl script is
used to drive the custom flow (shown previously in Figure 8‑6). We run
this tcl script from Libero GUI with the SMARTHLS script argument set to
point to a directory where the SmartHLS project is located.
The SmartHLS project files are located under the directory:
as shown in Figure 8‑10.
Figure 8‑10 Files in SmartHLS Projects
A description of each file in invert_and_threshold
is given in Figure
File Name | Description |
main_variations |
bmp.h |
| |
config_pfsoc_ref.tcl |
Makefile |
Makefile.user |
| |
toronto.bmp |
compile_and_integrate_shls_to_refdesign.tcl |
pre_hls_integration.tcl |
Figure 8‑11 Description of Various Files in SmartHLS Example
In this section, we are going to generate a Libero project and the
bitstream for the PolarFire SoC Reference Design, but with a SmartHLS
subsystem that contains an invert function accelerator and a
function accelerator connected. We have generated
the bitstream in advance, which can be downloaded from the release assets on Github
(see this section).
Users can save time by using the precompiled bitstream instead and
continue onto the next section.
Before generating the bitstream, make sure SRCS is set as below in icicle-kit-reference-design\script_support\additional_configurations\smarthls\invert_and_threshold\Makefile.user
SRCS = main_variations/main.simple.cpp
# SRCS = main_variations/main.cpu_usage.cpp
# SRCS = main_variations/main.hls_driver.cpp
# SRCS = main_variations/main.non-blocking.cpp
# This option requires bitstream configuration
# SRCS = main_variations/main.fifo.cpp
We can compile by either using the Libero GUI, or
by running the
or run_libero.ps1
(which script you use depends on your OS) script in
To use the Libero GUI to compile, open Libero. Press Ctrl+U in Libero to
open the “Execute Script” dialog as shown in Figure 8‑12. In the
“Script” file field, enter the path to the
script. In the “Arguments”
field enter the following:
SMARTHLS:C:/icicle-kit-reference-design/script_support/additional_configurations/smarthls/invert_and_threshold EXPORT_FPE:C:/icicle-kit-reference-design/MPFS_ICICLE_SMARTHLS_DEMO/ HSS_UPDATE:1
The above argument assumes that you have extracted the reference folder
into C:\
. Change the path accordingly if you have extracted the folder
to elsewhere. The first part of the argument, SMARTHLS:<Path to SmartHLS Project>
, informs the script where is the SmartHLS project to
be built and integrated into the Icicle Kit reference design. The second
part, EXPORT_FPE:<Path>
, specifies the location of the output .job
file; the last argument, HSS_UPDATE:1
, updates the Hart Software
Services (HSS) that performs boot and system monitoring functions for
PolarFire SoC.
Figure 8‑12 Libero’s Execute Script Window
is the main driver script that
generates an Icicle kit reference demo design.
223 # Compile and integrate the SmartHLS code
224 if {[info exists SMARTHLS]} {
225 # Prepare the SmartDesign for HLS integration
226 source ./script_support/additional_configurations/smarthls/pre_hls_integration.tcl
227 # Call SmartHLS tool
228 source ./script_support/additional_configurations/smarthls/compile_and_integrate_shls_to_refdesign.tcl
229 }
Figure 8‑13 SmartHLS configuration of MPFS_ICICLE_KIT_REFERENCE_DESIGN.tcl
Figure 8‑13 shows a snippet of MPFS_ICICLE_KIT_REFERENCE_DESIGN.tcl
When compiling with SmartHLS, MPFS_ICICLE_KIT_REFERENCE_DESIGN.tcl
sources two scripts. The first script, pre_hls_integration.tcl
modifies the Icicle Kit Reference Design by adding the necessary AXI
ports to integrate the generated HLS modules. When you open
, you
will see two very long configure_core
TCL commands at the start of the
script. The first configure_core
command configures FIC0_INITIATOR
to have 4 AXImslaves instead of 3. The second configure_core
configures PCIE_INITIATOR to have 2 AXImmasters instead of 1. The
additional ports are needed to connect to the SmartHLS subsystem.
The second script, compile_and_integrate_shls_to_refdesign.tcl
takes in a SmartHLS project and calls SmartHLS to generate HDL from C++
and integrate the modules into the SoC. This script attempts to obtain
the path to SmartHLS based on the user’s PATH. If the script cannot find
SmartHLS, the script will attempt to look in the default
installation path for Windows. If SmartHLS
still cannot be found, the script will give an error and users will have
to manually modify the script or add SmartHLS to their PATH environment
Alternatively, you can compile using the run_libero
scripts. If you are using Windows to compile, you will run run_libero.ps1
in Powershell. If you are using Linux to compile, or can otherwise run bash scripts, you will run
to compile.
First, open your shell and navigate to the icicle-kit-reference-design folder
. Then run
./script_support/additional_configurations/smarthls/run_libero.(your extension here)
01 #!/bin/bash
02 #
03 # Usage:
04 # cd icicle-kit-reference-design
05 # ./script_support/additional_configurations/smarthls/
06 #
07 set -e
09 prjDir=soc
11 HLS_PATH=./script_support/additional_configurations/smarthls/invert_and_threshold
13 #
14 # Start from a clean state
15 #
16 rm -rf \
17 $HLS_PATH/hls_output \
18 $prjDir
20 #
21 # Compile the Icicle reference design
22 #
24 time libero \
26 script_args:$target \
Figure 8-14 Script:
This script does essentially the same thing as what a user would do to
using the Libero GUI (see
instructions above on how to compile the hardware using the Libero GUI.)
After generating the project, we can program the Icicle board using FlashPro Express. FlashPro Express comes packaged with the Libero installation.
Open FPExpress. The program can be found by
pressing the Windows key and searching for “FPExpress”. Linux users can
find FPExpress under <Libero Installation Folder>/Libero/bin/
Click New…, select Import FlashPro Express
job file radio button, and navigate to Icicle reference design folder
to select the generated bitstream from the previous section <icicle-kit-reference-design>/Icicle_SoC.job
If you have skipped the previous section, you can program with the
precompiled .job file from the release assets Training4/INVERT_AND_THRESHOLD_SIMPLE.job
Set your FPExpress project location to wherever you please, then click OK.
Figure 8‑15 Create New Job Project Setting
From the drop-down box above the RUN button make sure that PROGRAM is selected.
Figure 8‑16 FlashPro Express Program Screen
Now press the RUN button, and you should see a confirmation that the programming passed:
Figure 8‑17 Program Successful
This will program a default bitstream to the FPGA fabric, as well as a compatible bootloader (HSS), which will allow the board to boot up with the newly added Linux image.
After the board has successfully booted, you
can connect using a serial terminal. Connect in the same
as the serial terminal used during the writing of the Linux image,
except this time using channel 1 (/dev/ttyUSB1
on Linux,
and Interface 1
on Windows), you should see a login screen:
Figure 8‑18 Login Screen
The login is root
, and no password is required.
After logging in, you should be able to see a
terminal. Now enter ifconfig
, look for inet
and take note of the IP
address that should have been assigned to the Icicle Kit by the network:
Figure 8‑19 Getting IP Address from ifconfig
Now that the IP address of the board is determined, you can access it
remotely over the network using SSH with the command ssh root@[your board IP here]
We will now explore different versions of the image filter introduced in
earlier. The goal of this exercise is demonstrating the design
considerations that should be taken and the understanding how the system
should work as a whole. We implemented a simple version of invert and
functions in main.simple.cpp
. Although we will not
be using the SmartHLS IDE for compilation, we will be using the SmartHLS
IDE for exploring and editing the code.
Open the SmartHLS project under your Icicle Kit
Reference Design folder. Go to File -> Open Projects from File
System… (Figure 8‑20), and then in Import Source, open the SmartHLS
project under your Icicle Kit Reference Design Folder (Figure 8‑21), under
Figure 8‑20 Open Projects from File Menu
Figure 8‑21 Import Projects from File System Settings
Open main_variations/main.simple.cpp
. The file
contains two top functions, invert
and threshold_to_zero
. Each top
function is an independent hardware module connected to the AXI
interconnect. There is a limit to the PolarFire SoC FPGA on-chip memory
of about 2MB for the entire FPGA fabric (MPFS250T part on the Icicle
kit). Thus, we have to split the large Full-HD image (1920x1080) into
multiple blocks. We have set the N_ROWS
constant in bmp.h to process
the input image 45 rows at a time.
The program takes in two arguments. The first argument is either 0 or 1.
When the first argument is 0, the program will not perform pixel
inversion; otherwise, the program will. The second argument is the
threshold ranging from 0 to 255. A zero-value threshold will bypass the
Figure 8‑22 Data Movement of main.simple.cpp
Each hardware module has an input and output on-chip buffer to store the
incoming and output data. Figure 8‑22 shows the data movement of one
block of data. First, the CPU initiates a DMA read from the DDR memory
to the input on-chip buffer of the invert()
module, performs the
inversion operation and stores the result into the output on-chip
buffer. After that the CPU initiates another DMA transaction to write
back to the DDR memory. threshold_to_zero()
follows a similar flow. In
total there are four DMA operations.
We can run the software and hardware accelerated versions of the code for comparison. If you are using Linux, open the command line interface. If you are using Windows, open the command prompt (cmd).
Add <SmartHLS Installation Path>/SmartHLS/bin
your PATH environment variable.
If you are using Linux, enter the following command:
export PATH=<SmartHLS Installation Path>/SmartHLS/bin:$PATH
If you are using Windows, enter the following command:
set PATH=<SmartHLS Installation Path>/SmartHLS/bin:%PATH%
After adding SmartHLS to the PATH environment
variable, export BOARD_IP
environment variable so that SmartHLS knows
the IP address of the Icicle board. Please refer to Figure 8‑19 on
finding out the IP of the Icicle board. If you are using Linux, enter the
following command:
export BOARD_IP=<Your Icicle Board IP>
If you are using Windows, enter the following command:
set BOARD_IP=<Your Icicle Board IP>
Now that we have finished setting up the environment, we can move on compiling and running the software. Go to your Icicle Kit Reference Design folder. If you are using Linux, run the following commands:
cd script_support/additional_configurations/smarthls/invert_and_threshold
If you are using Windows, run the following commands:
cd script_support\additional_configurations\smarthls\invert_and_threshold
You might see a warning about “REMOTE HOST IDENTIFICATION HAS CHANGED”
because we have changed the OS image. Simply remove previous ssh info by
doing a “rm ~/.ssh
” and accept the new RSA fingerprint the next time
ssh prompts.
While the code compiles, let’s look at these 2 scripts. The first
(Figure 8‑24), simply compiles the RISC-V executables with and without
accelerators. Line 10 is equivalent to Cross-compile with accelerator
drivers in Figure 7‑7 and Line 13 is equivalent to Cross-compile
software for RISC-V in Figure 7‑5 Run Software without Accelerators
Option Menu. The -a
option tells shls to build all dependencies in
Figure 6‑26 without prompting.
01 #!/bin/bash
03 set -eu
05 # Remove binaries and results from previous runs
06 ssh root@$BOARD_IP "rm -f output*.bmp *.elf"
07 shls clean
09 echo "Compiling w/HW module"
10 shls -a soc_sw_compile_accel
12 echo "Compiling SW-only"
13 shls -a soc_sw_compile_no_accel
Figure 8‑24 Compilation Script:
The second script,
(Figure 8‑25), runs the RISC-V
executables with and without accelerators on the board. Line 7 is
equivalent to Run software without accelerators (Figure 7‑5) and line
11 is equivalent to Run software with accelerators (Figure 7‑7).
However, unlike the options chosen from the IDE, line 8 and 11 do not
build any dependencies as described in Figure 6‑26. The
and soc_accel_proj_run
commands skip all build
dependencies because we do not wish to program the board with a SmartHLS
SoC, we already have programmed our Custom SoC bitstream to the FPGA.
01 #!/bin/bash
03 set -eu
05 echo "---------------------"
06 echo "Run SW-only"
07 shls -s soc_base_proj_run
09 echo "---------------------"
10 echo "Run w/HW module"
11 shls -s soc_accel_proj_run
Figure 8‑25 Run Program Script:
defines various options related to compiling and running
the compiled program. Figure 8‑26 is a snippet of Makefile.user
containing the runtime settings. Visit the Makefile
section of our user guide for a full list of predefined user flags and
their uses. Important: Ensure that SRCS
is set to
29 #-------------------------------------------------
30 # Runtime settings
31 #-------------------------------------------------
32 # Specify the working directory on the board
33 # All input, output, binaries will be based off this folder.
34 BOARD_PATH = ./
36 # INPUT_FILES_RISCV should use host paths.
37 # It lists the files, separated by a space, to be copied onto the board
38 INPUT_FILES_RISCV = toronto.bmp
40 # OUTPUT_FILES_RISCV should use on-board paths.
41 # It lists the files, separated by a space, to be copied from the board
42 OUTPUT_FILES_RISCV = output*.bmp
44 # Arguments to the program
45 # First argument: <0|1> 0 for skipping invert
46 # 1 for performing invert
47 # Second argument: <0..255> Threshold for not setting pixel to zero
Figure 8‑26 Runtime Settings Section of Makefile.user
If the run was successful, you should see similar output to Figure 8‑27 Sample Output of Successful Run below. The "Elapsed time" may differ.
Here we go!
N_ROWS:45, buf_size :259200, do_invert:1, threshold:200, mode:sw
Elapsed time: 0.005379 [s]
Figure 8‑27 Sample Output of Successful Run
We can also SSH into the board to run the binaries directly. Log onto the board by entering the following command:
ssh root@$BOARD_IP
There are two .elf
files in the home directory. They were copied over
when we ran
. The exact location of where shls soc_accel_proj_run
and shls soc_base_proj_run
are run depends on
defined in Makefile.user
. We can experiment with running
either program with various parameters. The accepted range for the
arguments is explained in the comments of PROGRAM_ARGUMENTS
in Figure
Figure 8‑28 Sample main.simple.cpp Output |
Figure 8‑29 Runtime of main.simple.cpp |
We summarize the runtime of the application with various program
arguments in Figure 8‑29. When using hardware accelerator, the execution
times of either invert or threshold is roughly 55 ms (Figure 8‑29).
However when running only in software, inversion takes about 46 ms while
the threshold_to_zero()
function takes approximately 141 ms. So, a
slight increase in complexity has a big effect on the overall software
runtime. 57 ms is approximately the time the MSS needs to move data to
and from the accelerator and DDR. This is, in fact, where most of the
time is spent in these simple hardware accelerators as can be seen in
Figure 8‑30.
Figure 8‑30 Runtime Breakdown of invert()
If we look at the hls_output/reports/summary.hls.invert.rpt
8‑31), we can see that the invert latency is 86,402 cycles at 125MHz,
which is 0.7ms. The invert function is called 24 times, which means that
the invert pipeline total runtime is only about 17ms of the 55.2ms
measured runtime. This represents only 30% of the total runtime, with
the other 70% spent performing data transfer.
====== 2. Function and Loop Scheduling Results ======
| Function: invert takes 86407 cycles |
| Loop | Location In Source |...| Total Latency |
| for.loop:main_variations/main.simple.cpp:12:5 | line 12 of main_variations/main.simple.cpp |...| 86403 |
Figure 8‑31 Pipeline Result of invert()
Chaining the hardware modules introduces a data dependency because the
output of the invert()
module is the input of the threshold_to_zero()
module. This causes a serialization in the execution as we can see in
the alternation of data transfer in Figure 8‑22. The hardware modules
cannot run in parallel but the three channels per pixel (Red, Green, and
Blue) are processed in parallel in the pipeline. When only one of the
invert() or threshold_to_zero()
hardware accelerators are called, the
processing time is about the same (57 ms) because the data transfer time
dominates the overall runtime.
Faster execution times is one benefit of offloading functions to the FPGA, the other benefit is leaving the CPU free to perform other tasks. They both contribute to reducing power consumption. In this case, the CPU does not have other tasks to perform, but we will see the CPU usage with and without hardware acceleration.
We introduced a new for loop with N_ITER
iterations in
to artificially increase the runtime and be able to
see the CPU usage using the Linux top
command. Think of this loop as a
sequence of video frames, where the same processing is performed
Modify Makefile.user
and select:
SRCS = main_variations/main.cpu_usage.cpp
Open a terminal on the Icicle board and run the Linux top
ssh root@$BOARD_IP
In another terminal, compile the software and run again. If you are using Linux, run:
shls clean
If you are using Windows, run:
shls.bat clean
Running software-only causes the CPU to reach 100% usage:
Figure 8‑32 CPU Usage when Running Software on RISC-V Only
Running with hardware module, the CPU utilization is about 11%:
Figure 8‑33 CPU Usage when Running with Accelerators
SmartHLS has a TCL parameter called SOC_POLL_DELAY with a value specified in microseconds. This parameter is used for controlling how often the hardware driver polls the module to check for completion. Sometimes for long running tasks, the MSS only needs to check occasionally (e.g., every 1 second), instead of many thousands of times per second, which frees up the CPU to do other useful work.
In main.non-blocking.cpp
, we change the objective and we no longer
require chaining the two image transformations as we did before with
. Now we want to output the inverted picture and
picture separately into two different output files.
This requires two new buffers to be allocated in memory, to hold the
output data of each image transformation. With this change, we have
removed the data dependency between the transformations and we can now
overlap the computation and data transmission between the hardware
modules. To accomplish this, we can use the non-blocking software driver
API functions generated by SmartHLS. See this section for explanation
on the generated software driver APIs.
Instead of calling invert()
or threshold_to_zero()
, we used a
different call in main.non-blocking.cpp as shown in Figure 8‑34.
64 for(int i = 0; i < HEIGHT/N_ROWS; i++) {
65 if (do_invert) {
67 invert_write_input_and_start((uint32_t *)&BitMap[i*WIDTH*N_ROWS]);
68 #else
69 invert((uint32_t *)\&BitMap[i*WIDTH*N_ROWS], (uint32_t *)&OutBitMap1[i*WIDTH*N_ROWS]);
70 #endif
71 }
73 if (threshold > 0) {
75 threshold_to_zero_write_input_and_start((uint32_t *)&BitMap[i*WIDTH*N_ROWS], threshold);
76 #else
77 threshold_to_zero((uint32_t *)&BitMap[i*WIDTH*N_ROWS], (uint32_t *)\&OutBitMap2[i*WIDTH*N_ROWS], threshold);
78 #endif
79 }
82 if (do_invert)
83 invert_join_and_read_output((uint32_t *)\&OutBitMap1[i*WIDTH*N_ROWS]);
85 if (threshold > 0)
86 threshold_to_zero_join_and_read_output((uint32_t *)\&OutBitMap2[i*WIDTH*N_ROWS]);
87 #endif
88 }
Figure 8‑34 Main Execution Loop of main.non-blocking.cpp
is a SmartHLS defined macro that indicates whether the program is
compiled with accelerators or not. The *_write_input_and_start()
functions send the data to the hardware accelerator and start the
accelerators without waiting for their completion. We check for
completion on line 83 and 86, where the *_join_and_read_output()
functions are called. This approach is like starting a thread and the
waiting for the result at synchronization. A full list of available
driver functions can be found under the hls_output/accelerator_driver
directory as described previously.
Although invert()
and threshold_to_zero()
can run independently of
each other, they still share the same physical DMA in the MSS that can
only access a single DDR memory channel. Thus, their execution time do
not completely overlap with each other. We will explore an alternative
in the next section.
Figure 8‑35 Data Movement of main.non-blocking.cpp
Modify the Makefile.user
and select:
SRCS = main_variations/main.non-blocking.cpp
Then compile the software and run again. For Linux:
shls clean
For Windows:
shls clean
Figure 8‑36 Runtime Results of main.non-blocking.cpp
Recall in Figure 8‑29, performing a single invert
or threshold_to_zero
takes approximately 57 ms with accelerators as that is the approximate
amount of time required to transfer the data from the DMA to the
accelerator then back. Running invert
and threshold_to_zero
parallel in this example did not completely overlap the runtime of the
two functions. They can only be partially overlapped because they share
the same DMA.
In the past sections we have only been changing the software, and have
made no changes regarding the hardware. Now we will change the hardware
and generate a new bitstream. Alternatively, you can use the
precompiled bitstream in the
release assets on Github.
In this example, we will refactor the code and merge the two functions,
and threshold_to_zero()
, into a single top function called
, which essentially calls the two
functions internally as shown in Figure 8‑37.
Figure 8‑37 Data Movement of main.fifo.cpp
We consolidated invert
and threshold_to_zero
into a single top module,
instead of two independent cores before. Normally, a user would have to
disconnect and remove the previous two modules by hand using the GUI or
via TCL commands to reconnect the new single hardware module. SmartHLS
will take care of that integration now automatically. Figure 8‑38 shows
how this new accelerator is implemented in SmartHLS:
27 void invert_and_threshold_to_zero(uint32_t *in, uint32_t *out, int do_invert, uint8_t thres) {
28 #pragma HLS function top
29 #pragma HLS interface default type(axi_target)
30 #pragma HLS interface argument(in) type(axi_target) dma(true) num_elements(WIDTH*N_ROWS)
31 #pragma HLS interface argument(out) type(axi_target) dma(true) num_elements(WIDTH*N_ROWS)
33 hls::FIFO<uint32_t> fifo(16);
34 hls::thread t1(invert, in, std::ref(fifo), do_invert);
35 hls::thread t2(threshold_to_zero, std::ref(fifo), out, thres);
36 t1.join();
37 t2.join();
38 }
Figure 8‑38 Thread and FIFO in main.fifo.cpp
The operations performed in the invert_and_threshold_to_zero()
function is a classic example of the producer-consumer pattern. The in
data is received from the AXI target interface and passed to each stage
of the computation, namely invert
and threshold_to_zero
. We use a
for each stage as each stage can be run independently as long as there
are data available. The two stages are connected via a fifo between
By combining the 2 functions into one, we achieved the following:
- Doubled the performance and half the memory resources as we no
longer need the output buffer of
and the input buffer ofthreshold_to_zero()
. - Reduced the runtime by half compared to
because the execution of the two functions is now pipelined. The execution of the two operations almost fully overlapped except for initial DMA transfer and pipeline latency of first module. - The amount of LSRAMs is reduced by half because we only need 2 DMA transfers instead of 4 compared to the simple configuration. The image data stays longer on the fabric increasing the amount of computation per data movement to/from the CPU.
Modify the Makefile.user
and select:
SRCS = main_variations/main.fifo.cpp
Rerun the entire flow as described in the Compiling the hardware section and the Programming the FPGA bitstream because this variation requires a hardware change.
Alternatively, you can use the INVERT_AND_THRESHOLD_FIFO.job
precompiled bitstream included as a release asset on Github.
Then compile the software and run again. On Linux:
shls clean
On Windows:
shls.bat clean
Figure 8‑39 main.fifo.cpp Runtime with Hardware Acceleration
As shown in Figure 8‑39, the runtime is now ~57 ms for both hardware
modules, which is the same runtime as running only one of the
accelerators and almost half the runtime of running both accelerators
(~112ms) in main.simple.cpp
variation. Despite an increase in
computation in the accelerator, we do not see any difference in runtime
between a simple invert and the combined invert
and threshold_to_zero
function. The runtime is still dominated by DMA transfers. Thus, we can
expect more saving in runtime as we increase the complexity of the
accelerator function.
The runtimes of the various implementations that we have explored have been summarized below in Figure 8‑40.
Main Variation | Arguments | Without Accelerators | With Accelerators |
main.simple.cpp | do_invert:0 threshold:0 |
0 ms | 0 ms |
do_invert:0 threshold:200 |
141 ms | 57 ms | |
do_invert:1 threshold:0 |
46 ms | 57 ms | |
do_invert:1 threshold:200 |
173 ms | 112 ms | |
main.non-blocking.cpp | do_invert:1 threshold:200 |
260 ms | 72 ms |
main.fifo.cpp | do_invert:0 threshold:0 |
4.6 s to 5.0 s | 57 ms |
do_invert:0 threshold:200 |
4.6 s to 5.0 s | 57 ms | |
do_invert:1 threshold:0 |
4.6 s to 5.0 s | 57 ms | |
do_invert:1 threshold:200 |
4.6 s to 5.0 s | 57 ms |
Figure 8‑40 Runtime of Various Implementations
Several things are of note here:
- DMA transfers dominate the overall runtime when running with
accelerators. When
consolidated invert andthreshold_to_zero
into a single accelerator, the runtime is effectively halved (57 ms) compared tomain.simple.cpp
’s runtime of performing both transformations (112 ms). Regardless of the complexity of invert, threshold_to_zero, and the combined function, the runtime is about 55 ms for each function called. - The DMA is shared and can become the bottleneck when multiple
accelerators are accessing at the same time.
produces an inverted image and a threshold_to_zero image in parallel. However, the execution of invert and threshold_to_zero functions can only be partially overlapped due to the DMA being shared amongst them. Hence, the runtime is longer (72 ms) than running only one of the transformations (57 ms), but shorter thanmain.simple.cpp
’s runtime of performing both transformations (112 ms). - Saving could be accomplished even for relatively simple functions
despite the cost of DMA transfer. In
, the accelerator version (57 ms) almost breaks even with the simple invert software function (46 ms). Running the accelerator version of threshold_to_zero (57 ms) took less than 40% of the pure software runtime (141 ms). - Threads are expensive in software but cheap in hardware.
main.fifo.cpp uses
to implement the producer-consumer behaviour. Creating and destroying threads for very simple calculations is costly. Even though the runtime for running with accelerators improved, runtime for pure software on the MSS has increased significantly. - Software can be used to save computations.
performs a check on the argument and does not send the data to the accelerator if calculations were not required, i.e., the argument is zero. On the other hand,main.fifo.cpp
blindly sends the data to the accelerator to compute. Hence,main.fifo.cpp
still takes 55 microseconds to complete even when the arguments are zero, butmain.simple.cpp
saved time (0 ms) by not doing the unnecessary calculations.
We have shown how to integrate SmartHLS generated accelerators into your own SoC through the AXI interface, how to use non-blocking driver functions to parallelize computation, how the DMA affects runtime, and how the DMA can be the bottleneck in your system. We hope you take what we have shown here and incorporate SmartHLS into your own SoC designs.
In this release of the SmartHLS PolarFire SoC flow there are a few limitations:
- No AXI streaming arguments
- No AXI initiator arguments with burst support
- No arbitrary bit-width types (ap_[u]int) are supported for function arguments
- No variable length transfers
Even though the amount of memory was reduced by half in the main.fifo variation, there is still the possibility to eliminate the on-chip buffers all together. However, to achieve this we need to address the first two limitations.
The AXI streaming arguments can be used to send data to hardware modules that can consume incoming data at line rate and eliminate the need for the on-chip buffer. That means the hardware modules should not backpressure the CPU interconnect, which is achieved when the modules are fully pipelined with an initiation interval (II) of 1.
The AXI initiator with burst support optimization would also allow removing the outgoing on-chip buffer because the hardware module would be able to directly write into the CPU memory at line rate without the CPU having to initiate a DMA transfer.
The image below (Figure 9‑1) shows what a full streaming configuration may look like. In this case, since there is only one DDR bank, the performance would be limited by the memory controller and interconnect.
Figure 9‑1 Fully Streaming Configuration
Also, SmartHLS does not currently support using arbitrary bit-width types as function arguments like this:
foo(hls::ap_int<24> &in)
The function would have to be rewritten (padded) like this:
foo(uint32_t &in)
For this reason, in our Image Processing example, the 24-bit pixel format (3-channels RGB, 8-bits per channel) had to be padded with the extra 8-bit alpha channel even though the original .bmp image does not contain the alpha channel. The alpha channel is ignored when reading and writing back to the .bmp files.
Finally, the amount of data being transferred is determined by SmartHLS
at compile-time via num_elements(WIDTH*N_ROWS)
pragma option as shown
in Figure 9‑2. For example, if we wanted to work with two different
image frame sizes HD (1280x720) and FULL-HD (1920x1080) on the invert()
function, we would have to use the largest size (Full-HD in this case)
for the value of num_elements
+, and add a function argument indicating
the actual size to use. This, however, would only limit the amount of
data that is processed but not the amount of data that is transferred
during the DMA transactions.
04 void invert(uint32_t *in, uint32_t *out) {
05 #pragma HLS function top
06 #pragma HLS interface default type(axi_target)
07 #pragma HLS interface argument(in) type(axi_target) dma(true) num_elements(WIDTH*N_ROWS)
08 #pragma HLS interface argument(out) type(axi_target) dma(true) num_elements(WIDTH*N_ROWS)
10 #pragma HLS loop pipeline II(1)
11 for (int j = 0; j < WIDTH*N_ROWS; j++) {
12 out[j] = ~in[j];
13 }
14 }
Figure 9‑2 Compile-Time Determination of the Number of Elements to be Processed