High-level synthesis#

Learning goals#

  • Integrate logic that was implemented through high-level synthesis into a system-on-a-chip

  • Understand how an accelerator interacts with a microprocessor on a system-on-a-chip

Introductory problem#

After implementing a processor you want to focus on your accelerator. As you remember, the you want to accelerate data signing on the FPGA. Implementing a data signing module in SystemVerilog can be very time-consuming, so you opt for writing your algorithms in a high-level language and synthesizing afterwards.

As a starter you come up with a simple calculation — multiplying two vectors component-wise and accumulating these:

Listing 82 code/macc/macc.cpp#
#include "macc.hpp"

// Multiply and accumulate
int macc(vec_t &xs, vec_t &ys) {
  int sum = 0;
  for (auto i = 0; i < ARRAY_SIZE; ++i)
    sum += xs[i] * ys[i];
  return sum;
}
Listing 83 code/macc/macc.hpp#
#include <array>

const auto ARRAY_SIZE = 1 << 8;
// Must be reasonably high. Otherwise Vitis does not infer an M-AXI interface

using vec_t = std::array<int, ARRAY_SIZE>;
int macc(vec_t &xs, vec_t &ys);

The corresponding testbench:

Listing 84 code/macc/main.cpp#
#include "macc.hpp"
#include <cstdlib> // exit, EXIT_FAILURE
#include <iostream>

int main() {
  vec_t xs, ys;

  unsigned int i = 0;
  for (auto &x : xs)
    x = ++i;
  for (auto &y : ys)
    y = 1;

  auto sum = macc(xs, ys);
  std::cout << "result: " << sum << std::endl;
  std::cout << "expect: " << (ARRAY_SIZE + 1) * ARRAY_SIZE / 2 << std::endl;
  // Assertion
  if (sum != (ARRAY_SIZE + 1) * ARRAY_SIZE / 2)
    exit(EXIT_FAILURE);
}

How would you integrate this functionality into a system-on-a-chip?

Requirements#

Use Vitis, which is a high-level synthesis tool by AMD. In this chapter we will use the PYNQ-Z2 platform to have hands-on experience with another kind of reconfigurable chips — a system-on-a-chip based on a hard processor and FPGA. The Jupyter notebook based interface to the FPGA will ease the interaction with our accelerator.

Integrate the accelerator and carry out a calculation.

Tasks#

Read chapter 22 of Zynq MPSoC book until including section 22.2.

Quiz#

<!-- What is PYNQ, does PYNQ use Python for synthesis? --> # Which is true about PYNQ? - [ ] is an experimentation board with AMD Zynq chip on it - [x] uses Jupyter notebooks - [ ] can be run on Intel CPUs - [ ] uses Python instead of VHDL to generate bitstreams <!-- Introduction, PYNQ layers --> # Which is true about PYNQ? - [ ] The application layer is used to create bitstreams - [ ] The software layer includes the drivers for communication with the PS - [ ] PS stands for *programmable system* - [x] PL stands for *programmable logic* <!-- introduction, lower layer, overlay --> # Which is true? - [x] overlays are like dynamically loadable *hardware libraries* - [ ] PYNQ's hardware layer includes a Linux-based OS - [ ] overlays are created automatically by Jupyter notebooks - [ ] the overlays are typically developed by software engineers <!-- introduction, Python --> # Which is true? - [ ] Python tends to run faster than C - [ ] PYNQ integrates VHDL into Python - [ ] PYNQ integrates Verilog into Python - [x] A programmer can typically implement a solution faster in Python compared to C <!-- typical applications for FPGAs --> # Why does the image resizing application run faster when an overlay is used? - [ ] the overlay contains efficient assembler code instead of Python - [ ] the overlay utilizes DMA (direct memory access), which transfers data from the memory to the CPU much faster (than without DM- [ ] - [x] programmable logic can utilize more parallelization image pixel processing compared to the CPU - [ ] the overlay is applied on the processing system and the processor can pipeline the image resize operation <!-- IP --> # What does IP mean in FPGA design? - [ ] *internet protocol* – a communication protocol to exchange data packages on the OSI network layer. Mostly used to exchange data between CPU and FPGA. - [x] *intellectual property (core)* – reusable integrated circuit block. Mostly in VHDL or Verilog. - [ ] *intelligent peripheral* – a peripheral component which is used for communication with the CPU in an heterogeneous architecture like CPU+FPGA - [ ] *image processing* – use of FPGA components to analyze or edit an image, e.g., resizing <!-- image resizer architecture --> # Which block accelerates the image resize operation? - [x] resize IP - [ ] ARM processor - [ ] memory controller - [ ] DMA <!-- memory bus, AXI --> # which of the following acts as a bus to interconnect PS and PL? - [ ] ARM - [ ] API - [x] AXI - [ ] IP <!-- AXI-Lite --> # which of the following is a simple memory interface typically used for configuring registers? - [x] AXI4-Lite - [ ] AXI4-Stream - [ ] DRAM - [ ] AXI4-RegConf <!-- 24 bit interface for the Resize IP --> # what is the reason for converting the width of the data exchanged between DMA and resize IP? - [ ] the PL does not support 32 bit data width - [x] the resize IP consumes 24 bits at each cycle - [ ] each data packet is 32 bits long, which consists of 8 bit header and 24 bit payload. <!-- AXI_GP_0 vs AXI_HP_0 --> # Look at [figure 22.5, the overview of the hardware system for the image resizer](https://www.zynq-mpsoc-book.com/wp-content/uploads/2019/04/MPSoC_ebook_web_v1.0.pdf#G33.1234829). Put them in order: 1. `AXI_GP_*` 1. and 1. `AXI_HP_*` 1. are used for setting up the resize IP and transferring the image to the resize IP, respectively. <!-- file handling --> # how do we program (i.e., configure) the FPGA in PYNQ? - [ ] using Vivado FPGA configurator - [ ] using the USB cable between the board and our computer - [ ] FPGA is configured using the memory card - [x] using the `Overlay()` function in Jupyter notebook <!-- overlay requirements --> # which (set of) file/s are required by the function `Overlay()` - [ ] bitstream - [ ] bitstream and payload, e.g., image file - [ ] payload and hand-off file - [x] bitstream and its metadata <!-- hand-off file --> # why is the bitstream alone not sufficient to load an overlay in PYNQ? - [ ] the notebook cannot know where the bitstream on the filesystem is - [x] the notebook needs to know addresses of the IPs integrated in the overlay to interact with them - [ ] the bitstream needs to be decrypted using the hand-off file - [ ] the bitstream needs to be decompressed using the metadata <!-- hand-off vs bitstream files --> # what is the extension for hand-off files? - [ ] `.meda` - [ ] `.hao` - [x] `.hwh` - [ ] `.bit` <!-- Vivado vs vitis_hls --> # which software is used to create the bitstream? - [ ] AMD - [ ] Vitis IDE - [ ] Vitis HLS - [x] Vivado <!-- acceleration on Jupyter notebooks --> # which is/are true? - [ ] Jupyter notebooks are executed on the PL - [x] the implementation included in the overlay is accelerated by the FPGA - [ ] Jupyter notebooks are executed on the FPGA - [ ] the implementation included in the overlay is accelerated by the CPU <!-- pip --> # which of the following is/are true? - [x] `pip` can manage installation of Python packages - [ ] `pip` can automatically load bitstreams to the FPGA - [x] `pip` can be used to install Python projects from Git repositories <!-- VHDL vs C/C++ vs Python --> # which language is/are used typically for driver development? - [x] C/C++ - [ ] VHDL - [ ] Verilog - [ ] Java <!-- board we will use --> # which board will we use in this chapter? - [x] PYNQ-Z2 - [ ] ZCU104 - [ ] Ultra96 - [ ] ZCU111

Mini-lecture#

High-level synthesis#

Examples using other languages:

Interfaces#

In high-level programming, a function encapsulate algorithms with defined inputs and outputs. These inputs and outputs are declared as arguments and return value, respectively. However it is also possible to return a value by using references or pointers in the arguments.

Exercise 37

You have seen many kind of interfaces until this chapter, e.g., memory interface for RAM and AMBA APB. Imagine you want to convert a C++ function to an HDL module. How would you synthesize the following function interfaces?

int f1(int x, int y);
void f2(int *input, int *output);

int* f3(int *input);
int* f4(int *input);
// to be used as f4(f3)

The circuit generated through HLS can have a plethora of interfaces, e.g.,

  • port/s which consist of as many bits as the argument/s to the synthesized function

  • a memory-mapped interface with address, data and enable signals, e.g., AMBA AXI-based bus

  • streaming of FIFO-based interface

  • a bus-based configuration interface to start/stop the circuit or check its status (e.g., idle, ready).

Tools#

Vitis Unified Software Platform#

Vitis Unified Software Platform caters for all software development aspects for FPGAs.

Vitis Unified IDE is an integrated development environment that acts as the frontend tool for the Vitis Unified Software Platform. Vitis Unified IDE can for example create an HDL implementation of a software-based algorithm, which can in turn be packaged as:

  • an IP that can be used in Vivado

  • an executable in .xo format that can be executed on an FPGA accelerator card running Xilinx runtime (XRT).

Vitis Unified IDE uses in the background the following tools:

  • Vitis HLS including v++

  • Vivado for the circuit to bitstream flow

Let us refer to the Vitis Unified IDE as Vitis in the following sections.

PYNQ#

Python productivity for ZYNQ

As a framework#
  • a software framework which combines Python + hardware acceleration

framework

a support structure comprising joined parts, arrangement of support beams that represent a building’s general shape

software framework

software providing generic functionality that can be selectively changed by additional code. This enables application-specific software

  • PYNQ can be seen as a starting point for ideas

Technically#
  • brings Linux, Jupyter notebooks and Python together

  • software components:

    • generic Python APIs for the accelerator

    • Linux drivers

  • hardware components:

    • hardware libraries (overlays), e.g., audio, video processing etc

Image processing example#

Let us try the example on Page 560 on MPSoC book together:

  1. login to the Jupyter on a PYNQ board

  2. open a terminal:

    1. Right hand side: Click on New🔽

    2. Click on Open a terminal

  3. pip3 install pynq-helloworld --no-build-isolation
    pynq get-notebooks pynq-helloworld -p /home/xilinx/jupyter_notebooks
    
  4. close the terminal

  5. open the notebook pynq-world/resizer_pl.ipynb.

    • note the processing time at the end

  6. open the notebook pynq-world/resizer_ps.ipynb and compare the processing time

Key observations:

  • 24 ms to 7 ms processing time

  • 3x speed-up

Hardware architecture for the image resizer#

Page 563

  • Resize IP

  • DMA

  • processor system (PS) <-AXI-> programmable logic (PL)

  • ARM processor <-AXI-GeneralPurpose-> Resize IP

  • Memory controller <-AXI-HighPerf-> DMA

  • data width converter between DMA and Resize IP based on AXI-stream datawidth converter

    • 32 to 24 bit converter (RGB)

Exercise 38

Look to the architectural overview of Zynq-7000 SoC. What could be the reason for the naming of general-purpose and high-performance ports?

Solution to the introductory problem#

First we synthesize our C++ code using Vitis and then use Vivado to implement our design on the board

Vitis#

Project creation#

For creating the project follow the following steps. If a setting is not mentioned, leave them in their default settings.

  1. Start Vitis

    • If you don’t have an existing workspace, create one. I recommend setting your code folder as a workspace.

  2. On the Welcome tab, click on HLS Development -> Create Component -> Create Empty HLS component:

  3. Name and location: I recommend:

    • creating a folder for the sources, e.g., macc and then using it as location

    • using simply hls as component name

  4. Configuration File: leave in their defaults

  5. Source Files:

    • DESIGN FILES: add macc.cpp and macc.hpp

    • select macc as top function

    • TEST BENCH FILES: add main.cpp

  6. Hardware: xc7z020clg400-1

  7. Settings:

    • clock: 50MHz

    • flow_target: Vitis Kernel Flow Target

  8. Summary: Click on Finish.

Executing the flow#

  1. Click on C SIMULATION -> ▶Run. Do not activate Code Analyzer.

    The process should output:

    result: 32896
    expect: 32896
    
  2. Run C SYNTHESIS:

    Notable output regarding the interface.

    INFO: [RTGEN 206-500] Setting interface mode on port 'macc/gmem' to 'm_axi'.
    INFO: [RTGEN 206-500] Setting interface mode on port 'macc/xs' to 's_axilite & ap_none'.
    INFO: [RTGEN 206-500] Setting interface mode on port 'macc/ys' to 's_axilite & ap_none'.
    INFO: [RTGEN 206-500] Setting interface mode on function 'macc' to 's_axilite & ap_ctrl_chain'.
    INFO: [HLS 200-1030] Apply Unified Pipeline Control on module 'macc' pipeline 'VITIS_LOOP_6_1' pipeline type 'loop pipeline'
    INFO: [RTGEN 206-100] Bundling port 'return', 'xs' and 'ys' to AXI-Lite port control.
    

    We see here the three aspects of an interface that we have seen before.

    The same info can be also seen in REPORTS -> Synthesis

    Schedule Viewer shows a time analysis of the generated circuit.

    Kernel Guidance shows performance improvement recommendations.

  3. C/RTL COSIMULATION allows a more accurate cycle-based analysis

  1. PACKAGE creates a package that can be imported in a Vivado project. The package is under hls/impl/ip/*.zip.

  2. IMPLEMENTATION places and routes the circuit using out-of-context (OOC) synthesis. Out-of-context implies that the environment is not synthesized, e.g., the processor that will use the circuit. OOC in context of Vitis is useful to get area and timing estimates for the circuit.

Integration of the IP in Vivado#

We want to connect the circuit to the hard ARM processor on the Zynq FPGA.

  1. Start Vivado

  2. Create a new Project location macc-single-channel-pynq-z2 under your code folder. You can use prj as your Project name.

  3. Project Type: RTL Project

  4. Add Sources: none

  5. Add Constraints: none

  6. Default Part: xc7z020clg400-1

  7. Click on Finish

In the new project:

  1. PROJECT MANAGER -> `⚙ Settings’

  2. Project Settings -> IP -> Repository -> IP Repositories

  3. Add a new repository with the path from the packaging step of the previous Vitis flow (ending with hls/impl/ip).

  4. Close settings

  5. IP Integrator -> Create Block Design -> Design name -> use the name macc

  6. Diagram -> add IP using ➕ -> search for macc and add it

  7. add ZYNQ7 Processing System

  8. Configure the processing system using double-click

  9. In Zynq Block Design click on High Performance AXI 32b/64b ...

  10. Activate S AXI HP0 Interface.

  11. Click OK.

  12. Run Connection Automation -> All Automation -> OK

  13. Run Block Automation connects the fixed IO like DDR (Connecting them does not make a difference, however. Probably they are connected by the tool automatically.)

  14. In the project settings -> General -> Top module name -> write macc

  15. Generate Bitstream

Using a script for HLS and bitstream generation#

Instead of the whole GUI flow you can use the files under repo:code/macc-single-channel-pynq-z2 and use make.

Using the overlay in PYNQ#

  1. Login to the Jupyter environment on PYNQ. The default password is xilinx.

  2. Upload your overlay and handoff file.

  3. Create a new notebook

  4. Run the following code:

import numpy as np

from pynq import Overlay, allocate

ARRAY_SIZE = 2**8
overlay = Overlay("macc.bit")

# Allocate memory for data exchange between microprocessor and macc
xs = allocate(ARRAY_SIZE, np.int32)
ys = allocate(ARRAY_SIZE, np.int32)

# Convenience variable for the registers
regs = overlay.macc.register_map

# Initialize memory addresses for the input buffers `xs` and `ys` Vitis HLS
# generates 64 bit addresses as default, `*_1` and `*_2` are for the least and
# most significant bits of the address. PYNQ-Z2 is is based on a Zynq 7000
# series FPGA which includes a Cortex-A9 MPCore processor, which in turn has a
# 32bit data path.
regs.xs_1 = xs.device_address
regs.ys_1 = ys.device_address

# Fill input data
for i in range(ARRAY_SIZE):
    xs[i] = i + 1
    ys[i] = 1

# Wait until idle before starting the IP
while not regs.CTRL.AP_IDLE:
    pass
regs.CTRL.AP_START = 1

# Wait until done
while not regs.CTRL.AP_DONE:
    pass

display(regs)
assert regs.ap_return.ap_return == sum(xs)  # Assuming ys[:] == 1

The result of the calculation will be under regs.ap_return.

Warning

IP sticks at AP_START=1 if the calculation is restarted using AP_START=1.

Workaround: If you want to repeat the calculation, reconfigure the FPGA by executing Overlay(...).

I could not fix this issue. Could be related to PYNQ, because the PYNQ version I used (3.0.1) is for Vivado version 2022.1.

Homework#

Exercise 39

In the introductory problem we used the same AXI port for the inputs xs and ys. The following HLS configuration uses

  1. two different AXI ports instead for parallel loading of xs and ys

  2. 32 bit addressing instead of 64 bit for saving unnecessary address bits as PYNQ-Z2 does not require 64 bit addressing.

Listing 85 code/macc-double-channel-32bit-addr-pynq-z2/hls_config.cfg#
part=xc7z020clg400-1

[hls]
clock=50MHz
flow_target=vitis
package.output.syn=true

# https://docs.amd.com/r/en-US/ug1399-vitis-hls/Interface-Configuration
syn.interface.m_axi_auto_max_ports=1
# Creates a separate AXI port for each argument

syn.interface.m_axi_addr64=0
# Disables default 64bit address width. Results in 32 bit addresses

syn.file=../macc/macc.cpp
syn.file=../macc/macc.hpp
tb.file=../macc/main.cpp
syn.top=macc
  1. Synthesize and package it using the config above

  2. Create a Vivado project and connect it to the microprocessor by introducing a second high-performance port.

  3. Check if the IP works using the following Python code we used before. Note that you have to replace {x,y}s_1 with {x,y}s, because we synthesize using 32 bit addresses this time.