Timing & FPGA primitives#
Learning goals#
Know what lack of timing constraints can lead to
Use FPGA primitives to implement non-synthesizable features like clock synthesizer or analog-digital-converter
Introductory problem#
(This problem follows Exercise and Exercise).
You are pondering why your RISC-V implementation with register-immediate instructions seemed to work on the board, but the version with complete set of RV32I instructions did not 🤔.
“Maybe the FPGA that I use has some problems”, you suspect. You try with a second board and the result is the same. “Damn. What is the problem? A five minute walk maybe helps”, you presume and walk down the stairs of your building.
A bit refreshed, you decide to look into the lengthy logs of the synthesis results to search for answers. In the timing report of your implementation you notice the following excerpt:
[Max Delay Paths]
Slack: inf
Source: mp_i/pc_reg[3]/C
(rising edge-triggered cell FDCE)
Destination: mp_i/rf_reg[11][22]/D
Path Group: (none)
Path Type: Max at Slow Process Corner
Data Path Delay: 54.299ns (logic 1.701ns (3.133%) route 52.598ns (96.867%))
Logic Levels: 6 (FDCE=1 LUT4=2 LUT6=2 MUXF7=1)
Location Delay type Incr(ns) Path(ns) Netlist Resource(s)
------------------------------------------------------------------- -------------------
SLICE_X10Y49 FDCE 0.000 0.000 r mp_i/pc_reg[3]/C
SLICE_X10Y49 FDCE (Prop_fdce_C_Q) 0.518 0.518 r mp_i/pc_reg[3]/Q
net (fo=155, routed) 5.016 5.534 mp_i/pc_reg_n_0_[3]
SLICE_X15Y11 MUXF7 (Prop_muxf7_S_O) 0.276 5.810 r mp_i/rf_reg[6][31]_i_2/O
net (fo=3516, routed) 44.871 50.681 mp_i/rf_reg[6][31]_i_2_n_0
SLICE_X44Y85 LUT4 (Prop_lut4_I1_O) 0.327 51.008 r mp_i/rf[29][22]_i_6/O
net (fo=2, routed) 0.734 51.742 mp_i/rf[29][22]_i_6_n_0
SLICE_X44Y84 LUT6 (Prop_lut6_I5_O) 0.332 52.074 r mp_i/rf[14][22]_i_5/O
net (fo=2, routed) 0.951 53.025 mp_i/rf[14][22]_i_5_n_0
SLICE_X44Y83 LUT6 (Prop_lut6_I4_O) 0.124 53.149 f mp_i/rf[11][22]_i_2/O
net (fo=1, routed) 1.027 54.175 mp_i/rf[11][22]_i_2_n_0
SLICE_X49Y83 LUT4 (Prop_lut4_I0_O) 0.124 54.299 r mp_i/rf[11][22]_i_1/O
net (fo=1, routed) 0.000 54.299 mp_i/rf[11][22]_i_1_n_0
SLICE_X49Y83 FDCE r mp_i/rf_reg[11][22]/D
------------------------------------------------------------------- -------------------
What do these lines mean in general?
The line about
Data Path Delay
states a delay of ~50 ns. Is this a problem?If the delay of ~50ns is a problem, how would you change this delay?
The frequency of our input clock is at 100 MHz. How would you decrease/increase the frequency?
Homework#
In section constraints methodology read until the beginning of project flows.
In section defining clocks read until the block
create_clock -period 10 [get_ports sysclk]
.
Some questions to ponder:
Did you use any kind of constraints in your design before?
Mini lecture#
Timing constraints#
Synthesis is driven by constraints. We have seen that the top ports are connected to arbitrary pins of our FPGA if we don’t include any pin placement constraints. This is similar for timing constraints. Without any constraints Vivado does not try to fulfill any clock period requirements.
For example, to specify the period for a clock input pin clk
clocked at 100 MHz:
create_clock -period 10 [get_ports clk]
However synthesis will still fail to implement a design that meets our constraints. We should still analyze the timing results after the synthesis.
When we use clocking primitives then the constraints are set automatically.
Timing analysis#
Definitions of critical path and slack
post-synthesis/-route simulation
FPGA primitives#
Until now we have seen SystemVerilog design elements called primitive which are low-level logic gates like xor
or and
.
An FPGA primitive is a design element that is part of an FPGA model series. This means that an FPGA primitive cannot be instantiated on a different FPGA model. Example primitives in AMD 7-series and Ultrascale FPGAs are:
BUFG
: global clock bufferFDCE
: D-FF with clock enable and asynchronous clearFIFO36E1
: 36 Kb FIFOLUT6
: 6-output look-up tableXADC
: analog-to-digital converterOBUFDS
: differential signaling output bufferCMAC
: 100G Ethernet medium-access-control
The list of primitives are available in the library of an FPGA series. For example:
The goal of synthesis is to infer these primitives from the HDL. We could also create a structural design using these primitives, but this would make HDL code less readable and non-portable, so we should typically stick to behavioral code. Having said that, there are cases when we cannot describe our intent using behavioral code, e.g.,
converting an analog input to a digital value
creating a 200 MHz clock from a 100 MHz clock source
decoding an Ethernet packet, e.g., CMAC
There are typically three methods to integrate these primitives in our design:
Inference by synthesis using HDL
Manual instantiation using HDL
Using a generator, e.g., IP catalog in the integrated design environment
Clocking#
FPGAs typically have special hardware blocks to generate additional clock signals with different properties (e.g., frequency, skew, duty cycle etc) using a clock source signal. These are called clock management tiles (CMT).
Each CMT consists of an PLL (phase-locked loop) and MMCM (mixed-mode clock manager). The PLL has a subset of features of MMCM. Typically the choice is done dependent on the parameters. For example MMCM can generate some fractional frequencies that PLL cannot generate.
PLL concept#
PLL is a control loop:
This loop tracks the phase of the input signal, generates a voltage stabilized by the loop filter, which in turn is fed to the VCO. The VCO generates an oscillated signal which matches the phase of the input signal and thus also the same frequency. If we integrate a frequency divider, as follows,
then we get a frequency which can even be a multiple of the input frequency.
Why do we get a multiple of the output frequency if we integrate a divider in the feedback?
A PLL is used in digital circuits and also in FPGAs, as follows:
We should prefer CMTs to self-implemented clock dividers if we want to generate clocks in MHz range to save resources and avoid design errors. Generating a slow clock, e.g., 10 Hz, is not possible with a CMT and a divider-based approach must be used.
Integration of clocking primitives#
We mentioned before three methods for inferring primitives. Inferring a PLL from behavioral SystemVerilog code is not feasible, so we can manually instantiate or use a generator.
Using manual instantiation#
Corresponding primitive for a PLL is PLLE2_BASE
.
An idea is to first encapsulate PLL2_BASE
as follows, which we in turn later instantiate with appropriate parameters.
module pll_2port #(
real PERIOD_IN = 10,
// The resulting division value:
// DIVIDE_BY_IN * DIVIDE_BY_OUT / FEEDBACK_MULTIPLIER
int unsigned DIVIDE_BY_OUT0 = 10, // 1 to 128
int unsigned DIVIDE_BY_OUT1 = 10,
int unsigned DIVIDE_BY_IN = 1, // 1 to 56
int unsigned FEEDBACK_MULTIPLIER = 5 // 2 to 64
) (
input rst,
in,
output out0,
out1,
locked,
input powerdown = 0
// Default values for inputs not supported by Verilator
// but Vivado
);
logic feedback;
PLLE2_BASE #(
.CLKFBOUT_MULT (FEEDBACK_MULTIPLIER),
.CLKIN1_PERIOD (PERIOD_IN),
.CLKOUT0_DIVIDE(DIVIDE_BY_OUT0),
.CLKOUT1_DIVIDE(DIVIDE_BY_OUT1),
.DIVCLK_DIVIDE (DIVIDE_BY_IN)
) pll_i (
.CLKFBIN(feedback),
.CLKFBOUT(feedback),
.CLKIN1(in),
.CLKOUT0(out0),
.CLKOUT1(out1),
.LOCKED(locked),
.PWRDWN(powerdown),
.RST(rst)
);
endmodule
For example if we want to generate 200 MHz and 10 MHz from 100 MHz:
pll_2port #(
.PERIOD_IN(10),
.DIVIDE_BY_OUT0(5),
.DIVIDE_BY_OUT1(100),
.FEEDBACK_MULTIPLIER(10)
The corresponding project can be found on repo:code/pll-example.
Using a generator#
You have probably noticed that a manual instantiation of PLL requires many parameters which increase the chance of writing errors. Synthesis IDEs typically include GUI-assisted generators which may be less error-prone.
For example let us create the same PLL above using Vivado’s IP generator:
Click on
IP Catalog
on the left bar.Search for
clock
and click onClocking Wizard
.Set
Input Frequency(MHz)
below.Click on the
Output Clocks
tab and setOutput Freq(MHz)
. Add other outputs as needed.Clock
OK
.On the new pop-up window click on
Generate
and wait for the generation of the module.Now you can instantiate the module. Find the module under
Sources
. Open its source file to see the interface of the module for instantiation.
If you prefer to work with scripts, then you can use the following commands in TCL prompt of Vivado:
create_ip -name clk_wiz -module_name pll_2o
set_property -dict [list \
CONFIG.CLKOUT1_USED {true} \
CONFIG.CLKOUT2_USED {true} \
CONFIG.CLKOUT1_REQUESTED_OUT_FREQ {200} \
CONFIG.CLKOUT2_REQUESTED_OUT_FREQ {10} \
] [get_ips pll_2o]
generate_target synthesis [get_ips pll_2o]
Solution to the introductory problem#
First we try to implement the design at 100 MHz using constraints:
create_clock -period 10 [get_ports clk]
After constraining we notice a new table under timing summary that did not exist before we constrained our design:
| Design Timing Summary
| ---------------------
------------------------------------------------------------------------------------------------
WNS(ns) TNS(ns) TNS Failing Endpoints TNS Total Endpoints WHS(ns) THS(ns) THS Failing Endpoints THS Total Endpoints WPWS(ns) TPWS(ns) TPWS Failing Endpoints TPWS Total Endpoints
------- ------- --------------------- ------------------- ------- ------- --------------------- ------------------- -------- -------- ---------------------- --------------------
-5.422 -7116.198 2055 2055 0.226 0.000 0 2055 4.500 0.000 0 1543
Timing constraints are not met.
If the timing is not constrained, then there are no timing constraints to meet, so we did not have this table. The message Timing constraints are not met.
is obvious, but why?
According to the design timing summary docs WNS
stands for the worst negative slack. A negative duration of \(t_\mathrm{slack}\) means that there was a synchronous combinational path had \(t_\mathrm{slack}\) ns slack in the setup window of the flip-flop that ends this combinational path on the FPGA (called endpoint in the report). In other words we should reimplement this path so that the propagation time is \(-t_\mathrm{slack}\) ns shorter.
T{NS, HS, PWS} are the total or in other words sum of the negative slacks for these values.
Failing Endpoints
is the number of failing paths. The product of TNS Failing Endpoints
and WNS
is smaller than TNS
, because WNS
is the worst value.
WHS
stands for the worst hold slack. Hold time of a flip-flop is the minimum time duration that a signal must be stable so that the flip-flop can reliably flip the stored bit. So a negative hold slack means that the current implementation fails to meet this constraint for an endpoint. I believe a negative WNS
leads to a negative WHS
, so we will focus only on WNS
.
Typically all slacks should be positive for proper functioning of our design.
Testing on the FPGA#
So our design should not work right? When we program the FPGA, we see that our test works, even we saw some paths failing in the timing report. Presumably we could not introduce an error using the short test program, however may encounter a failure in the long run.
If we wanted to deliberately cause an error in our design, we should use a program that uses a failing path:
[Max Delay Paths]
Slack (VIOLATED) : -5.422ns (required time - arrival time)
Source: mp_i/mem_reg[0][0]/C
(rising edge-triggered cell FDRE clocked by clk {rise@0.000ns fall@5.000ns period=10.000ns})
Destination: mp_i/rf_reg[4][23]/D
(rising edge-triggered cell FDCE clocked by clk {rise@0.000ns fall@5.000ns period=10.000ns})
Path Group: clk
Path Type: Setup (Max at Slow Process Corner)
Requirement: 10.000ns (clk rise@10.000ns - clk rise@0.000ns)
Data Path Delay: 15.277ns (logic 3.872ns (25.345%) route 11.405ns (74.655%))
Logic Levels: 17 (CARRY4=1 LUT2=2 LUT3=2 LUT6=9 MUXF7=2 MUXF8=1)
Clock Path Skew: -0.139ns (DCD - SCD + CPR)
Destination Clock Delay (DCD): 4.773ns = ( 14.773 - 10.000 )
Source Clock Delay (SCD): 5.090ns
Clock Pessimism Removal (CPR): 0.179ns
Clock Uncertainty: 0.035ns ((TSJ^2 + TIJ^2)^1/2 + DJ) / 2 + PE
Total System Jitter (TSJ): 0.071ns
Total Input Jitter (TIJ): 0.000ns
Discrete Jitter (DJ): 0.000ns
Phase Error (PE): 0.000ns
Location Delay type Incr(ns) Path(ns) Netlist Resource(s)
------------------------------------------------------------------- -------------------
(clock clk rise edge) 0.000 0.000 r
F14 0.000 0.000 r clk (IN)
net (fo=0) 0.000 0.000 clk
F14 IBUF (Prop_ibuf_I_O) 1.458 1.458 r clk_IBUF_inst/O
net (fo=1, routed) 1.972 3.430 clk_IBUF
BUFGCTRL_X0Y16 BUFG (Prop_bufg_I_O) 0.096 3.526 r clk_IBUF_BUFG_inst/O
net (fo=1542, routed) 1.564 5.090 mp_i/CLK
SLICE_X34Y39 FDRE r mp_i/mem_reg[0][0]/C
------------------------------------------------------------------- -------------------
SLICE_X34Y39 FDRE (Prop_fdre_C_Q) 0.518 5.608 f mp_i/mem_reg[0][0]/Q
net (fo=9, routed) 1.010 6.618 mp_i/mem_reg[0]_31[0]
SLICE_X35Y39 LUT6 (Prop_lut6_I0_O) 0.124 6.742 f mp_i/rf[10][16]_i_24/O
net (fo=1, routed) 0.000 6.742 mp_i/rf[10][16]_i_24_n_0
SLICE_X35Y39 MUXF7 (Prop_muxf7_I1_O) 0.217 6.959 f mp_i/rf_reg[10][16]_i_18/O
net (fo=1, routed) 1.027 7.986 mp_i/rf_reg[10][16]_i_18_n_0
SLICE_X33Y46 LUT6 (Prop_lut6_I0_O) 0.299 8.285 r mp_i/rf[10][16]_i_14/O
net (fo=285, routed) 1.573 9.859 mp_i/rf[10][16]_i_14_n_0
SLICE_X31Y35 LUT6 (Prop_lut6_I2_O) 0.124 9.983 f mp_i/mem[37][7]_i_46/O
net (fo=1, routed) 0.590 10.573 mp_i/mem[37][7]_i_46_n_0
SLICE_X30Y34 LUT6 (Prop_lut6_I3_O) 0.124 10.697 r mp_i/mem[37][7]_i_20/O
net (fo=7, routed) 0.503 11.200 mp_i/mem[37][7]_i_20_n_0
SLICE_X29Y36 LUT3 (Prop_lut3_I2_O) 0.124 11.324 r mp_i/rf[8][1]_i_20/O
net (fo=40, routed) 0.650 11.975 mp_i/rf__0[1]
SLICE_X28Y37 LUT2 (Prop_lut2_I0_O) 0.124 12.099 r mp_i/pc[3]_i_15/O
net (fo=1, routed) 0.000 12.099 mp_i/pc[3]_i_15_n_0
SLICE_X28Y37 CARRY4 (Prop_carry4_S[1]_O[2])
0.580 12.679 r mp_i/pc_reg[3]_i_6/O[2]
net (fo=261, routed) 1.093 13.771 mp_i/p_6_in[2]
SLICE_X31Y42 MUXF7 (Prop_muxf7_S_O) 0.474 14.245 r mp_i/rf_reg[31][27]_i_55/O
net (fo=1, routed) 0.000 14.245 mp_i/rf_reg[31][27]_i_55_n_0
SLICE_X31Y42 MUXF8 (Prop_muxf8_I0_O) 0.104 14.349 r mp_i/rf_reg[31][27]_i_33/O
net (fo=2, routed) 0.661 15.010 mp_i/rf_reg[31][27]_i_33_n_0
SLICE_X31Y43 LUT6 (Prop_lut6_I0_O) 0.316 15.326 f mp_i/rf[8][7]_i_11/O
net (fo=24, routed) 0.642 15.969 mp_i/rf[8][7]_i_11_n_0
SLICE_X32Y45 LUT2 (Prop_lut2_I0_O) 0.124 16.093 f mp_i/rf[24][12]_i_7/O
net (fo=26, routed) 1.113 17.206 mp_i/rf[24][12]_i_7_n_0
SLICE_X29Y60 LUT6 (Prop_lut6_I2_O) 0.124 17.330 r mp_i/rf[4][31]_i_15/O
net (fo=1, routed) 0.403 17.733 mp_i/rf[4][31]_i_15_n_0
SLICE_X29Y60 LUT3 (Prop_lut3_I0_O) 0.124 17.857 r mp_i/rf[4][31]_i_13/O
net (fo=16, routed) 1.468 19.326 mp_i/rf[4][31]_i_13_n_0
SLICE_X29Y83 LUT6 (Prop_lut6_I0_O) 0.124 19.450 r mp_i/rf[4][23]_i_8/O
net (fo=1, routed) 0.327 19.776 mp_i/rf[4][23]_i_8_n_0
SLICE_X30Y84 LUT6 (Prop_lut6_I0_O) 0.124 19.900 r mp_i/rf[4][23]_i_5/O
net (fo=1, routed) 0.343 20.243 mp_i/rf[4][23]_i_5_n_0
SLICE_X31Y84 LUT6 (Prop_lut6_I5_O) 0.124 20.367 r mp_i/rf[4][23]_i_1/O
net (fo=1, routed) 0.000 20.367 mp_i/rf[4][23]_i_1_n_0
SLICE_X31Y84 FDCE r mp_i/rf_reg[4][23]/D
------------------------------------------------------------------- -------------------
(clock clk rise edge) 10.000 10.000 r
F14 0.000 10.000 r clk (IN)
net (fo=0) 0.000 10.000 clk
F14 IBUF (Prop_ibuf_I_O) 1.388 11.388 r clk_IBUF_inst/O
net (fo=1, routed) 1.868 13.256 clk_IBUF
BUFGCTRL_X0Y16 BUFG (Prop_bufg_I_O) 0.091 13.347 r clk_IBUF_BUFG_inst/O
net (fo=1542, routed) 1.426 14.773 mp_i/CLK
SLICE_X31Y84 FDCE r mp_i/rf_reg[4][23]/C
clock pessimism 0.179 14.952
clock uncertainty -0.035 14.916
SLICE_X31Y84 FDCE (Setup_fdce_C_D) 0.029 14.945 mp_i/rf_reg[4][23]
-------------------------------------------------------------------
required time 14.945
arrival time -20.367
-------------------------------------------------------------------
slack -5.422
In this path the bit from mem_reg[0][0]
to rf_reg[4][23]
takes too long. If we could come up with an instruction that uses this path, we may trigger an error. TODO
Searching for a positive slack#
To increase the time between failures, we should decrease the frequency. How much, however? An idea is to make our period at least \(|t_\mathrm{slack}| = |t_\mathrm{WNS}|\) longer:
So we should test our target period to \( 10 \mathrm{ns} + |t_\mathrm{WNS}| \approx 15 \mathrm{ns} \). So we have to divide our input clock by \( \frac{15}{10} = 1.5 \).
In this solution let us use the generator which selects the divisors and multipliers automatically to synthesize the target frequency. However it needs a target frequency as input. 15 ns period corresponds to ~67 MHz.
create_ip -name clk_wiz -module_name pll
set_property -dict [list \
CONFIG.CLKOUT1_USED {true} \
CONFIG.CLKOUT1_REQUESTED_OUT_FREQ {67} \
] [get_ips pll]
generate_target synthesis [get_ips pll]
The PLL requires some time to lock to the phase of the input signal, so we should leave the reset signal of the processor active until the PLL has locked:
module mp_boolean (
input clk,
input [3:0] btn,
output logic [15:0] led
);
logic clk_mp, pll_locked;
pll pll_i (
.reset(rst),
.clk_in1(clk),
.clk_out1(clk_mp),
.locked(pll_locked)
);
logic rst;
mp #(
.MEM_INIT_HEX_FILE("program.mem"),
.MEM_DEPTH_LG2(4)
) mp_i (
.*,
.rst(!pll_locked),
.clk(clk_mp)
);
always_comb begin
led = 0;
for (int reg_id = 10; reg_id <= 19; ++reg_id) led[reg_id-10] = mp_i.rf[reg_id][0];
end
assign rst = |btn;
endmodule
Indeed we get a positive slack 🥳:
| Design Timing Summary
| ---------------------
------------------------------------------------------------------------------------------------
WNS(ns) TNS(ns) TNS Failing Endpoints TNS Total Endpoints WHS(ns) THS(ns) THS Failing Endpoints THS Total Endpoints WPWS(ns) TPWS(ns) TPWS Failing Endpoints TPWS Total Endpoints
------- ------- --------------------- ------------------- ------- ------- --------------------- ------------------- -------- -------- ---------------------- --------------------
0.558 0.000 0 1279 0.223 0.000 0 1279 3.000 0.000 0 1157
All user specified timing constraints are met.
There are also further optimization approaches and probably we can achieve higher frequencies but our current goal is to get an efficient design with reasonable effort.
Homework#
We used a generator in the solution for our introductory problem. Instantiate the PLL manually using primitives.
In this exercise you will create and instantiate clocking and differential output primitives for display output using HDMI.
Steps:
Create a project
Clone hdmi-util/hdmi and add all the Systemverilog files under
src/
to your projectCreate a PLL with the following properties:
name:
pll
choose clock input frequency
create a clock output called
clk_pixel
at 74.25 MHzcreate a clock output called
clk_pixel_x5
at 371.25 MHz
Instantiate four differential output buffers
OBUFDS
. These are three differential outputs for the signaltmds[2:0]
and one fortmds_clock
. The outputs should be connected to the portshdmi_tx_*
andhdmi_clk_*
.p
stands for positive andn
negative.Synthesize, place & route
Analyze the timing report
What is the minimum slack?
Are the constraints met?
Connect the HDMI cable to a monitor and program your board. You should see a red screen.
Use the template hdmi_boolean.sv
.
module hdmi_boolean (
input clk,
output [2:0] hdmi_tx_p,
hdmi_tx_n,
output hdmi_clk_p,
hdmi_clk_n
);
logic clk_pixel_x5, clk_pixel, tmds_clock;
logic [23:0] rgb;
logic [ 2:0] tmds;
logic [10:0] cx;
logic [ 9:0] cy;
hdmi #(
.DVI_OUTPUT(1),
.VIDEO_ID_CODE(4),
.VIDEO_REFRESH_RATE(60.0)
) hdmi (
.clk_pixel_x5,
.clk_pixel,
.tmds_clock,
.rgb,
.tmds,
.cx,
.cy
);
assign rgb = '1;
Optional challenges:
Display only green.
Display three vertical stripes with red, green and blue.
Display a left to right red gradient, leftmost side should be black and rightmost side should be red.
Use
image2mem.py
to display an image.Display a white ball and make it move with a constant velocity. It should bounce back at the edges.
from PIL import Image
img = Image.open("image.jpg")
img.thumbnail((320, 240))
with open("image.memh", "w") as f:
f.writelines(
hex((pixel[0] >> 4 << 8) + (pixel[1] >> 4 << 4) + (pixel[2] >> 4))[2:]
+ "\n"
for pixel in img.getdata()
)