Timing & FPGA primitives#

Learning goals#

  • Know what lack of timing constraints can lead to

  • Use FPGA primitives to implement non-synthesizable features like clock synthesizer or analog-digital-converter

Introductory problem#

(This problem follows Exercise and Exercise).

You are pondering why your RISC-V implementation with register-immediate instructions seemed to work on the board, but the version with complete set of RV32I instructions did not 🤔.

“Maybe the FPGA that I use has some problems”, you suspect. You try with a second board and the result is the same. “Damn. What is the problem? A five minute walk maybe helps”, you presume and walk down the stairs of your building.

A bit refreshed, you decide to look into the lengthy logs of the synthesis results to search for answers. In the timing report of your implementation you notice the following excerpt:

Listing 52 code/riscv-single-cycle-boolean/prj/prj.runs/impl_1/mp_boolean_timing_summary_routed.rpt#
[Max Delay Paths]
Slack:                    inf
  Source:                 mp_i/pc_reg[3]/C
                            (rising edge-triggered cell FDCE)
  Destination:            mp_i/rf_reg[11][22]/D
  Path Group:             (none)
  Path Type:              Max at Slow Process Corner
  Data Path Delay:        54.299ns  (logic 1.701ns (3.133%)  route 52.598ns (96.867%))
  Logic Levels:           6  (FDCE=1 LUT4=2 LUT6=2 MUXF7=1)

    Location             Delay type                Incr(ns)  Path(ns)    Netlist Resource(s)
  -------------------------------------------------------------------    -------------------
    SLICE_X10Y49         FDCE                         0.000     0.000 r  mp_i/pc_reg[3]/C
    SLICE_X10Y49         FDCE (Prop_fdce_C_Q)         0.518     0.518 r  mp_i/pc_reg[3]/Q
                         net (fo=155, routed)         5.016     5.534    mp_i/pc_reg_n_0_[3]
    SLICE_X15Y11         MUXF7 (Prop_muxf7_S_O)       0.276     5.810 r  mp_i/rf_reg[6][31]_i_2/O
                         net (fo=3516, routed)       44.871    50.681    mp_i/rf_reg[6][31]_i_2_n_0
    SLICE_X44Y85         LUT4 (Prop_lut4_I1_O)        0.327    51.008 r  mp_i/rf[29][22]_i_6/O
                         net (fo=2, routed)           0.734    51.742    mp_i/rf[29][22]_i_6_n_0
    SLICE_X44Y84         LUT6 (Prop_lut6_I5_O)        0.332    52.074 r  mp_i/rf[14][22]_i_5/O
                         net (fo=2, routed)           0.951    53.025    mp_i/rf[14][22]_i_5_n_0
    SLICE_X44Y83         LUT6 (Prop_lut6_I4_O)        0.124    53.149 f  mp_i/rf[11][22]_i_2/O
                         net (fo=1, routed)           1.027    54.175    mp_i/rf[11][22]_i_2_n_0
    SLICE_X49Y83         LUT4 (Prop_lut4_I0_O)        0.124    54.299 r  mp_i/rf[11][22]_i_1/O
                         net (fo=1, routed)           0.000    54.299    mp_i/rf[11][22]_i_1_n_0
    SLICE_X49Y83         FDCE                                         r  mp_i/rf_reg[11][22]/D
  -------------------------------------------------------------------    -------------------
  1. What do these lines mean in general?

  2. The line about Data Path Delay states a delay of ~50 ns. Is this a problem?

  3. If the delay of ~50ns is a problem, how would you change this delay?

  4. The frequency of our input clock is at 100 MHz. How would you decrease/increase the frequency?


Some questions to ponder:

  • Did you use any kind of constraints in your design before?

Mini lecture#

Timing constraints#

Synthesis is driven by constraints. We have seen that the top ports are connected to arbitrary pins of our FPGA if we don’t include any pin placement constraints. This is similar for timing constraints. Without any constraints Vivado does not try to fulfill any clock period requirements.

For example, to specify the period for a clock input pin clk clocked at 100 MHz:

create_clock -period 10 [get_ports clk]

However synthesis will still fail to implement a design that meets our constraints. We should still analyze the timing results after the synthesis.

When we use clocking primitives then the constraints are set automatically.

Timing analysis#

FPGA primitives#

Until now we have seen SystemVerilog design elements called primitive which are low-level logic gates like xor or and.

An FPGA primitive is a design element that is part of an FPGA model series. This means that an FPGA primitive cannot be instantiated on a different FPGA model. Example primitives in AMD 7-series and Ultrascale FPGAs are:

  • BUFG: global clock buffer

  • FDCE: D-FF with clock enable and asynchronous clear

  • FIFO36E1: 36 Kb FIFO

  • LUT6: 6-output look-up table

  • XADC: analog-to-digital converter

  • OBUFDS: differential signaling output buffer

  • CMAC: 100G Ethernet medium-access-control

The list of primitives are available in the library of an FPGA series. For example:

The goal of synthesis is to infer these primitives from the HDL. We could also create a structural design using these primitives, but this would make HDL code less readable and non-portable, so we should typically stick to behavioral code. Having said that, there are cases when we cannot describe our intent using behavioral code, e.g.,

  • converting an analog input to a digital value

  • creating a 200 MHz clock from a 100 MHz clock source

  • decoding an Ethernet packet, e.g., CMAC

There are typically three methods to integrate these primitives in our design:

  1. Inference by synthesis using HDL

  2. Manual instantiation using HDL

  3. Using a generator, e.g., IP catalog in the integrated design environment


FPGAs typically have special hardware blocks to generate additional clock signals with different properties (e.g., frequency, skew, duty cycle etc) using a clock source signal. These are called clock management tiles (CMT).

Each CMT consists of an PLL (phase-locked loop) and MMCM (mixed-mode clock manager). The PLL has a subset of features of MMCM. Typically the choice is done dependent on the parameters. For example MMCM can generate some fractional frequencies that PLL cannot generate.

PLL concept#

PLL is a control loop:


This loop tracks the phase of the input signal, generates a voltage stabilized by the loop filter, which in turn is fed to the VCO. The VCO generates an oscillated signal which matches the phase of the input signal and thus also the same frequency. If we integrate a frequency divider, as follows,


then we get a frequency which can even be a multiple of the input frequency.

Exercise 30

Why do we get a multiple of the output frequency if we integrate a divider in the feedback?

A PLL is used in digital circuits and also in FPGAs, as follows:


We should prefer CMTs to self-implemented clock dividers if we want to generate clocks in MHz range to save resources and avoid design errors. Generating a slow clock, e.g., 10 Hz, is not possible with a CMT and a divider-based approach must be used.

Integration of clocking primitives#

We mentioned before three methods for inferring primitives. Inferring a PLL from behavioral SystemVerilog code is not feasible, so we can manually instantiate or use a generator.

Using manual instantiation#

Corresponding primitive for a PLL is PLLE2_BASE.

For example if we want to generate 200 MHz and 10 MHz from 100 MHz:

Listing 53 code/pll-example/pll_2port.sv#
module pll_2port #(
    real PERIOD_IN = 10,
    // The resulting division value:
    int unsigned DIVIDE_BY_OUT0 = 10,  // 1 to 128
    int unsigned DIVIDE_BY_OUT1 = 10,
    int unsigned DIVIDE_BY_IN = 1,  // 1 to 56
    int unsigned FEEDBACK_MULTIPLIER = 5  // 2 to 64

) (
    input  rst,
    output out0,
    input  powerdown = 0
    // Default values for inputs not supported by Verilator
    // but Vivado
  logic feedback;
  ) pll_i (

The corresponding testbench can be found on repo:code/pll-example.

Using a generator#

You have probably noticed that a manual instantiation of PLL requires many parameters which increase the chance of writing errors. Synthesis IDEs typically include GUI-assisted generators which may be less error-prone.

For example let us create the same PLL above using Vivado’s IP generator:

  1. Click on IP Catalog on the left bar.

  2. Search for clock and click on Clocking Wizard.

  3. Set Input Frequency(MHz) below.

  4. Click on the Output Clocks tab and set Output Freq(MHz). Add other outputs as needed.

  5. Clock OK.

  6. On the new pop-up window click on Generate and wait for the generation of the module.

  7. Now you can instantiate the module. Find the module under Sources. Open its source file to see the interface of the module for instantiation.

If you prefer to work with scripts, then you can use the following commands in TCL prompt of Vivado:

create_ip -name clk_wiz -module_name pll_2o
set_property -dict [list \
] [get_ips pll_2o]
generate_target synthesis [get_ips pll_2o]

Solution to the introductory problem#

First we try to implement the design at 100 MHz using constraints:

Listing 54 code/riscv-single-cycle-boolean-constrained/timing.xdc#
create_clock -period 10 [get_ports clk]

After constraining we notice a new table under timing summary that did not exist before we constrained our design:

Listing 55 code/riscv-single-cycle-boolean-constrained/prj/prj.runs/impl_1/mp_boolean_timing_summary_routed.rpt#
| Design Timing Summary
| ---------------------

    WNS(ns)      TNS(ns)  TNS Failing Endpoints  TNS Total Endpoints      WHS(ns)      THS(ns)  THS Failing Endpoints  THS Total Endpoints     WPWS(ns)     TPWS(ns)  TPWS Failing Endpoints  TPWS Total Endpoints  
    -------      -------  ---------------------  -------------------      -------      -------  ---------------------  -------------------     --------     --------  ----------------------  --------------------  
     -5.422    -7116.198                   2055                 2055        0.226        0.000                      0                 2055        4.500        0.000                       0                  1543  

Timing constraints are not met.

If the timing is not constrained, then there are no timing constraints to meet, so we did not have this table. The message Timing constraints are not met. is obvious, but why?

According to the design timing summary docs WNS stands for the worst negative slack. A negative duration of \(t_\mathrm{slack}\) means that there was a synchronous combinational path had \(t_\mathrm{slack}\) ns slack in the setup window of the flip-flop that ends this combinational path on the FPGA (called endpoint in the report). In other words we should reimplement this path so that the propagation time is \(-t_\mathrm{slack}\) ns shorter.

T{NS, HS, PWS} are the total or in other words sum of the negative slacks for these values.

Failing Endpoints is the number of failing paths. The product of TNS Failing Endpoints and WNS is smaller than TNS, because WNS is the worst value.

WHS stands for the worst hold slack. Hold time of a flip-flop is the minimum time duration that a signal must be stable so that the flip-flop can reliably flip the stored bit. So a negative hold slack means that the current implementation fails to meet this constraint for an endpoint. I believe a negative WNS leads to a negative WHS, so we will focus only on WNS.

Typically all slacks should be positive for proper functioning of our design.

Testing on the FPGA#

So our design should not work right? When we program the FPGA, we see that our test works, even we saw some paths failing in the timing report. Presumably we could not introduce an error using the short test program, however may encounter a failure in the long run.

If we wanted to deliberately cause an error in our design, we should use a program that uses a failing path:

Listing 56 code/riscv-single-cycle-boolean-constrained/prj/prj.runs/impl_1/mp_boolean_timing_summary_routed.rpt#
[Max Delay Paths]
Slack (VIOLATED) :        -5.422ns  (required time - arrival time)
  Source:                 mp_i/mem_reg[0][0]/C
                            (rising edge-triggered cell FDRE clocked by clk  {rise@0.000ns fall@5.000ns period=10.000ns})
  Destination:            mp_i/rf_reg[4][23]/D
                            (rising edge-triggered cell FDCE clocked by clk  {rise@0.000ns fall@5.000ns period=10.000ns})
  Path Group:             clk
  Path Type:              Setup (Max at Slow Process Corner)
  Requirement:            10.000ns  (clk rise@10.000ns - clk rise@0.000ns)
  Data Path Delay:        15.277ns  (logic 3.872ns (25.345%)  route 11.405ns (74.655%))
  Logic Levels:           17  (CARRY4=1 LUT2=2 LUT3=2 LUT6=9 MUXF7=2 MUXF8=1)
  Clock Path Skew:        -0.139ns (DCD - SCD + CPR)
    Destination Clock Delay (DCD):    4.773ns = ( 14.773 - 10.000 ) 
    Source Clock Delay      (SCD):    5.090ns
    Clock Pessimism Removal (CPR):    0.179ns
  Clock Uncertainty:      0.035ns  ((TSJ^2 + TIJ^2)^1/2 + DJ) / 2 + PE
    Total System Jitter     (TSJ):    0.071ns
    Total Input Jitter      (TIJ):    0.000ns
    Discrete Jitter          (DJ):    0.000ns
    Phase Error              (PE):    0.000ns

    Location             Delay type                Incr(ns)  Path(ns)    Netlist Resource(s)
  -------------------------------------------------------------------    -------------------
                         (clock clk rise edge)        0.000     0.000 r  
    F14                                               0.000     0.000 r  clk (IN)
                         net (fo=0)                   0.000     0.000    clk
    F14                  IBUF (Prop_ibuf_I_O)         1.458     1.458 r  clk_IBUF_inst/O
                         net (fo=1, routed)           1.972     3.430    clk_IBUF
    BUFGCTRL_X0Y16       BUFG (Prop_bufg_I_O)         0.096     3.526 r  clk_IBUF_BUFG_inst/O
                         net (fo=1542, routed)        1.564     5.090    mp_i/CLK
    SLICE_X34Y39         FDRE                                         r  mp_i/mem_reg[0][0]/C
  -------------------------------------------------------------------    -------------------
    SLICE_X34Y39         FDRE (Prop_fdre_C_Q)         0.518     5.608 f  mp_i/mem_reg[0][0]/Q
                         net (fo=9, routed)           1.010     6.618    mp_i/mem_reg[0]_31[0]
    SLICE_X35Y39         LUT6 (Prop_lut6_I0_O)        0.124     6.742 f  mp_i/rf[10][16]_i_24/O
                         net (fo=1, routed)           0.000     6.742    mp_i/rf[10][16]_i_24_n_0
    SLICE_X35Y39         MUXF7 (Prop_muxf7_I1_O)      0.217     6.959 f  mp_i/rf_reg[10][16]_i_18/O
                         net (fo=1, routed)           1.027     7.986    mp_i/rf_reg[10][16]_i_18_n_0
    SLICE_X33Y46         LUT6 (Prop_lut6_I0_O)        0.299     8.285 r  mp_i/rf[10][16]_i_14/O
                         net (fo=285, routed)         1.573     9.859    mp_i/rf[10][16]_i_14_n_0
    SLICE_X31Y35         LUT6 (Prop_lut6_I2_O)        0.124     9.983 f  mp_i/mem[37][7]_i_46/O
                         net (fo=1, routed)           0.590    10.573    mp_i/mem[37][7]_i_46_n_0
    SLICE_X30Y34         LUT6 (Prop_lut6_I3_O)        0.124    10.697 r  mp_i/mem[37][7]_i_20/O
                         net (fo=7, routed)           0.503    11.200    mp_i/mem[37][7]_i_20_n_0
    SLICE_X29Y36         LUT3 (Prop_lut3_I2_O)        0.124    11.324 r  mp_i/rf[8][1]_i_20/O
                         net (fo=40, routed)          0.650    11.975    mp_i/rf__0[1]
    SLICE_X28Y37         LUT2 (Prop_lut2_I0_O)        0.124    12.099 r  mp_i/pc[3]_i_15/O
                         net (fo=1, routed)           0.000    12.099    mp_i/pc[3]_i_15_n_0
    SLICE_X28Y37         CARRY4 (Prop_carry4_S[1]_O[2])
                                                      0.580    12.679 r  mp_i/pc_reg[3]_i_6/O[2]
                         net (fo=261, routed)         1.093    13.771    mp_i/p_6_in[2]
    SLICE_X31Y42         MUXF7 (Prop_muxf7_S_O)       0.474    14.245 r  mp_i/rf_reg[31][27]_i_55/O
                         net (fo=1, routed)           0.000    14.245    mp_i/rf_reg[31][27]_i_55_n_0
    SLICE_X31Y42         MUXF8 (Prop_muxf8_I0_O)      0.104    14.349 r  mp_i/rf_reg[31][27]_i_33/O
                         net (fo=2, routed)           0.661    15.010    mp_i/rf_reg[31][27]_i_33_n_0
    SLICE_X31Y43         LUT6 (Prop_lut6_I0_O)        0.316    15.326 f  mp_i/rf[8][7]_i_11/O
                         net (fo=24, routed)          0.642    15.969    mp_i/rf[8][7]_i_11_n_0
    SLICE_X32Y45         LUT2 (Prop_lut2_I0_O)        0.124    16.093 f  mp_i/rf[24][12]_i_7/O
                         net (fo=26, routed)          1.113    17.206    mp_i/rf[24][12]_i_7_n_0
    SLICE_X29Y60         LUT6 (Prop_lut6_I2_O)        0.124    17.330 r  mp_i/rf[4][31]_i_15/O
                         net (fo=1, routed)           0.403    17.733    mp_i/rf[4][31]_i_15_n_0
    SLICE_X29Y60         LUT3 (Prop_lut3_I0_O)        0.124    17.857 r  mp_i/rf[4][31]_i_13/O
                         net (fo=16, routed)          1.468    19.326    mp_i/rf[4][31]_i_13_n_0
    SLICE_X29Y83         LUT6 (Prop_lut6_I0_O)        0.124    19.450 r  mp_i/rf[4][23]_i_8/O
                         net (fo=1, routed)           0.327    19.776    mp_i/rf[4][23]_i_8_n_0
    SLICE_X30Y84         LUT6 (Prop_lut6_I0_O)        0.124    19.900 r  mp_i/rf[4][23]_i_5/O
                         net (fo=1, routed)           0.343    20.243    mp_i/rf[4][23]_i_5_n_0
    SLICE_X31Y84         LUT6 (Prop_lut6_I5_O)        0.124    20.367 r  mp_i/rf[4][23]_i_1/O
                         net (fo=1, routed)           0.000    20.367    mp_i/rf[4][23]_i_1_n_0
    SLICE_X31Y84         FDCE                                         r  mp_i/rf_reg[4][23]/D
  -------------------------------------------------------------------    -------------------

                         (clock clk rise edge)       10.000    10.000 r  
    F14                                               0.000    10.000 r  clk (IN)
                         net (fo=0)                   0.000    10.000    clk
    F14                  IBUF (Prop_ibuf_I_O)         1.388    11.388 r  clk_IBUF_inst/O
                         net (fo=1, routed)           1.868    13.256    clk_IBUF
    BUFGCTRL_X0Y16       BUFG (Prop_bufg_I_O)         0.091    13.347 r  clk_IBUF_BUFG_inst/O
                         net (fo=1542, routed)        1.426    14.773    mp_i/CLK
    SLICE_X31Y84         FDCE                                         r  mp_i/rf_reg[4][23]/C
                         clock pessimism              0.179    14.952    
                         clock uncertainty           -0.035    14.916    
    SLICE_X31Y84         FDCE (Setup_fdce_C_D)        0.029    14.945    mp_i/rf_reg[4][23]
                         required time                         14.945    
                         arrival time                         -20.367    
                         slack                                 -5.422    

In this path the bit from mem_reg[0][0] to rf_reg[4][23] takes too long. If we could come up with an instruction that uses this path, we may trigger an error. TODO

Searching for a positive slack#

To increase the time between failures, we should decrease the frequency. How much, however? An idea is to make our period at least \(|t_\mathrm{slack}| = |t_\mathrm{WNS}|\) longer:

So we should test our target period to \( 10 \mathrm{ns} + |t_\mathrm{WNS}| \approx 15 \mathrm{ns} \). So we have to divide our input clock by \( \frac{15}{10} = 1.5 \).

In this solution let us use the generator which selects the divisors and multipliers automatically to synthesize the target frequency. However it needs a target frequency as input. 15 ns period corresponds to ~67 MHz.

Listing 57 code/riscv-single-cycle-boolean-pll/generate-pll.tcl#
create_ip -name clk_wiz -module_name pll
set_property -dict [list \
] [get_ips pll]
generate_target synthesis [get_ips pll]

The PLL requires some time to lock to the phase of the input signal, so we should leave the reset signal of the processor active until the PLL has locked:

Listing 58 code/riscv-single-cycle-boolean-pll/mp_boolean.sv#
module mp_boolean (
    input clk,
    input [3:0] btn,
    output logic [15:0] led
  logic clk_mp, pll_locked;
  pll pll_i (

  logic rst;
  mp #(
  ) mp_i (

  always_comb begin
    led = 0;
    for (int reg_id = 10; reg_id <= 19; ++reg_id) led[reg_id-10] = mp_i.rf[reg_id][0];
  assign rst = |btn;

Indeed we get a positive slack 🥳:

Listing 59 code/riscv-single-cycle-boolean-pll/prj/prj.runs/impl_1/mp_boolean_timing_summary_routed.rpt#
| Design Timing Summary
| ---------------------

    WNS(ns)      TNS(ns)  TNS Failing Endpoints  TNS Total Endpoints      WHS(ns)      THS(ns)  THS Failing Endpoints  THS Total Endpoints     WPWS(ns)     TPWS(ns)  TPWS Failing Endpoints  TPWS Total Endpoints  
    -------      -------  ---------------------  -------------------      -------      -------  ---------------------  -------------------     --------     --------  ----------------------  --------------------  
      0.558        0.000                      0                 1279        0.223        0.000                      0                 1279        3.000        0.000                       0                  1157  

All user specified timing constraints are met.

There are also further optimization approaches and probably we can achieve higher frequencies but our current goal is to get an efficient design with reasonable effort.


Exercise 31

We used a generator in the solution for our introductory problem. Instantiate the PLL manually using primitives.

Exercise 32

In this exercise you will create and instantiate clocking and differential output primitives for display output using HDMI.


  1. Create a project

  2. Clone hdmi-util/hdmi and add all the Systemverilog files under src/ to your project

  3. Create a PLL with the following properties:

    1. name: pll

    2. choose clock input frequency

    3. create a clock output called clk_pixel at 74.25 MHz

    4. create a clock output called clk_pixel_x5 at 371.25 MHz

  4. Instantiate four differential output buffers OBUFDS. These are three differential outputs for the signal tmds[2:0] and one for tmds_clock. The outputs should be connected to the ports hdmi_tx_* and hdmi_clk_*. p stands for positive and n negative.

  5. Synthesize, place & route

  6. Analyze the timing report

    1. What is the minimum slack?

    2. Are the constraints met?

  7. Connect the HDMI cable to a monitor and program your board. You should see a red screen.

Use the template hdmi_boolean.sv.

Listing 60 code/hdmi-boolean/hdmi_boolean.sv#
module hdmi_boolean (
    input clk,
    output [2:0] hdmi_tx_p,
    output hdmi_clk_p,

  logic clk_pixel_x5, clk_pixel, tmds_clock;
  logic [23:0] rgb;
  logic [ 2:0] tmds;
  logic [10:0] cx;
  logic [ 9:0] cy;

  hdmi #(
  ) hdmi (

  assign rgb = '1;

Optional challenges:

  1. Display only green.

  2. Display three vertical stripes with red, green and blue.

  3. Display a left to right red gradient, leftmost side should be black and rightmost side should be red.

  4. Use image2mem.py to display an image.

  5. Display a white ball and make it move with a constant velocity. It should bounce back at the edges.

from PIL import Image

img = Image.open("image.jpg")
img.thumbnail((320, 240))
with open("image.memh", "w") as f:
        hex((pixel[0] >> 4 << 8) + (pixel[1] >> 4 << 4) + (pixel[2] >> 4))[2:]
        + "\n"
        for pixel in img.getdata()