Memory#

Learning goals#

  • Apply features of the language to describe memory

Introductory problem#

You are designing a game on an FPGA that uses various sound effects. The sounds are stored as a sequence of bytes or in general words of arbitrary size. As a resource-efficient storage you want to use a random-access memory.

Design a memory module with the following requirements:

  1. Must have the following interface:

    module ram #(
        int unsigned DEPTH_LG2 = 10,
        int unsigned WORD_SIZE = 8
    ) (
        input clk,
    
        input [WORD_SIZE-1 : 0] din,  // Data in
        input [DEPTH_LG2-1 : 0] din_addr,  // Address
        input wen,  // Write enable
    
        output logic [WORD_SIZE-1 : 0] dout,  // Data out
        input [DEPTH_LG2-1 : 0] dout_addr  // Address
    );
    
  2. The testbench must write the RAM with values that correspond to the last significant bits of the address and then check the written values using the read operation.

Implementation on the board#

  1. Use a depth of 16 and word size of 4

  2. Connect the:

    1. least significant slide switches to din_addr

    2. slide switches next to din_addr to din

    3. most significant slide switches to dout_addr

    4. most significant LEDs to dout

    5. A button to wen

  3. Check the logs whether the synthesizer inferred a RAM from your description or not. If your description does not mimic the behavior of an on-chip RAM primitive, your description may be synthesized by other means. For example in your synthesis logs you should see:

    | Module Name      | RTL Object  | Inference | Size (Depth x Width) | Primitives |
    +------------------+-------------+-----------+----------------------+------------|
    | ram_boolean_test | dut/mem_reg | Implied   | 16 x 4               | RAM32M x 1 |
    

    And:

    +------+-------+------+
    |      |Cell   |Count |
    +------+-------+------+
    ...
    |*     |RAM32M |     1|
    

    RAM32M is a memory primitive available on 7 Series AMD FPGAs.

Testing on the board#

RAM will likely contain zeroes in the beginning. Write at least one value to an address as follows:

  1. Set an address using din_addr and non-zero data (otherwise you won’t notice whether you wrote any data or not) din

  2. Activate wen for at least one clock cycle

  3. For reading, set the address you wrote to using dout_addr.

  4. The LEDs should show the data that you wrote. Pay attention that this data is only available on the address that you wrote to and not in other addresses.

Tasks#

Read section 7 of SV Guide.

Quiz#

TODO

Mini-lecture#

Memory infrastructure on FPGAs#

  • FF

  • Block RAM

    • high capacity but requires one clock cycle (i.e., synchronous) to read/write

  • distributed RAM using LUTs, also called SelectRAM

    • low capacity but immediate read (i.e., asynchronous) capability, one clock write

  • Ultra RAM, high-bandwidth memory (HBM)

Case study:

The synthesizer infers Block RAM or distributed RAM based on:

  1. Memory size. For example our introductory example infers a distributed RAM. If we increase the DEPTH_LG2 from 4 to 8, then we increase the size from 16*4 bits to 256*4 bits.

  2. Number of read/write ports: For example distributed RAM on 7 Series FPGAs support Quad port:

    Distributed RAM configurations include:

    Quad port

    • One port for synchronous writes and asynchronous reads

    • Three ports for asynchronous reads

    However Block RAM supports only up to two ports:

    … the two ports are symmetrical and totally independent, …

  3. Read behavior: For example it is not possible to read the data on Block RAMs synchronously in the same clock cycle. Distributed RAM supports this.

    Note that distributed RAM can also be inferred even we read the memory output asynchronously. The synthesizer does this by adding flip-flops to the memory data outputs. This happens if the memory size is under a threshold. This is the reason why the synthesizer infers a distributed RAM instead in our introductory example.

Ready-Valid Handshake#

Memory inference in synthesis#

Packed vs unpacked arrays#

  • packed: logic [7:0] mem

    • the logic bits are very closely related, e.g., the bits of a byte

  • unpacked: logic mem[8]

    • the logic bits are independent, e.g., a bit addressable RAM

Tracing large memory arrays in Verilator#

Verilator traces arrays up to a specific depth, so you may wonder why the waveform does not include some arrays. To include these, use the following argument:

verilator --trace-max-array 100 ...
# Traces up to a depth of 100

Initializing memory#

  1. Manual

  2. readmemh()

Loading memory array data from a file
$readmemb(filename, memory_name [, start_addr[, finish_addr]]);
$readmemh(...

Example#

module ram #(
    ...
    string INIT_HEX_FILE = "init.mem"
)
  ...
  initial $readmemh(INIT_HEX_FILE, mem);
endmodule
Listing 20 code/ram_boolean_test/test_data.mem#
1
2
3
4
5
6
7
8
9
a
b
c
d
e
f
0
Listing 21 code/ram_boolean_test/ram_boolean_test.sv#
module ram_boolean_test #(
    int unsigned DEPTH_LG2 = 4,
    int unsigned WORD_SIZE = 4
) (
    input clk,
    input [15:0] sw,
    output [15:0] led,
    input [3:0] btn
);
  logic wen;
  logic [WORD_SIZE-1 : 0] din, dout;
  logic [DEPTH_LG2-1 : 0] din_addr, dout_addr;
  ram #(
      .DEPTH_LG2(DEPTH_LG2),
      .WORD_SIZE(WORD_SIZE),
      .INIT_HEX_FILE("test_data.mem")
  ) ram_i (
      .*
  );

  assign din_addr = sw[0+:4], din = sw[4+:4], dout_addr = sw[15-:4], led[15-:4] = dout, wen = |btn;
endmodule

You have to add the initialization file to your project. Otherwise it won’t be visible to Vivado. If you use the extension .mem, then Vivado will automatically recognize the file as memory initialization file.

Solution for the introductory problem#

Listing 22 code/ram_basic/ram.sv#
module ram #(
    int unsigned DEPTH_LG2 = 10,
    int unsigned WORD_SIZE = 8
) (
    input clk,

    input [WORD_SIZE-1 : 0] din,  // Data in
    input [DEPTH_LG2-1 : 0] din_addr,  // Address
    input wen,  // Write enable

    output logic [WORD_SIZE-1 : 0] dout,  // Data out
    input [DEPTH_LG2-1 : 0] dout_addr  // Address
);
  type (din) mem[2**DEPTH_LG2];
  always_ff @(posedge clk) begin
    if (wen) mem[din_addr] <= din;
    dout <= mem[dout_addr];
  end
endmodule

We drive and check the signals @(negedge clk). Let us discuss the reason using the following example: If we would activate wen right after @(posedge clk), then the RAM will update the memory immediately. However this behavior is not realistic. In reality, the signals are typically driven between two rising edges and must be driven before the setup time window of a flip-flop. We get a similar behavior by driving the signals after @(negedge clk).

To check the output in the same clock cycle after driving a combinational signal, a basic assert won’t work. assert checks the output

Listing 23 code/ram_basic/tb.sv#
module tb #(
    int unsigned DEPTH_LG2 = 10,
    int unsigned WORD_SIZE = 8
);
  logic clk, wen;
  logic [WORD_SIZE-1 : 0] din, dout;
  logic [DEPTH_LG2-1 : 0] din_addr, dout_addr;
  ram dut (.*);
  type (din) test_word;

  always #1 clk = !clk;

  initial begin
    test_word = '1;
    wen = 0;
    @(posedge clk);

    din = test_word;
    din_addr = 1;
    wen = 1;
    @(posedge clk);

    wen = 0;
    dout_addr = 1;
    @(posedge clk);

    assert (test_word == dout);

    @(posedge clk) $finish;
  end

  import util::dump_and_timeout;
  initial dump_and_timeout;
endmodule
Listing 24 code/ram_basic_boolean_test/ram_boolean_test.sv#
module ram_boolean_test #(
    int unsigned DEPTH_LG2 = 4,
    int unsigned WORD_SIZE = 4
) (
    input clk,
    input [15:0] sw,
    output [15:0] led,
    input [3:0] btn
);
  logic wen;
  logic [WORD_SIZE-1 : 0] din, dout;
  logic [DEPTH_LG2-1 : 0] din_addr, dout_addr;
  ram #(
      .DEPTH_LG2(DEPTH_LG2),
      .WORD_SIZE(WORD_SIZE)
  ) ram_i (
      .*
  );

  assign din_addr = sw[0+:4], din = sw[4+:4], dout_addr = sw[15-:4], led[15-:4] = dout, wen = |btn;
endmodule

In every Verilator testbench we want to dump the signals and have a maximum simulation duration. We introduce the following task for this purpose:

Listing 25 code/util/util.sv#
package util;
  task static dump_and_timeout(time t = 100, string dumpfile = "signals.fst");
    /* Creates the signal dump for a waveform viewer and stops the
       simulation after the timeout.

       To be called in an `initial` block in a testbench.
    */
    begin
      $dumpfile(dumpfile);
      $dumpvars;
      #t $fatal("*** Timeout ***");
      $finish;
    end
  endtask

  task automatic clk_and_dump_and_timeout(ref logic clk, input time period = 2, time timeout = 100,
                                          string dumpfile = "signals.fst");
    /* Warning: if period has a unit, e.g., 2ps, and another file sets
    * a larger precision, e.g., ns, then period may be converted to 0, because
    * time datatype does not support floats */
    fork
      forever #(period / 2) clk = ~clk;
      dump_and_timeout(timeout, dumpfile);
    join
  endtask
endpackage
`ifdef VERILATOR
// _verilator does not support assert final
// https://github.com/verilator/verilator/issues/5081
// Use #1 delay instead.
`define ASSERT_FINAL(arg) #1 assert (arg)
`else
`define ASSERT_FINAL(arg) assert final (arg)
`endif

Homework#

Exercise 25

  1. Describe a RAM which can read data in the same clock cycle but writes after a single clock cycle with the same interface as in the intoductory problem.

  2. Verify the RAM by copying your testbench from the introductory problem and tweaking it.

  3. Implement the RAM on the board similar to the introductory problem.

  4. Which primitive was used for your description? Provide the synthesizer log snippet that proves your answer.

Exercise 26

Implement a sound player. Requirements:

  1. Has the following interface:

    module sound_player #(
        string ROM_INIT_HEX_FILE = "sound.mem",
        int unsigned ROM_DEPTH_LG2 = 4
    ) (
        input  clk,
        rst,
        output o
    );
    
  2. The sound data is stored in a RAM which will be used read-only. The ROM is initialized using the bitstream (e.g., $readmemh).

  3. The sound data in the memory has the following structure:

    CLK_DIV: Clock divisor value for creating the sound frequency for the current note. CLK_CYCLES: Duration of the note played using CLK_DIV in clock cycles.

    CLK_DIV1
    CLK_CYCLES1
    CLK_DIV2
    CLK_CYCLES2
    ...
    

    An example file follows:

    2  // Creates a signal with a frequency of MAIN_CLK_FREQ/2
    5  // ... for 5 clock cycles.
    3  // MAIN_CLK_FREQ/3
    7  // ... for 7 clock cycles.
    0  // Sets the output to low
    10 // ... 10 clock cycles.
    0  //
    0  // Skips.
    

    Note that above values won’t create any hearable sound if the MAIN_CLK_FREQ is at MHz level. After you verified your module with these values, implement your module on the FPGA board and test it with hearable frequencies. For example sound.mem contains the Imperial March which was produced using notes_to_soundmem.py.

  4. If CLK_DIV is zero and one, then the output o is low and high, respectively.

  5. If CLK_CYCLES is zero, then the circuit skips immediately to the next note.

  6. Use the dynamic clock divider and RAM that we implemented before.

  7. Board implementation:

    1. Use the left-most slide switch for muting. The sound output is active when the slide switch is high.

    2. Use the push buttons for reset.

Optional ideas:

  • Toggle an on-board LED each time a new note is encountered.

  • Configurable playback speed

  • Configurable volume

Warning

Use a speaker with a volume knob if you did not implement configurable volume. The sudden full volume sound may be dangerous.