AXI DMA in Scatter Gather Mode

Xilinx SoC based FPGA

Kavindu Vindika
10 min readOct 16, 2020

AXI DMA IP in Xilinx SoC based FPGAs is required to off-load the data transactions performed by CPU in order to allow the CPU to allocate the relevant time for more useful processing applications.

AXI DMA can be configured as Direct Register mode or SG (Scatter/Gather) mode. In register mode, it would be less resource intensive with lower performance. In SG mode, it is possible to perform DMA transactions and management using buffer descriptors (BDs) which can be placed in any memory mapped storage unit such as BRAM. As a result, higher performance can be achieved with data transactions using AXI DMA in SG mode due to the placement of BDs in PL side of the FPGA.

Now let’s move onto the topic.

Here I’m going to create AXI sub system where data transfer happens from DRAM to AXI4-stream data FIFO via AXI DMA. Create the following sub system and follow the instructions given below. (Address Mapping also given)

For test and debug purposes I’m using Ultra96-V1 which comes with Xilinx Zynq UltraScale+ MPSoC ZU3EG FPGA.

  1. In order to transfer data between AXI DMA and DRAM, first enable the slave interface of Zynq MPSoC. (Here I have enabled S_AXI_HP0)
  2. For debug purpose, attach debug probes to M_AXI_SG, M_AXI_MM2S and M_AXIS_MM2S of AXI DMA. Then connect ILA.
AXI DMA sub system
Address Map of the sub system

Now export the hardware and launch SDK to configure AXI DMA in SG mode. Follow the instructions given below.

  1. Create board support package (BSP) in standalone mode. (Since we are going to program the SoC in baremetal)
  2. Modify BSP stdin and stdout to connect with uart_1 using which you will be able to communicate with the board through SDK terminal.
  3. Create an empty application project where you can build your application from scratch.

Now I’m going to explain each and every code snippet I’m going to add in the C source file to comprehend the application thoroughly.

we are going to use SG mode polling example given by Xilinx and modify it, since we are only going to perform data transfer from DRAM to FIFO. Therefore, we only need to configure the transmission channel of AXI DMA (MM2S channel).

Here you can find the modified source file (axiDma.c).

Now let’s explain each and every code line in the modified source file.

In order to identify that we have created an AXI DMA in FPGA fabric, we need to explicitly introduce it to the SoC to perform the data transactions. Therefore we need to create AXI DMA driver instance which contains all the details relevant to the physical DMA device we created inside the FPGA.

Driver instance contains the following information inside the C structure.

RegBase points to the address given to the AXI DMA in address map of the sub system (This will be 0x00a0000000 base address of AXI DMA) and other information will be filled in accordance with the configurations you made on vivado block diagram, when you run XAXIDma_CfgInitialize(&DMA_HW, AXI_DMA_CONFIG). As a result you’ll be able to get all the data contained in the configuration structure instance, xAxiDma_Config *AXI_DMA_CONFIG.

Here you can see that TxBdRing is just an struct instance of XAxiDMA_BdRing ,but RxBdRing is an array of such struct instances. Configuration of this array is relevant for multi-channel transmission in SG mode which will not be discussed here.

Following structure is taken from defined variables when I run the application in debug mode.

Here I have removed TxBdRing and RxBdRing details. (This will be discussed later)

Now to configure the transmission channel, TxSetup(XAxiDma *AxiDmaInstPtr) will be used. Pass the configured pointer to the driver instance to the function.

Status = TxSetup(&DMA_HW);

Now you need to understand the fundamentals of SG mode data transmission before moving to the above function.

For data transmission, most important factors are the data and the address where the data stored. Then only, the device which is going to access these data, will be able to communicate with the storage unit from the given address and then get the data.

In SG mode, processor does not directly provide the address of the DRAM where relevant data contains. It will be provided by another storage unit called Buffer Descriptors (BDs). Each buffer descriptor contains the details (the most important details are address and length of the data buffer) relevant to the data that the AXI DMA needs to be received from or transferred to DRAM. (Transmission BDs are relevant to data transmission from DRAM to DMA or simply it can be identified as DMA receiving data from DRAM via MM2S channel. Receiving BDs are relevant to data transmission from DMA to DRAM or simply it can be identified as DMA transferring data to DRAM via S2MM channel. Please be careful about the wordings like transmission and receive)

We can define more than one BD and these BDs can be stored in a BRAM. Each BD will be processed by AXI DMA sequentially and once it comes to the final BD which is called tail BD, AXI DMA stops the data transmission (If cyclic mode is not enabled). Once it stopped transmission, processor is able to update the BDs which are already processed for another data transmission if necessary.

Buffer Descriptor Ring

In our sub system, since we are only using a single BD, a BRAM hasn’t been configured to store the BD. In this case Buffer Descriptor will be directly provided by the processor to AXI DMA via M_AXI_SG port.

Now let’s hop to the TxSetup function.

TxRingPtr = xAxiDma_GetTxRing(AxiDmaInstPtr)

This will get the pointer to the TxRing, we created in AXI DMA driver instance and it can be viewed in defined variables in debug mode.

Refer XAxiDma_BdRing structure defined in xaxidma_bdring.h header file to see the definition of each element in the structure.

Now we need to understand the definitions of FreeHead, PreHead, HwHead, HwTail and PostHead. Before that let’s go through the life cycle of buffer descriptors. (Refer BD Ring Management section of xaxidma.h header file)

Each BD has a 4 phases in the life cycle.

  1. Free
  2. Pre-process
  3. Hardware
  4. Post-process

For transmission BDs (TxBDs), we are going to allocate a memory region in between following address range. (include xparameters.h header file to access the base address of DDR memory — here it is 0x00000000)

In order to find out how many buffer descriptors can be defined in the given memory region, we have to use following function. (This will return the number of BDs that can be possibly allocated in the given memory region)

XAxiDma_BdRingCntCalc(XAXIDMA_BD_MINIMUM_ALIGNMENT,
TX_BD_SPACE_HIGH - TX_BD_SPACE_BASE + 1);

Now we are going to fill the details of the BdRing given above with following function.

XAxiDma_BdRingCreate(TxRingPtr, TX_BD_SPACE_BASE,TX_BD_SPACE_BASE,
XAXIDMA_BD_MINIMUM_ALIGNMENT, BdCount);

Now let’s check the BdRing again using defined variable in debug mode.

Now check the differences between TxRing before and after creation.

Here you can see that first BD address and last BD address has been given, which is same as the TX_BD_SPACE_BASE. (Reminder — since we are using the processor in standalone mode, physical and virtual addresses will be same)

AllCnt means number of possibly allocatable BDs (not the actually allocated number of BDs). Since processor can possibly allocate or define these BDs (similar to the number given to AllCnt), all the allocatable BDs are added to the Free group (That’s how you can see all the possible BDs in free group — FreeCnt).

Now pre-process means that processor actually allocate or define the BDs from the free group. (when processor actually define a BD from the free group, then it will reduce 1 from FreeCnt and add 1 to PreCnt)

Hardware phase means that when processor has defined a BD, it can be fetched by the AXI DMA and do the processing where AXI DMA reads the data from the data buffer relevant to the given BD.

Post-process means that after fetch and process of the BD, it can be updated by the processor or its status can be checked.

I believe that sometimes this will be hard to understand, but once you read the header files relevant to the AXI DMA to comprehend how these functions are created and what details or elements that are defined in these structures, then it’s certain that you’ll get a good grasp of the BD life cycle.

Our application process will be given in the following diagram.

Buffer Descriptor life cycle

Now let’s check what happens with the following functions used inside TxSetup.

XAxiDma_BdClear(&BdTemplate);

You have to understand what does the BD represent (Earlier I briefly told you it mainly includes data buffer address and length of data to be transmitted).

Now let’s study it thoroughly. Turn to page 38 of pg021 of Xilinx User guides and you’ll find out the buffer descriptor fields as follows.

BD fields from Xilinx pg021 user guide (page 38)

Each field is allocated with 32 bits and overall we need 13 * 32 bits to store a single BD. But usually we don’t use user application fields and even for our project we only need first 8 fields. In order to allocate memory for all the fields, each buffer descriptor will be designed as an array of 16–32 bit words. (check xaxidma_bd.h header file)

Here XAXIDMA_BD_NUM_WORDS is given as 16 (Therefore 16–32 bit words).

With BdClear function necessary fields in allocated BD array will be cleared to zeros.

XAxiDma_BdRingClone(TxRingPtr, &BdTemplate);

This will copy the BdTemplate to all the BDs in the free group (it’s 64 in our case). Refer to xaxibd_ring.c source file to more information about the above function.

XAxiDma_BdRingStart(TxRingPtr);

This will set run/stop bit to 1 in control register and halted bit to 0 in status register from the SG register space. (Register space is relevant for the main configuration of the AXI DMA — refer to page 12 of pg021)

SG mode register address space

Once you checked it in debug mode, you’ll find the changes in control and status registers of Tx channel clearly as follows.

Tx channel changes with BD Ring starts

Now let’s move on to the next important function which is sendPacket(XAxiDma *AxiDmaInstPtr)

XAxiDma_BdRingAlloc(TxRingPtr, 1, &BdPtr);

This will add BD in free group to the pre-process group. (refer xaxidma_bdring.h header file for more information)

Now the most important part of the allocated BD in pre-process comes. It is setting up of its data buffer address and length of the data buffer to be transmitted. Check the following code snippet.

XAxiDma_BdSetBufAddr(BdPtr, (UINTPTR) TxArray);XAxiDma_BdSetLength(BdPtr, MAX_PKT_LEN * sizeof(u32), TxRingPtr->MaxTransferLen);

Since we are only sending a frame containing a single data packet; set SOF (start of frame) and EOF (end of frame) for the BD.

XAxiDma_BdSetCtrl(BdPtr, XAXIDMA_BD_CTRL_TXEOF_MASK | XAXIDMA_BD_CTRL_TXSOF_MASK);

Now let’s summarize what happens with above code snippets with our application results as follows.

Configure the BD during pre-process phase

Before setting the tail descriptor, its better to flush the cache range for the BD.

Now in order to start the transmission, we have to set the tail descriptor as follows in which we are allowing AXI DMA to fetch the BDs until the tail descriptor fetches. (Setting the tail descriptor will kick off the data transmisson)

XAxiDma_BdRingToHw(TxRingPtr, 1, BdPtr);

Now let’s check the ILA results once we run the application.

  1. Overall view of the data transmission (ILA transmisson window has been set at 512 / 1024 to be triggered with TVALID assertion of AXIS_MM2S interface)
overall view of data transmission via MM2S channel

2. M_AXI_SG AR and R channel respectively read first BD address and first BD

M_AXI_SG AR and R channels

3. M_AXI_MM2S AR and R channel respectively read the data buffer address and data buffer

M_AXI_MM2S AR and R channels

4. Data transmission from memory mapped interface (M_AXI_MM2S) to stream interface (M_AXIS_MM2S)

data transmisson from M_AXI_MM2S to M_AXIS_MM2S

Brief video on ILA results relevant for the above application processing given below.

AXI DMA in Scatter Gather Mode

References

  1. pg021 — AXI DMA v7.0
  2. AXI DMA SG polling method examples provided by Xilinx
  3. Results received from SDK using Debug mode
  4. Results from ILA

This article has been a long one, but I wanted to explain each and everything relevant to the SG mode data transmission since simply explained resources such as articles or youtube videos are not frequent for the SG mode.

If you have any issue regarding the above methods used, please feel free to ask questions on comment section below.

--

--

Kavindu Vindika

AWS Certified Solution Architect - Associate | Senior Software Engineer @SyscoLABS | AWS Community Builder