Implementing Gaussian Filter by Using VHDL to Blur Images

7 min readDec 12, 2021

In this article, I will explain the steps I followed in order to implement Gaussian Filter.

In this project, I have used:

VHDL as a language
Jidan floating-point unit[1]
Procedure
User-defined data types
2D and 3D arrays
Self-checking test bench (assertion-based verification)
Package
for generate
for loop
pipelining

What is a Gaussian Filter?

Gaussian filter is used in image processing in order to reduce noise, remove details, and blur images which will be shown in my result. After introducing it briefly, let me continue with the steps I followed.

Implementing the Algorithm in Software

I determined the coefficients using Octave. fspecian function can be used to determine. In that case, cutoff and sigma values are both 5.

Here are the coefficients I obtained. As you notice, the coefficients are floating-point. We can use a floating-point unit in VHDL or extend the coefficients. Let me first go on with floating points.

IEEE 754 floating-point single-precision representation of coefficients

After obtaining coefficients, we determine the image to be filtered. I chose a very popular cameraman image. The image is in 256x256 uint8 format which means there are 252*252 points to be filtered. In this case, I ignored the edges of the image which I will explain later. The image pixels are actually represented by integer where we do not need to represent them by IEEE 754 format. However, since we converted coefficients to IEEE 754 single-precision representation, we also convert integers to floating points. num2dec function can be utilized for this purpose.

After obtaining both coefficient and image, let me explain the algorithm.

In order to find the filtered values of small rectangles, we convolve big rectangles with coefficients. At this point, I can explain why I ignored the 2 pixels in the edges. If we try to filter a corner right-upper corner, for example, there are only 9 points to be convolved with coefficients and this is not enough. If you don’t want to ignore it, then you should apply some techniques to increase the pixel number to 25. One way is to pad just zeros.

In the figure below, we can see the problem in the edges.

The coefficient is advanced to each point and convolved. The result of the convolution is the filtred values.

First I will show you the results I obtained using software implementation. The software implementation is very simple. This code snippet can be utilized to implement the gaussian filter. VHDL implementation will not be that easy.

If we convolve each point and store it in a new array, the result becomes as shown in the figure below. As you notice, the image is blurred.

This result is our golden result and I will use it in the verification step.

Implementing the Algorithm in Hardware using Floating Point Unit

Now, we can continue with hardware implementation. For hardware implementation, we need 25 multiplication. Each coefficient will be multiplied by the corresponding pixel. Then we will sum them up. I divided multiplication into 5 steps. 5 pixels are multiplied in parallel with 5 coefficients. After multiplying them, I add them up. As you can see, multiplication is completed in a single step. However, the addition is completed in 3 steps. This is repeated 5 times for each pixel. After 5 times of this process, 25 coefficients are multiplied with 25 pixels. Then we obtain the filtered pixel.

The parallel multiplication is performed using for generate and for loop statements.

gaussian_coeff is a 3D array and is defined in the package.

Now we can write the test bench to verify the code.

The complete VHDL source is shared in my Github account. You can see the complete code as well. The Jidan floating-point unit is downloaded from Opencores.

Verification(Self-Checking test bench)

Here, we read from a text file and compare the result. If the difference is small enough, the simulation continues, otherwise stops and gives failure.

FILTERDATA in the above code is the procedure. In the procedure, we give the required stimulus and wait for the completion of 5 steps. After 5 steps, the pixel’s value is ready to compare with the golden result.

This is performed for each pixel (for 252*252 pixels).

The complete simulation result, Golden data, raw data, and filtered data are also shared in my Github account. You can analyze them in detail.

When I run the simulation, it goes up to end and does not give failure which means that we have verified our design with a self-checking test bench. The report says that it compared almost 63500 pixels and all of them have matched with golden data.

If we reconstruct the filtered data the result becomes the same as in the figure below.

The utilization is shown in the figure below. In that case, I chose Arty 7 35T FPGA board.

Implementing the Algorithm in Hardware using Integers

After implementing the algorithm and experiencing difficulty with it, we can implement it now using simple integer multiplication and addition. In order to achieve this, we extend the coefficient with an integer. In this case, I entended it by 4096. When we reconstruct the image we divide it by 4096 to get the correct result.

The coefficients become as in the figure below.

After that, we do not need to use floating-point arithmetic anymore because both the pixel values and coefficients are integer values. In this way, the algorithm runs much faster.

Since we use integer multiplication, we can also use DSP resources in our FPGA. However, there is not need for this because the coefficients are constant values and FPGA do not use DSP sources when the multiplication is performed with constant. If you still want to use DSP, you should use attribute and force FPGA to use DSP resources.

In this case, we are going to multiply coefficients with pixels altogether. The multiplication of 25 coefficients with pixels will take only one clock cycle. In the previous example, we divided it into 5 to reduce resource usage. In the previous example, the multiplication process takes 15 clock cycles since we use floating points. Totaly 15*5 = 75 cycles for only multiplication. In the second method, it takes only one cycle. Here is my implementation with for loop to multiply them all together. The results are kept in a 3D array.

After multiplication, the addition comes which also can be completed in a single cycle but is very bad for timing in FGPA. That’s why I will pipeline (put extra registers) the addition. As you can see, the addition takes 5 clock cycles to complete. After the addition is completed, the filtered pixel becomes ready.

With the second method, the filtering is completed in 6 clock cycles for each pixel. If we run in 100 MHZ, it takes only 60 ns and the complete image is filtered in 60 * 252 * 252 ns which corresponds to 630 us.

Here is the VHDL code for the addition.