A Portable Image Processing Accelerator using FPGA

Aditya Patil
Image processing using FPGA
7 min readFeb 27, 2021

An FPGA which is an initialism for Field Programmable Gate Arrays. It is indeed a topic of interest for many and something that is in boom in the 21st century. A lot of research has been done in this field of electronics and is still being carried out even today. A field-programmable gate array is an integrated circuit designed to be configured by a customer or a designer after manufacturing — hence the term “field-programmable”. Field Programmable Gate Arrays (FPGAs) are semiconductor devices that are based around a matrix of configurable logic blocks (CLBs) connected via programmable interconnects. FPGAs can be reprogrammed to desired application or functionality requirements after manufacturing. But, what exactly is a CLB? A configurable logic block (CLB) is the basic repeating logic resource on an FPGA. When linked together by routing resources, the components in CLBs execute complex logic functions, implement memory functions, and synchronize code on the FPGA.

Now, coming to the subtopic that shall be reviewed and summarized in this blog, “A Portable Image Processing Accelerator using FPGA”. An image processing accelerator is an FPGA-based image processing acceleration solution that greatly improves the performance of image processing and image analytics by transferring computational workload from CPU to FPGA. On a usual basis, image processing on a serial processor is not easy to achieve because of the main reason being the large dataset represented by the image or the complex operations that are required to be performed on the image. One single operation performed on every pixel of a 768 by 576 colour image (PAL frame) equates to 33 million operations per second, not taking into consideration the storing overhead and retrieving pixel values. What exactly is PAL frame? PAL is an abbreviation for Phase Alternate Line. This is the video format standard used in many European countries. A PAL picture is made up of 625 interlaced lines and is displayed at a rate of 25 frames per second. A lot number of image processing applications require many operations to be performed on each pixel of an image and as a result, the number of operations per second increases exponentially. When we compare different computer architectures, a conclusive expectation can be stated that each one of the implementation platforms has advantages and disadvantages. What exactly is a computer architecture? In computer engineering, computer architecture is a set of rules and methods that describe the functionality, organization, and implementation of computer systems. Some definitions of architecture define it as describing the capabilities and programming model of a computer but not a particular implementation. A hardware-based solution is more preferred for this implementation as the parallelization benefits are huge comparing to other platforms. Now, what is parallelization? Parallelization is the act of designing a computer program or system to process data in parallel. Image processing involves a massive amount of pixel operations which can be parallelized on hardware due to the repetitive nature of the algorithms applied. As a reasult, the use of a field-programmable gate array (FPGA) is a good alternative. FPGAs indeed have proved to become popular as implementation platforms mainly due to continual growth in functionality and size, especially for image processing applications and video processing. In general, edge detection can significantly reduce the amount of data in an image, but the structural properties of the image are preserved in order to be used for further image processing. This is the first step for many digital image processing applications and computer vision. It has found to be reducing the complexity of further processing and facilitates the implementation of the main algorithm. In this survey of the reivew, a grayscale and an edge detection module, based on Sobel operator algorithm, is applied in order to produce an output image containing only the edges of the input. What exactly is a Sobel operator? The Sobel operator, sometimes called the Sobel–Feldman operator or Sobel filter, is used in image processing and computer vision, particularly within edge detection algorithms where it creates an image emphasising edges. Initially, there is a list of similar research done. But here, a description of the hardware architecture of this design, the software and bitstream configuration required and a detailed analysis of our image processing modules is explained with an ease.

In the recent times, the most stereotyped use of FPGAs is as implementation platforms for graphics processing applications. Their structure can exploit spatial and temporal parallelism, but such parallelisation depends on the processing model and hardware constraints of the system. What exactly is spatial parallelism? Spatial parallelism refers to the simultaneous execution of tasks by several processing units. At a given instant, these units can be executing the same task (or instruction) or different tasks. On the similar note, what is meant by temporal parallelism? Temporal parallelism or pipelining refers to the execution of a task as a ‘cascade’ of sub-tasks. Each functional unit can be seen as a “specialized” processor in the sense that it always executes the same sub-task. Such restrictions can force the designer to reformulate the algorithm. In this review, an FPGA design as a portable USB accelerator device which implements the Grayscale and Sobel Edge Detection algorithms, which have been mentioned above, two of the most fundamental algorithms in digital image processing, is summarized in a brief manner.

A prodigious amount of literature is present in this subtopic which apparently conveys that a lot of research is done in this field of electronics. A bit to mention, a research is carried out that describes the design of a high level framework which can be used to implement 1D and 2D FFTs in real time. A common framework has been implemented to simplify the development process of the designers, in order to meet different platform requirements. The framework includes a wide range of FFT algorithms such as radix 2, radix 4, split radix and fast Hartley transform (FHT). The resulting parallel implementation of 2D FFT achieves linear speed-up and can be used for large matrix sizes with real time performance. The fast Fourier transform based applications (such as signal and image processing), have both increased needs of computational power and the ability to experiment with algorithm parameters. FPGAs are widely used as a way to obtain high performance/cost ratio, but their low-level programming requires detailed architecture knowledge of the design as the latter is implemented. In the end, an environment based on an FPGA using the 2D FFT framework is presented as a solution for frequency-domain image filtering application. A similar research has been carried out which deals with a PCI-compatible FPGA-coprocessor for 2D/3D image processing. Now, what exactly is 3D image processing? 3D image processing is the visualization, processing, and analysis of 3D image data through geometric transformations, filtering, image segmentation, and other morphological operations. This 3D image processing techniques can also be used in microscopy to detect and analyse tissue samples or trace neurons. Now coming to the topic, an FPGA board for PCI systems is presented that features one XC3195A FPGA, three XC4013 devices (each up to 13K gate equivalents), 2 MByte of Flash Memory, 256 KByte of high-speed SRAM and a 16- bit high-speed multiply and accumulate unit. This design is used for scientific visualization in order to speed up algorithms and 3D-datasets. Due to the large number of bit and short integer operations, the calculations can be efficiently offloaded to the FPGA. This research can also be used as a starting point for speech or image processing. While the transfer bandwidth is high because of the intensive computations, this is not an issue on the PCI bus.

In this review, it is found that the entire design is synthesized completely, and that too without any inferred latches, timing arcs, and sensitivity list warnings. It can also be concluded that the source and mapped version of the complete design have been found to behave the same for all the test cases. The mapped version simulated without timing errors apart from time zero. Also, complete IC layout passed all geometry and connectivity. The maximum number of clock cycles processing a single pixel value was 16. The maximum time to process an 640x480 image is 16∗1/50Mhz ∗640∗480 = 0.098304sec. It has also been discussed that the average time for a C program to execute the same algorithm on a modern Linux machine is around 2.7 seconds. As a result, the FPGA implementation was around 27 times faster. The resulting Gray edge-detected image could be displayed on the host application, verifying that the Sobel Edge Detection produced visible and convincing edge detection.

We can summarize that, the most peculiar requirements for this design are geared towards the needs of the application. The main reason of implementing edge detection on an FPGA, rather than in pure software, was to decrease the runtime. As a result, computational latency is the highest priority in comparison to other qualities specified. For the reason to make sure, that the algorithm runs efficiently, we required attention for the organization of the design and how the latter one affected the critical path. Design specifications geared towards efficiency included optimizing pre-processing logic and accelerated matrix handling. An extended version of this implementation could be used to accelerate image processing software. The workstation (host) software could be either standalone or implemented as a plugin for an existing software (such as GIMP etc). Advantages can be a huge improvement in performance and productivity while maintaining the usability of a simple plug and play device for the end user. As a result, the summary of this implementation differs from similar accelerators since the requirement of heavy workstation modifications is completely removed.

As a result, we can conclude that, in total, the FPGA can execute an image processing algorithm nearly 20 times faster than the CPU. Each of the processing steps in these algorithms operate on individual pixels, or small groups of pixels, at the same time, so the algorithms can take advantage of the massive parallelism of the FPGA to process the images.

--

--