Commit 2d503a6e authored by Sabyasachi Mondal's avatar Sabyasachi Mondal
Browse files


parent 609d0782
......@@ -13,52 +13,63 @@ In FPGA for example may implement multiple multipliers or registers to work in p
But We as software designers can develop our own algorithms bottom up from register levels to a high level code (python for example), which may prove immensely powerful for the task specific algorithm. In our case we use python as a host to drive our fpga.
*FPGA should be able to process multiple streams in synchronized manner. We want to process the streams coming from an image and process them through a convolution algorithm (Robert's matrix) and then use another function to filter out relevant parts*
# Objective
Our Objective is to develop better integrated code such that our hardware and software works hand in hand to deliver the best result. We start thinking of a algorithm in python and think how it can be optimized while running it in the FPGA's Logic Unit. We would develop the hardware in C++ and write/burn the hardware in FPGA and use our Py code to drive it.
Our objective is to use enable continous data stream processing in a pipeline that runs faster using FPGA in comparison to CPU.
We try to implement a image-filter which works by taking data streams and processing them on fly, and the FPGA should work faster than CPU. Our objective is not to make the image-processing-algorithm fast.
In this case we are going to use the FPGA to implement a processing unit in hardware from High Level C code that will be able to perform image processing (like inversion, color specific background sieve) at a much faster rate:
We should be able to:
1. *Perform Image processing by using the registers, axi_streaming and DMA* [Future scope multi-agent control]
1. *Remove limitations on length and size of data so the structure can be adapted for real-time continous use*
*1.a Implement image inversion and build / test IP*
*1.b Implement image layer extraction using modified convolution (Robert's operator).*
2. *Enable multiple data stream processing is parallel using the ctrategies used in FPGA for faster processing*
compare how CPU performs in comparision to our FPGA hardware that is exactly wired up to work on the kind of data we expect to provide as input.
3. *FPGA should be reasonably faster than our CPU for processing streams*
# Implementation Strategy
Previously we have seen the image resizer takes in the whole data DMA makes the data transfer rate much faster, but there were several instances where CPU performed better and faster specifically in a wider range of image dimension color and size.
Previously we have seen the image resizer takes in the whole data DMA makes the data transfer rate much faster, but we cant process an image or stream of data that is infinitely received and require processing.
We intend to implement the following:
1. make faster multichannel operations at a hardware level integrated with similar high level software constructs
1. *fast multichannel stream operations at a hardware level integrated with similar high level software constructs in python*
*1.a Highl Level Code structure to enable parallel operation and optimization*
*1.a High Level Code structure to enable parallel operation and optimization in functions*
*1.b Maintain same level of parallelism (multiple data streams and logical processing constructs) in H/W level*
*1.b Maintain same level of parallelism (multiple processing streams) in unrolled loops*
2. *make the FPGA capable to process continous stream of data which is infeasible to be stored in a large space*
*2.a CPU packs data and feed them to FPGA till the image is processed (but we can simply loop it forever for continous data)*
2. make the FPGA capable to process images in as wide range as our CPU supports
*2.b Synchronized operation between packets of each stream which is essential for processing multiple togather.*
*2.a CPU has large storage FPGA doesnot, we can make high level py code drive large data into DMA acceptable maximum chunks*
We try to read each row in the image as a pack of 3 streams process it in 2 seperate block and return the output as an array.
*2.b Increase number of data channels into and out of FPGA for faster processing (higher utilization).*
This would mean we can store real-time data in frames and feed them continously from our python code. The processing blocks consists of a 2x2 array each and they are the convolution weights added to our stream of data and we return the output.
This is how a typical openCV resizer works:
<Data Transfer Image>
(DMA1 + DMA2) streams are processed in PU1 and (DMA2 + DMA3) streams in PU2. However becuase Robert's convolution algorithm needs data to be processed in a 2x2 array they must enter and get processed in Synchronized manner.
We will notice this further if we study the resizer code that in the 2d image is fed to our DMA and internally the whole image is read row by row , col by col. Image array size is static becuase we are have finite space in FPGA.
On the higher level the interaction between CPU and FPGA looks like the schematic shown below:
This may be made more efficient and robust (accomodating any image width and video) if by implementing the following changes:
1. Multichannel image operation where we use parallel threads for processing. Each of this processing an logic entity (utilizing multiple CLBs) is expected to be faster.
2. By chunking and sending data in packets fromour high level code we can also ensure that our FPGA can process an image much larger than it's own memory or DMA allocation space.
3. Creating un-rolled loop for read write operations along with function calls.
We said we use two blocks to process the streams but that doesnot mean we use one thread we basically dont wait for Nth data to be read before we can start processing N+2nd data. Since the convolution algorithm needs two sequential data to be processed at a time 1st thread in unrolled loop can process N , N+1 data packet from Stream1 and Stream2 . But it's cloned thread can read N+2 and N+3 data packet. It looks something like this due to loop unrolling and parallel processing.
We use two streams of data in each process with it's own processing unit in our IP , which can be schematically represented in:
<image for our Implementation>
# What we achieved and the caveat :
*We intended to build a architechture that can process multiple streams and process them in same parallel level and we were sucessful.*
*Our main goal is to ensure such a architechture runs faster in FPGA and it was reasonably fast; most importantly it can be scaled up to handle multiple streams.*
It is not very suitable for image processing tasks as arrays stored in memory does a better work in that, so a Robert's convolution algorithm is faster in an OpenCV library.
*CPU Average for images was at 10s and FPGA at about 6s*
#### Future scope
*This is a new idea and has no previous references except implementaton guides.*
*The image processing can serve a stepping stone for controlling multi-agent systems. Where each streaming interface can be used for instruction input and output for each agent.*
*We achieved good synchronization betwenn the input streams in terms of pixel processing. We can consider the real world environment as a array of pixels with each pixel representing the coordinates of each bot. In this scenario we can process all inputs (pixels) from each bots and implement collison avoidance and basic navigation using same architechture.*
In the background extraction technique we use a modified form of convolution to extract layer / feature to from the image for example IR bands which can be applied as nightvision references for navigations.
<Image modified convolution>
# Tasks
The Tasks and maximum actual time:
......@@ -91,9 +102,6 @@ The Tasks and maximum actual time:
11. RAW,WAR,WAW.. :
12: Loop Pipelining Roll Unroll :
#### Future scope
The image processing can serve a stepping stone for controlling multi-agent systems. Where each streaming interface can be used for instruction input and output for each agent. Instead of using RTOS in each bot we can have multiple datastreams from each bots being processed in an IP designed to emulate a FSM for each agent and decide their action. This can lead to higher robustness and fault tolerance and lower costs.
# Errors Logs and Issues encountered
The input pins (listed below) are either not connected or do not have a source port, and they don't have a tie-off specified. These pins are tied-off to all 0's to avoid error in Implementation flow.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment