README.md 14 KB
Newer Older
Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
1
# FPGA in Image Feature detection and comparison with CPU
Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
2
#### Group4: Sabyasachi Mondal , Ravi Yadav
Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
3
4
fpga for streamlining of computation intensive tasks. Like in cases where signals or images from sensors are procesed continously.

Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
5
#### I Have tried to implement this from scratch to have much higher control over design. It has been sucessful and for very large images received as continous streams FPGA is 3 times as fast as CPU.
Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
6

Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
7
8
9
| Index | Title | Content |
| - | ------ | ------ |
| 1 | Overview | Why do we use FPGA |
Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
10
| 2 | Problem background | Where can we apply our FPGA and get results |
Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
11
12
13
| 3 | Objective | Our vision of the solution |
| 4 | Implementation | The Hardware architechture and HLS code constructs |
| 5 | Robert's cross | The algorithm we implemented |
Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
14
| 6 | What we achieved and Caveats | (RESULTS) Success and limitations |
Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
15
16
| 7 | Future Scope | Possible use in multiagent robot control using continous streaming |
| 8 | Tasks , Errors , References | Epilogue |
Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
17

Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
18
## Overview
Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
19
20
21
We want to use FPGA for implementing an algorithm in hardware to perform computation more effeciently. CPU hardware is non-flexible so the code runs using the same set of registers and ALU , we cant optimize the harware as per our code. Our objective here is to harware a processing unit (something smilar to a flexible ALU using the CLBs) in the FPGA using High level code.


Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
22
## Problem background
Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
23
<b>*For applications like real time image processing, using CPU resource can be expensive. The reaction time may increase in applications where decisions are based on calculations. We need a dedicated hardware that can process continous stream of data coming in from sensors or camera endlessly.*</b>
Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
24

Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
25
<b>*FPGA should be able to process multiple streams in synchronized manner. We want to process the streams coming from an image and process them through a convolution algorithm (Robert's matrix) and then use another function to filter out bands of pixel values*</b>
Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
26

Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
27
## Objective
Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
28

Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
29
Our objective is to use enable continous data stream processing in a pipeline that runs faster using FPGA in comparison to CPU.
Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
30
We try to implement a image-filter which works by taking data streams and processing them on real time, and the FPGA should work faster than CPU. *My primary objective is to adapt the FPGA logic design so that it can process multi-channel streams in synchronized manner and still be faster than CPU.*
Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
31

Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
32
We should be able to:
Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
33

Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
34
1. <b>*Remove limitations on length and size of data so the design can be adapted for real-time continous use with streaming data (ideally an really large image can be fed to our FPGA using streams)*</b>
35
    
Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
36
2. <b>*Enable multiple data stream processing is parallel and using the strategies used in HLS for faster processing*</b>
Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
37
38

3. <b>*FPGA should be reasonably faster than our CPU for processing streams*</b>
Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
39

Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
40
## Implementation Strategy
Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
41
Previously we have seen the image resizer takes in the whole data DMA makes the data transfer rate much faster, but we cant process an image or stream of data that is infinitely received and has no limit on total length.
Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
42

Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
43
44
I intend to implement the following:
1. <b>*fast multichannel stream operations at a hardware level (HLS synthesis) integrated with similar software definitions and constructs in python*</b>
45
    
Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
46
    *1.a High Level Code structure in pythonto enable parallel operation and feed data continously using DMA*
47
    
Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
48
    *1.b CPU packs data and feed them to FPGA till the image is processed (but we can simply loop it forever for continous data)*
Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
49

Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
50
2. <b>*make the FPGA capable to process continous stream of data in seperate threads without interdependent variables*</b>
Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
51

Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
52
    *2.a Maintain same level of parallelism (multiple processing streams) in unrolled loops*
53

Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
54
    *2.b Synchronized operation between packets of each stream which is essential for processing multiple togather.*
55

Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
56
We try to read each row in the image as a pack of 3 streams, process it in 2 seperate block, and return the output as an array. *Since Robert's cross works on a specific 2x2 set of pixels of the image we must synchronize the stream of data coming in from our DMA such that if we read Nth packet in DMA A we must read Nth packet in DMA B and Nth packet in DMA E this is the only dependency we have on the variables so they need to wait for each other to complete the processing.*
Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
57

Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
58
![Schematic streaming rows and output](https://mygit.th-deg.de/sm11312/fpga_final_project/-/raw/main/HLSolution.JPG "Schematic streaming rows and output")
59

Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
60
This would mean we can store real-time data in frames and feed them continously from our python code. The processing blocks consists of a 2x2 array each and they are the convolution weights added to our stream of data and we return the output with a pre-processing in another function.
Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
61

Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
62
![Convolution on streaming row](https://mygit.th-deg.de/sm11312/fpga_final_project/-/raw/main/RobertCross.JPG "Convolution on streaming row")
Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
63

Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
64
(DMA1 + DMA2) streams are processed in PU1 and (DMA2 + DMA3) streams in PU2. However becuase Robert's convolution algorithm needs data to be processed in a 2x2 array they must enter and get processed in Synchronized manner.
Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
65
66

![CPU FPGA interconnection and data transfer](https://mygit.th-deg.de/sm11312/fpga_final_project/-/raw/main/CPU_FPGA.JPG "CPU FPGA interconnection and data transfer")
Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
67

Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
68
On the higher level the interaction between CPU and FPGA looks like the schematic shown above.
Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
69

Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
70
We use two blocks to process the streams but that doesnot mean we use one thread we basically dont wait for Nth set of data to be processed before we can start processing N+1 data. Since the convolution algorithm does not wait for processing it can start to read and process the next N+1 set of data from the stream as soon as Nth set has been read. It looks something like this due to loop unrolling and leads to parallel processing.
Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
71
72

![Unravelling of streams in loop and parallel processing](https://mygit.th-deg.de/sm11312/fpga_final_project/-/raw/main/Parallel_process.JPG "Unravelling of streams in loop and parallel processing")
Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
73

Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
74
75
We also use the Schedule Viewer to find out the interdependent variables so that we can have higher level of loop unrolling and find out the variables that can become bottlenecks in parallel processing for example for synchronization we needto have dependency on the three variables in a single execution instant (for example when DMA A's Nth packet is read same Nth index data must be read in DMA B and DMA E) of the loop.

Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
76
![The interdependent packets of data in the three input DMAs](https://mygit.th-deg.de/sm11312/fpga_final_project/-/raw/main/Schedule_Viewer.JPG "Interdependency to synchronize our image processing which can also be a bottleneck since we need to wait for other 2 DMA data to arrive.")
Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
77

Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
78
[The notebook for analysis](https://mygit.th-deg.de/sm11312/fpga_final_project/-/blob/main/Image_Filter_FPGA_CPU.ipynb) of the algorithm itself and looking at its corresponding CPU implementation.
Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
79

Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
80
![The block diagram illustrated](https://mygit.th-deg.de/sm11312/fpga_final_project/-/raw/main/Block_diagram_illustrated.jpg 'Blockdiagram')
Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
81
82
83

https://mygit.th-deg.de/sm11312/fpga_final_project/-/blob/main/design_1.pdf

Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
84
## Introduction to the algorithm
Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
85
<br>*The Robert's cross operation is very effective in detecting features in an image, specifically for images with more precise features*</br>
Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
86

Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
87
88
89
<br>*The Robert's operator is a 2x2 Matrix that can be used to find out differences in image boundaries becuase of it's weights*<br>

*It uses different weights in different pixel positions in a 2x2 cell which effectively acts like an differential operation. So places with higher differences in pixel values get pronounced in the output.*
Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
90

Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
91
[Introduction to Robert's cross operation](https://homepages.inf.ed.ac.uk/rbf/HIPR2/roberts.htm "Brief introduction to Robert's cross operation")
Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
92

Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
93
[Output image](https://mygit.th-deg.de/sm11312/fpga_final_project/-/raw/main/Output.jpg "Output image with the terrain contours")
Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
94

Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
95
## What we achieved and the caveat :
Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
96
<b>*We intended to build a architechture that can process multiple streams and process them in same parallel level and we were sucessful.*</b>
Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
97

Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
98
<b>*Our main goal is to ensure such a architechture runs faster in FPGA and it was reasonably fast; most importantly it can be scaled up to handle multiple streams.*</b>
Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
99

Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
100
<b>*CPU Average for images was at about 25s and FPGA at about 10s for 6 images. For smaller images a comparison table is shown below:*</b>
Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
101

Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
102
103
![Speed comparison in single images](https://mygit.th-deg.de/sm11312/fpga_final_project/-/raw/main/Speed_table.JPG "Speed comparison in single images")

Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
104
But the best part of the result was the FPGA speed increases when images are processed for longer time, larger images, larger streams of data.
Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
105
<b>*An Important observation was that for very large images (greater then 1 MB) our resizer is almost 3 times faster, but its just about 0.2 times fast for lower size images seen in above table, (for different size image analysis graph linked below).*<b>
Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
106

Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
107
Please refer to the [Jupyter notebook linked here](https://mygit.th-deg.de/sm11312/fpga_final_project/-/blob/main/Notebook_Speed_Comparison.ipynb) which shows this result (image 2 and 3 are around 200KB rest are above 800KB)
Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
108

Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
109
110
<b>*On some cases the FPGA took 11s and CPU took rougly thee times as much time 33s. In images of size range 800-1000 kB we achieve a 200 percent speed up, and around 20 percent for 200-300kB images.*<b>

Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
111
## Future scope
Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
112
*This is a new idea and has no previous references except implementaton guides. All the code and ideas were developed groundup*
Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
113

Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
114
<b>*The image processing can serve a stepping stone for controlling multi-agent systems. Where each streaming interface can be used for instruction input and output for each agent/bots.*</b>
Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
115

Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
116
*We achieved good synchronization between the input streams in terms of pixel processing. We can consider extending the filter to video with streaming, it might be possible with similar kind of streaming interface. Specially we have seen in the results that for longer streams the FPGA performance is more pronounced than CPU.*
Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
117
118


Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
119
#### Tasks
Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
120
The Tasks and maximum actual time:
Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
121

Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
122
1. Problem statement and brainstorming for project selection : *24 hrs*
Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
123
124
125
126
127
128
129
2. Design a basic model and build overlay : *6 hrs*
3. Python code adjustment and integration : *4 hrs*
4. Plan next stage of overlay design : *4 hrs*
5. Develop algorithm for FPGA using C++ : *4 hrs*
6. Optimize code and add synchronization of multiple channels : *24 hrs*
7. Implement block diagram : *4 hrs*
8. Upload code and test in IPy notebook : *3 hrs* 
Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
130
131


Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
132
#### Resources used and Future project topics
Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
133

Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
134
##### Resources used
Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
135
0. Images: https://serc.carleton.edu/earth_analysis/image_analysis/introduction/day_4_part_2.html
Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
136
137
138
139
140
141
142
143
144
145
1. Image segmentation : https://theailearner.com/2020/11/29/image-segmentation-with-watershed-algorithm/
2. Operation with stream: https://www.xilinx.com/html_docs/xilinx2020_2/vitis_doc/hls_stream_library.html#ivv1539734234667__ad398476
3. Stream Interface : https://www.xilinx.com/html_docs/xilinx2020_2/vitis_doc/managing_interface_synthesis.html#ariaid-title32
3. Specialized Constructs : https://www.xilinx.com/html_docs/xilinx2020_2/vitis_doc/special_graph_constructs.html?hl=template
4. Vitis Examples : https://github.com/Xilinx/Vitis_Accel_Examples/blob/master/cpp_kernels/README.md
5. Running Accelerator : https://pynq.readthedocs.io/en/v2.6.1/pynq_alveo.html#running-accelerators
6. Pragma Interfaces : https://www.xilinx.com/html_docs/xilinx2017_4/sdaccel_doc/jit1504034365862.html
7. AXI4 : https://ch.mathworks.com/help/hdlcoder/ug/getting-started-with-axi4-stream-interface-in-zynq-workflow.html
8. Interface of Streaming : https://www.xilinx.com/html_docs/xilinx2020_2/vitis_doc/managing_interface_synthesis.html#ariaid-title34
9. Database in FPGA : https://dspace.mit.edu/bitstream/handle/1721.1/91829/894228451-MIT.pdf, https://www.xilinx.com/publications/events/developer-forum/2018-frankfurt/accelerating-databases-with-fpgas.pdf, https://www.xilinx.com/html_docs/xilinx2020_2/vitis_doc/vitis_hls_process.html#djn1584047476918
Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
146
10. Muxed Stream : https://liu.diva-portal.org/smash/get/diva2:1057270/FULLTEXT01.pdf
Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
147
148
11. RAW,WAR,WAW.. : https://www.xilinx.com/html_docs/xilinx2020_2/vitis_doc/vitis_hls_optimization_techniques.html#wen1539734225565__aa1299615
12: Loop Pipelining Roll Unroll : https://www.xilinx.com/html_docs/xilinx2020_2/vitis_doc/vitis_hls_optimization_techniques.html#kcq1539734224846
Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
149

Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
150

Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
151
### Errors Logs and Issues encountered
Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
152
The input pins (listed below) are either not connected or do not have a source port, and they don't have a tie-off specified. These pins are tied-off to all 0's to avoid error in Implementation flow.
Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
153
154
155
Please check your design and connect them as needed: 
/color_filter/ap_start
When ap_Ctrl = None not specified in design
Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
156
157

Cant find custom IP in Vivado : add IP zip path, open IP Integrator view, from IP configure window manually add the IP
Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
158
159
160

Cant connect hls::stream<> type object in IP : Note: The hls::stream class should always be passed between functions as a C++ reference argument. For example, &my_stream.
IMPORTANT: The hls::stream class is only used in C++ designs. Array of streams is not supported.
Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
161
162

Non-Blocking write not-allowed in Non-FIFO Interfaces like axis instead try using FIFO m_axi
Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
163
164

DMA size must be lesser than 16383 so we cant feed very large datasets directly to a single DMA.
Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
165
166
167

WARNING: [HLS 200-786] Detected dataflow-on-top in function  'color_filter' (../project_3/color_filter.cpp:45)  with default interface mode 'ap_ctrl_hs'. Overlapped execution of successive kernel calls will not happen unless interface mode 'ap_ctrl_chain' is used (or 'ap_ctrl_none' for a purely data-driven design).
Resolution: For help on HLS 200-786 see www.xilinx.com/cgi-bin/docs/rdoc?v=2020.2;t=hls+guidance;d=200-786.html
Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
168

Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
169
DMA Stuck and not reponding, [fixed thanks to Lauri's Blog](https://lauri.võsandi.com/hdl/zynq/xilinx-dma.html) and [problems other's faced](https://forums.xilinx.com/t5/Processor-System-Design-and-AXI/Why-AXI-DMA-starts-acquiring-data-during-configuration/td-p/766605) and more [problems](https://forums.xilinx.com/t5/AXI-Infrastructure-Archive/tkeep-signal-in-AXI-DMA-and-tstrb-3-0-in-Custom-AXI-Stream-IP/td-p/921850)