README.md 10.6 KB
Newer Older
Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
1
# FPGA_final_project
Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
2
#### Group4: Sabyasachi Mondal , Ravi Yadav
Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
3
fpga for streamlining of computation intensive tasks. In this case we take an hyperspectral image which is generally analysed by satellites or drones mostly consisting of single band image data. This can be used for both maritime and vehicular navigation.
Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
4

Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
5
## Overview
Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
6
7
8
We want to use FPGA for implementing an algorithm in hardware to perform computation more effeciently. CPU hardware is non-flexible so the code runs using the same set of registers and ALU , we cant optimize the harware as per our code. Our objective here is to harware a processing unit (something smilar to a flexible ALU using the CLBs) in the FPGA using High level code.


Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
9
10
## Problem background
<b>*For applications like real time image processing, using CPU resource can be expensive, and the reaction time may increase in applications where decisions are based on calculations. We need a dedicated hardware that can process continous process of data coming in from sensors or camera endlessly.*</b>
Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
11

Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
12
<b>*FPGA should be able to process multiple streams in synchronized manner. We want to process the streams coming from an image and process them through a convolution algorithm (Robert's matrix) and then use another function to filter out relevant parts*</b>
Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
13

Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
14
## Objective
Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
15

Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
16
Our objective is to use enable continous data stream processing in a pipeline that runs faster using FPGA in comparison to CPU.
Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
17
We try to implement a image-filter which works by taking data streams and processing them on real time, and the FPGA should work faster than CPU. Our objective is not to make the image-processing-algorithm fast.
Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
18

Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
19
We should be able to:
Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
20
21

1. <b>*Remove limitations on length and size of data so the structure can be adapted for real-time continous use*</b>
22
    
Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
23
24
25
2. <b>*Enable multiple data stream processing is parallel using the ctrategies used in FPGA for faster processing*</b>

3. <b>*FPGA should be reasonably faster than our CPU for processing streams*</b>
Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
26

Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
27
## Implementation Strategy
Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
28
Previously we have seen the image resizer takes in the whole data DMA makes the data transfer rate much faster, but we cant pr``ocess an image or stream of data that is infinitely received and require processing.
Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
29
30

We intend to implement the following:
Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
31
1. <b>*fast multichannel stream operations at a hardware level integrated with similar high level software constructs in python*</b>
32
    
Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
33
    *1.a High Level Code structure to enable parallel operation and optimization in functions*
34
    
Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
35
36
    *1.b Maintain same level of parallelism (multiple processing streams) in unrolled loops*

Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
37
2. <b>*make the FPGA capable to process continous stream of data which is infeasible to be stored*</b>
Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
38
39

    *2.a CPU packs data and feed them to FPGA till the image is processed (but we can simply loop it forever for continous data)*
40

Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
41
    *2.b Synchronized operation between packets of each stream which is essential for processing multiple togather.*
42

Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
43
We try to read each row in the image as a pack of 3 streams process it in 2 seperate block and return the output as an array.
Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
44

Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
45
![Schematic streaming rows and output](https://mygit.th-deg.de/sm11312/fpga_final_project/-/raw/main/HLSolution.JPG "Schematic streaming rows and output")
46

Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
47
This would mean we can store real-time data in frames and feed them continously from our python code. The processing blocks consists of a 2x2 array each and they are the convolution weights added to our stream of data and we return the output with a pre-processing in another function.
Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
48

Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
49
![Convolution on streaming row](https://mygit.th-deg.de/sm11312/fpga_final_project/-/raw/main/RobertCross.JPG "Convolution on streaming row")
Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
50

Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
51
(DMA1 + DMA2) streams are processed in PU1 and (DMA2 + DMA3) streams in PU2. However becuase Robert's convolution algorithm needs data to be processed in a 2x2 array they must enter and get processed in Synchronized manner.
Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
52
53

![CPU FPGA interconnection and data transfer](https://mygit.th-deg.de/sm11312/fpga_final_project/-/raw/main/CPU_FPGA.JPG "CPU FPGA interconnection and data transfer")
Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
54

Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
55
On the higher level the interaction between CPU and FPGA looks like the schematic shown above.
Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
56

Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
57
We use two blocks to process the streams but that doesnot mean we use one thread we basically dont wait for Nth set of data to be processed before we can start processing N+1 data. Since the convolution algorithm does not wait for processing it can start to read and process the next N+1 set of data from the stream as soon as Nth set has been read. It looks something like this due to loop unrolling and leads to parallel processing.
Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
58
59

![Unravelling of streams in loop and parallel processing](https://mygit.th-deg.de/sm11312/fpga_final_project/-/raw/main/Parallel_process.JPG "Unravelling of streams in loop and parallel processing")
Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
60

Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
61
62
63
## Instructions to run
Simply clone the repository in a Pynq Z2 board and run the file "Hyperspectral_Image_Filter_FPGA_CPU_comparison.ipynb" line by line.

Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
64
65
66
67
68
69
We should ensure the design.bit , design.hwh and design.tcl are present in the same folder as the above Jupyternotebook.

There is also a sample image lakemead_2004.jpg we may use as input to check the output of our IP.

We can also use the edge_filter.cpp file for creating an IP and generating a bitstream by designing the block diagram as shown in BlockDiagram.JPG

Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
70
71
The code should generate a output file that highlights the rough edges of the image for example contours / hills.

Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
72
## What we achieved and the caveat :
Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
73
<b>*We intended to build a architechture that can process multiple streams and process them in same parallel level and we were sucessful.*</b>
Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
74

Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
75
<b>*Our main goal is to ensure such a architechture runs faster in FPGA and it was reasonably fast; most importantly it can be scaled up to handle multiple streams.*</b>
Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
76
77
78

It is not very suitable for image processing tasks as arrays stored in memory does a better work in that, so a Robert's convolution algorithm is faster in an OpenCV library.

Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
79
<b>*CPU Average for images was at 10s and FPGA at about 6s*</b>
Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
80

Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
81
82
![Speed comparison in single images](https://mygit.th-deg.de/sm11312/fpga_final_project/-/raw/main/Speed_table.JPG "Speed comparison in single images")

Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
83
## Future scope
Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
84
85
*This is a new idea and has no previous references except implementaton guides.*

Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
86
<b>*The image processing can serve a stepping stone for controlling multi-agent systems. Where each streaming interface can be used for instruction input and output for each agent/bots.*</b>
Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
87

Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
88
*We achieved good synchronization betwenn the input streams in terms of pixel processing. We can consider extending the filter to video with streaming, it might be possible with similar kind of streaming interface.*
Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
89
90


Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
91
#### Tasks
Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
92
The Tasks and maximum actual time:
Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
93

Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
94
1. Problem statement and brainstorming for project selection : *24 hrs*
Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
95
96
97
98
99
100
101
2. Design a basic model and build overlay : *6 hrs*
3. Python code adjustment and integration : *4 hrs*
4. Plan next stage of overlay design : *4 hrs*
5. Develop algorithm for FPGA using C++ : *4 hrs*
6. Optimize code and add synchronization of multiple channels : *24 hrs*
7. Implement block diagram : *4 hrs*
8. Upload code and test in IPy notebook : *3 hrs* 
Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
102
103


Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
104
#### Resources used and Future project topics
Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
105

Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
106
##### Resources used
Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
107
0. Images: https://serc.carleton.edu/earth_analysis/image_analysis/introduction/day_4_part_2.html
Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
108
109
110
111
112
113
114
115
116
117
1. Image segmentation : https://theailearner.com/2020/11/29/image-segmentation-with-watershed-algorithm/
2. Operation with stream: https://www.xilinx.com/html_docs/xilinx2020_2/vitis_doc/hls_stream_library.html#ivv1539734234667__ad398476
3. Stream Interface : https://www.xilinx.com/html_docs/xilinx2020_2/vitis_doc/managing_interface_synthesis.html#ariaid-title32
3. Specialized Constructs : https://www.xilinx.com/html_docs/xilinx2020_2/vitis_doc/special_graph_constructs.html?hl=template
4. Vitis Examples : https://github.com/Xilinx/Vitis_Accel_Examples/blob/master/cpp_kernels/README.md
5. Running Accelerator : https://pynq.readthedocs.io/en/v2.6.1/pynq_alveo.html#running-accelerators
6. Pragma Interfaces : https://www.xilinx.com/html_docs/xilinx2017_4/sdaccel_doc/jit1504034365862.html
7. AXI4 : https://ch.mathworks.com/help/hdlcoder/ug/getting-started-with-axi4-stream-interface-in-zynq-workflow.html
8. Interface of Streaming : https://www.xilinx.com/html_docs/xilinx2020_2/vitis_doc/managing_interface_synthesis.html#ariaid-title34
9. Database in FPGA : https://dspace.mit.edu/bitstream/handle/1721.1/91829/894228451-MIT.pdf, https://www.xilinx.com/publications/events/developer-forum/2018-frankfurt/accelerating-databases-with-fpgas.pdf, https://www.xilinx.com/html_docs/xilinx2020_2/vitis_doc/vitis_hls_process.html#djn1584047476918
Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
118
10. Muxed Stream : https://liu.diva-portal.org/smash/get/diva2:1057270/FULLTEXT01.pdf
Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
119
120
11. RAW,WAR,WAW.. : https://www.xilinx.com/html_docs/xilinx2020_2/vitis_doc/vitis_hls_optimization_techniques.html#wen1539734225565__aa1299615
12: Loop Pipelining Roll Unroll : https://www.xilinx.com/html_docs/xilinx2020_2/vitis_doc/vitis_hls_optimization_techniques.html#kcq1539734224846
Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
121

Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
122

Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
123
### Errors Logs and Issues encountered
Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
124
The input pins (listed below) are either not connected or do not have a source port, and they don't have a tie-off specified. These pins are tied-off to all 0's to avoid error in Implementation flow.
Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
125
126
127
Please check your design and connect them as needed: 
/color_filter/ap_start
When ap_Ctrl = None not specified in design
Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
128
129

Cant find custom IP in Vivado : add IP zip path, open IP Integrator view, from IP configure window manually add the IP
Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
130
131
132

Cant connect hls::stream<> type object in IP : Note: The hls::stream class should always be passed between functions as a C++ reference argument. For example, &my_stream.
IMPORTANT: The hls::stream class is only used in C++ designs. Array of streams is not supported.
Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
133
134

Non-Blocking write not-allowed in Non-FIFO Interfaces like axis instead try using FIFO m_axi
Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
135
136

DMA size must be lesser than 16383 so we cant feed very large datasets directly to a single DMA.
Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
137
138
139

WARNING: [HLS 200-786] Detected dataflow-on-top in function  'color_filter' (../project_3/color_filter.cpp:45)  with default interface mode 'ap_ctrl_hs'. Overlapped execution of successive kernel calls will not happen unless interface mode 'ap_ctrl_chain' is used (or 'ap_ctrl_none' for a purely data-driven design).
Resolution: For help on HLS 200-786 see www.xilinx.com/cgi-bin/docs/rdoc?v=2020.2;t=hls+guidance;d=200-786.html
Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
140

Sabyasachi Mondal's avatar
Sabyasachi Mondal committed
141
DMA Stuck and not reponding, [fixed thanks to Lauri's Blog](https://lauri.võsandi.com/hdl/zynq/xilinx-dma.html) and [problems other's face](https://forums.xilinx.com/t5/Processor-System-Design-and-AXI/Why-AXI-DMA-starts-acquiring-data-during-configuration/td-p/766605) and more [problems](https://forums.xilinx.com/t5/AXI-Infrastructure-Archive/tkeep-signal-in-AXI-DMA-and-tstrb-3-0-in-Custom-AXI-Stream-IP/td-p/921850)