# FPGA in Image Feature detection and comparison with CPU #### Group4: Sabyasachi Mondal , Ravi Yadav fpga for streamlining of computation intensive tasks. In this case we take an hyperspectral image which is generally analysed by satellites or drones mostly consisting of single band image data. This can be used for both maritime and vehicular navigation. #### I Have tried to build a implementation from scratch where all code structure were created so we can have much higher control over design | Title | Content | | ------ | ------ | | Overview | Why do we use FPGA | | Problem background | Where can we apply our methods and get results | | Objective | Our vision of the solution | | Implementation | The Hardware architechture and HLS code constructs | | Robert's cross | The algorithm we implemented | | What we achieved and Caveats | Objectives sucessful and limitations | | Future Scope | Possible use in multiagent robot control using continous streaming | ## Overview We want to use FPGA for implementing an algorithm in hardware to perform computation more effeciently. CPU hardware is non-flexible so the code runs using the same set of registers and ALU , we cant optimize the harware as per our code. Our objective here is to harware a processing unit (something smilar to a flexible ALU using the CLBs) in the FPGA using High level code. ## Problem background *For applications like real time image processing, using CPU resource can be expensive, and the reaction time may increase in applications where decisions are based on calculations. We need a dedicated hardware that can process continous process of data coming in from sensors or camera endlessly.* *FPGA should be able to process multiple streams in synchronized manner. We want to process the streams coming from an image and process them through a convolution algorithm (Robert's matrix) and then use another function to filter out relevant parts* ## Objective Our objective is to use enable continous data stream processing in a pipeline that runs faster using FPGA in comparison to CPU. We try to implement a image-filter which works by taking data streams and processing them on real time, and the FPGA should work faster than CPU. *My objective is to adapt the FPGA logic design so that it can process multi-channel streams in synchronized manner and still be faster than CPU.* We should be able to: 1. *Remove limitations on length and size of data so the design can be adapted for real-time continous use with streaming data* 2. *Enable multiple data stream processing is parallel using the strategies used in HLS for faster processing* 3. *FPGA should be reasonably faster than our CPU for processing streams* ## Implementation Strategy Previously we have seen the image resizer takes in the whole data DMA makes the data transfer rate much faster, but we cant pr``ocess an image or stream of data that is infinitely received and require processing. We intend to implement the following: 1. *fast multichannel stream operations at a hardware level integrated with similar high level software constructs in python* *1.a High Level Code structure in pythonto enable parallel operation and feed data continously using DMA* *1.b Maintain same level of parallelism (multiple processing streams) in unrolled loops* 2. *make the FPGA capable to process continous stream of data which is infeasible to be stored* *2.a CPU packs data and feed them to FPGA till the image is processed (but we can simply loop it forever for continous data)* *2.b Synchronized operation between packets of each stream which is essential for processing multiple togather.* We try to read each row in the image as a pack of 3 streams process it in 2 seperate block and return the output as an array. *Since Robert's cross works on a specific 2x2 set of pixels of the image we must synchronize the stream of data coming in from our DMA such that if we read Nth packet in DMA A we must read Nth packet in DMA B and Nth packet in DMA C.* ![Schematic streaming rows and output](https://mygit.th-deg.de/sm11312/fpga_final_project/-/raw/main/HLSolution.JPG "Schematic streaming rows and output") This would mean we can store real-time data in frames and feed them continously from our python code. The processing blocks consists of a 2x2 array each and they are the convolution weights added to our stream of data and we return the output with a pre-processing in another function. ![Convolution on streaming row](https://mygit.th-deg.de/sm11312/fpga_final_project/-/raw/main/RobertCross.JPG "Convolution on streaming row") (DMA1 + DMA2) streams are processed in PU1 and (DMA2 + DMA3) streams in PU2. However becuase Robert's convolution algorithm needs data to be processed in a 2x2 array they must enter and get processed in Synchronized manner. ![CPU FPGA interconnection and data transfer](https://mygit.th-deg.de/sm11312/fpga_final_project/-/raw/main/CPU_FPGA.JPG "CPU FPGA interconnection and data transfer") On the higher level the interaction between CPU and FPGA looks like the schematic shown above. We use two blocks to process the streams but that doesnot mean we use one thread we basically dont wait for Nth set of data to be processed before we can start processing N+1 data. Since the convolution algorithm does not wait for processing it can start to read and process the next N+1 set of data from the stream as soon as Nth set has been read. It looks something like this due to loop unrolling and leads to parallel processing. ![Unravelling of streams in loop and parallel processing](https://mygit.th-deg.de/sm11312/fpga_final_project/-/raw/main/Parallel_process.JPG "Unravelling of streams in loop and parallel processing") ## Instroduction to the algorithm
*The RObert's operator is very effective in detecting features in an image, specifically for images with more precise features*

*The Robert's operator is a 2x2 Matrix that can be used to find out differences in image bouundaries becuase of it's weights*
![Introduction to Robert's cross operation](https://homepages.inf.ed.ac.uk/rbf/HIPR2/roberts.htm "Brief introduction to Robert's cross operation") ## What we achieved and the caveat : *We intended to build a architechture that can process multiple streams and process them in same parallel level and we were sucessful.* *Our main goal is to ensure such a architechture runs faster in FPGA and it was reasonably fast; most importantly it can be scaled up to handle multiple streams.* #### An Important observation was that for very large images (greater then 1 MB) our resizer is almost 3 times faster, thoguh its about 0.2 times fast for lower size images. *CPU Average for images was at about 25s and FPGA at about 10s for 6 images* ![Speed comparison in single images](https://mygit.th-deg.de/sm11312/fpga_final_project/-/raw/main/Speed_table.JPG "Speed comparison in single images") ## Future scope *This is a new idea and has no previous references except implementaton guides.* *The image processing can serve a stepping stone for controlling multi-agent systems. Where each streaming interface can be used for instruction input and output for each agent/bots.* *We achieved good synchronization betwenn the input streams in terms of pixel processing. We can consider extending the filter to video with streaming, it might be possible with similar kind of streaming interface.* #### Tasks The Tasks and maximum actual time: 1. Problem statement and brainstorming for project selection : *24 hrs* 2. Design a basic model and build overlay : *6 hrs* 3. Python code adjustment and integration : *4 hrs* 4. Plan next stage of overlay design : *4 hrs* 5. Develop algorithm for FPGA using C++ : *4 hrs* 6. Optimize code and add synchronization of multiple channels : *24 hrs* 7. Implement block diagram : *4 hrs* 8. Upload code and test in IPy notebook : *3 hrs* #### Resources used and Future project topics ##### Resources used 0. Images: https://serc.carleton.edu/earth_analysis/image_analysis/introduction/day_4_part_2.html 1. Image segmentation : https://theailearner.com/2020/11/29/image-segmentation-with-watershed-algorithm/ 2. Operation with stream: https://www.xilinx.com/html_docs/xilinx2020_2/vitis_doc/hls_stream_library.html#ivv1539734234667__ad398476 3. Stream Interface : https://www.xilinx.com/html_docs/xilinx2020_2/vitis_doc/managing_interface_synthesis.html#ariaid-title32 3. Specialized Constructs : https://www.xilinx.com/html_docs/xilinx2020_2/vitis_doc/special_graph_constructs.html?hl=template 4. Vitis Examples : https://github.com/Xilinx/Vitis_Accel_Examples/blob/master/cpp_kernels/README.md 5. Running Accelerator : https://pynq.readthedocs.io/en/v2.6.1/pynq_alveo.html#running-accelerators 6. Pragma Interfaces : https://www.xilinx.com/html_docs/xilinx2017_4/sdaccel_doc/jit1504034365862.html 7. AXI4 : https://ch.mathworks.com/help/hdlcoder/ug/getting-started-with-axi4-stream-interface-in-zynq-workflow.html 8. Interface of Streaming : https://www.xilinx.com/html_docs/xilinx2020_2/vitis_doc/managing_interface_synthesis.html#ariaid-title34 9. Database in FPGA : https://dspace.mit.edu/bitstream/handle/1721.1/91829/894228451-MIT.pdf, https://www.xilinx.com/publications/events/developer-forum/2018-frankfurt/accelerating-databases-with-fpgas.pdf, https://www.xilinx.com/html_docs/xilinx2020_2/vitis_doc/vitis_hls_process.html#djn1584047476918 10. Muxed Stream : https://liu.diva-portal.org/smash/get/diva2:1057270/FULLTEXT01.pdf 11. RAW,WAR,WAW.. : https://www.xilinx.com/html_docs/xilinx2020_2/vitis_doc/vitis_hls_optimization_techniques.html#wen1539734225565__aa1299615 12: Loop Pipelining Roll Unroll : https://www.xilinx.com/html_docs/xilinx2020_2/vitis_doc/vitis_hls_optimization_techniques.html#kcq1539734224846 ### Errors Logs and Issues encountered The input pins (listed below) are either not connected or do not have a source port, and they don't have a tie-off specified. These pins are tied-off to all 0's to avoid error in Implementation flow. Please check your design and connect them as needed: /color_filter/ap_start When ap_Ctrl = None not specified in design Cant find custom IP in Vivado : add IP zip path, open IP Integrator view, from IP configure window manually add the IP Cant connect hls::stream<> type object in IP : Note: The hls::stream class should always be passed between functions as a C++ reference argument. For example, &my_stream. IMPORTANT: The hls::stream class is only used in C++ designs. Array of streams is not supported. Non-Blocking write not-allowed in Non-FIFO Interfaces like axis instead try using FIFO m_axi DMA size must be lesser than 16383 so we cant feed very large datasets directly to a single DMA. WARNING: [HLS 200-786] Detected dataflow-on-top in function 'color_filter' (../project_3/color_filter.cpp:45) with default interface mode 'ap_ctrl_hs'. Overlapped execution of successive kernel calls will not happen unless interface mode 'ap_ctrl_chain' is used (or 'ap_ctrl_none' for a purely data-driven design). Resolution: For help on HLS 200-786 see www.xilinx.com/cgi-bin/docs/rdoc?v=2020.2;t=hls+guidance;d=200-786.html DMA Stuck and not reponding, [fixed thanks to Lauri's Blog](https://lauri.võsandi.com/hdl/zynq/xilinx-dma.html) and [problems other's faced](https://forums.xilinx.com/t5/Processor-System-Design-and-AXI/Why-AXI-DMA-starts-acquiring-data-during-configuration/td-p/766605) and more [problems](https://forums.xilinx.com/t5/AXI-Infrastructure-Archive/tkeep-signal-in-AXI-DMA-and-tstrb-3-0-in-Custom-AXI-Stream-IP/td-p/921850)