Commit 88cddfea authored by Sabyasachi Mondal's avatar Sabyasachi Mondal
Browse files

commit alpha

parent 1d926e1f
...@@ -7,22 +7,44 @@ We want to use FPGA for implementing an algorithm in hardware to perform computa ...@@ -7,22 +7,44 @@ We want to use FPGA for implementing an algorithm in hardware to perform computa
# Background # Background
CPUs are known for their general purpose use, the same GPUs can power all kinds of applications. EINAC the first computer in a sense had programmable cards, taking days to reprogram but used general purpose computations, the limitation is code could be used to perform any tasks. CPU can run simulate any finite state machine but can't be reprogrammed as a hardware. CPUs are known for their general purpose use, the same GPUs can power all kinds of applications. CPU can run simulate any finite state machine but can't be reprogrammed as a hardware. In CPU the hardware is static so all data will get converted to the same set of specific instruction set that runs one at a time in CPU.
For application specific needs like signal processing wiring a device to do particular computations can prove much more efficient. We may implement an multiplier on the hardware level if we want. Depending on the kind of data we are can implement an hardware that can entirely process the exact type of data much faster. In CPU the hardware is static so all data will get converted to the same set of specific instruction set that runs one at a time in CPU. In FPGA for example may implement multiple multipliers or registers to work in parallel or in specifc order on the hardware level if we want. Depending on the kind of data we would receive we can implement an hardware that can entirely process the exact type of data much faster.
If we know we will be doing a matrix addition for 2 4x4 array we can simply implement a register to register connected adder that will always give us the result of addition in the next cycle the data is received. In CPU we cant simply do that! For application specific needs like signal processing the CPU takes help of the same compilation techniques and the same Machine level instructions which cant be optimized except for designing better software at the high level / mid level.
In this case we are going to use the FPGA to implement a processing unit in hardware from High Level C code that will be able to compute : But We can break the same stereotype and as software designers develop our own algorithms bottom up from register levels to a high level code (python for example), which may prove immensely powerful for the task specific algorithm.
1. *The weight matrix of a neural network* [Future Application to develop a hardware optimized neural network]
# Objective
Our Objective is to develop better integrated code such that our hardware and software works hand in hand to deliver the best result. We start thinking of a algorithm in python and think how it can be optimized while running it in the FPGA's Logic Unit. We would develop the hardware in C++ and write/burn the hardware in FPGA and use our Py code to drive it.
In this case we are going to use the FPGA to implement a processing unit in hardware from High Level C code that will be able to perform image processing (like inversion, color swap, color sieve) at a much faster rate:
1. *Perform Image processing by using the registers, streaming Interface and DMA* [Future scope multi-agent control]
and and
compare how CPU performs in comparision to our FPGA hardware that is exactly wired up to work on the kind of data we expect to provide as input. compare how CPU performs in comparision to our FPGA hardware that is exactly wired up to work on the kind of data we expect to provide as input.
# Implementation Strategy # Implementation Strategy
First we need to determine the type of data we would be using in our project. Based on that we need to decide the type of ports and hardware we can use in FPGA. Previously we have seen the image resizer takes in the whole data DMA makes the data transfer rate much faster, but there were several instances where CPU performed better and faster specifically in a wider range of image dimension color and size.
We intend to implement the following:
1. make faster multichannel operations at a hardware level integrated with similar high level software constructs
2. make the FPGA capable to process images in as wide range as our CPU supports
This is how a typical openCV resizer works:
<Data Transfer Image>
We will notice this further if we study the resizer code that in the 2d image is fed to our DMA and internally the whole image is read row by row , col by col. Image array size is static becuase we are have finite space in FPGA.
This may be made more efficient and robust (accomodating any image width) if by implementing the following changes:
1. Multichannel image operation where we use parallel threads for processing. Each of this processing an logic entity (utilizing multiple CLBs) is expected to be faster.
2. By chunking and sending data in packets fromour high level code we can also ensure that our FPGA can process an image much larger than it's own memory or DMA allocation space.
We use two streams of data in each process with it's own processing unit in our IP , which can be schematically represented in:
<image for our Implementation>
After this we need to determine a mental sketch of the hardware that if implemented can make the processing faster.
###### At this point we will do a project estimate analysis and select one of the above problem statement if needed (to fit within the time) ###### At this point we will do a project estimate analysis and select one of the above problem statement if needed (to fit within the time)
...@@ -37,18 +59,15 @@ Then finally we can check the runtime and reach a conclusion on which is faster ...@@ -37,18 +59,15 @@ Then finally we can check the runtime and reach a conclusion on which is faster
# Tasks # Tasks
The Tasks and maximum estimated time: The Tasks and maximum estimated time:
1. Problem statement and solution Plan brainstorming and refresher on NN : *12 hrs* 1. Problem statement and brainstorming for project selection : *24 hrs*
2. Implementing the network in python CPU : *16 hrs* 2. Design a basic model and build overlay : *4 hrs*
3. Pseudo code and solution adjustment : *6 hrs* 3. Python code adjustment and integration : *3 hrs*
4. Vivado study of other solutions, available tools, code and hardware correlation : *16 hrs* 4. Implement next stage of overlay design : *_ hrs*
5. Writting the code in Vivado : *6 hrs*
6. Implementing code and checking hardware features and making final adjustments : *16 hrs*
7. Bitstream generation python code for overlay : *2 hrs*
8. Drafting the report and Analysis : *4 hrs*
# Resources used and Future project topics # Resources used and Future project topics
##### Resources used
#### Resources used
Operation with stream: https://www.xilinx.com/html_docs/xilinx2020_2/vitis_doc/hls_stream_library.html#ivv1539734234667__ad398476 Operation with stream: https://www.xilinx.com/html_docs/xilinx2020_2/vitis_doc/hls_stream_library.html#ivv1539734234667__ad398476
Specialized Constructs : https://www.xilinx.com/html_docs/xilinx2020_2/vitis_doc/special_graph_constructs.html?hl=template Specialized Constructs : https://www.xilinx.com/html_docs/xilinx2020_2/vitis_doc/special_graph_constructs.html?hl=template
Database in FPGA : https://dspace.mit.edu/bitstream/handle/1721.1/91829/894228451-MIT.pdf?sequence=2&isAllowed=y Database in FPGA : https://dspace.mit.edu/bitstream/handle/1721.1/91829/894228451-MIT.pdf?sequence=2&isAllowed=y
...@@ -58,12 +77,10 @@ Vitis Examples : https://github.com/Xilinx/Vitis_Accel_Examples/blob/master/cpp_ ...@@ -58,12 +77,10 @@ Vitis Examples : https://github.com/Xilinx/Vitis_Accel_Examples/blob/master/cpp_
Running Accelerator : https://pynq.readthedocs.io/en/v2.6.1/pynq_alveo.html#running-accelerators Running Accelerator : https://pynq.readthedocs.io/en/v2.6.1/pynq_alveo.html#running-accelerators
Pragma Interfaces : https://www.xilinx.com/html_docs/xilinx2017_4/sdaccel_doc/jit1504034365862.html Pragma Interfaces : https://www.xilinx.com/html_docs/xilinx2017_4/sdaccel_doc/jit1504034365862.html
Interface of Streaming : https://www.xilinx.com/html_docs/xilinx2020_2/vitis_doc/managing_interface_synthesis.html#ariaid-title34 Interface of Streaming : https://www.xilinx.com/html_docs/xilinx2020_2/vitis_doc/managing_interface_synthesis.html#ariaid-title34
##### Future topics
A great resource for 32x32 image dataset: http://chaladze.com/l5/ #### Future scope
Book to jumstart or serve as refresher: Programming Machine Learning (Perrotta, Paolo) [ISBN: , 9781680507720] The image processing can serve a stepping stone for controlling multi-agent systems. Where each streaming interface can be used for instruction input and output for each agent. Instead of using RTOS in each bot we can have multiple datastreams from each bots being processed in an IP designed to emulate a FSM for each agent and decide their action. This can lead to higher robustness and fault tolerance and lower costs.
Hardware based Neural networks : https://users.ece.cmu.edu/~pgrover/teaching/files/NeuromorphicComputing.pdf
https://www.amiq.com/consulting/2018/12/14/how-to-implement-a-convolutional-neural-network-using-high-level-synthesis/
https://wiki.nus.edu.sg/display/ee4218/Hardware+Implementation+Flow
# Errors Logs and Issues encountered # Errors Logs and Issues encountered
[BD 41-759] The input pins (listed below) are either not connected or do not have a source port, and they don't have a tie-off specified. These pins are tied-off to all 0's to avoid error in Implementation flow. [BD 41-759] The input pins (listed below) are either not connected or do not have a source port, and they don't have a tie-off specified. These pins are tied-off to all 0's to avoid error in Implementation flow.
...@@ -77,3 +94,5 @@ Cant connect hls::stream<> type object in IP : Note: The hls::stream class shoul ...@@ -77,3 +94,5 @@ Cant connect hls::stream<> type object in IP : Note: The hls::stream class shoul
IMPORTANT: The hls::stream class is only used in C++ designs. Array of streams is not supported. IMPORTANT: The hls::stream class is only used in C++ designs. Array of streams is not supported.
Non-Blocking write not-allowed in Non-FIFO Interfaces like axis instead try using FIFO m_axi Non-Blocking write not-allowed in Non-FIFO Interfaces like axis instead try using FIFO m_axi
DMA size must be lesser than 16383 so we cant feed very large datasets directly to a single DMA.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment