2. Hardware Overview
The hardware can be structured in 3 hierachical levels.
The System level
The IP Core level
The Engine Core level
2.1. System Level
2.1.1. Description
The figure gives an overview of the overall system level of the project. The accelerator starts with a host-PC, that passes input data, as well as weights and configuration to the service processor over the Ethernet interface. The service processor runs C and Python scripts on an installed Peta-Linux to drive the accelerator over a direct connection to the THANNA IP core. Data, such as parameters, inputs, and outputs are stored and loaded from the external DDR memory. The programmable THANNA IP core also has direct access to the memory, which enables the exchange of large amounts of data. Because the CPU and IP Core do have different bus connections to the DDR, the buses can each only do either a read or a write access at the same time. Cache coherency is enabled to synchronize the CPU and IP core access to the DDR and avoid data loss. The components on the Zybo Z-720 are connected via 32-bit data buses, which means that each bus is able to transfer 4 Byte at a time.
2.1.2. Block Design
The hardware on the FPGA board is connected via the block design and can be opened
and adapted with the viado editor. If persistent changes to block design file are made in vivado,
copy the file from out/system/design_1.bd
to board_files/<<name_of_fpga_device>>/design_1.bd
.
With the block design editor in vivado one can:
integrate new peripherical components or IP cores
configure the current IP Core through parameters
generate the bitstream to program the FPGA board
2.2. IP Core Level
2.2.1. Description
The IP Core level describes the data interfaces between the peripheral and the main engines. The following diagramm presents an overview of the THANNA_IP_CORE:
The THANNA IP Core and service processor are directly connected over the AXI_SLAVE
component.
Both as well share access to the external DRR memory. Over the AXI_MASTER
and MASTER_CONTROLLER
.
The data flow and command flow are kept strictly separated at the IP core and
host processor intersection since they have different requirements. The data flow
includes the inputs, intermediate results, and parameters. The command-flow data
consists of instructions, responses, and debug information. Within the IP core, the
command flow is distributed to the right engine over the SLAVE_CONTROLLER
module.
The number and types of engines that are deployed to the FPGA are actually fully
customizable, as long as they are connected in the right way and follow the rules of
the interfaces.
2.2.2. Modules
MAIN
: Top level file for deployment on the FPGA deviceMAIN_SIM
: Top level file for the simulation on the FPGA deviceAXI_TEST
: Example engine core that tests the AXI interfacesSLAVE_CONTROLLER
: Command transfer logicAXI_SLAVE
: Vivado AXI interface module to the host-cpuMASTER_CONTROLLER
: Write / Read and engine core DDR access prioritizationAXI_MASTER
: Vivado AXI interface module to the DDR
2.2.3. Interfaces
Hardware Interfaces:
SLAVE_CONTROLLER_INTERFACE
: Example engine core that tests the AXI interfacesMASTER_CONTROLLER_INTEFACE
: Write / Read and engine core DDR access prioritization
Software Interfaces:
MASTER_CONTROLLER_INTEFACE
: Interface to the host cpu which includes command handshakes and core targetingAXI_TEST_INTEFACE
: Interface to the host cpu that explains all provided command types
2.3. The Engine Core Level
2.3.1. Description
Each engine core is able to receive a custom set of control commands from the host-cpu It can receive/send data to the external memory. Beyond that it is free to do anything with the data and commands.
The AXI_TEST
engine core serves as a demo engine core that tests the inferface functionality.
It can receive essentially 3 control commands:
read data from the DDR
increment the data
write the data back
The SINGLE_ENGINE_CORE
is supposed to be a lightweight, but scalable engine core that can compute
all kind of layers within a neural network.
The currently supported layers are:
convolutional layers
full connected layers
In addition it can perform:
a RELU activation function
max pooling