BMNNSDK provides a lightweight set of c/c++APIs for deep learning application developer, it consists of TPU BMRuntime Library, BMKernel Library and BMNet Library. Which will be described in detail in this section.

bmerr_t bm_init(int index,bmctx_t *ctx)

bm_init() initializes BM device and creates a handle to BM context.

Parameter | Type | Description |

index | Input | Not used now. Set 0 as default value. |

ctx | Output | The pointer of BM context handle. |

void bm_exit(bmctx_t ctx)

bm_exit() must be called before application exits. It will release all internal resources.

Parameter | Type | Description |

ctx | Input | The BM context handle which is created by bm_init(). |

void bm_enum_devices(int *count,bm_devinfo_t devinfo[])

bm_enum_devices() enumerates all BM devices in the system.

Parameter | Type | Description |

count | Output | The count of BM device. |

devinfo | Output | The array of device info. |

bmerr_t bm_device_open(int index,bmdev_t *dev)

bm_device_open() opens a BM device.

Parameter | Type | Description |

index | Input | The index of BM device. |

dev | Output | The pointer of BM device handle. |

void bm_device_close(bmdev_t dev)

bm_device_close() closes an opened BM device.

Parameter | Type | Description |

dev | Input | The BM device handle. |

bmerr_t bm_device_query(bmdev_t dev,int id,void *buf)

bm_device_query() always returns BM_ERR_NOT_SUPPORTED now.

bmerr_t bm_device_config(bmdev_t dev,int id,void *buf)

bm_device_config() always returns BM_ERR_NOT_SUPPORTED now.

bm_devinfo_t bm_device_get_info(bmdev_t dev)

bm_device_get_info() return a BM device information.

Parameter | Type | Description |

dev | Input | The BM device handle. |

bmerr_t bm_context_create(bmctx_t *ctx)

bm_context_create() creates a BM context.

Parameter | Type | Description |

ctx | Output | The pointer of BM context handle. |

void bm_context_destroy(bmctx_t ctx)

bm_context_destroy() destroys a BM context.

Parameter | Type | Description |

ctx | Input | The BM context handle. |

bmerr_t bm_bind_device(bmctx_t ctx,bmdev_t dev)

bm_bind_device() binds a BM context with a BM device.

Parameter | Type | Description |

ctx | Input | The BM context handle. |

dev | Input | The BM device handle. |

void bm_unbind_device(bmctx_t ctx)

bm_unbind_device() unbinds a BM context with the BM device.

Parameter | Type | Description |

ctx | Input | The BM context handle. |

bmdev_t bm_get_device(bmctx_t ctx)

bm_get_device() returns the BM device handle which is bound with the BM context.

Parameter | Type | Description |

ctx | Input | The BM context handle. |

bmerr_t bmruntime_bmkernel_create(bmctx_t ctx,bmkernel_handle_t **p_bk_ctx)

bmruntime_bmkernel_create() creates a BM kernel with the BM context. The p_bk_ctx points to a thread local variable, so you can use this API to create multi BM contexts in multiple threads, they are independent. But you can’t own more than one BM context at the same time in one thread, otherwise there will be a memory leak.

Parameter | Type | Description |

ctx | Input | The BM context handle. |

p_bk_ctx | Output | The pointer of BM kernel handle. |

void bmruntime_bmkernel_destroy(bmctx_t ctx)

bmruntime_bmkernel_submit() the BM kernel with the BM context.

Parameter | Type | Description |

ctx | Input | The BM context handle. |

void bmruntime_bmkernel_destroy(bmctx_t ctx)

bmruntime_bmkernel_destroy() destroys the BM kernel with the BM context.

Parameter | Type | Description |

ctx | Input | The BM context handle. |

bmmem_device_t bmmem_device_alloc_raw(bmctx_t ctx,size_t size)

bmmem_device_alloc_raw() allocates device memory as the input size.

Parameter | Type | Description |

ctx | Input | The BM context handle. |

size | Input | The size of device memory. |

bmmem_device_t bmmem_device_prealloc_raw(bmctx_t ctx,bmmem_device_t mem,uint64_t offset,size_t size)

bmmem_device_prealloc_raw() allows application to allocate memory from previously allocted device memory. The memory you want to allocate needs to fall in the previously allocated device memory.

Parameter | Type | Description |

ctx | Input | The BM context handle. |

mem | Input | The previously allocated device memory. |

offset | Input | The offset in the previously allocated device memory. |

size | Input | The size of device memory |

bmmem_device_t bmmem_device_alloc(bmctx_t ctx,bmshape_t *shape)

bmmem_device_alloc() allocates device memory as the input shape.

Parameter | Type | Description |

ctx | Input | The BM context handle. |

shape | Input | The shape of device memory. |

bmmem_device_t bmmem_device_prealloc(bmctx_t ctx,bmmem_device_t mem,uint64_t offset,bmshape_t *shape)

Parameter | Type | Description |

ctx | Input | The BM context handle. |

mem | Input | The previously allocated device memory. |

Offset | Input | The offset in the previously allocated device memory. |

shape | Input | The shape of device memory. |

void bmmem_device_free(bmctx_t ctx,bmmem_device_t mem)

bmmem_device_free() frees the device memory that are allocated by the above allocating functions.

Parameter | Type | Description |

ctx | Input | The BM context handle. |

mem | Input | The device memory handle. |

bmmem_host_t bmmem_host_alloc(bmctx_t ctx,bmshape_t *shape)

bmmem_host_alloc() always returns BM_ERR_NOT_SUPPORTED now.

void bmmem_host_free(bmctx_t ctx,bmmem_host_t mem)

bmmem_host_free() always returns BM_ERR_NOT_SUPPORTED now.

size_t bmmem_device_size(bmctx_t ctx,bmmem_device_t mem)

bmmem_device_size() returns the device memory size.

Parameter | Type | Description |

ctx | Input | The BM context handle. |

mem | Input | The device memory handle. |

uint64_t bmmem_device_addr(bmctx_t ctx,bmmem_device_t mem)

bmmem_device_addr() returns the device memory address.

Parameter | Type | Description |

ctx | Input | The BM context handle. |

mem | Input | The device memory handle. |

void* bmmem_host_v_addr(bmctx_t ctx,bmmem_host_t mem)

bmmem_host_v_addr() always returns BM_ERR_NOT_SUPPORTED now.

uint64_t bmmem_host_p_addr(bmctx_t ctx,bmmem_host_t mem)

bmmem_host_p_addr() always returns BM_ERR_NOT_SUPPORTED now.

bmerr_t bm_memcpy_s2d(bmctx_t ctx,bmmem_device_t dst,uint8_t* src)

bm_memcpy_s2d() copy system memory data to device memory. s means system, d means device.

Parameter | Type | Description |

ctx | Input | The BM context handle. |

dst | Input | The device memory handle. |

src | Input | The system memory pointer. |

bmerr_t bm_memcpy_d2s(bmctx_t ctx,uint8_t* dst,bmmem_device_t src)

bm_memcpy_d2s copy device memory data to system memory.

Parameter | Type | Description |

ctx | Input | The BM context handle. |

dst | Input | The system memory pointer. |

src | Input | The device memory handle. |

bmnet_register() registers a neuron network with bmnet info.

bmerr_t bmnet_register(bmctx_t ctx,bmnet_info_t *info,bmnet_t *net)

Parameter | Type | Description |

ctx | Input | The BM context handle. |

info | Input | The BM network info. |

net | Output | The registered network handle. |

bmerr_t bmnet_register_bmodel (bmctx_t ctx,char *bmodel,bmnet_t *net)

bmnet_register_bmodel() registers a neuron network with bmodel file.

Parameter | Type | Description |

ctx | Input | The BM context handle. |

bmodel | Input | bmodel filename. |

net | Output | The registered network handle. |

bmerr_t bmnet_register_noalloc(bmctx_t ctx,bmnet_info_t *info,bmnet_t *net)

bmnet_register_noalloc() registers a compiled neuron network without allocating weight and neuron device memory.

Parameter | Type | Description |

ctx | Input | The BM context handle. |

info | Input | The BM network info. |

net | Output | The registered network handle. |

bmerr_t bmnet_set_input_shape(bmnet_t net,shape_t input_shape)

bmnet_set_input_shape () sets a input shape for a registered BM network. The bmodel support different input shapes, the API can set one of them.

Parameter | Type | Description |

net | Input | The BM network handle. |

input_shape | Input | The input shape. |

bmerr_t bmnet_get_output_info(bmnet_t net,bmnet_output_info_t *output_info)

bmnet_get_output_info () sets a input shape for a registered BM network.

Parameter | Type | Description |

net | Input | The BM network handle. |

output_info | Output | The output info. |

void bmnet_cleanup(bmnet_t net)

bmnet_cleanup() cleans up a registered BM network.

Parameter | Type | Description |

net | Input | The BM network handle. |

bmerr_t bmnet_run(bmnet_t net)

bmnet_run() runs a registered BM network. You need load input and store output by yourself.

Parameter | Type | Description |

net | Input | The BM network handle. |

bmmem_device_t bmnet_weight_devmem(bmnet_t net)

bmnet_weight_devmem() retrieves the weight device memory handler from a registered BM network.

Parameter | Type | Description |

net | Input | The BM network handle. |

bmmem_device_t bmnet_neuron_devmem(bmnet_t net)

bmnet_neuron_devmem() retrieves neuron device memory handler from a registered BM network.

Parameter | Type | Description |

net | Input | The BM network handle. |

bmmem_device_t bmnet_input_devmem(bmnet_t net)

bmnet_input_devmem() retrieves input device memory handler from a registered BM network.

Parameter | Type | Description |

net | Input | The BM network handle. |

bmmem_device_t bmnet_output_devmem(bmnet_t net)

bmnet_output_devmem() retrieves output device memory handler from a registered BM network.

Parameter | Type | Description |

net | Input | The BM network handle. |

bmerr_t bmnet_import_weight_devmem(bmnet_t net,bmmem_device_t weight_mem)

bmnet_import_weight_devmem() imports weight device memory for a registered BM network. application should allocate weight device memory firstly, then call it to import weight memory. This function and bmnet_import_neuron_devmem() function are usually used with bmnet_register_noalloc() function. Application can register BM network without allocating weight and neuron device memory, and then use these two functions to import weight and neuron memory.

Parameter | Type | Description |

net | Input | The BM network handle. |

weight_mem | Input | The weight device memory handle. |

bmerr_t bmnet_import_neuron_devmem(bmnet_t net,bmmem_device_t neuron_mem)

bmnet_import_neuron_devmem() imports neuron device memory for a registered BM network. Application should allocate neuron device memory firstly, then call it to import neuron memory.

Parameter | Type | Description |

net | Input | The BM network handle. |

neuron_mem | Input | The neuron device memory handle. |

bmerr_t bmnet_load_input(bmnet_t net,uint8_t *input)

bmnet_load_input() loads input data for a registered BM network.

Parameter | Type | Description |

net | Input | The BM network handle. |

input | Input | The input data pointer. |

bmerr_t bmnet_load_neuron(bmnet_t net,uint64_t neuron_offset,int neuron_size,uint8_t *neuron)

bmnet_load_neuron() loads neuron data for a registered BM network.

Parameter | Type | Description |

net | Input | The BM network handle. |

neuron_offset | Input | The offset of neuron buffer. |

neuron_size | Input | The neuron buffer size. |

neuron | Input | The pointer to the neuron buffer. |

bmerr_t bmnet_store_output (bmnet_t net,uint8_t *output)

bmnet_store_output() stores output data for a registered BM network. Application uses this function to copy output data from device memory to host memory.

Parameter | Type | Description |

net | Input | The BM network handle. |

output | Input | The output buffer pointer. |

bmerr_t bmnet_store_neuron(bmnet_t net,uint64_t neuron_offset,int neuron_size,uint8_t *neuron)

bmnet_store_neuron() stores neuron data for a registered BM network. Application uses this function to copy neuron data from device memory to host memory.

Parameter | Type | Description |

net | Input | The BM network handle. |

neuron_offset | Input | The offset of neuron buffer. |

neuron_size | Input | The neuron buffer size. |

neuron | Input | The pointer to the neuron buffer. |

bmerr_t bmnet_inference(bmnet_t net,uint8_t *input,uint8_t *output)

bmnet_inference() runs inference with a registered BM network.

Parameter | Type | Description |

net | Input | The BM network handle. |

input | Input | The input buffer pointer. |

output | Input | The output buffer pointer. |

User allocates a BMKernel context by filling a bmk1880 info t structure and passing it to bmk1880 register function. The function returns a handle of the initialized context.

In the bmk1880 info t structure: chip version is an integer describing the version of chip to work with, and can be 1880 or 1880; cmdbuf (short for “command buffer”) is a user-allocated buffer to contain generated hardware instructions and cmdbuf size describes its size in bytes. Note that user is responsible to free cmdbuf after the use of referring BMKernel context.

typedef struct { u32 chip_version; u8 *cmdbuf;u32 cmdbuf_size;} bmk1880_info_t;void * bmk1880_register(bmk1880_info_t *info);

bmk1880 cleanup frees the context previously allocated by bmk1880 register.

void bmk1880_cleanup(void *ctx);

bmk1880 acquire cmdbuf returns a buffer of hardware instructions generated so far and set (*size) to buffer’s valid size in bytes. The buffer is an array of cmd hdr t structures each containing one variable-sized generated hardware instruction.

u8 *bmk1880_acquire_cmdbuf(void *ctx, u32 *size);typedef struct {u8 engine_id : 4; ...u8 len;u8 cmd [0];} cmd_hdr_t;

In the cmd hdr t structure, engine id is the identifier of engine on which the contained in- struction is supposed to be executed. And len indicates in bytes the length of the hardware instruction immediately following this cmd hdr t structure.

bmk1880 reset resets current BMKernel context to its initial state as returned by bmk1880 - register. This function is usually called after bmk1880 acquire cmdbuf to empty the cmdbuf buffer.

void bmk1880_reset(void *ctx);

bmk1880 parallel enable claims that following computations on different engines can be executed with no synchornization with each other. This function enables engine-oriented parallel programming style.

void bmk1880_parallel_enable(void *ctx);

bmk1880 parallel disable disables engine-oriented parallel programming style.

void bmk1880_parallel_disable(void *ctx);

bmk1880 create streams creates nr streams streams, indexed 0 to (nr streams - 1), that following calls to bmk1880 set stream can refer to. This function enables dependency-oriented parallel programming style. Note this style can not be disabled once enabled.

void bmk1880_create_streams(void *ctx, int nr_streams);

bmk1880 destroy streams destroys all the streams created by the previous call to bmk1880 - create streams and resets the system back to serial mode.

void bmk1880_destroy_streams(void *ctx);

bmk1880 set stream set current stream to stream i that has been created by calling bmk1880 - create streams. Following computations will be put into this stream until another bmk1880 set - stream specifying a different stream index is called.

void bmk1880_set_stream(void *ctx, int i);

bmk1880 add dependency further restricts that the computation represented by before must take place strictly before that represented by after. Both before and after are pointers returned by some computation API.

void bmk1880_add_dependency( void *ctx, void *before, void *after);

During all kinds of computation, input values are first converted into 32-bit ones before any internal computation, and final 32-bit values are saturated into ranges that can be represented by the final 8-bit or 16-bit integer format. That is, if the value before saturation can be represented by the final integer format, it is unchanged. Otherwise it is saturated into the maximun or minimum in the final integer format, whichever is nearer to the original value. For example, if the final integer format is FMT_U8, then the representable maximum and minimum are 255 and 0 respectively. In this case, any value that is bigger than 255 becomes 255 after saturation, and values smaller than 0 are saturated into 0’s.

About signedness, one general rule applies to all kinds of computation when not otherwise specified: the result is unsigned if and only if all input tensors or matrice are unsigned. A tensor or matrix is said to be signed if it is of format FMT_I8, unsigned if FMT_U8.

fmt t describes the type of basic data in a tensor or matrix. The naming consists of three parts. “FMT” is a fixed prefix. A following “I” or “U” stands for signed integer or unsigned integer respectively. “8” describes the bit-width of the type.

typedef u32 fmt_t;#define FMT_I8 4#define FMT_U8 9

shape t describes the shape of a tensor or matrix. shape t4 and shape t2 are used to construct shape t’s for tensor and matrix, respectively.

typedef struct {u32 dim;u32 n;u32 c;union {u32 h;u32 row; };union {u32 w;u32 col; };} shape_t;shape_t shape_t4(int n, int c, int h, int w);shape_t shape_t2(int row, int col);

stride t describes the stride of a tensor or matrix. stride t4 and stride t2 are used to construct stride t’s for tensor and matrix, respectively.

typedef struct {u32 n;u32 c;union {u32 h;u32 row;};union {u32 w;u32 col;};} stride_t;stride_t stride_t4(int n, int c, int h, int w); stride_t stride_t2(int row, int col);

tensor lmem represents a tensor or matrix in lmem. fmt, shape, stride are as explained above. If stride is NULL, aligned will be referred as indication of two frequently used stride values.

typedef struct {fmt_t fmt;shape_t shape;stride_t *stride;bool aligned;...} tensor_lmem;

For tensors, if aligned is false, the stride values are as in the default unaligned stride on page 5. If aligned is true, the values are as in the default aligned stride on page 5. For matrice, stride values are computed by the shapes of corresponding specially shaped tensors, following the same rule.

tensor gmem represents a tensor or matrix in gmem.

typedef struct {u64 addr;shape_t shape;stride_t stride;} tensor_gmem;

bmk1880 chip info returns a structure describing design parameters of the BM1880 chip.

typedef struct {u32 version;u32 npu_num;u32 eu_num;u32 lmem_size;u32 lmem_banks;u32 lmem_bank_size;} bmk1880_chip_info_t;bmk1880_chip_info_t bmk1880_chip_info();

bmk1880 tl prealloc allocates a tensor lmem structure on heap memory, and constructs it as dictated by parameters. The parameter la is the starting address in lmem. The tensor lmem’s aligned field is set to false. If the allocation succeeds, a pointer to the constructed structure is returned, NULL otherwise.

tensor_lmem * bmk1880_tl_prealloc(void *ctx,laddr_t la ,shape_t s,fmt_t fmt);

Same as bmk1880 tl prealloc, except the aligned field is set to true.

tensor_lmem * bmk1880_tl_prealloc_align(void *ctx,laddr_t la ,shape_t s,fmt_t fmt);

bmk1880 tl alloc allocates a tensor lmem structure on heap memory, and constructs it as dic- tated by parameters. Unlike in bmk1880 tl prealloc, the starting address is not determined from parameters, but assigned by BMKernel automatically. BMKernel manages the starting addresses in lmem by a simple stack. The starting address in each returned tensor lmem increases mono- tonically against successive bmk1880 tl alloc calls. And the last allocated tensor lmem must be freed first, using function bmk1880 tl free explained soon. If the available memory in lmem is not enough to satisfy an allocation request, or some other error occurs, a NULL pointer is returned.

tensor_lmem * bmk1880_tl_alloc(void *ctx,shape_t s,fmt_t fmt ,u32 ctrls);

tensor lmem’s aligned field is set to false when ctrls is CTRL_NULL, and true when CTRL_AL.

bmk1880 tl alloc bank allocates memory from a specific lmem bank, as dictated by the bank id parameter.

tensor_lmem * bmk1880_tl_alloc_bank(void *ctx,u32 bank_id ,shape_t s,fmt_t fmt ,u32 ctrls);

bmk1880 tl free frees the tensor lmem structure allocated by bmk1880 tl prealloc, bmk1880 tl pre- alloc align, bmk1880 tl alloc and bmk1880 tl alloc bank back to heap memory. If the structure is allocated by bmk1880 tl alloc or bmk1880 tl alloc bank, bmk1880 tl free also increases the avail- able lmem memory managed by BMKernel and checks that the last allocate, first free rule is obeyed (see bmk1880 tl alloc).

void bmk1880_tl_free(void ctx, tensor_lmem tlp);

bmk1880 gdma copy gmem instructs DMA to copy tensor or matrix within gmem. src and dst must be both tensors or matrice and must contain 8-bit basic data only. The shapes of src and dst may be different, as long as their total numbers of basic data equal. When src and dst are tensors, ctrls can be CTRL_TP, indicating N/C-transposition. In other cases, ctrls must be CTRL_NULL.

void * bmk1880_gdma_copy_gmem(void *ctx,tensor_gmem *dst ,tensor_gmem *src ,ctrl_t ctrls);

bmk1880 gdma copy lmem instructs DMA to copy a tensor (not matrix) within lmem, from src to dst. The shapes of src and dst may be different, as long as their total numbers of basic data equal. The basic data must be 8-bit.

void * bmk1880_gdma_copy_lmem(void *ctx,tensor_lmem *dst ,tensor_lmem *src);

bmk1880 gdma load instructs DMA to copy a tensor or matrix from gmem to lmem. The tensor or matrix starts at gaddr in gmem, and is strided by default values. When ctrls is CTRL_TP (instead of CTRL_NULL), it indicates N/C-transposition for a tensor, or row/column-transposition for a matrix. The basic data must be 8-bit.

void * bmk1880_gdma_load(void *ctx,tensor_lmem *t,u64 gaddr ,ctrl_t ctrls);

Similar to bmk1880 gdma load, but copies the tensor or matrix from lmem to gmem.

void * bmk1880_gdma_store(void *ctx,tensor_lmem *t,u64 gaddr ,ctrl_t ctrls);

Similar to bmk1880 gdma load, but enables users to specify stride values in gmem.

void * bmk1880_gdma_load_stride(void *ctx,tensor_lmem *t,u64 gaddr ,stride_t stride ,ctrl_t ctrls);

Similar to bmk1880 gdma store, but enables users to specify stride values in gmem.

void * bmk1880_gdma_store_stride(void *ctx,tensor_lmem *t,u64 gaddr ,stride_t stride ,ctrl_t ctrls);

bmk1880 gdma lrn shift instructs DMA to compute a tensor (not matrix) dst from tensor src, both of which are of same shape (N, C, H, W ). If right shift is true, the computation copies datum at index (ni, ci, hi, wi) in tensor src into index (ni, ci + lrn step, hi, wi) in tensor dst for each 0 ≤ ci < C − lrn step, and set datum at index (ni, ci, hi, wi) in tensor dst to zero for each 0 ≤ ci < lrn step. If right shift is false, the computation copies datum at index (ni, ci, hi, wi) in tensor src into index (ni, ci − lrn step, hi, wi) in tensor dst for each lrn step ≤ ci < C, and set datum at index (ni, ci, hi, wi) in tensor dst to zero for each C − lrn step ≤ ci < C. The basic data must be 8-bit.

void * bmk1880_gdma_lrn_shift(void *ctx,tensor_lmem *dst ,tensor_lmem *src ,bool right_shift ,int lrn_step);

bmk1880 tpu mul instructs TPU to compute resi = (ai × bi) ≫ rshift width for each datum ai in tensor a and bi in tensor b, where resi, ai and bi are of same index. All tensors must be of same shape, and the basic data in all tensor lmem structures must be 8-bit. If the result is a 16-bit tensor, res high and res low represent its high and low 8-bit parts, respectively. res high should be NULL if the result is 8-bit. rshift width indicates the bits to be shifted to right for each result value before saturation.

typedef struct {tensor_lmem *res_high;tensor_lmem *res_low;tensor_lmem *a;tensor_lmem *b;int rshift_width;} bmk1880_mul_param_t;void * bmk1880_tpu_mul(void *ctx, const bmk1880_mul_param_t *p);

Similar to bmk1880 tpu mul, but tensor b is replaced by an 8-bit constant. The constant is signed if b is signed is true, unsigned otherwise.

typedef struct {tensor_lmem *res_high;tensor_lmem *res_low;tensor_lmem *a; s8 b;bool b_is_signed;int rshift_width;} bmk1880_mul_const_param_t;void * bmk1880_tpu_mul_const(void *ctx, constbmk1880_mul_const_param_t *p);

bmk1880 tpu mac instructs TPU to compute resi = (ai × bi + (resi ≪ lshift width)) ≫ rshift width for each datum ai in tensor a, bi in tensor b and resi represented by res high and res low together, where resi, ai and bi are of same index. All tensors must be of same shape, and the basic data in all tensor lmem structures must be 8-bit. The result is a 16-bit tensor if res is int8 is false, or a 8-bit tensor otherwise. rshift width indicates the bits to be shifted to right for each result value before saturation. Note that res high and res low are used both as input resi’s and output resi’s. Input resi’s are fixed to be 16-bit so that both res high and res low must be non-NULL. When the result is a 8-bit tensor, it is stored into res low.

typedef struct {tensor_lmem *res_high;tensor_lmem *res_low;bool res_is_int8;tensor_lmem *a;tensor_lmem *b;int lshift_width; int rshift_width;} bmk1880_mac_param_t;void * bmk1880_tpu_mac(void *ctx, const bmk1880_mac_param_t *p);

Similar to bmk1880 tpu mac, but tensor b is replaced by an 8-bit constant. The constant is signed if b is signed is true, unsigned otherwise.

typedef struct {tensor_lmem *res_high;tensor_lmem *res_low;bool res_is_int8;tensor_lmem *a;s8 b;bool b_is_signed;int lshift_width;int rshift_width;} bmk1880_mac_const_param_t;void * bmk1880_tpu_mac_const(void *ctx, constbmk1880_mac_const_param_t *p);

bmk1880 tpu add instructs TPU to compute resi = ai + bi for each datum ai in tensor a and bi in tensor b, where resi, ai and bi are of same index. All tensors must be of same shape, and the basic data in all tensor lmem structures must be 8-bit. Tensor a and tensor b must all be 16-bit so that a high and b high must not be NULL. If the result is a 16-bit tensor, res high and res low represent its high and low 8-bit parts, respectively. res high should be NULL if the result is 8-bit.

typedef struct {tensor_lmem *res_high;tensor_lmem *res_low;tensor_lmem *a_high;tensor_lmem *a_low;tensor_lmem *b_high;tensor_lmem *b_low;} bmk1880_add_param_t;void * bmk1880_tpu_add(void *ctx, const bmk1880_add_param_t *p);

Similar to bmk1880 tpu add, but tensor b is replaced by a 16-bit constant. The constant is signed if b is signed is true, unsigned otherwise.

typedef struct {tensor_lmem *res_high;tensor_lmem *res_low;tensor_lmem *a_high;tensor_lmem *a_low;s16 b;bool b_is_signed;} bmk1880_add_const_param_t;void * bmk1880_tpu_add_const(void *ctx, const bmk1880_add_const_param_t *p);

bmk1880 tpu sub instructs TPU to compute resi = ai − bi for each datum ai in tensor a and bi in tensor b, where resi, ai and bi are of same index. All tensors must be of same shape, and the basic data in all tensor lmem structures must be 8-bit. Tensor a and tensor b must all be 16-bit so that a high and b high must not be NULL. The result must be signed integers so that the fmt t field in res high and res low must be FMT_I8. If the result is a 16-bit tensor, res high and res low represent its high and low 8-bit parts, respectively. res high should be NULL if the result is 8-bit.

typedef struct {tensor_lmem *res_high;tensor_lmem *res_low;tensor_lmem *a_high;tensor_lmem *a_low;tensor_lmem *b_high;tensor_lmem *b_low;} bmk1880_sub_param_t;void * bmk1880_tpu_sub(void *ctx, const bmk1880_sub_param_t *p);

bmk1880 tpu max instructs TPU to compute resi = max(ai,bi) for each datum ai in tensor a and bi in tensor b, where resi, ai and bi are of same index. All tensors must be of same shape, and the basic data in all tensor lmem structures must be 8-bit. Tensor a and tensor b must both be signed or unsigned at the same time.

typedef struct {tensor_lmem *max;tensor_lmem *a;tensor_lmem *b;} bmk1880_max_param_t;void * bmk1880_tpu_max(void *ctx, const bmk1880_max_param_t *p);

Similar to bmk1880 tpu max, but computes resi = min(ai, bi).

typedef struct {tensor_lmem *min;tensor_lmem *a;tensor_lmem *b;} bmk1880_min_param_t;void * bmk1880_tpu_min(void *ctx, const bmk1880_min_param_t *p);

Similar to bmk1880 tpu min, but tensor b is replaced by an 8-bit constant. The constant is signed if b is signed is true, unsigned otherwise.

typedef struct {tensor_lmem *min;tensor_lmem *a;s8 b;bool b_is_signed;} bmk1880_min_const_param_t;void * bmk1880_tpu_min_const(void *ctx, const bmk1880_min_const_param_t *p);

bmk1880 tpu arith shift instructs TPU to compute resi = ai ≫ bitsi for each datum ai in tensor a and bitsi in tensor bits, where resi, ai and bitsi are of same index. All tensors must be of same shape, and the basic data in all tensor lmem structures must be 8-bit. Tensor a must be 16-bit and signed so that the fmt fields in a high and a low must be FMT_I8. Tensor bits must be signed and every datum in it must range in [−16, 16]. The result tensor must be 16-bit so that res high must be non-NULL.

typedef struct {tensor_lmem *res_high;tensor_lmem *res_low;tensor_lmem *a_high;tensor_lmem *a_low;tensor_lmem *bits;} bmk1880_arith_shift_param_t;void * bmk1880_tpu_arith_shift(void *ctx, const bmk1880_arith_shift_param_t *p);

bmk1880 tpu and int8 instructs TPU to compute resi = ai ∧ bi for each datum ai in tensor a and bi in tensor b, where resi, ai and bi are of same index. All tensors must be of same shape, and the basic data in all tensor lmem structures must be 8-bit.

typedef struct {tensor_lmem *res;tensor_lmem *a;tensor_lmem *b;} bmk1880_and_int8_param_t;void * bmk1880_tpu_and_int8(void *ctx, constbmk1880_and_int8_param_t *p);

Similar to bmk1880 tpu and int8, but all input and output tensors are 16-bit. So res high, a high and b high must be non-NULL.

typedef struct {tensor_lmem *res_high;tensor_lmem *res_low;tensor_lmem *a_high;tensor_lmem *a_low;tensor_lmem *b_high;tensor_lmem *b_low;} bmk1880_and_int16_param_t;void * bmk1880_tpu_and_int16(void *ctx, constbmk1880_and_int16_param_t *p);

Similar to bmk1880 tpu and int8, but computes resi = ai ∨ bi.

typedef struct {tensor_lmem *res;tensor_lmem *a;tensor_lmem *b;} bmk1880_or_int8_param_t;void * bmk1880_tpu_or_int8(void *ctx, constbmk1880_or_int8_param_t *p);

Similar to bmk1880 tpu and int16, but computes resi = ai ∨ bi.

typedef struct {tensor_lmem *res_high;tensor_lmem *res_low;tensor_lmem *a_high;tensor_lmem *a_low;tensor_lmem *b_high;tensor_lmem *b_low;} bmk1880_or_int16_param_t;void * bmk1880_tpu_or_int16(void *ctx, constbmk1880_or_int16_param_t *p);

Similar to bmk1880 tpu and int8, but computes resi = ai ⊕ bi.

typedef struct {tensor_lmem *res;tensor_lmem *a;tensor_lmem *b;} bmk1880_xor_int8_param_t;void * bmk1880_tpu_xor_int8(void *ctx, constbmk1880_xor_int8_param_t *p);

Similar to bmk1880 tpu and int16, but computes resi = ai ⊕ bi.

typedef struct {tensor_lmem *res_high;tensor_lmem *res_low;tensor_lmem *a_high;tensor_lmem *a_low;tensor_lmem *b_high;tensor_lmem *b_low;} bmk1880_xor_int16_param_t;void * bmk1880_tpu_xor_int16(void *ctx, constbmk1880_xor_int16_param_t *p);

bmk1880 tpu copy instructs TPU to copy tensors within lmem, from src to dst. The basic data must be 8-bit.

typedef struct {tensor_lmem *dst;tensor_lmem *src;} bmk1880_copy_param_t;void * bmk1880_tpu_copy(void *ctx, const bmk1880_copy_param_t *p) ;

Similar to bmk1880 tpu copy, but user provides stride t structures specifying the layouts of ten- sors dst and src. The basic data must be 8-bit.

typedef struct {tensor_lmem *dst;stride_t dst_stride;tensor_lmem *src;stride_t src_stride;} bmk1880_copy_with_stride_param_t;void * bmk1880_tpu_copy_with_stride(void *ctx,const bmk1880_copy_with_stride_param_t *p);

bmk1880 tpu mdsum instructs TPU to compute a tensor res of shape (1,C,1,1) from tensor a of shape (N,C,H,W). Every datum resci of index (0,ci,0,0) in tensor res is computed as
where an*i*,c*i,*h*i*,w*i* is a datum of index (ni,ci,hi,wi) in tensor a. The basic data in all tensor lmem structures must be 8-bit. If the result is a 16-bit tensor, res high and res low represent its high and low 8-bit parts, respectively. Res_high should be NULL if the result is 8-bit. a and res must both be signed or unsigned at the same time.

typedef struct {tensor_lmem *res_high;tensor_lmem *res_low;tensor_lmem *a;} bmk1880_mdsum_param_t;void * bmk1880_tpu_mdsum(void *ctx, const bmk1880_mdsum_param_t * p);

bmk1880 tpu lut instructs TPU to compute a tensor res from tensor idx, by using tensor table as a lookup table and values in tensor idx as indice. Tensor table must be of shape (1, slices, 16, 16), where slices is the number of lmem slices. Tensor idx and tensor res must be of same shape. Assuming their shape is (N,C,H,W), the datum resi of index (ni,ci,hi,wi) in tensor res is computed from idxi of same index (ni, ci, hi, wi) in tensor idx as resi = tablei , where tablei is of index (0, ct, idxi , idxi mod 16) in tensor table, and ct is the index of lmem slice the datum idxi 16 resides in. The basic data in all tensor lmem structures must be 8-bit.

typedef struct {tensor_lmem *ofmap;tensor_lmem *ifmap;tensor_lmem *table;} bmk1880_lut_param_t;void * bmk1880_tpu_lut(void *ctx, const bmk1880_lut_param_t *p);

bmk1880 tpu relu instructs TPU to compute resi = max(0,ai) for each datum ai in tensor a, where resi and ai are of same index. The basic data in all tensor lmem structures must be 8-bit.

typedef struct {tensor_lmem *ofmap;tensor_lmem *ifmap;} bmk1880_relu_param_t;void * bmk1880_tpu_relu(void *ctx, const bmk1880_relu_param_t *p) ;

bmk1880 tpu conv instructs TPU to compute a tensor ofmap from tensor ifmap, weight and bias, by using ifmap as input feature map, weight as convolution kernel and bias as bias to be added into the convolution result. relu enable may be true, indicating ReLU activations after adding bias values but before shifting every basic datum. rshift width specifies the number of bits to shift every basic datum rightward after optional ReLU activations.

typedef struct {tensor_lmem *ofmap;tensor_lmem *ifmap;tensor_lmem *weight;tensor_lmem *bias;u8 ins_h, ins_last_h;u8 ins_w, ins_last_w;u8 pad_top , pad_bottom;u8 pad_left , pad_right;u8 stride_h , stride_wu8 dilation_h , dilation_w;bool relu_enable;int rshift_width;} bmk1880_conv_param_t;void * bmk1880_tpu_conv(void *ctx, const bmk1880_conv_param_t *p) ;

ofmap and ifmap must be aligned (see BMKernel 1880 Guide.pdf).

weight is of a special layout which is very different from that described in programming model(see BMKernel 1880 Guide.pdf). If ifmap is of shape (Nin, Cin, Hin, Win), ofmap is of shape (Nout, Cout, Hout, Wout) and convolution kernels are of shape (Hkernel, Wkernel), then weight should be of shape (Cin, Cout, Hkernel, Wkernel).

The layout of weight, however, is as if it is of shape (1,Cout,Hkernel ×Wkernel,Cin). This special layout can be precisely defined by applying the following stride values to weight’s logical shape (Cin, Cout, Hkernel, Wkernel):

bias may be NULL, indicating no bias values. If it is non-NULL, and assume ofmap is of shape (N, C, H, W ), then bias must be a 16-bit tensor of shape (1, C, 1, 1). Since a 16-bit tensor is stored as two 8-bit tensors in lmem, bias’s tensor lmem structure must be of shape (2,C,1,1) and must be unaligned (see unaligned stride values in section 2.4 BMKernel 1880 Guide.pdf). During the phase of adding bias, the value of datum at index (0,ci,0,0) in the 16-bit tensor are added to all data in ofmap whose C-dimension index is ci.

param contains detailed convolution parameters that can be classified into four categories by their functions. They are insertion, padding, striding and dilations parameters, which are detailed below. Insertion parameters specify the number of zeros to be inserted into specific locations within ifmap. They include ins h, ins last h, ins w and ins last w. ins h specifies the number of zeros to be inserted after every non-last basic datum, along the H-dimension. Consider ifmap of shape (N, C, H, W ) for example. After inserting zeros, ifmap′ will be of shape (N, C, H′, W ), where H′ = 1 + (H − 1) × (ins h + 1). Denoting as xni,ci,hi,wi the value of basic datum at index (ni,ci,hi,wi) of tensor ifmap, and as x′ni,ci,hi,wi the value of that of tensor ifmap′, the following holds:

ins last h specifies the number of zeros to be inserted only after every last basic datum. Similarly, ins w and ins last w specify the number of zeros to be inserted along the W -dimension. Padding parameters specify the number of zeros to be inserted around elements within ifmap. pad top specifies the number of zeros to be inserted before every first basic datum along the H-dimension. pad bottom specifies the number after every last basic datum along the H-dimension. Similary, pad left and pad right specify the number along the W-dimension. Striding parameters specify the number of basic data convolution kernel should stride over after each convolution step. stride h and stride w specify the number along the H-dimension and W-dimension, respectively. Dilation parameters specify the dilation of the convolution kernel weight. That is, (stride h − 1) zeros are inserted between each two basic data along the H-dimension. Similary (stride w − 1) zeros are inserted along the W-dimension.

Similar to bmk1880 tpu conv, but use winograd algorithm to accelerate the computation. More- over, weight must contain only 3 × 3 kernels and must be default strided in lmem (see section 2.4 BMkernel 1880 Guide.pdf). The other parameters, including those in param, are similar to those of same names in function bmk1880 tpu conv (see section 4.41 BMkernel 1880 Guide.pdf).

typedef struct {tensor_lmem *ofmap;tensor_lmem *ifmap;tensor_lmem *weight;tensor_lmem *bias;u8 ins_h, ins_last_h;u8 ins_w, ins_last_w;u8 pad_top , pad_bottom;u8 pad_left , pad_right;bool relu_enable;int rshift_width;} bmk1880_winograd_param_t;void * bmk1880_tpu_winograd(void *ctx, const bmk1880_winograd_param_t *p);

Similar to bmk1880 tpu conv, but computes a depthwise convolution. Moreover, weight is default strided in lmem (see section 2.4 BMkernel 1880 Guide.pdf ). The other parameters, including those in param, are similar to those of same names in function bmk1880 tpu conv.

typedef struct {tensor_lmem *ofmap;tensor_lmem *ifmap;tensor_lmem *weight;tensor_lmem *bias;u8 ins_h, ins_last_h;u8 ins_w, ins_last_w;u8 pad_top , pad_bottom;u8 pad_left , pad_right;u8 stride_h , stride_w;int rshift_width;} bmk1880_depthwise_param_t;void * bmk1880_tpu_depthwise(void *ctx, constbmk1880_depthwise_param_t *p);

bmk1880 tpu max pooling instructs TPU to compute a tensor ofmap from tensor ifmap, by doing a (kh × kw) max pooling over ifmap. The size parameters of pooling kernel, kh and kw, are specified in param. Other parameters in param are similar to those of same names in bmk1880 - conv param t .

ofmap and ifmap must be aligned.

typedef struct {tensor_lmem *ofmap;tensor_lmem *ifmap;u8 kh, kw;u8 ins_h, ins_last_h;u8 ins_w, ins_last_w;u8 pad_top , pad_bottom;u8 pad_left , pad_right;u8 stride_h , stride_w;} bmk1880_max_pooling_param_t;void * bmk1880_tpu_max_pooling(void *ctx, constbmk1880_max_pooling_param_t *p);

Similar to bmk1880 tpu max pooling, but does an average pooling over ifmap as controlled by avg pooling const. At every pooling step, all related basic data in ifmap are summed together, multiplied by avg pooling const, and then shifted rightward by rshift width bits.

typedef struct {tensor_lmem *ofmap;tensor_lmem *ifmap;u8 kh, kw;u8 ins_h, ins_last_h;u8 ins_w, ins_last_w;u8 pad_top , pad_bottom;u8 pad_left , pad_right;u8 stride_h , stride_w;u8 avg_pooling_const;int rshift_width;} bmk1880_avg_pooling_param_t;void * bmk1880_tpu_avg_pooling(void *ctx, constbmk1880_avg_pooling_param_t *p);

bmk1880 tpu matrix mac instructs TPU to compute a matrix res by multiplying left matrix left with right matrix right, and then add matrix bias (if not NULL), and finally shift to right by rshift width bits. Noth that all tensor lmem structures involved must be matrice instead of tensors. ctrls may have CTRL_RELU or CTRL_RA flag set, but not both. After adding bias but before right shifting, ReLU activations are performed in which negative values are rectified to 0 if ctrls is CTRL_RELU, or the original values in res are shifted leftward by lshift width bits and then added into the results if ctrls is CTRL_RA. res is int8 indicates whether the result is 8-bit or 16-bit.

The use of res matrix is unusual when ctrls is CTRL_RA or when res is int8 is false. Assume that the result is a matrix of shape (R,C). When ctrls is CTRL_RA, the original result is a 16-bit matrix of shape (R,C) represented by res. Since a 16-bit matrix’s high and low 8-bit parts are stored separately as two 8-bit matrice in lmem, res’s tensor lmem structure must be of 8-bit format (FMT_I8 or FMT_U8), must be of shape (R × 2, C), and must be aligned (see aligned stride values in section 2.4 BMkernel 1880 Guide.pdf). When res is int8 is false, the final result is a 16-bit matrix similarly represented by res. When ctrls is CTRL_RA but res is int8 is true, the original result is 16-bit while the final result is 8-bit. In this case, only the low 8-bit parts (located at lower addresses) of the res matrix are written with the final result. In the final case where both the original and final result are 8-bit matrice, res is a normal 8-bit matrix of shape (R,C).

Note that bias is different from those in bmk1880 tpu conv, bmk1880 winograd or bmk1880 - tpu depthwise. Firstly, it is a matrix. Moreover, if res is of shape (R,C), then bias must be a 16-bit matrix of shape (1,C). Since a 16-bit matrix’s high and low 8-bit parts are stored separately as two 8-bit matrice in lmem, bias’s tensor lmem structure must be of shape (2,C) and must be aligned

res, left, right and bias must all be aligned.

typedef struct {tensor_lmem *res;tensor_lmem *left;tensor_lmem *right;tensor_lmem *bias;int lshift_width;int rshift_width;bool res_is_int8;ctrl_t ctrls;} bmk1880_matrix_mac_param_t;void * bmk1880_tpu_matrix_mac(void *ctx, constbmk1880_matrix_mac_param_t *p);

bmk1880 tpu matrix mac 2 instructs TPU to compute a matrix res by multiplying left matrix left with right matrix right. left, right and res must be tensors, though the computation is matrix multiplication. res and left must be of shape (1, 256, 1, 256). right must be of shape (256, 16, 1, 16). The basic data in all tensor lmem structures must be 8-bit.

typedef struct {tensor_lmem *res;tensor_lmem *left;tensor_lmem *right;} bmk1880_matrix_mac_2_param_t;void * bmk1880_tpu_matrix_mac_2( void *ctx , const bmk1880_matrix_mac_2_param_t *p);

TensorOp represents a BMNET IR, which is a bridge between front end and back end. it provides lots of member method to set information to or get from it. Below is the prototype:

namespace bmnet {class TensorOp {public:int input_shape_size();int output_shape_size();const TensorShape& input_shape(int index);const TensorShape& output_shape(int index);TensorShape* add_output_shape();u64 global_input(int index);u64 global_output(int index);TGCustomizedParameter* mutable_tg_customized_param();const TGCustomizedParameter& tg_customized_param();};}

void TensorOp::input_shape_size()

Return the number of inputs.

void TensorOp::output_shape_size()

Return the number of outputs.

const TensorShape& TensorOp::input_shape( int index)

const TensorShape& TensorOp::input_shape( int index)

Return shape of input by index.

Parameter | Type | Description |

index | int | [Required] index of input that to be returned. |

const TensorShape& TensorOp::output_shape(int index)

Return shape of output by index.

Parameter | Type | Description |

index | int | [Required] index of output that to be returned. |

TensorShape* TensorOp::add_output_shape()

Return a mutable pointer to a new added TensorShape of outputs. The returned TensorShape could be modified latter.

u64 TensorOp::global_input(int index)

Return offset of input tensor by index, while it was stored in device memory.

Parameter | Type | Description |

index | int | [Required] index of input that to be returned. |

u64 TensorOp::global_output(int index)

Return offset of output tensor by index, while it was stored in device memory.

Parameter | Type | Description |

index | int | [Required] index of output that to be returned. |

TGCustomizedParameter* TensorOp::mutable_tg_customized_param()

Return a mutable pointer to parameters of customized BMNET IR.

const TGCustomizedParameter& TensorOp::tg_customized_param()

Return reference of customized BMNET IR’s paramters.

CustomizedCaffeLayer is abstract class, which is used to implement a Layer to convert CAFFE Layer into BMNet IR(please refer to Chapter 5 for details about BMNet IR). If you want to introduce a customized CAFFE layer into BMNet, please inherit this class and implement all pure virtual functions of it. The CustomizedCaffeLayer inherits from CaffeLayer/Layer class. Below are the prototypes of them:

namespace bmnet {class Layer {public:Layer();virtual ~Layer(void);virtual std::string layer_name() = 0;virtual void dump () = 0;virtual void codegen(TensorOp *op) = 0;protected:void add_output_offset(int offset);};}namespace bmnet {class CaffeLayer : public Layer {public:CaffeLayer(){}virtual ~CaffeLayer(void);protected:caffe::LayerParameter &layer_;};}namespace bmnet {class CustomizedCaffeLayer : public CaffeLayer {public:CustomizedCaffeLayer();~CustomizedCaffeLayer();void setup(TensorOp* op) override {......TGCustomizedParameter* param = op->mutable_tg_customized_param ();param->set_sub_type(layer_name());}};}

std::string CustomizedCaffelayer::layer_name()

Pure virtual function, return type of new added CAFFE layer.

void CustomizedCaffelayer::dump()

Pure virtual function, is used to print information of CAFFE Layer.

void CustomizedCaffelayer::setup()

Option. It is used to set sub type of Customized Layer only. Implement by default. If child class will override it, this parent class setup function must be call first.

Pure virtual function, is used to setup BMNET IR according to LayerParameter of CAFFE Layer. In this function, you should setup output shape and fill parameters to TensorOp.

Parameter | Type | Description |

op | TensorOp* | [Required] pointer to a instance of BMNET IR |

void CustomizedCaffelayer::add_output_offset (int offset)

Protected member method, should be called when setup output offset of Layer’s top.

Parameter | Type | Description |

offset | int | [Required] offset of output, should be 0. |

caffe::LayerParameter CustomizedCaffelayer::&layer_

Protected member variable, which is reference of customized CAFFE layer’s LayerParameter.

CustomizedTensorFixedInst is abstract class, which is used to implement a Layer to convert BMNET IR into instructions by BMKernel APIs. Please inherit this class and implement all pure virtual functions of it. The CustomizedTensorFixedInst inherits from TensorFixedInst/ TensorInst class. Below are the prototypes of them:

namespace bmnet {class TensorFixedInst: public TensorInst {public:TensorFixedInst() : TensorInst() {}TensorFixedInst(TensorOp &op) : TensorInst(op) {}virtual ~ TensorFixedInst (void);void SetCalibrationParameter(const LayerCalibrationParameter &calibration_parameter) {m_calibrationParameter = calibration_parameter;}void AddInputCalibrationParameter(const LayerCalibrationParameter &calibration_parameter){m_inputCalibrationParameter.push_back(calibration_parameter);}protected:LayerCalibrationParameter m_calibrationParameter;