# BMNNSDK API v1

BMNNSDK provides a lightweight set of c/c++APIs for deep learning application developer, it consists of TPU BMRuntime Library, BMKernel Library and BMNet Library. Which will be described in detail in this section.

## BM Runtime Library

### bm_init

bm_init() initializes BM device and creates a handle to BM context.

Parameter | Type | Description |

index | Input | Not used now. Set 0 as default value. |

ctx | Output | The pointer of BM context handle. |

### bm_exit

bm_exit() must be called before application exits. It will release all internal resources.

Parameter | Type | Description |

ctx | Input | The BM context handle which is created by bm_init(). |

### bm_enum_devices

bm_enum_devices() enumerates all BM devices in the system.

Parameter | Type | Description |

count | Output | The count of BM device. |

devinfo | Output | The array of device info. |

### bm_device_open

bm_device_open() opens a BM device.

Parameter | Type | Description |

index | Input | The index of BM device. |

dev | Output | The pointer of BM device handle. |

### bm_device_close

bm_device_close() closes an opened BM device.

Parameter | Type | Description |

dev | Input | The BM device handle. |

### bm_device_query

bm_device_query() always returns BM_ERR_NOT_SUPPORTED now.

### bm_device_config

bm_device_config() always returns BM_ERR_NOT_SUPPORTED now.

### bm_device_get_info

bm_device_get_info() return a BM device information.

Parameter | Type | Description |

dev | Input | The BM device handle. |

### bm_context_create

bm_context_create() creates a BM context.

Parameter | Type | Description |

ctx | Output | The pointer of BM context handle. |

### bm_context_destroy

bm_context_destroy() destroys a BM context.

Parameter | Type | Description |

ctx | Input | The BM context handle. |

### bm_bind_device

bm_bind_device() binds a BM context with a BM device.

Parameter | Type | Description |

ctx | Input | The BM context handle. |

dev | Input | The BM device handle. |

### bm_unbind_device

bm_unbind_device() unbinds a BM context with the BM device.

Parameter | Type | Description |

ctx | Input | The BM context handle. |

### bm_get_device

bm_get_device() returns the BM device handle which is bound with the BM context.

Parameter | Type | Description |

ctx | Input | The BM context handle. |

### bmruntime_bmkernel_create

bmruntime_bmkernel_create() creates a BM kernel with the BM context. The p_bk_ctx points to a thread local variable, so you can use this API to create multi BM contexts in multiple threads, they are independent. But you can’t own more than one BM context at the same time in one thread, otherwise there will be a memory leak.

Parameter | Type | Description |

ctx | Input | The BM context handle. |

p_bk_ctx | Output | The pointer of BM kernel handle. |

### bmruntime_bmkernel_submit

bmruntime_bmkernel_submit() the BM kernel with the BM context.

Parameter | Type | Description |

ctx | Input | The BM context handle. |

### bmruntime_bmkernel_destroy

bmruntime_bmkernel_destroy() destroys the BM kernel with the BM context.

Parameter | Type | Description |

ctx | Input | The BM context handle. |

### bmmem_device_alloc_raw

bmmem_device_alloc_raw() allocates device memory as the input size.

Parameter | Type | Description |

ctx | Input | The BM context handle. |

size | Input | The size of device memory. |

### bmmem_device_prealloc_raw

bmmem_device_prealloc_raw() allows application to allocate memory from previously allocted device memory. The memory you want to allocate needs to fall in the previously allocated device memory.

Parameter | Type | Description |

ctx | Input | The BM context handle. |

mem | Input | The previously allocated device memory. |

offset | Input | The offset in the previously allocated device memory. |

size | Input | The size of device memory |

### bmmem_device_alloc

bmmem_device_alloc() allocates device memory as the input shape.

Parameter | Type | Description |

ctx | Input | The BM context handle. |

shape | Input | The shape of device memory. |

### bmmem_device_prealloc

Parameter | Type | Description |

ctx | Input | The BM context handle. |

mem | Input | The previously allocated device memory. |

Offset | Input | The offset in the previously allocated device memory. |

shape | Input | The shape of device memory. |

### bmmem_device_free

bmmem_device_free() frees the device memory that are allocated by the above allocating functions.

Parameter | Type | Description |

ctx | Input | The BM context handle. |

mem | Input | The device memory handle. |

### bmmem_host_alloc

bmmem_host_alloc() always returns BM_ERR_NOT_SUPPORTED now.

### bmmem_host_free

bmmem_host_free() always returns BM_ERR_NOT_SUPPORTED now.

### bmmem_device_size

bmmem_device_size() returns the device memory size.

Parameter | Type | Description |

ctx | Input | The BM context handle. |

mem | Input | The device memory handle. |

### bmmem_device_addr

bmmem_device_addr() returns the device memory address.

Parameter | Type | Description |

ctx | Input | The BM context handle. |

mem | Input | The device memory handle. |

### bmmem_host_v_addr

bmmem_host_v_addr() always returns BM_ERR_NOT_SUPPORTED now.

### bmmem_host_p_addr

bmmem_host_p_addr() always returns BM_ERR_NOT_SUPPORTED now.

### bm_memcpy_s2d

bm_memcpy_s2d() copy system memory data to device memory. s means system, d means device.

Parameter | Type | Description |

ctx | Input | The BM context handle. |

dst | Input | The device memory handle. |

src | Input | The system memory pointer. |

### bm_memcpy_d2s

bm_memcpy_d2s copy device memory data to system memory.

Parameter | Type | Description |

ctx | Input | The BM context handle. |

dst | Input | The system memory pointer. |

src | Input | The device memory handle. |

### bmnet_register

bmnet_register() registers a neuron network with bmnet info.

Parameter | Type | Description |

ctx | Input | The BM context handle. |

info | Input | The BM network info. |

net | Output | The registered network handle. |

### bmnet_register_bmodel

bmnet_register_bmodel() registers a neuron network with bmodel file.

Parameter | Type | Description |

ctx | Input | The BM context handle. |

bmodel | Input | bmodel filename. |

net | Output | The registered network handle. |

### bmnet_register_noalloc

bmnet_register_noalloc() registers a compiled neuron network without allocating weight and neuron device memory.

Parameter | Type | Description |

ctx | Input | The BM context handle. |

info | Input | The BM network info. |

net | Output | The registered network handle. |

### bmnet_set_input_shape

bmnet_set_input_shape () sets a input shape for a registered BM network. The bmodel support different input shapes, the API can set one of them.

Parameter | Type | Description |

net | Input | The BM network handle. |

input_shape | Input | The input shape. |

### bmnet_get_output_info

bmnet_get_output_info () sets a input shape for a registered BM network.

Parameter | Type | Description |

net | Input | The BM network handle. |

output_info | Output | The output info. |

### bmnet_cleanup

bmnet_cleanup() cleans up a registered BM network.

Parameter | Type | Description |

net | Input | The BM network handle. |

### bmnet_run

bmnet_run() runs a registered BM network. You need load input and store output by yourself.

Parameter | Type | Description |

net | Input | The BM network handle. |

### bmnet_weight_devmem

bmnet_weight_devmem() retrieves the weight device memory handler from a registered BM network.

Parameter | Type | Description |

net | Input | The BM network handle. |

### bmnet_neuron_devmem

bmnet_neuron_devmem() retrieves neuron device memory handler from a registered BM network.

Parameter | Type | Description |

net | Input | The BM network handle. |

### bmnet_input_devmem

bmnet_input_devmem() retrieves input device memory handler from a registered BM network.

Parameter | Type | Description |

net | Input | The BM network handle. |

### bmnet_output_devmem

bmnet_output_devmem() retrieves output device memory handler from a registered BM network.

Parameter | Type | Description |

net | Input | The BM network handle. |

### bmnet_import_weight_devmem

bmnet_import_weight_devmem() imports weight device memory for a registered BM network. application should allocate weight device memory firstly, then call it to import weight memory. This function and bmnet_import_neuron_devmem() function are usually used with bmnet_register_noalloc() function. Application can register BM network without allocating weight and neuron device memory, and then use these two functions to import weight and neuron memory.

Parameter | Type | Description |

net | Input | The BM network handle. |

weight_mem | Input | The weight device memory handle. |

### bmnet_import_neuron_devmem

bmnet_import_neuron_devmem() imports neuron device memory for a registered BM network. Application should allocate neuron device memory firstly, then call it to import neuron memory.

Parameter | Type | Description |

net | Input | The BM network handle. |

neuron_mem | Input | The neuron device memory handle. |

### bmnet_load_input

bmnet_load_input() loads input data for a registered BM network.

Parameter | Type | Description |

net | Input | The BM network handle. |

input | Input | The input data pointer. |

### bmnet_load_neuron

bmnet_load_neuron() loads neuron data for a registered BM network.

Parameter | Type | Description |

net | Input | The BM network handle. |

neuron_offset | Input | The offset of neuron buffer. |

neuron_size | Input | The neuron buffer size. |

neuron | Input | The pointer to the neuron buffer. |

### bmnet_store_output

bmnet_store_output() stores output data for a registered BM network. Application uses this function to copy output data from device memory to host memory.

Parameter | Type | Description |

net | Input | The BM network handle. |

output | Input | The output buffer pointer. |

### bmnet_store_neuron

bmnet_store_neuron() stores neuron data for a registered BM network. Application uses this function to copy neuron data from device memory to host memory.

Parameter | Type | Description |

net | Input | The BM network handle. |

neuron_offset | Input | The offset of neuron buffer. |

neuron_size | Input | The neuron buffer size. |

neuron | Input | The pointer to the neuron buffer. |

### bmnet_inference

bmnet_inference() runs inference with a registered BM network.

Parameter | Type | Description |

net | Input | The BM network handle. |

input | Input | The input buffer pointer. |

output | Input | The output buffer pointer. |

## BMKernel Library

### System API

### bmk1880 register

User allocates a BMKernel context by filling a bmk1880 info t structure and passing it to bmk1880 register function. The function returns a handle of the initialized context.

In the bmk1880 info t structure: chip version is an integer describing the version of chip to work with, and can be 1880 or 1880; cmdbuf (short for “command buffer”) is a user-allocated buffer to contain generated hardware instructions and cmdbuf size describes its size in bytes. Note that user is responsible to free cmdbuf after the use of referring BMKernel context.

### bmk1880 cleanup

bmk1880 cleanup frees the context previously allocated by bmk1880 register.

### bmk1880 acquire cmdbuf

bmk1880 acquire cmdbuf returns a buffer of hardware instructions generated so far and set (*size) to buffer’s valid size in bytes. The buffer is an array of cmd hdr t structures each containing one variable-sized generated hardware instruction.

In the cmd hdr t structure, engine id is the identifier of engine on which the contained in- struction is supposed to be executed. And len indicates in bytes the length of the hardware instruction immediately following this cmd hdr t structure.

### bmk1880 reset

bmk1880 reset resets current BMKernel context to its initial state as returned by bmk1880 - register. This function is usually called after bmk1880 acquire cmdbuf to empty the cmdbuf buffer.

### bmk1880 parallel enable

bmk1880 parallel enable claims that following computations on different engines can be executed with no synchornization with each other. This function enables engine-oriented parallel programming style.

### bmk1880 parallel disable

bmk1880 parallel disable disables engine-oriented parallel programming style.

### bmk1880 create streams

bmk1880 create streams creates nr streams streams, indexed 0 to (nr streams - 1), that following calls to bmk1880 set stream can refer to. This function enables dependency-oriented parallel programming style. Note this style can not be disabled once enabled.

### bmk1880 destroy streams

bmk1880 destroy streams destroys all the streams created by the previous call to bmk1880 - create streams and resets the system back to serial mode.

### bmk1880 set stream

bmk1880 set stream set current stream to stream i that has been created by calling bmk1880 - create streams. Following computations will be put into this stream until another bmk1880 set - stream specifying a different stream index is called.

### bmk1880 add dependency

bmk1880 add dependency further restricts that the computation represented by before must take place strictly before that represented by after. Both before and after are pointers returned by some computation API.

### Computation API

During all kinds of computation, input values are first converted into 32-bit ones before any internal computation, and final 32-bit values are saturated into ranges that can be represented by the final 8-bit or 16-bit integer format. That is, if the value before saturation can be represented by the final integer format, it is unchanged. Otherwise it is saturated into the maximun or minimum in the final integer format, whichever is nearer to the original value. For example, if the final integer format is FMT_U8, then the representable maximum and minimum are 255 and 0 respectively. In this case, any value that is bigger than 255 becomes 255 after saturation, and values smaller than 0 are saturated into 0’s.

About signedness, one general rule applies to all kinds of computation when not otherwise specified: the result is unsigned if and only if all input tensors or matrice are unsigned. A tensor or matrix is said to be signed if it is of format FMT_I8, unsigned if FMT_U8.

### fmt t

fmt t describes the type of basic data in a tensor or matrix. The naming consists of three parts. “FMT” is a fixed prefix. A following “I” or “U” stands for signed integer or unsigned integer respectively. “8” describes the bit-width of the type.

### shape t

shape t describes the shape of a tensor or matrix. shape t4 and shape t2 are used to construct shape t’s for tensor and matrix, respectively.

### stride t

stride t describes the stride of a tensor or matrix. stride t4 and stride t2 are used to construct stride t’s for tensor and matrix, respectively.

### tensor lmem

tensor lmem represents a tensor or matrix in lmem. fmt, shape, stride are as explained above. If stride is NULL, aligned will be referred as indication of two frequently used stride values.

For tensors, if aligned is false, the stride values are as in the default unaligned stride on page 5. If aligned is true, the values are as in the default aligned stride on page 5. For matrice, stride values are computed by the shapes of corresponding specially shaped tensors, following the same rule.

### tensor gmem

tensor gmem represents a tensor or matrix in gmem.

### bmk1880 chip info

bmk1880 chip info returns a structure describing design parameters of the BM1880 chip.

### bmk1880 tl prealloc

bmk1880 tl prealloc allocates a tensor lmem structure on heap memory, and constructs it as dictated by parameters. The parameter la is the starting address in lmem. The tensor lmem’s aligned field is set to false. If the allocation succeeds, a pointer to the constructed structure is returned, NULL otherwise.

### bmk1880 tl prealloc align

Same as bmk1880 tl prealloc, except the aligned field is set to true.

### bmk1880 tl alloc

bmk1880 tl alloc allocates a tensor lmem structure on heap memory, and constructs it as dic- tated by parameters. Unlike in bmk1880 tl prealloc, the starting address is not determined from parameters, but assigned by BMKernel automatically. BMKernel manages the starting addresses in lmem by a simple stack. The starting address in each returned tensor lmem increases mono- tonically against successive bmk1880 tl alloc calls. And the last allocated tensor lmem must be freed first, using function bmk1880 tl free explained soon. If the available memory in lmem is not enough to satisfy an allocation request, or some other error occurs, a NULL pointer is returned.

tensor lmem’s aligned field is set to false when ctrls is CTRL_NULL, and true when CTRL_AL.

### bmk1880 tl alloc bank

bmk1880 tl alloc bank allocates memory from a specific lmem bank, as dictated by the bank id parameter.

### bmk1880 tl free

bmk1880 tl free frees the tensor lmem structure allocated by bmk1880 tl prealloc, bmk1880 tl pre- alloc align, bmk1880 tl alloc and bmk1880 tl alloc bank back to heap memory. If the structure is allocated by bmk1880 tl alloc or bmk1880 tl alloc bank, bmk1880 tl free also increases the avail- able lmem memory managed by BMKernel and checks that the last allocate, first free rule is obeyed (see bmk1880 tl alloc).

### bmk1880 gdma copy gmem

bmk1880 gdma copy gmem instructs DMA to copy tensor or matrix within gmem. src and dst must be both tensors or matrice and must contain 8-bit basic data only. The shapes of src and dst may be different, as long as their total numbers of basic data equal. When src and dst are tensors, ctrls can be CTRL_TP, indicating N/C-transposition. In other cases, ctrls must be CTRL_NULL.

### bmk1880 gdma copy lmem

bmk1880 gdma copy lmem instructs DMA to copy a tensor (not matrix) within lmem, from src to dst. The shapes of src and dst may be different, as long as their total numbers of basic data equal. The basic data must be 8-bit.

### bmk1880 gdma load

bmk1880 gdma load instructs DMA to copy a tensor or matrix from gmem to lmem. The tensor or matrix starts at gaddr in gmem, and is strided by default values. When ctrls is CTRL_TP (instead of CTRL_NULL), it indicates N/C-transposition for a tensor, or row/column-transposition for a matrix. The basic data must be 8-bit.

### bmk1880 gdma store

Similar to bmk1880 gdma load, but copies the tensor or matrix from lmem to gmem.

### bmk1880 gdma load stride

Similar to bmk1880 gdma load, but enables users to specify stride values in gmem.

### bmk1880 gdma store stride

Similar to bmk1880 gdma store, but enables users to specify stride values in gmem.

### bmk1880 gdma lrn shift

bmk1880 gdma lrn shift instructs DMA to compute a tensor (not matrix) dst from tensor src, both of which are of same shape (N, C, H, W ). If right shift is true, the computation copies datum at index (ni, ci, hi, wi) in tensor src into index (ni, ci + lrn step, hi, wi) in tensor dst for each 0 ≤ ci < C − lrn step, and set datum at index (ni, ci, hi, wi) in tensor dst to zero for each 0 ≤ ci < lrn step. If right shift is false, the computation copies datum at index (ni, ci, hi, wi) in tensor src into index (ni, ci − lrn step, hi, wi) in tensor dst for each lrn step ≤ ci < C, and set datum at index (ni, ci, hi, wi) in tensor dst to zero for each C − lrn step ≤ ci < C. The basic data must be 8-bit.

### bmk1880 tpu mul

bmk1880 tpu mul instructs TPU to compute resi = (ai × bi) ≫ rshift width for each datum ai in tensor a and bi in tensor b, where resi, ai and bi are of same index. All tensors must be of same shape, and the basic data in all tensor lmem structures must be 8-bit. If the result is a 16-bit tensor, res high and res low represent its high and low 8-bit parts, respectively. res high should be NULL if the result is 8-bit. rshift width indicates the bits to be shifted to right for each result value before saturation.

### bmk1880 tpu mul const

Similar to bmk1880 tpu mul, but tensor b is replaced by an 8-bit constant. The constant is signed if b is signed is true, unsigned otherwise.

### bmk1880 tpu mac

bmk1880 tpu mac instructs TPU to compute resi = (ai × bi + (resi ≪ lshift width)) ≫ rshift width for each datum ai in tensor a, bi in tensor b and resi represented by res high and res low together, where resi, ai and bi are of same index. All tensors must be of same shape, and the basic data in all tensor lmem structures must be 8-bit. The result is a 16-bit tensor if res is int8 is false, or a 8-bit tensor otherwise. rshift width indicates the bits to be shifted to right for each result value before saturation. Note that res high and res low are used both as input resi’s and output resi’s. Input resi’s are fixed to be 16-bit so that both res high and res low must be non-NULL. When the result is a 8-bit tensor, it is stored into res low.

### bmk1880 tpu mac const

Similar to bmk1880 tpu mac, but tensor b is replaced by an 8-bit constant. The constant is signed if b is signed is true, unsigned otherwise.

### bmk1880 tpu add

bmk1880 tpu add instructs TPU to compute resi = ai + bi for each datum ai in tensor a and bi in tensor b, where resi, ai and bi are of same index. All tensors must be of same shape, and the basic data in all tensor lmem structures must be 8-bit. Tensor a and tensor b must all be 16-bit so that a high and b high must not be NULL. If the result is a 16-bit tensor, res high and res low represent its high and low 8-bit parts, respectively. res high should be NULL if the result is 8-bit.

### bmk1880 tpu add const

Similar to bmk1880 tpu add, but tensor b is replaced by a 16-bit constant. The constant is signed if b is signed is true, unsigned otherwise.

### bmk1880 tpu sub

bmk1880 tpu sub instructs TPU to compute resi = ai − bi for each datum ai in tensor a and bi in tensor b, where resi, ai and bi are of same index. All tensors must be of same shape, and the basic data in all tensor lmem structures must be 8-bit. Tensor a and tensor b must all be 16-bit so that a high and b high must not be NULL. The result must be signed integers so that the fmt t field in res high and res low must be FMT_I8. If the result is a 16-bit tensor, res high and res low represent its high and low 8-bit parts, respectively. res high should be NULL if the result is 8-bit.

### bmk1880 tpu max

bmk1880 tpu max instructs TPU to compute resi = max(ai,bi) for each datum ai in tensor a and bi in tensor b, where resi, ai and bi are of same index. All tensors must be of same shape, and the basic data in all tensor lmem structures must be 8-bit. Tensor a and tensor b must both be signed or unsigned at the same time.

### bmk1880 tpu min

Similar to bmk1880 tpu max, but computes resi = min(ai, bi).

### bmk1880 tpu min const

Similar to bmk1880 tpu min, but tensor b is replaced by an 8-bit constant. The constant is signed if b is signed is true, unsigned otherwise.

### bmk1880 tpu arith shift

bmk1880 tpu arith shift instructs TPU to compute resi = ai ≫ bitsi for each datum ai in tensor a and bitsi in tensor bits, where resi, ai and bitsi are of same index. All tensors must be of same shape, and the basic data in all tensor lmem structures must be 8-bit. Tensor a must be 16-bit and signed so that the fmt fields in a high and a low must be FMT_I8. Tensor bits must be signed and every datum in it must range in [−16, 16]. The result tensor must be 16-bit so that res high must be non-NULL.

### bmk1880 tl and int8

bmk1880 tpu and int8 instructs TPU to compute resi = ai ∧ bi for each datum ai in tensor a and bi in tensor b, where resi, ai and bi are of same index. All tensors must be of same shape, and the basic data in all tensor lmem structures must be 8-bit.

### bmk1880 tpu and int16

Similar to bmk1880 tpu and int8, but all input and output tensors are 16-bit. So res high, a high and b high must be non-NULL.

### bmk1880 tpu or int8

Similar to bmk1880 tpu and int8, but computes resi = ai ∨ bi.

### bmk1880 tpu or int16

Similar to bmk1880 tpu and int16, but computes resi = ai ∨ bi.

### bmk1880 tpu xor int8

Similar to bmk1880 tpu and int8, but computes resi = ai ⊕ bi.

### bmk1880 tpu xor int16

Similar to bmk1880 tpu and int16, but computes resi = ai ⊕ bi.

### bmk1880 tpu copy

bmk1880 tpu copy instructs TPU to copy tensors within lmem, from src to dst. The basic data must be 8-bit.

### bmk1880 tpu copy with stride

Similar to bmk1880 tpu copy, but user provides stride t structures specifying the layouts of ten- sors dst and src. The basic data must be 8-bit.

### bmk1880 tpu mdsum

### bmk1880 tpu lut

bmk1880 tpu lut instructs TPU to compute a tensor res from tensor idx, by using tensor table as a lookup table and values in tensor idx as indice. Tensor table must be of shape (1, slices, 16, 16), where slices is the number of lmem slices. Tensor idx and tensor res must be of same shape. Assuming their shape is (N,C,H,W), the datum resi of index (ni,ci,hi,wi) in tensor res is computed from idxi of same index (ni, ci, hi, wi) in tensor idx as resi = tablei , where tablei is of index (0, ct, idxi , idxi mod 16) in tensor table, and ct is the index of lmem slice the datum idxi 16 resides in. The basic data in all tensor lmem structures must be 8-bit.

### bmk1880 tpu relu

bmk1880 tpu relu instructs TPU to compute resi = max(0,ai) for each datum ai in tensor a, where resi and ai are of same index. The basic data in all tensor lmem structures must be 8-bit.

### bmk1880 tpu conv

bmk1880 tpu conv instructs TPU to compute a tensor ofmap from tensor ifmap, weight and bias, by using ifmap as input feature map, weight as convolution kernel and bias as bias to be added into the convolution result. relu enable may be true, indicating ReLU activations after adding bias values but before shifting every basic datum. rshift width specifies the number of bits to shift every basic datum rightward after optional ReLU activations.

ofmap and ifmap must be aligned (see BMKernel 1880 Guide.pdf).

weight is of a special layout which is very different from that described in programming model(see BMKernel 1880 Guide.pdf). If ifmap is of shape (Nin, Cin, Hin, Win), ofmap is of shape (Nout, Cout, Hout, Wout) and convolution kernels are of shape (Hkernel, Wkernel), then weight should be of shape (Cin, Cout, Hkernel, Wkernel).

The layout of weight, however, is as if it is of shape (1,Cout,Hkernel ×Wkernel,Cin). This special layout can be precisely defined by applying the following stride values to weight’s logical shape (Cin, Cout, Hkernel, Wkernel):

bias may be NULL, indicating no bias values. If it is non-NULL, and assume ofmap is of shape (N, C, H, W ), then bias must be a 16-bit tensor of shape (1, C, 1, 1). Since a 16-bit tensor is stored as two 8-bit tensors in lmem, bias’s tensor lmem structure must be of shape (2,C,1,1) and must be unaligned (see unaligned stride values in section 2.4 BMKernel 1880 Guide.pdf). During the phase of adding bias, the value of datum at index (0,ci,0,0) in the 16-bit tensor are added to all data in ofmap whose C-dimension index is ci.

param contains detailed convolution parameters that can be classified into four categories by their functions. They are insertion, padding, striding and dilations parameters, which are detailed below. Insertion parameters specify the number of zeros to be inserted into specific locations within ifmap. They include ins h, ins last h, ins w and ins last w. ins h specifies the number of zeros to be inserted after every non-last basic datum, along the H-dimension. Consider ifmap of shape (N, C, H, W ) for example. After inserting zeros, ifmap′ will be of shape (N, C, H′, W ), where H′ = 1 + (H − 1) × (ins h + 1). Denoting as xni,ci,hi,wi the value of basic datum at index (ni,ci,hi,wi) of tensor ifmap, and as x′ni,ci,hi,wi the value of that of tensor ifmap′, the following holds:

ins last h specifies the number of zeros to be inserted only after every last basic datum. Similarly, ins w and ins last w specify the number of zeros to be inserted along the W -dimension. Padding parameters specify the number of zeros to be inserted around elements within ifmap. pad top specifies the number of zeros to be inserted before every first basic datum along the H-dimension. pad bottom specifies the number after every last basic datum along the H-dimension. Similary, pad left and pad right specify the number along the W-dimension. Striding parameters specify the number of basic data convolution kernel should stride over after each convolution step. stride h and stride w specify the number along the H-dimension and W-dimension, respectively. Dilation parameters specify the dilation of the convolution kernel weight. That is, (stride h − 1) zeros are inserted between each two basic data along the H-dimension. Similary (stride w − 1) zeros are inserted along the W-dimension.

### bmk1880 tpu winograd

Similar to bmk1880 tpu conv, but use winograd algorithm to accelerate the computation. More- over, weight must contain only 3 × 3 kernels and must be default strided in lmem (see section 2.4 BMkernel 1880 Guide.pdf). The other parameters, including those in param, are similar to those of same names in function bmk1880 tpu conv (see section 4.41 BMkernel 1880 Guide.pdf).

### bmk1880 tpu depthwise

Similar to bmk1880 tpu conv, but computes a depthwise convolution. Moreover, weight is default strided in lmem (see section 2.4 BMkernel 1880 Guide.pdf ). The other parameters, including those in param, are similar to those of same names in function bmk1880 tpu conv.

### bmk1880 tpu max pooling

bmk1880 tpu max pooling instructs TPU to compute a tensor ofmap from tensor ifmap, by doing a (kh × kw) max pooling over ifmap. The size parameters of pooling kernel, kh and kw, are specified in param. Other parameters in param are similar to those of same names in bmk1880 - conv param t .

ofmap and ifmap must be aligned.

### bmk1880 tpu avg pooling

Similar to bmk1880 tpu max pooling, but does an average pooling over ifmap as controlled by avg pooling const. At every pooling step, all related basic data in ifmap are summed together, multiplied by avg pooling const, and then shifted rightward by rshift width bits.

### bmk1880 tpu matrix mac

bmk1880 tpu matrix mac instructs TPU to compute a matrix res by multiplying left matrix left with right matrix right, and then add matrix bias (if not NULL), and finally shift to right by rshift width bits. Noth that all tensor lmem structures involved must be matrice instead of tensors. ctrls may have CTRL_RELU or CTRL_RA flag set, but not both. After adding bias but before right shifting, ReLU activations are performed in which negative values are rectified to 0 if ctrls is CTRL_RELU, or the original values in res are shifted leftward by lshift width bits and then added into the results if ctrls is CTRL_RA. res is int8 indicates whether the result is 8-bit or 16-bit.

The use of res matrix is unusual when ctrls is CTRL_RA or when res is int8 is false. Assume that the result is a matrix of shape (R,C). When ctrls is CTRL_RA, the original result is a 16-bit matrix of shape (R,C) represented by res. Since a 16-bit matrix’s high and low 8-bit parts are stored separately as two 8-bit matrice in lmem, res’s tensor lmem structure must be of 8-bit format (FMT_I8 or FMT_U8), must be of shape (R × 2, C), and must be aligned (see aligned stride values in section 2.4 BMkernel 1880 Guide.pdf). When res is int8 is false, the final result is a 16-bit matrix similarly represented by res. When ctrls is CTRL_RA but res is int8 is true, the original result is 16-bit while the final result is 8-bit. In this case, only the low 8-bit parts (located at lower addresses) of the res matrix are written with the final result. In the final case where both the original and final result are 8-bit matrice, res is a normal 8-bit matrix of shape (R,C).

Note that bias is different from those in bmk1880 tpu conv, bmk1880 winograd or bmk1880 - tpu depthwise. Firstly, it is a matrix. Moreover, if res is of shape (R,C), then bias must be a 16-bit matrix of shape (1,C). Since a 16-bit matrix’s high and low 8-bit parts are stored separately as two 8-bit matrice in lmem, bias’s tensor lmem structure must be of shape (2,C) and must be aligned

res, left, right and bias must all be aligned.

### bmk1880 tpu matrix mac 2

bmk1880 tpu matrix mac 2 instructs TPU to compute a matrix res by multiplying left matrix left with right matrix right. left, right and res must be tensors, though the computation is matrix multiplication. res and left must be of shape (1, 256, 1, 256). right must be of shape (256, 16, 1, 16). The basic data in all tensor lmem structures must be 8-bit.

## BMNet Library

### TensorOp

TensorOp represents a BMNET IR, which is a bridge between front end and back end. it provides lots of member method to set information to or get from it. Below is the prototype:

### TensorOp::input_shape_size

Return the number of inputs.

### TensorOp::output_shape_size

Return the number of outputs.

### TensorOp::input_shape

const TensorShape& TensorOp::input_shape( int index)

Return shape of input by index.

Parameter | Type | Description |

index | int | [Required] index of input that to be returned. |

### TensorOp::output_shape

Return shape of output by index.

Parameter | Type | Description |

index | int | [Required] index of output that to be returned. |

### TensorOp::add_output_shape

Return a mutable pointer to a new added TensorShape of outputs. The returned TensorShape could be modified latter.

### TensorOp::global_input

Return offset of input tensor by index, while it was stored in device memory.

Parameter | Type | Description |

index | int | [Required] index of input that to be returned. |

### TensorOp::global_output

Return offset of output tensor by index, while it was stored in device memory.

Parameter | Type | Description |

index | int | [Required] index of output that to be returned. |

### TensorOp::mutable_tg_customized_param

Return a mutable pointer to parameters of customized BMNET IR.

### TensorOp::tg_customized_param

Return reference of customized BMNET IR’s paramters.

### CustomizedCaffeLayer

CustomizedCaffeLayer is abstract class, which is used to implement a Layer to convert CAFFE Layer into BMNet IR(please refer to Chapter 5 for details about BMNet IR). If you want to introduce a customized CAFFE layer into BMNet, please inherit this class and implement all pure virtual functions of it. The CustomizedCaffeLayer inherits from CaffeLayer/Layer class. Below are the prototypes of them:

### CustomizedCaffeLayer::layer_name

Pure virtual function, return type of new added CAFFE layer.

### CustomizedCaffeLayer::dump

Pure virtual function, is used to print information of CAFFE Layer.

### CustomizedCaffeLayer:: setup

Option. It is used to set sub type of Customized Layer only. Implement by default. If child class will override it, this parent class setup function must be call first.

### CustomizedCaffeLayer::codegen

Pure virtual function, is used to setup BMNET IR according to LayerParameter of CAFFE Layer. In this function, you should setup output shape and fill parameters to TensorOp.

Parameter | Type | Description |

op | TensorOp* | [Required] pointer to a instance of BMNET IR |

### CustomizedCaffeLayer::add_output_offset

Protected member method, should be called when setup output offset of Layer’s top.

Parameter | Type | Description |

offset | int | [Required] offset of output, should be 0. |

### CustomizedCaffeLayer::layer_

Protected member variable, which is reference of customized CAFFE layer’s LayerParameter.

### CustomizedTensorFixedInst

CustomizedTensorFixedInst is abstract class, which is used to implement a Layer to convert BMNET IR into instructions by BMKernel APIs. Please inherit this class and implement all pure virtual functions of it. The CustomizedTensorFixedInst inherits from TensorFixedInst/ TensorInst class. Below are the prototypes of them: