# BMNNSDK API v1

BMNNSDK provides a lightweight set of c/c++APIs for deep learning application developer, it consists of TPU BMRuntime Library, BMKernel Library and BMNet Library. Which will be described in detail in this section.

## BM Runtime Library <a href="#bm_init" id="bm_init"></a>

### bm\_init

```c
bmerr_t  bm_init(
int        index,
bmctx_t   *ctx)
```

bm\_init() initializes BM device and creates a handle to BM context.

| Parameter | Type   | Description                           |
| --------- | ------ | ------------------------------------- |
| index     | Input  | Not used now. Set 0 as default value. |
| ctx       | Output | The pointer of BM context handle.     |

### bm\_exit

```c
void  bm_exit(
bmctx_t  ctx)
```

bm\_exit() must be called before application exits. It will release all internal resources.

| Parameter | Type  | Description                                           |
| --------- | ----- | ----------------------------------------------------- |
| ctx       | Input | The BM context handle which is created by bm\_init(). |

### bm\_enum\_devices

```c
void  bm_enum_devices(
int            *count, 
bm_devinfo_t   devinfo[])
```

bm\_enum\_devices() enumerates all BM devices in the system.

| Parameter | Type   | Description               |
| --------- | ------ | ------------------------- |
| count     | Output | The count of BM device.   |
| devinfo   | Output | The array of device info. |

### bm\_device\_open

```c
bmerr_t  bm_device_open(
int         index,
bmdev_t    *dev)
```

bm\_device\_open() opens a BM device.

| Parameter | Type   | Description                      |
| --------- | ------ | -------------------------------- |
| index     | Input  | The index of BM device.          |
| dev       | Output | The pointer of BM device handle. |

### bm\_device\_close

```c
void    bm_device_close(
bmdev_t    dev)
```

bm\_device\_close() closes an opened BM device.

| Parameter | Type  | Description           |
| --------- | ----- | --------------------- |
| dev       | Input | The BM device handle. |

### bm\_device\_query

```c
bmerr_t    bm_device_query(
bmdev_t     dev, 
int          id, 
void        *buf)
```

bm\_device\_query() always returns BM\_ERR\_NOT\_SUPPORTED now.

### bm\_device\_config

```c
bmerr_t    bm_device_config(
bmdev_t    dev, 
int         id, 
void        *buf)
```

bm\_device\_config() always returns BM\_ERR\_NOT\_SUPPORTED now.

### bm\_device\_get\_info

```c
bm_devinfo_t   bm_device_get_info(
bmdev_t   dev)
```

bm\_device\_get\_info() return a BM device information.

| Parameter | Type  | Description           |
| --------- | ----- | --------------------- |
| dev       | Input | The BM device handle. |

### bm\_context\_create

```c
bmerr_t    bm_context_create(
bmctx_t     *ctx)
```

bm\_context\_create() creates a BM context.

| Parameter | Type   | Description                       |
| --------- | ------ | --------------------------------- |
| ctx       | Output | The pointer of BM context handle. |

### bm\_context\_destroy

```c
void   bm_context_destroy(
bmctx_t     ctx)
```

bm\_context\_destroy() destroys a BM context.

| Parameter | Type  | Description            |
| --------- | ----- | ---------------------- |
| ctx       | Input | The BM context handle. |

### bm\_bind\_device

```c
bmerr_t   bm_bind_device(
bmctx_t      ctx,
bmdev_t     dev)
```

bm\_bind\_device() binds a BM context with a BM device.

| Parameter | Type  | Description            |
| --------- | ----- | ---------------------- |
| ctx       | Input | The BM context handle. |
| dev       | Input | The BM device handle.  |

### bm\_unbind\_device

```c
void     bm_unbind_device(
bmctx_t      ctx)  
```

bm\_unbind\_device() unbinds a BM context with the BM device.

| Parameter | Type  | Description            |
| --------- | ----- | ---------------------- |
| ctx       | Input | The BM context handle. |

### bm\_get\_device

```c
bmdev_t    bm_get_device(
bmctx_t    ctx)
```

bm\_get\_device() returns the BM device handle which is bound with the BM context.

| Parameter | Type  | Description            |
| --------- | ----- | ---------------------- |
| ctx       | Input | The BM context handle. |

### bmruntime\_bmkernel\_create

```c
bmerr_t    bmruntime_bmkernel_create(
bmctx_t             ctx,
bmkernel_handle_t    **p_bk_ctx)
```

bmruntime\_bmkernel\_create() creates a BM kernel with the BM context. The p\_bk\_ctx points to a thread local variable, so you can use this API to create multi BM contexts in multiple threads, they are independent. But you can’t own more than one BM context at the same time in one thread, otherwise there will be a memory leak.

| Parameter  | Type   | Description                      |
| ---------- | ------ | -------------------------------- |
| ctx        | Input  | The BM context handle.           |
| p\_bk\_ctx | Output | The pointer of BM kernel handle. |

### bmruntime\_bmkernel\_submit

```c
void   bmruntime_bmkernel_destroy(
bmctx_t     ctx)
```

bmruntime\_bmkernel\_submit() the BM kernel with the BM context.

| Parameter | Type  | Description            |
| --------- | ----- | ---------------------- |
| ctx       | Input | The BM context handle. |

### bmruntime\_bmkernel\_destroy

```c
void   bmruntime_bmkernel_destroy(
bmctx_t     ctx)
```

bmruntime\_bmkernel\_destroy() destroys the BM kernel with the BM context.

| Parameter | Type  | Description            |
| --------- | ----- | ---------------------- |
| ctx       | Input | The BM context handle. |

### bmmem\_device\_alloc\_raw

```c
bmmem_device_t   bmmem_device_alloc_raw(
bmctx_t     ctx,
size_t       size)
```

bmmem\_device\_alloc\_raw() allocates device memory as the input size.

| Parameter | Type  | Description                |
| --------- | ----- | -------------------------- |
| ctx       | Input | The BM context handle.     |
| size      | Input | The size of device memory. |

### bmmem\_device\_prealloc\_raw

```c
bmmem_device_t    bmmem_device_prealloc_raw(
bmctx_t          ctx, 
bmmem_device_t  mem,
uint64_t          offset,
size_t            size)
```

bmmem\_device\_prealloc\_raw() allows application to allocate memory from previously allocted device memory. The memory you want to allocate needs to fall in the previously allocated device memory.

| Parameter | Type  | Description                                           |
| --------- | ----- | ----------------------------------------------------- |
| ctx       | Input | The BM context handle.                                |
| mem       | Input | The previously allocated device memory.               |
| offset    | Input | The offset in the previously allocated device memory. |
| size      | Input | The size of device memory                             |

### bmmem\_device\_alloc

```c
bmmem_device_t   bmmem_device_alloc(
bmctx_t       ctx,
bmshape_t    *shape)
```

bmmem\_device\_alloc() allocates device memory as the input shape.

| Parameter | Type  | Description                 |
| --------- | ----- | --------------------------- |
| ctx       | Input | The BM context handle.      |
| shape     | Input | The shape of device memory. |

### bmmem\_device\_prealloc

```c
bmmem_device_t    bmmem_device_prealloc(
bmctx_t          ctx,
bmmem_device_t  mem,
uint64_t          offset,
bmshape_t       *shape)
```

| Parameter | Type  | Description                                           |
| --------- | ----- | ----------------------------------------------------- |
| ctx       | Input | The BM context handle.                                |
| mem       | Input | The previously allocated device memory.               |
| Offset    | Input | The offset in the previously allocated device memory. |
| shape     | Input | The shape of device memory.                           |

### bmmem\_device\_free

```c
void    bmmem_device_free(
bmctx_t          ctx, 
bmmem_device_t   mem)
```

bmmem\_device\_free() frees the device memory that are allocated by the above allocating functions.

| Parameter | Type  | Description               |
| --------- | ----- | ------------------------- |
| ctx       | Input | The BM context handle.    |
| mem       | Input | The device memory handle. |

### bmmem\_host\_alloc

```c
bmmem_host_t  bmmem_host_alloc(
bmctx_t          ctx, 
bmshape_t        *shape)
```

bmmem\_host\_alloc() always returns BM\_ERR\_NOT\_SUPPORTED now.

### bmmem\_host\_free

```c
void    bmmem_host_free(
bmctx_t          ctx, 
bmmem_host_t   mem)
```

bmmem\_host\_free() always returns BM\_ERR\_NOT\_SUPPORTED now.

### bmmem\_device\_size

```
size_t     bmmem_device_size(
bmctx_t            ctx, 
bmmem_device_t    mem)
```

bmmem\_device\_size() returns the device memory size.

| Parameter | Type  | Description               |
| --------- | ----- | ------------------------- |
| ctx       | Input | The BM context handle.    |
| mem       | Input | The device memory handle. |

### bmmem\_device\_addr

```c
uint64_t    bmmem_device_addr(
bmctx_t            ctx, 
bmmem_device_t    mem)
```

bmmem\_device\_addr() returns the device memory address.

| Parameter | Type  | Description               |
| --------- | ----- | ------------------------- |
| ctx       | Input | The BM context handle.    |
| mem       | Input | The device memory handle. |

### bmmem\_host\_v\_addr

```c
void*    bmmem_host_v_addr(
bmctx_t         ctx, 
bmmem_host_t   mem)
```

bmmem\_host\_v\_addr() always returns BM\_ERR\_NOT\_SUPPORTED now.

### bmmem\_host\_p\_addr

```c
uint64_t   bmmem_host_p_addr(
bmctx_t         ctx, 
bmmem_host_t   mem)
```

bmmem\_host\_p\_addr() always returns BM\_ERR\_NOT\_SUPPORTED now.

### bm\_memcpy\_s2d

```c
bmerr_t     bm_memcpy_s2d(
bmctx_t           ctx, 
bmmem_device_t   dst, 
uint8_t*           src)
```

bm\_memcpy\_s2d() copy system memory data to device memory. s means system, d means device.

| Parameter | Type  | Description                |
| --------- | ----- | -------------------------- |
| ctx       | Input | The BM context handle.     |
| dst       | Input | The device memory handle.  |
| src       | Input | The system memory pointer. |

### bm\_memcpy\_d2s

```c
bmerr_t     bm_memcpy_d2s(
bmctx_t          ctx, 
uint8_t*          dst,
bmmem_device_t  src)
```

bm\_memcpy\_d2s copy device memory data to system memory.

| Parameter | Type  | Description                |
| --------- | ----- | -------------------------- |
| ctx       | Input | The BM context handle.     |
| dst       | Input | The system memory pointer. |
| src       | Input | The device memory handle.  |

### bmnet\_register

bmnet\_register() registers a neuron network with bmnet info.

```c
bmerr_t      bmnet_register(
bmctx_t         ctx, 
bmnet_info_t    *info, 
bmnet_t        *net)
```

| Parameter | Type   | Description                    |
| --------- | ------ | ------------------------------ |
| ctx       | Input  | The BM context handle.         |
| info      | Input  | The BM network info.           |
| net       | Output | The registered network handle. |

### bmnet\_register\_bmodel

```c
bmerr_t      bmnet_register_bmodel (
bmctx_t         ctx, 
char           *bmodel, 
bmnet_t        *net)
```

bmnet\_register\_bmodel() registers a neuron network with bmodel file.

| Parameter | Type   | Description                    |
| --------- | ------ | ------------------------------ |
| ctx       | Input  | The BM context handle.         |
| bmodel    | Input  | bmodel filename.               |
| net       | Output | The registered network handle. |

### bmnet\_register\_noalloc

```c
bmerr_t      bmnet_register_noalloc(
bmctx_t         ctx, 
bmnet_info_t    *info, 
bmnet_t        *net)
```

bmnet\_register\_noalloc() registers a compiled neuron network without allocating weight and neuron device memory.

| Parameter | Type   | Description                    |
| --------- | ------ | ------------------------------ |
| ctx       | Input  | The BM context handle.         |
| info      | Input  | The BM network info.           |
| net       | Output | The registered network handle. |

### bmnet\_set\_input\_shape

```c
bmerr_t     bmnet_set_input_shape(
bmnet_t     net, 
shape_t     input_shape)
```

bmnet\_set\_input\_shape () sets a input shape for a registered BM network. The bmodel support different input shapes, the API can set one of them.

| Parameter    | Type  | Description            |
| ------------ | ----- | ---------------------- |
| net          | Input | The BM network handle. |
| input\_shape | Input | The input shape.       |

### bmnet\_get\_output\_info

```c
bmerr_t     bmnet_get_output_info(
bmnet_t             net, 
bmnet_output_info_t  *output_info)
```

bmnet\_get\_output\_info () sets a input shape for a registered BM network.

| Parameter    | Type   | Description            |
| ------------ | ------ | ---------------------- |
| net          | Input  | The BM network handle. |
| output\_info | Output | The output info.       |

### bmnet\_cleanup

```c
void       bmnet_cleanup(
bmnet_t    net)
```

bmnet\_cleanup() cleans up a registered BM network.

| Parameter | Type  | Description            |
| --------- | ----- | ---------------------- |
| net       | Input | The BM network handle. |

### bmnet\_run

```c
bmerr_t    bmnet_run(
bmnet_t  net)
```

bmnet\_run() runs a registered BM network. You need load input and store output by yourself.

| Parameter | Type  | Description            |
| --------- | ----- | ---------------------- |
| net       | Input | The BM network handle. |

### bmnet\_weight\_devmem

```c
bmmem_device_t   bmnet_weight_devmem(
bmnet_t         net)
```

bmnet\_weight\_devmem() retrieves the weight device memory handler from a registered BM network.

| Parameter | Type  | Description            |
| --------- | ----- | ---------------------- |
| net       | Input | The BM network handle. |

### bmnet\_neuron\_devmem

```c
bmmem_device_t    bmnet_neuron_devmem(
bmnet_t         net)
```

bmnet\_neuron\_devmem() retrieves neuron device memory handler from a registered BM network.

| Parameter | Type  | Description            |
| --------- | ----- | ---------------------- |
| net       | Input | The BM network handle. |

### bmnet\_input\_devmem

```c
bmmem_device_t     bmnet_input_devmem(
bmnet_t        net)
```

bmnet\_input\_devmem() retrieves input device memory handler from a registered BM network.

| Parameter | Type  | Description            |
| --------- | ----- | ---------------------- |
| net       | Input | The BM network handle. |

### bmnet\_output\_devmem

```c
bmmem_device_t     bmnet_output_devmem(
bmnet_t        net)
```

bmnet\_output\_devmem() retrieves output device memory handler from a registered BM network.

| Parameter | Type  | Description            |
| --------- | ----- | ---------------------- |
| net       | Input | The BM network handle. |

### bmnet\_import\_weight\_devmem

```c
bmerr_t    bmnet_import_weight_devmem(
bmnet_t           net,
bmmem_device_t   weight_mem)
```

bmnet\_import\_weight\_devmem() imports weight device memory for a registered BM network. application should allocate weight device memory firstly, then call it to import weight memory. This function and bmnet\_import\_neuron\_devmem() function are usually used with bmnet\_register\_noalloc() function. Application can register BM network without allocating weight and neuron device memory, and then use these two functions to import weight and neuron memory.

| Parameter   | Type  | Description                      |
| ----------- | ----- | -------------------------------- |
| net         | Input | The BM network handle.           |
| weight\_mem | Input | The weight device memory handle. |

### bmnet\_import\_neuron\_devmem

```c
bmerr_t    bmnet_import_neuron_devmem(
bmnet_t           net,
bmmem_device_t   neuron_mem)
```

bmnet\_import\_neuron\_devmem() imports neuron device memory for a registered BM network. Application should allocate neuron device memory firstly, then call it to import neuron memory.

| Parameter   | Type  | Description                      |
| ----------- | ----- | -------------------------------- |
| net         | Input | The BM network handle.           |
| neuron\_mem | Input | The neuron device memory handle. |

### bmnet\_load\_input

```c
bmerr_t      bmnet_load_input(
bmnet_t      net, 
uint8_t     *input)
```

bmnet\_load\_input() loads input data for a registered BM network.

| Parameter | Type  | Description             |
| --------- | ----- | ----------------------- |
| net       | Input | The BM network handle.  |
| input     | Input | The input data pointer. |

### bmnet\_load\_neuron

```c
bmerr_t     bmnet_load_neuron(
bmnet_t     net, 
uint64_t     neuron_offset,
int          neuron_size,
uint8_t      *neuron)
```

bmnet\_load\_neuron() loads neuron data for a registered BM network.

| Parameter      | Type  | Description                       |
| -------------- | ----- | --------------------------------- |
| net            | Input | The BM network handle.            |
| neuron\_offset | Input | The offset of neuron buffer.      |
| neuron\_size   | Input | The neuron buffer size.           |
| neuron         | Input | The pointer to the neuron buffer. |

### bmnet\_store\_output

```c
bmerr_t    bmnet_store_output (
bmnet_t    net,
uint8_t    *output)
```

bmnet\_store\_output() stores output data for a registered BM network. Application uses this function to copy output data from device memory to host memory.

| Parameter | Type  | Description                |
| --------- | ----- | -------------------------- |
| net       | Input | The BM network handle.     |
| output    | Input | The output buffer pointer. |

### bmnet\_store\_neuron

```c
bmerr_t     bmnet_store_neuron(
bmnet_t     net, 
uint64_t     neuron_offset, 
int          neuron_size,
uint8_t      *neuron)
```

bmnet\_store\_neuron() stores neuron data for a registered BM network. Application uses this function to copy neuron data from device memory to host memory.

| Parameter      | Type  | Description                       |
| -------------- | ----- | --------------------------------- |
| net            | Input | The BM network handle.            |
| neuron\_offset | Input | The offset of neuron buffer.      |
| neuron\_size   | Input | The neuron buffer size.           |
| neuron         | Input | The pointer to the neuron buffer. |

### bmnet\_inference

```c
bmerr_t     bmnet_inference(
bmnet_t     net, 
uint8_t      *input,
uint8_t      *output)
```

bmnet\_inference() runs inference with a registered BM network.

| Parameter | Type  | Description                |
| --------- | ----- | -------------------------- |
| net       | Input | The BM network handle.     |
| input     | Input | The input buffer pointer.  |
| output    | Input | The output buffer pointer. |

## BMKernel Library

### System API

### bmk1880 register

User allocates a BMKernel context by filling a bmk1880 info t structure and passing it to bmk1880 register function. The function returns a handle of the initialized context.&#x20;

In the bmk1880 info t structure: chip version is an integer describing the version of chip to work with, and can be 1880 or 1880; cmdbuf (short for “command buffer”) is a user-allocated buffer to contain generated hardware instructions and cmdbuf size describes its size in bytes. Note that user is responsible to free cmdbuf after the use of referring BMKernel context.

```c
typedef struct { u32 chip_version; u8 *cmdbuf;
u32 cmdbuf_size;
} bmk1880_info_t;
void * bmk1880_register(bmk1880_info_t *info);
```

### bmk1880 cleanup

bmk1880 cleanup frees the context previously allocated by bmk1880 register.

```c
void bmk1880_cleanup(void *ctx);
```

### bmk1880 acquire cmdbuf

bmk1880 acquire cmdbuf returns a buffer of hardware instructions generated so far and set (\*size) to buffer’s valid size in bytes. The buffer is an array of cmd hdr t structures each containing one variable-sized generated hardware instruction.

```c
u8 *bmk1880_acquire_cmdbuf(void *ctx, u32 *size);
typedef struct {
  u8 engine_id : 4; ...
  u8 len;
  u8 cmd [0];
} cmd_hdr_t;
```

In the cmd hdr t structure, engine id is the identifier of engine on which the contained in- struction is supposed to be executed. And len indicates in bytes the length of the hardware instruction immediately following this cmd hdr t structure.

### bmk1880 reset

bmk1880 reset resets current BMKernel context to its initial state as returned by bmk1880 - register. This function is usually called after bmk1880 acquire cmdbuf to empty the cmdbuf buffer.

```c
void bmk1880_reset(void *ctx); 
```

### bmk1880 parallel enable

bmk1880 parallel enable claims that following computations on different engines can be executed with no synchornization with each other. This function enables engine-oriented parallel programming style.

```c
void bmk1880_parallel_enable(void *ctx); 
```

### bmk1880 parallel disable

bmk1880 parallel disable disables engine-oriented parallel programming style.

```c
void bmk1880_parallel_disable(void *ctx); 
```

### bmk1880 create streams

bmk1880 create streams creates nr streams streams, indexed 0 to (nr streams - 1), that following calls to bmk1880 set stream can refer to. This function enables dependency-oriented parallel programming style. Note this style can not be disabled once enabled.

```c
void bmk1880_create_streams(void *ctx, int nr_streams); 
```

### bmk1880 destroy streams

bmk1880 destroy streams destroys all the streams created by the previous call to bmk1880 - create streams and resets the system back to serial mode.

```c
void bmk1880_destroy_streams(void *ctx); 
```

### bmk1880 set stream

bmk1880 set stream set current stream to stream i that has been created by calling bmk1880 - create streams. Following computations will be put into this stream until another bmk1880 set - stream specifying a different stream index is called.

```c
void bmk1880_set_stream(void *ctx, int i); 
```

### bmk1880 add dependency

bmk1880 add dependency further restricts that the computation represented by before must take place strictly before that represented by after. Both before and after are pointers returned by some computation API.

```c
void bmk1880_add_dependency( void *ctx, void *before, void *after); 
```

### Computation API

During all kinds of computation, input values are first converted into 32-bit ones before any internal computation, and final 32-bit values are saturated into ranges that can be represented by the final 8-bit or 16-bit integer format. That is, if the value before saturation can be represented by the final integer format, it is unchanged. Otherwise it is saturated into the maximun or minimum in the final integer format, whichever is nearer to the original value. For example, if the final integer format is FMT\_U8, then the representable maximum and minimum are 255 and 0 respectively. In this case, any value that is bigger than 255 becomes 255 after saturation, and values smaller than 0 are saturated into 0’s.&#x20;

About signedness, one general rule applies to all kinds of computation when not otherwise specified: the result is unsigned if and only if all input tensors or matrice are unsigned. A tensor or matrix is said to be signed if it is of format FMT\_I8, unsigned if FMT\_U8.

### fmt t

fmt t describes the type of basic data in a tensor or matrix. The naming consists of three parts. “FMT” is a fixed prefix. A following “I” or “U” stands for signed integer or unsigned integer respectively. “8” describes the bit-width of the type.

```c
typedef u32 fmt_t;
#define FMT_I8 4 
#define FMT_U8 9 
```

### shape t

shape t describes the shape of a tensor or matrix. shape t4 and shape t2 are used to construct shape t’s for tensor and matrix, respectively.

```c
typedef struct { 
u32 dim; 
u32 n; 
u32 c; 
union { 
    u32 h; 
    u32 row; }; 
union { 
    u32 w; 
    u32 col; }; 
} shape_t; 
shape_t shape_t4(int n, int c, int h, int w); 
shape_t shape_t2(int row, int col); 
```

### stride t

stride t describes the stride of a tensor or matrix. stride t4 and stride t2 are used to construct stride t’s for tensor and matrix, respectively.

```c
typedef struct { 
    u32 n; 
    u32 c; 
union { 
    u32 h; 
    u32 row; 
    }; 
union { 
    u32 w; 
    u32 col; 
    }; 
} stride_t; 
stride_t stride_t4(int n, int c, int h, int w); stride_t stride_t2(int row, int col); 
```

### tensor lmem

tensor lmem represents a tensor or matrix in lmem. fmt, shape, stride are as explained above. If stride is NULL, aligned will be referred as indication of two frequently used stride values.

```c
typedef struct { 
  fmt_t fmt; 
  shape_t shape; 
  stride_t *stride; 
   bool aligned; 
   ... 
} tensor_lmem; 
```

For tensors, if aligned is false, the stride values are as in the default unaligned stride on page 5. If aligned is true, the values are as in the default aligned stride on page 5. For matrice, stride values are computed by the shapes of corresponding specially shaped tensors, following the same rule.

### tensor gmem

tensor gmem represents a tensor or matrix in gmem.

```c
typedef struct { 
   u64 addr; 
   shape_t shape; 
   stride_t stride; 
} tensor_gmem; 
```

### bmk1880 chip info

bmk1880 chip info returns a structure describing design parameters of the BM1880 chip.

```c
typedef struct { 
   u32 version; 
   u32 npu_num; 
   u32 eu_num; 
   u32 lmem_size; 
   u32 lmem_banks; 
   u32 lmem_bank_size; 
} bmk1880_chip_info_t; 
bmk1880_chip_info_t bmk1880_chip_info(); 
```

### bmk1880 tl prealloc

bmk1880 tl prealloc allocates a tensor lmem structure on heap memory, and constructs it as dictated by parameters. The parameter la is the starting address in lmem. The tensor lmem’s aligned field is set to false. If the allocation succeeds, a pointer to the constructed structure is returned, NULL otherwise.

```c
tensor_lmem * bmk1880_tl_prealloc( 
  void *ctx, 
  laddr_t la , 
  shape_t s, 
  fmt_t fmt); 
```

### bmk1880 tl prealloc align

Same as bmk1880 tl prealloc, except the aligned field is set to true.

```c
tensor_lmem * bmk1880_tl_prealloc_align( 
  void *ctx, 
  laddr_t la , 
  shape_t s, 
  fmt_t fmt); 
```

### bmk1880 tl alloc

bmk1880 tl alloc allocates a tensor lmem structure on heap memory, and constructs it as dic- tated by parameters. Unlike in bmk1880 tl prealloc, the starting address is not determined from parameters, but assigned by BMKernel automatically. BMKernel manages the starting addresses in lmem by a simple stack. The starting address in each returned tensor lmem increases mono- tonically against successive bmk1880 tl alloc calls. And the last allocated tensor lmem must be freed first, using function bmk1880 tl free explained soon. If the available memory in lmem is not enough to satisfy an allocation request, or some other error occurs, a NULL pointer is returned.

```c
tensor_lmem * bmk1880_tl_alloc( 
  void *ctx, 
  shape_t s, 
  fmt_t fmt , 
  u32 ctrls); 
```

{% hint style="info" %}
tensor lmem’s aligned field is set to false when ctrls is CTRL\_NULL, and true when CTRL\_AL.
{% endhint %}

### bmk1880 tl alloc bank

bmk1880 tl alloc bank allocates memory from a specific lmem bank, as dictated by the bank id parameter.

```c
tensor_lmem * bmk1880_tl_alloc_bank( 
  void *ctx, 
  u32 bank_id , 
  shape_t s, 
  fmt_t fmt , 
  u32 ctrls); 
```

### bmk1880 tl free

bmk1880 tl free frees the tensor lmem structure allocated by bmk1880 tl prealloc, bmk1880 tl pre- alloc align, bmk1880 tl alloc and bmk1880 tl alloc bank back to heap memory. If the structure is allocated by bmk1880 tl alloc or bmk1880 tl alloc bank, bmk1880 tl free also increases the avail- able lmem memory managed by BMKernel and checks that the last allocate, first free rule is obeyed (see bmk1880 tl alloc).

```c
void bmk1880_tl_free(void ctx, tensor_lmem tlp);
```

### bmk1880 gdma copy gmem

bmk1880 gdma copy gmem instructs DMA to copy tensor or matrix within gmem. src and dst must be both tensors or matrice and must contain 8-bit basic data only. The shapes of src and dst may be different, as long as their total numbers of basic data equal. When src and dst are tensors, ctrls can be CTRL\_TP, indicating N/C-transposition. In other cases, ctrls must be CTRL\_NULL.

```c
void * bmk1880_gdma_copy_gmem( 
  void *ctx, 
  tensor_gmem *dst , 
  tensor_gmem *src , 
  ctrl_t ctrls); 
```

### bmk1880 gdma copy lmem

bmk1880 gdma copy lmem instructs DMA to copy a tensor (not matrix) within lmem, from src to dst. The shapes of src and dst may be different, as long as their total numbers of basic data equal. The basic data must be 8-bit.

```c
void * bmk1880_gdma_copy_lmem( 
  void *ctx, 
  tensor_lmem *dst , 
  tensor_lmem *src); 
```

### bmk1880 gdma load

bmk1880 gdma load instructs DMA to copy a tensor or matrix from gmem to lmem. The tensor or matrix starts at gaddr in gmem, and is strided by default values. When ctrls is CTRL\_TP (instead of CTRL\_NULL), it indicates N/C-transposition for a tensor, or row/column-transposition for a matrix. The basic data must be 8-bit.

```c
void * bmk1880_gdma_load( 
  void *ctx, 
  tensor_lmem *t, 
  u64 gaddr , 
  ctrl_t ctrls); 
```

### bmk1880 gdma store

Similar to bmk1880 gdma load, but copies the tensor or matrix from lmem to gmem.

```c
void * bmk1880_gdma_store( 
  void *ctx, 
  tensor_lmem *t, 
  u64 gaddr , 
  ctrl_t ctrls); 
```

### bmk1880 gdma load stride

Similar to bmk1880 gdma load, but enables users to specify stride values in gmem.

```c
void * bmk1880_gdma_load_stride( 
  void *ctx, 
  tensor_lmem *t,
   u64 gaddr , 
  stride_t stride , 
  ctrl_t ctrls); 
```

### bmk1880 gdma store stride

Similar to bmk1880 gdma store, but enables users to specify stride values in gmem.

```c
void * bmk1880_gdma_store_stride( 
  void *ctx, 
  tensor_lmem *t, 
  u64 gaddr , 
  stride_t stride , 
  ctrl_t ctrls); 
```

### bmk1880 gdma lrn shift

bmk1880 gdma lrn shift instructs DMA to compute a tensor (not matrix) dst from tensor src, both of which are of same shape (N, C, H, W ). If right shift is true, the computation copies datum at index (ni, ci, hi, wi) in tensor src into index (ni, ci + lrn step, hi, wi) in tensor dst for each 0 ≤ ci < C − lrn step, and set datum at index (ni, ci, hi, wi) in tensor dst to zero for each 0 ≤ ci < lrn step. If right shift is false, the computation copies datum at index (ni, ci, hi, wi) in tensor src into index (ni, ci − lrn step, hi, wi) in tensor dst for each lrn step ≤ ci < C, and set datum at index (ni, ci, hi, wi) in tensor dst to zero for each C − lrn step ≤ ci < C. The basic data must be 8-bit.

```c
void * bmk1880_gdma_lrn_shift( 
  void *ctx, 
  tensor_lmem *dst , 
  tensor_lmem *src , 
  bool right_shift , 
  int lrn_step); 
```

### bmk1880 tpu mul

bmk1880 tpu mul instructs TPU to compute resi = (ai × bi) ≫ rshift width for each datum ai in tensor a and bi in tensor b, where resi, ai and bi are of same index. All tensors must be of same shape, and the basic data in all tensor lmem structures must be 8-bit. If the result is a 16-bit tensor, res high and res low represent its high and low 8-bit parts, respectively. res high should be NULL if the result is 8-bit. rshift width indicates the bits to be shifted to right for each result value before saturation.

```c
typedef struct { 
  tensor_lmem *res_high; 
  tensor_lmem *res_low; 
  tensor_lmem *a; 
  tensor_lmem *b; 
  int rshift_width; 
} bmk1880_mul_param_t; 

void * bmk1880_tpu_mul(void *ctx, const bmk1880_mul_param_t *p); 
```

### bmk1880 tpu mul const

Similar to bmk1880 tpu mul, but tensor b is replaced by an 8-bit constant. The constant is signed if b is signed is true, unsigned otherwise.

```c
typedef struct { 
  tensor_lmem *res_high; 
  tensor_lmem *res_low; 
  tensor_lmem *a; s8 b; 
  bool b_is_signed; 
  int rshift_width; 
} bmk1880_mul_const_param_t; 

void * bmk1880_tpu_mul_const(void *ctx, const 
bmk1880_mul_const_param_t *p); 
```

### bmk1880 tpu mac

bmk1880 tpu mac instructs TPU to compute resi = (ai × bi + (resi ≪ lshift width)) ≫ rshift width for each datum ai in tensor a, bi in tensor b and resi represented by res high and res low together, where resi, ai and bi are of same index. All tensors must be of same shape, and the basic data in all tensor lmem structures must be 8-bit. The result is a 16-bit tensor if res is int8 is false, or a 8-bit tensor otherwise. rshift width indicates the bits to be shifted to right for each result value before saturation. Note that res high and res low are used both as input resi’s and output resi’s. Input resi’s are fixed to be 16-bit so that both res high and res low must be non-NULL. When the result is a 8-bit tensor, it is stored into res low.

```c
typedef struct { 
  tensor_lmem *res_high; 
  tensor_lmem *res_low; 
  bool res_is_int8; 
  tensor_lmem *a; 
  tensor_lmem *b; 
  int lshift_width; int rshift_width; 
} bmk1880_mac_param_t; 

void * bmk1880_tpu_mac(void *ctx, const bmk1880_mac_param_t *p); 
```

### bmk1880 tpu mac const

Similar to bmk1880 tpu mac, but tensor b is replaced by an 8-bit constant. The constant is signed if b is signed is true, unsigned otherwise.

```c
typedef struct { 
  tensor_lmem *res_high; 
  tensor_lmem *res_low; 
  bool res_is_int8; 
  tensor_lmem *a; 
  s8 b; 
  bool b_is_signed; 
  int lshift_width; 
  int rshift_width; 
} bmk1880_mac_const_param_t; 

void * bmk1880_tpu_mac_const(void *ctx, const 
bmk1880_mac_const_param_t *p); 
```

### bmk1880 tpu add

bmk1880 tpu add instructs TPU to compute resi = ai + bi for each datum ai in tensor a and bi in tensor b, where resi, ai and bi are of same index. All tensors must be of same shape, and the basic data in all tensor lmem structures must be 8-bit. Tensor a and tensor b must all be 16-bit so that a high and b high must not be NULL. If the result is a 16-bit tensor, res high and res low represent its high and low 8-bit parts, respectively. res high should be NULL if the result is 8-bit.

```c
typedef struct { 
  tensor_lmem *res_high; 
  tensor_lmem *res_low; 
  tensor_lmem *a_high; 
  tensor_lmem *a_low; 
  tensor_lmem *b_high; 
  tensor_lmem *b_low; 
} bmk1880_add_param_t; 

void * bmk1880_tpu_add(void *ctx, const bmk1880_add_param_t *p); 
```

### bmk1880 tpu add const

Similar to bmk1880 tpu add, but tensor b is replaced by a 16-bit constant. The constant is signed if b is signed is true, unsigned otherwise.

```c
typedef struct { 
  tensor_lmem *res_high; 
  tensor_lmem *res_low; 
  tensor_lmem *a_high; 
  tensor_lmem *a_low; 
  s16 b; 
  bool b_is_signed; 
} bmk1880_add_const_param_t; 

void * bmk1880_tpu_add_const(void *ctx, const bmk1880_add_const_param_t *p); 
```

### bmk1880 tpu sub

bmk1880 tpu sub instructs TPU to compute resi = ai − bi for each datum ai in tensor a and bi in tensor b, where resi, ai and bi are of same index. All tensors must be of same shape, and the basic data in all tensor lmem structures must be 8-bit. Tensor a and tensor b must all be 16-bit so that a high and b high must not be NULL. The result must be signed integers so that the fmt t field in res high and res low must be FMT\_I8. If the result is a 16-bit tensor, res high and res low represent its high and low 8-bit parts, respectively. res high should be NULL if the result is 8-bit.

```c
typedef struct { 
  tensor_lmem *res_high; 
  tensor_lmem *res_low; 
  tensor_lmem *a_high; 
  tensor_lmem *a_low;  
  tensor_lmem *b_high; 
  tensor_lmem *b_low; 
} bmk1880_sub_param_t; 

void * bmk1880_tpu_sub(void *ctx, const bmk1880_sub_param_t *p); 
```

### bmk1880 tpu max

bmk1880 tpu max instructs TPU to compute resi = max(ai,bi) for each datum ai in tensor a and bi in tensor b, where resi, ai and bi are of same index. All tensors must be of same shape, and the basic data in all tensor lmem structures must be 8-bit. Tensor a and tensor b must both be signed or unsigned at the same time.

```c
typedef struct { 
  tensor_lmem *max; 
  tensor_lmem *a; 
  tensor_lmem *b; 
} bmk1880_max_param_t; 

void * bmk1880_tpu_max(void *ctx, const bmk1880_max_param_t *p); 
```

### bmk1880 tpu min

Similar to bmk1880 tpu max, but computes resi = min(ai, bi).

```c
typedef struct { 
  tensor_lmem *min; 
  tensor_lmem *a; 
  tensor_lmem *b; 
} bmk1880_min_param_t; 

void * bmk1880_tpu_min(void *ctx, const bmk1880_min_param_t *p); 
```

### bmk1880 tpu min const

Similar to bmk1880 tpu min, but tensor b is replaced by an 8-bit constant. The constant is signed if b is signed is true, unsigned otherwise.

```c
typedef struct { 
  tensor_lmem *min; 
  tensor_lmem *a; 
  s8 b; 
  bool b_is_signed; 
} bmk1880_min_const_param_t; 

void * bmk1880_tpu_min_const(void *ctx, const bmk1880_min_const_param_t *p); 
```

### bmk1880 tpu arith shift

bmk1880 tpu arith shift instructs TPU to compute resi = ai ≫ bitsi for each datum ai in tensor a and bitsi in tensor bits, where resi, ai and bitsi are of same index. All tensors must be of same shape, and the basic data in all tensor lmem structures must be 8-bit. Tensor a must be 16-bit and signed so that the fmt fields in a high and a low must be FMT\_I8. Tensor bits must be signed and every datum in it must range in \[−16, 16]. The result tensor must be 16-bit so that res high must be non-NULL.

```c
typedef struct { 
  tensor_lmem *res_high; 
  tensor_lmem *res_low; 
  tensor_lmem *a_high; 
  tensor_lmem *a_low; 
  tensor_lmem *bits; 
} bmk1880_arith_shift_param_t; 

void * bmk1880_tpu_arith_shift(void *ctx, const bmk1880_arith_shift_param_t *p); 
```

### bmk1880 tl and int8

bmk1880 tpu and int8 instructs TPU to compute resi = ai ∧ bi for each datum ai in tensor a and bi in tensor b, where resi, ai and bi are of same index. All tensors must be of same shape, and the basic data in all tensor lmem structures must be 8-bit.

```c
typedef struct { 
  tensor_lmem *res; 
  tensor_lmem *a; 
  tensor_lmem *b; 
} bmk1880_and_int8_param_t; 

void * bmk1880_tpu_and_int8(void *ctx, const 
bmk1880_and_int8_param_t *p); 
```

### bmk1880 tpu and int16

Similar to bmk1880 tpu and int8, but all input and output tensors are 16-bit. So res high, a high and b high must be non-NULL.

```c
typedef struct { 
  tensor_lmem *res_high; 
  tensor_lmem *res_low; 
  tensor_lmem *a_high; 
  tensor_lmem *a_low; 
  tensor_lmem *b_high; 
  tensor_lmem *b_low; 
} bmk1880_and_int16_param_t; 

void * bmk1880_tpu_and_int16(void *ctx, const 
bmk1880_and_int16_param_t *p); 
```

### bmk1880 tpu or int8

Similar to bmk1880 tpu and int8, but computes resi = ai ∨ bi.

```c
typedef struct {
  tensor_lmem *res; 
  tensor_lmem *a; 
  tensor_lmem *b; 
} bmk1880_or_int8_param_t; 

void * bmk1880_tpu_or_int8(void *ctx, const 
bmk1880_or_int8_param_t *p); 
```

### bmk1880 tpu or int16

Similar to bmk1880 tpu and int16, but computes resi = ai ∨ bi.

```c
typedef struct { 
  tensor_lmem *res_high;
  tensor_lmem *res_low; 
  tensor_lmem *a_high; 
  tensor_lmem *a_low; 
  tensor_lmem *b_high; 
  tensor_lmem *b_low; 
} bmk1880_or_int16_param_t; 

void * bmk1880_tpu_or_int16(void *ctx, const 
bmk1880_or_int16_param_t *p); 
```

### bmk1880 tpu xor int8

Similar to bmk1880 tpu and int8, but computes resi = ai ⊕ bi.

```c
typedef struct { 
  tensor_lmem *res;
  tensor_lmem *a; 
  tensor_lmem *b; 
} bmk1880_xor_int8_param_t; 

void * bmk1880_tpu_xor_int8(void *ctx, const 
bmk1880_xor_int8_param_t *p); 
```

### bmk1880 tpu xor int16

Similar to bmk1880 tpu and int16, but computes resi = ai ⊕ bi.

```c
typedef struct { 
  tensor_lmem *res_high;   
  tensor_lmem *res_low; 
  tensor_lmem *a_high; 
  tensor_lmem *a_low; 
  tensor_lmem *b_high; 
  tensor_lmem *b_low; 
} bmk1880_xor_int16_param_t; 

void * bmk1880_tpu_xor_int16(void *ctx, const 
bmk1880_xor_int16_param_t *p); 
```

### bmk1880 tpu copy

bmk1880 tpu copy instructs TPU to copy tensors within lmem, from src to dst. The basic data must be 8-bit.

```c
typedef struct { 
  tensor_lmem *dst; 
  tensor_lmem *src; 
} bmk1880_copy_param_t; 

void * bmk1880_tpu_copy(void *ctx, const bmk1880_copy_param_t *p) ; 
```

### bmk1880 tpu copy with stride

Similar to bmk1880 tpu copy, but user provides stride t structures specifying the layouts of ten- sors dst and src. The basic data must be 8-bit.

```c
typedef struct { 
  tensor_lmem *dst; 
  stride_t dst_stride; 
  tensor_lmem *src; 
  stride_t src_stride; 
} bmk1880_copy_with_stride_param_t; 

void * bmk1880_tpu_copy_with_stride( 
void *ctx, 
const bmk1880_copy_with_stride_param_t *p); 
```

### bmk1880 tpu mdsum

bmk1880 tpu mdsum instructs TPU to compute a tensor res of shape (1,C,1,1) from tensor a of shape (N,C,H,W). Every datum resci of index (0,ci,0,0) in tensor res is computed as  <img src="https://2115705518-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-LOME7sTpZZ0QmlqOlnw%2F-LPFNH7ZOCsoZt5Lp51r%2F-LPFVD0mjSADia6dK0B-%2F%E5%9B%BE%E7%89%87%201.png?alt=media&#x26;token=4c2da989-2b73-42ba-ac01-da340772149a" alt="" data-size="line">&#x20;where an*i*,c*i,*&#x68;*i*,w*i* is a datum of index (ni,ci,hi,wi) in tensor a. The basic data in all tensor lmem structures must be 8-bit. If the result is a 16-bit tensor, res high and res low represent its high and low 8-bit parts, respectively. Res\_high should be NULL if the result is 8-bit. a and res must both be signed or unsigned at the same time.

```c
typedef struct { 
  tensor_lmem *res_high; 
  tensor_lmem *res_low; 
  tensor_lmem *a; 
} bmk1880_mdsum_param_t; 

void * bmk1880_tpu_mdsum(void *ctx, const bmk1880_mdsum_param_t * p); 
```

### bmk1880 tpu lut

bmk1880 tpu lut instructs TPU to compute a tensor res from tensor idx, by using tensor table as a lookup table and values in tensor idx as indice. Tensor table must be of shape (1, slices, 16, 16), where slices is the number of lmem slices. Tensor idx and tensor res must be of same shape. Assuming their shape is (N,C,H,W), the datum resi of index (ni,ci,hi,wi) in tensor res is computed from idxi of same index (ni, ci, hi, wi) in tensor idx as resi = tablei , where tablei is of index (0, ct, idxi , idxi mod 16) in tensor table, and ct is the index of lmem slice the datum idxi 16 resides in. The basic data in all tensor lmem structures must be 8-bit.

```c
typedef struct { 
  tensor_lmem *ofmap; 
  tensor_lmem *ifmap; 
  tensor_lmem *table; 
} bmk1880_lut_param_t; 

void * bmk1880_tpu_lut(void *ctx, const bmk1880_lut_param_t *p); 

```

### bmk1880 tpu relu

bmk1880 tpu relu instructs TPU to compute resi = max(0,ai) for each datum ai in tensor a, where resi and ai are of same index. The basic data in all tensor lmem structures must be 8-bit.

```c
typedef struct { 
  tensor_lmem *ofmap; 
  tensor_lmem *ifmap; 
} bmk1880_relu_param_t; 

void * bmk1880_tpu_relu(void *ctx, const bmk1880_relu_param_t *p) ; 
```

### bmk1880 tpu conv

bmk1880 tpu conv instructs TPU to compute a tensor ofmap from tensor ifmap, weight and bias, by using ifmap as input feature map, weight as convolution kernel and bias as bias to be added into the convolution result. relu enable may be true, indicating ReLU activations after adding bias values but before shifting every basic datum. rshift width specifies the number of bits to shift every basic datum rightward after optional ReLU activations.

```c
typedef struct { 
  tensor_lmem *ofmap; 
  tensor_lmem *ifmap; 
  tensor_lmem *weight; 
  tensor_lmem *bias; 
  u8 ins_h, ins_last_h; 
  u8 ins_w, ins_last_w; 
  u8 pad_top , pad_bottom;
  u8 pad_left ,  pad_right;
  u8 stride_h ,  stride_w
  u8 dilation_h , dilation_w; 
  bool relu_enable; 
  int rshift_width; 
} bmk1880_conv_param_t; 

void * bmk1880_tpu_conv(void *ctx, const bmk1880_conv_param_t *p) ; 
```

{% hint style="info" %}
ofmap and ifmap must be aligned (see BMKernel 1880 Guide.pdf).
{% endhint %}

weight is of a special layout which is very different from that described in programming model(see BMKernel 1880 Guide.pdf). If ifmap is of shape (Nin, Cin, Hin, Win), ofmap is of shape (Nout, Cout, Hout, Wout) and convolution kernels are of shape (Hkernel, Wkernel), then weight should be of shape (Cin, Cout, Hkernel, Wkernel).

The layout of weight, however, is as if it is of shape (1,Cout,Hkernel ×Wkernel,Cin). This special layout can be precisely defined by applying the following stride values to weight’s logical shape (Cin, Cout, Hkernel, Wkernel):

![](https://2115705518-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-LOME7sTpZZ0QmlqOlnw%2F-LPFVxkeOQa405QhuARV%2F-LPFXfHsZXBeHTTPrSsL%2F121.png?alt=media\&token=86099f76-5baf-4aee-983f-95da0bce32fc)

&#x20;bias may be NULL, indicating no bias values. If it is non-NULL, and assume ofmap is of shape (N, C, H, W ), then bias must be a 16-bit tensor of shape (1, C, 1, 1). Since a 16-bit tensor is stored as two 8-bit tensors in lmem, bias’s tensor lmem structure must be of shape (2,C,1,1) and must be unaligned (see unaligned stride values in section 2.4 BMKernel 1880 Guide.pdf). During the phase of adding bias, the value of datum at index (0,ci,0,0) in the 16-bit tensor are added to all data in ofmap whose C-dimension index is ci.

param contains detailed convolution parameters that can be classified into four categories by their functions. They are insertion, padding, striding and dilations parameters, which are detailed below. Insertion parameters specify the number of zeros to be inserted into specific locations within ifmap. They include ins h, ins last h, ins w and ins last w. ins h specifies the number of zeros to be inserted after every non-last basic datum, along the H-dimension. Consider ifmap of shape (N, C, H, W ) for example. After inserting zeros, ifmap′ will be of shape (N, C, H′, W ), where H′ = 1 + (H − 1) × (ins h + 1). Denoting as xni,ci,hi,wi the value of basic datum at index (ni,ci,hi,wi) of tensor ifmap, and as x′ni,ci,hi,wi the value of that of tensor ifmap′, the following holds:

![](https://2115705518-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-LOME7sTpZZ0QmlqOlnw%2F-LPFVxkeOQa405QhuARV%2F-LPFYBf-AqeUfc_Ic3if%2F%E5%9B%BE%E7%89%87%201.png?alt=media\&token=de4129ad-e3a5-4054-a06d-53326f97c16e)

&#x20;ins last h specifies the number of zeros to be inserted only after every last basic datum. Similarly, ins w and ins last w specify the number of zeros to be inserted along the W -dimension. Padding parameters specify the number of zeros to be inserted around elements within ifmap. pad top specifies the number of zeros to be inserted before every first basic datum along the H-dimension. pad bottom specifies the number after every last basic datum along the H-dimension. Similary, pad left and pad right specify the number along the W-dimension. Striding parameters specify the number of basic data convolution kernel should stride over after each convolution step. stride h and stride w specify the number along the H-dimension and W-dimension, respectively. Dilation parameters specify the dilation of the convolution kernel weight. That is, (stride h − 1) zeros are inserted between each two basic data along the H-dimension. Similary (stride w − 1) zeros are inserted along the W-dimension.

### bmk1880 tpu winograd

Similar to bmk1880 tpu conv, but use winograd algorithm to accelerate the computation. More- over, weight must contain only 3 × 3 kernels and must be default strided in lmem (see section 2.4 BMkernel 1880 Guide.pdf). The other parameters, including those in param, are similar to those of same names in function bmk1880 tpu conv (see section 4.41 BMkernel 1880 Guide.pdf).

```c
typedef struct { 
  tensor_lmem *ofmap; 
  tensor_lmem *ifmap; 
  tensor_lmem *weight; 
  tensor_lmem *bias; 
  u8 ins_h, ins_last_h; 
  u8 ins_w, ins_last_w; 
  u8 pad_top , pad_bottom; 
  u8 pad_left , pad_right;
  bool relu_enable; 
  int rshift_width; 
} bmk1880_winograd_param_t; 

void * bmk1880_tpu_winograd(void *ctx, const bmk1880_winograd_param_t *p); 
```

### bmk1880 tpu depthwise

Similar to bmk1880 tpu conv, but computes a depthwise convolution. Moreover, weight is default strided in lmem (see section 2.4 BMkernel 1880 Guide.pdf ). The other parameters, including those in param, are similar to those of same names in function bmk1880 tpu conv.

```c
typedef struct { 
  tensor_lmem *ofmap; 
  tensor_lmem *ifmap; 
  tensor_lmem *weight; 
  tensor_lmem *bias;
   u8 ins_h, ins_last_h; 
  u8 ins_w, ins_last_w; 
  u8 pad_top , pad_bottom;
  u8 pad_left , pad_right;
  u8 stride_h ,  stride_w; 
  int rshift_width; 
} bmk1880_depthwise_param_t; 

void * bmk1880_tpu_depthwise(void *ctx, const 
bmk1880_depthwise_param_t *p); 
```

### bmk1880 tpu max pooling

bmk1880 tpu max pooling instructs TPU to compute a tensor ofmap from tensor ifmap, by doing a (kh × kw) max pooling over ifmap. The size parameters of pooling kernel, kh and kw, are specified in param. Other parameters in param are similar to those of same names in bmk1880 - conv param t .&#x20;

{% hint style="info" %}
ofmap and ifmap must be aligned.
{% endhint %}

```c
typedef struct { 
  tensor_lmem *ofmap; 
  tensor_lmem *ifmap; 
  u8 kh, kw; 
  u8 ins_h, ins_last_h; 
  u8 ins_w, ins_last_w; 
  u8 pad_top , pad_bottom;
  u8 pad_left , pad_right;
  u8 stride_h , stride_w; 
} bmk1880_max_pooling_param_t; 

void * bmk1880_tpu_max_pooling(void *ctx, const 
bmk1880_max_pooling_param_t *p); 
```

### bmk1880 tpu avg pooling

Similar to bmk1880 tpu max pooling, but does an average pooling over ifmap as controlled by avg pooling const. At every pooling step, all related basic data in ifmap are summed together, multiplied by avg pooling const, and then shifted rightward by rshift width bits.

```c
typedef struct { 
  tensor_lmem *ofmap; 
  tensor_lmem *ifmap; 
  u8 kh, kw; 
  u8 ins_h, ins_last_h; 
  u8 ins_w, ins_last_w; 
  u8 pad_top ,  pad_bottom;
  u8 pad_left ,  pad_right;
  u8 stride_h ,  stride_w; 
  u8 avg_pooling_const; 
  int rshift_width; 
} bmk1880_avg_pooling_param_t; 

void * bmk1880_tpu_avg_pooling(void *ctx, const 
bmk1880_avg_pooling_param_t *p); 
```

### bmk1880 tpu matrix mac

bmk1880 tpu matrix mac instructs TPU to compute a matrix res by multiplying left matrix left with right matrix right, and then add matrix bias (if not NULL), and finally shift to right by rshift width bits. Noth that all tensor lmem structures involved must be matrice instead of tensors. ctrls may have CTRL\_RELU or CTRL\_RA flag set, but not both. After adding bias but before right shifting, ReLU activations are performed in which negative values are rectified to 0 if ctrls is CTRL\_RELU, or the original values in res are shifted leftward by lshift width bits and then added into the results if ctrls is CTRL\_RA. res is int8 indicates whether the result is 8-bit or 16-bit.

The use of res matrix is unusual when ctrls is CTRL\_RA or when res is int8 is false. Assume that the result is a matrix of shape (R,C). When ctrls is CTRL\_RA, the original result is a 16-bit matrix of shape (R,C) represented by res. Since a 16-bit matrix’s high and low 8-bit parts are stored separately as two 8-bit matrice in lmem, res’s tensor lmem structure must be of 8-bit format (FMT\_I8 or FMT\_U8), must be of shape (R × 2, C), and must be aligned (see aligned stride values in section 2.4 BMkernel 1880 Guide.pdf). When res is int8 is false, the final result is a 16-bit matrix similarly represented by res. When ctrls is CTRL\_RA but res is int8 is true, the original result is 16-bit while the final result is 8-bit. In this case, only the low 8-bit parts (located at lower addresses) of the res matrix are written with the final result. In the final case where both the original and final result are 8-bit matrice, res is a normal 8-bit matrix of shape (R,C).

Note that bias is different from those in bmk1880 tpu conv, bmk1880 winograd or bmk1880 - tpu depthwise. Firstly, it is a matrix. Moreover, if res is of shape (R,C), then bias must be a 16-bit matrix of shape (1,C). Since a 16-bit matrix’s high and low 8-bit parts are stored separately as two 8-bit matrice in lmem, bias’s tensor lmem structure must be of shape (2,C) and must be aligned&#x20;

res, left, right and bias must all be aligned.

```c
typedef struct { 
  tensor_lmem *res; 
  tensor_lmem *left; 
  tensor_lmem *right;   
  tensor_lmem *bias; 
  int lshift_width; 
  int rshift_width; 
  bool res_is_int8; 
  ctrl_t ctrls; 
} bmk1880_matrix_mac_param_t; 

void * bmk1880_tpu_matrix_mac(void *ctx, const 
bmk1880_matrix_mac_param_t *p); 
```

### bmk1880 tpu matrix mac 2

bmk1880 tpu matrix mac 2 instructs TPU to compute a matrix res by multiplying left matrix left with right matrix right. left, right and res must be tensors, though the computation is matrix multiplication. res and left must be of shape (1, 256, 1, 256). right must be of shape (256, 16, 1, 16). The basic data in all tensor lmem structures must be 8-bit.

```
typedef struct { 
  tensor_lmem *res; 
  tensor_lmem *left; 
  tensor_lmem *right; 
} bmk1880_matrix_mac_2_param_t; 

void * bmk1880_tpu_matrix_mac_2( void *ctx , const bmk1880_matrix_mac_2_param_t *p); 

```

## BMNet Library

### TensorOp

TensorOp represents a BMNET IR, which is a bridge between front end and back end. it provides lots of member method to set information to or get from it. Below is the prototype:

```c
namespace bmnet {
class TensorOp {
public:
  int input_shape_size();
  int output_shape_size();
  const TensorShape& input_shape(int index);
  const TensorShape& output_shape(int index);
  TensorShape* add_output_shape();
  u64 global_input(int index);
  u64 global_output(int index);
  TGCustomizedParameter* mutable_tg_customized_param(); 
  const TGCustomizedParameter& tg_customized_param();
};
}
```

### TensorOp::input\_shape\_size

```c
void TensorOp::input_shape_size()
```

&#x20;Return the number of inputs.

### TensorOp::output\_shape\_size

```c
void TensorOp::output_shape_size()
```

Return the number of outputs.

### TensorOp::input\_shape

const TensorShape& TensorOp::input\_shape( int index)

```c
const TensorShape& TensorOp::input_shape( int index)
```

Return shape of input by index.

| Parameter | Type | Description                                     |
| --------- | ---- | ----------------------------------------------- |
| index     | int  | \[Required] index of input that to be returned. |

### TensorOp::output\_shape

```c
const TensorShape& TensorOp::output_shape(int index)
```

Return shape of output by index.

| Parameter | Type | Description                                      |
| --------- | ---- | ------------------------------------------------ |
| index     | int  | \[Required] index of output that to be returned. |

### TensorOp::add\_output\_shape

```c
TensorShape* TensorOp::add_output_shape() 
```

Return a mutable pointer to a new added TensorShape of outputs. The returned TensorShape could be modified latter.

### TensorOp::global\_input

```
u64 TensorOp::global_input(int index) 
```

Return offset of input tensor by index, while it was stored in device memory.

| Parameter | Type | Description                                     |
| --------- | ---- | ----------------------------------------------- |
| index     | int  | \[Required] index of input that to be returned. |

### TensorOp::global\_output

```cpp
u64 TensorOp::global_output(int index) 
```

Return offset of output tensor by index, while it was stored in device memory.

| Parameter | Type | Description                                      |
| --------- | ---- | ------------------------------------------------ |
| index     | int  | \[Required] index of output that to be returned. |

### TensorOp::mutable\_tg\_customized\_param

```cpp
TGCustomizedParameter* TensorOp::mutable_tg_customized_param() 
```

Return a mutable pointer to parameters of customized BMNET IR.

### TensorOp::tg\_customized\_param

```cpp
const TGCustomizedParameter& TensorOp::tg_customized_param() 
```

Return reference of customized BMNET IR’s paramters.

### CustomizedCaffeLayer

CustomizedCaffeLayer is abstract class, which is used to implement a Layer to convert CAFFE Layer into BMNet IR(please refer to Chapter 5 for details about BMNet IR). If you want to introduce a customized CAFFE layer into BMNet, please inherit this class and implement all pure virtual functions of it. The CustomizedCaffeLayer inherits from CaffeLayer/Layer class. Below are the prototypes of them:

```c
namespace bmnet {
class Layer {
 public:
   Layer();
   virtual  ~Layer(void);
   virtual  std::string layer_name() = 0;
   virtual  void dump () = 0;
   virtual  void codegen(TensorOp *op) = 0;

protected:
   void add_output_offset(int offset);
};
} 

namespace bmnet {

class CaffeLayer : public Layer { 
public:
   CaffeLayer(){}
   virtual ~CaffeLayer(void); 
protected:
   caffe::LayerParameter &layer_; 
};
} 

namespace bmnet {

class CustomizedCaffeLayer : public CaffeLayer { 
public:
   CustomizedCaffeLayer(); 
   ~CustomizedCaffeLayer();
   void setup(TensorOp* op) override {
 ... 
 ...
     TGCustomizedParameter* param = op->mutable_tg_customized_param ();
     param->set_sub_type(layer_name()); 
    }
};
}

```

### CustomizedCaffeLayer::layer\_name

```cpp
std::string CustomizedCaffelayer::layer_name() 
```

Pure virtual function, return type of new added CAFFE layer.

### CustomizedCaffeLayer::dump

```cpp
void CustomizedCaffelayer::dump() 
```

Pure virtual function, is used to print information of CAFFE Layer.

### CustomizedCaffeLayer:: setup

```c
void CustomizedCaffelayer::setup() 
```

Option. It is used to set sub type of Customized Layer only. Implement by default. If child class will override it, this parent class setup function must be call first.

### CustomizedCaffeLayer::codegen

Pure virtual function, is used to setup BMNET IR according to LayerParameter of CAFFE Layer. In this function, you should setup output shape and fill parameters to TensorOp.

| Parameter | Type       | Description                                   |
| --------- | ---------- | --------------------------------------------- |
| op        | TensorOp\* | \[Required] pointer to a instance of BMNET IR |

### CustomizedCaffeLayer::add\_output\_offset

```cpp
void CustomizedCaffelayer::add_output_offset (int offset)
```

Protected member method, should be called when setup output offset of Layer’s top.

| Parameter | Type | Description                                |
| --------- | ---- | ------------------------------------------ |
| offset    | int  | \[Required] offset of output, should be 0. |

### CustomizedCaffeLayer::layer\_

```cpp
caffe::LayerParameter CustomizedCaffelayer::&layer_ 
```

Protected member variable, which is reference of customized CAFFE layer’s LayerParameter.

### CustomizedTensorFixedInst

CustomizedTensorFixedInst is abstract class, which is used to implement a Layer to convert BMNET IR into instructions by BMKernel APIs. Please inherit this class and implement all pure virtual functions of it. The CustomizedTensorFixedInst inherits from TensorFixedInst/ TensorInst class. Below are the prototypes of them:

```cpp
namespace bmnet {
class TensorFixedInst: public TensorInst { 
public:
  TensorFixedInst() : TensorInst() {} 
  TensorFixedInst(TensorOp &op) : TensorInst(op) {} 
  virtual ~ TensorFixedInst (void);
  void SetCalibrationParameter(
     const LayerCalibrationParameter &calibration_parameter) { 
	m_calibrationParameter = calibration_parameter;
}
  void AddInputCalibrationParameter(
     const LayerCalibrationParameter &calibration_parameter){  
	m_inputCalibrationParameter.push_back(calibration_parameter);
} 
protected:
  LayerCalibrationParameter m_calibrationParameter; 
  std::vector <LayerCalibrationParameter >
     m_inputCalibrationParameter;
};
}

```

```cpp
namespace bmnet {
class TensorInst {
public:
  TensorInst();
  virtual ~TensorInst(void);
  virtual std::string inst_name() = 0;
  virtual void dump () = 0;
  virtual void encode () = 0;

protected:
  TensorOp &op_; 
};
}

```

```cpp
namespace bmnet {

class CustomizedTensorFixedInst : public TensorFixedInst { 
public:
  CustomizedTensorFixedInst ();
  ~CustomizedTensorFixedInst (); 
protected:
  u64 get_global_neuron_base();
  u64 get_global_weight_base();
}; 
}

```

### CustomizedTensorFixedInst::inst\_name

```cpp
std::string CustomizedTensorFixedInst::inst_name()
```

Pure virtual function, return type of customized BMNET IR.

### CustomizedTensorFixedInst::dump

```cpp
void CustomizedTensorFixedInst::dump()
```

Pure virtual function, is used to print information of BMNET IR.

### CustomizedTensorFixedInst::encode

```cpp
void CustomizedTensorFixedInst::encode() 
```

Pure virtual function, is used to convert BMNET IR into instructions using BMKernel APIs.

### CustomizedTensorFixedInst::get\_global\_neuron\_base

```cpp
u64 CustomizedTensorFixedInst::get_global_neuron_base() 
```

Protected member method, return the base address, where the neurons are stored in device memory.

### CustomizedTensorFixedInst::get\_global\_weight\_base

```cpp
u64 CustomizedTensorFixedInst::get_global_weight_base() 
```

Protected member method, return the base address, where weight is stored in device memory.

### CustomizedTensorFixedInst::op\_

```cpp
TensorOp CustomizedTensorFixedInst::&op_ 
```

Protected member variable, which is reference of BMNET IR.

### TGCustomizedParamter

TGCustomizedParamter represents a customized BMNET IR’s parameters. It provides member methods to set parameters to or get from it. Below is the prototype:

```cpp
namespace bmnet {

class TGCustomizedParameter {
public:
  int i32_param_size();
  int f32_param_size();
  int i32_param(int index);
  float f32_param(int index);
  void add_i32_param(int value);
  void add_f32_param(float value);
};
}

```

### TGCustomizedParamter::i32\_param\_size

```cpp
void TGCustomizedParamter::i32_param_size()
```

Return the number of int parameters, which stored in TGCustomizedParamter.

### TGCustomizedParamter::f32\_param\_size

```cpp
void TGCustomizedParamter::f32_param_size() 
```

Return the number of float parameters, which stored in TGCustomizedParamter.

### TGCustomizeParamter::i32\_param

```cpp
int TGCustomizedParamter::i32_param(int index) 
```

Return int parameter by index.

| Parameter | Type  | Description                                             |
| --------- | ----- | ------------------------------------------------------- |
| index     | index | \[Required] index of int parameter that to be returned. |

### TGCustomizeParamter::f32\_param

```cpp
float TGCustomizedParamter::f32_param( int index) 
```

Return int parameter by index.

| Parameter | Type  | Description                                               |
| --------- | ----- | --------------------------------------------------------- |
| index     | index | \[Required] index of float parameter that to be returned. |

### TGCustomizeParamter::add\_i32\_param

```cpp
void TGCustomizedParamter::add_i32_param(int value) 
```

Append a new int parameter to TGCustomizedParamter.

| Parameter | Type | Description                 |
| --------- | ---- | --------------------------- |
| value     | int  | \[Required]  int parameter. |

### TGCustomizeParamter::add\_f32\_param

```cpp
void TGCustomizedParamter::add_f32_param(int value)
```

Append a new int parameter to TGCustomizedParamter.

| Parameter | Type  | Description                   |
| --------- | ----- | ----------------------------- |
| value     | float | \[Required]  float parameter. |

### TensorShape

TensorShape represents a shape of tensor. Below is the prototype:

```c
namespace bmnet {

class TensorShape { 
public:
  void CopyFrom(const TensorShape& from); 
  int dim_size() const;
  int dim(int index);
  void add_dim(int value);
};
}
```

### TensorShape::dim\_size

```cpp
void TensorShape::dim_size()
```

Return the number of dims.

### TensorShape::dim

```cpp
int TensorShape::dim(int index) 
```

Return one dim by index.

| Parameter | Type | Description                                    |
| --------- | ---- | ---------------------------------------------- |
| Index     | int  | \[Required]  index of dim that to be returned. |

### TensorShape::add\_dim

```cpp
void TensorShape::add_dim(int value)
```

Append a dim to TensorShape.

| Parameter | Type | Description                          |
| --------- | ---- | ------------------------------------ |
| value     | int  | \[Required]  new dim to be appended. |

### TensorShape::CopyFrom

```cpp
void TensorShape::CopyFrom(const TensorShape& from) 
```

Copy from another TensorShape instance.

| Parameter | Type               | Description                               |
| --------- | ------------------ | ----------------------------------------- |
| value     | const TensorShape& | \[Required]  source TensorShape instance. |

### CaffeBuilder

CaffeBuilder is a class, which provides a uniform interface to combine front end/optimizer/back end core code into one, to compile CAFFE neuron network graph into bmodel file. The CaffeBuilder inherits from Builder class, which is a base compiler class. Below are the prototypes of them:

```cpp
namespace bmnet { 

class Builder {
public:
  Builder(CHIP_VER ver);
  virtual ~Builder();
  void addCustomizedTensorInst(TensorInst *inst);
  void build(int n, int c, int h, int w, int opt);
  void store_prototxt(const char *dst);
  void store_model(const char *net_name, const char *dst);
};
}

```

```cpp
namespace bmnet { 

class CaffeBuilder : public Builder { 
public:
  CaffeBuilder(CHIP_VER ver, const char *modified_proto , const char *caffemodel , const char *weight_bin , const char *in_ctable , const char *out_ctable);
  ~CaffeBuilder();
  void addCustomizedLayer(Layer *layer);
};

```

### CaffeBuilder::CaffeBuilder

```cpp
CaffeBuilder::CaffeBuilder(
CHIP_VER ver ,
const char *modified_proto ,
const char *caffemodel ,
const char *weight_bin ,
const char *in_ctable ,
const char *out_ctable)
```

Constructor function of CaffeBuilder class.

| Parameter       | Type         | Description                                                                                      |
| --------------- | ------------ | ------------------------------------------------------------------------------------------------ |
| ver             | CHIP\_VER    | <p>\[Required]  The target chip version. Currently only BM\_CHIP\_BM1880 is</p><p>available.</p> |
| modified\_proto | const char\* | \[Optional] The modified prototxt file, please refer Chapter 4 to get more detail.               |
| caffemodel      | const char\* | \[Required] The specified caffemode file of network                                              |
| weight\_bin     | const char\* | \[Optional] The specified weight file of network                                                 |
| in\_ctable      | const char\* | \[Required] The specified calibration table file of network                                      |
| out\_ctable     | const char\* | \[Required] The specified weight file of network                                                 |

modified\_proto are optional parameters, that means you no need to fill all of this parameters. Below combination are valid: 1) caffemodel only; 2) caffemodel, as well as modified\_protos

### CaffeBuilder::Builder

Core member function of CaffeBuilder class, used to compile the network by specifying input shape and optimization level.

| Parameter | Type | Description                                                                                        |
| --------- | ---- | -------------------------------------------------------------------------------------------------- |
| n,c,h,w   | int  | \[Required]  The input shape                                                                       |
| opt       | int  | \[Optional] The input optimization options. The default value is BM\_OPT\_LAYER\_GROUP\_WITH\_WEIG |

Below are the values for opt.

| value                             | Description                                                                           |
| --------------------------------- | ------------------------------------------------------------------------------------- |
| OPT\_NONE                         | No optimization                                                                       |
| BM\_OPT\_LAYER\_GROUP             | Divides layers into clusters to optimize the bandwidth overhead.                      |
| BM\_OPT\_LAYER\_GROUP\_WITH\_WEIG | Add additional optimization to reduce the device memory footprint and reshape weight. |

### CaffeBuilder::store\_prototxt

store the optimized network graph as a file.

| Parameter | Type         | Description                    |
| --------- | ------------ | ------------------------------ |
| dst       | const char\* | \[Required]  File to be stored |

### CaffeBuilder::store\_model

```cpp
void CaffeBuilder::store_model(
const char* net_name ,
const char* dst,
onst char* plugin_path=nullptr)
```

Store compiled instructions, weight and other information of the network as a bmodel file.

| Parameter    | Type         | Description                        |
| ------------ | ------------ | ---------------------------------- |
| net\_name    | const char\* | \[Required]  the network name.     |
| dst          | const char\* | \[Required]  File to store bmodel. |
| Plugin\_path | const char\* | \[Required]  cpu op plugins.       |

### CaffeBuilder::addCustomizedLayer

```cpp
void CaffeBuilder::addCustomizedLayer( Layer* layer)
```

Register a new added customized layer, which used to convert CAFFE layer into BMNet IR (Intermediate representation).

| Parameter | Type    | Description                                     |
| --------- | ------- | ----------------------------------------------- |
| Layer     | Layer\* | \[Required]  pointer to instance of Class Layer |

### CaffeBuilder::addCustomizedTensorInst

```cpp
void CaffeBuilder::addCustomizedTensorInst(TensorInst* inst) 
```

Register a new added customized TensorInst (Tensor Instruction), which used to convert BMNet IR into instructions.

| Parameter | Type         | Description                                     |
| --------- | ------------ | ----------------------------------------------- |
| inst      | TensorInst\* | \[Required]  pointer to instance of Class Layer |
