BMNNSDK API v1
Last updated
Last updated
BMNNSDK provides a lightweight set of c/c++APIs for deep learning application developer, it consists of TPU BMRuntime Library, BMKernel Library and BMNet Library. Which will be described in detail in this section.
bm_init() initializes BM device and creates a handle to BM context.
bm_exit() must be called before application exits. It will release all internal resources.
bm_enum_devices() enumerates all BM devices in the system.
bm_device_open() opens a BM device.
bm_device_close() closes an opened BM device.
bm_device_query() always returns BM_ERR_NOT_SUPPORTED now.
bm_device_config() always returns BM_ERR_NOT_SUPPORTED now.
bm_device_get_info() return a BM device information.
bm_context_create() creates a BM context.
bm_context_destroy() destroys a BM context.
bm_bind_device() binds a BM context with a BM device.
bm_unbind_device() unbinds a BM context with the BM device.
bm_get_device() returns the BM device handle which is bound with the BM context.
bmruntime_bmkernel_create() creates a BM kernel with the BM context. The p_bk_ctx points to a thread local variable, so you can use this API to create multi BM contexts in multiple threads, they are independent. But you can’t own more than one BM context at the same time in one thread, otherwise there will be a memory leak.
bmruntime_bmkernel_submit() the BM kernel with the BM context.
bmruntime_bmkernel_destroy() destroys the BM kernel with the BM context.
bmmem_device_alloc_raw() allocates device memory as the input size.
bmmem_device_prealloc_raw() allows application to allocate memory from previously allocted device memory. The memory you want to allocate needs to fall in the previously allocated device memory.
bmmem_device_alloc() allocates device memory as the input shape.
bmmem_device_free() frees the device memory that are allocated by the above allocating functions.
bmmem_host_alloc() always returns BM_ERR_NOT_SUPPORTED now.
bmmem_host_free() always returns BM_ERR_NOT_SUPPORTED now.
bmmem_device_size() returns the device memory size.
bmmem_device_addr() returns the device memory address.
bmmem_host_v_addr() always returns BM_ERR_NOT_SUPPORTED now.
bmmem_host_p_addr() always returns BM_ERR_NOT_SUPPORTED now.
bm_memcpy_s2d() copy system memory data to device memory. s means system, d means device.
bm_memcpy_d2s copy device memory data to system memory.
bmnet_register() registers a neuron network with bmnet info.
bmnet_register_bmodel() registers a neuron network with bmodel file.
bmnet_register_noalloc() registers a compiled neuron network without allocating weight and neuron device memory.
bmnet_set_input_shape () sets a input shape for a registered BM network. The bmodel support different input shapes, the API can set one of them.
bmnet_get_output_info () sets a input shape for a registered BM network.
bmnet_cleanup() cleans up a registered BM network.
bmnet_run() runs a registered BM network. You need load input and store output by yourself.
bmnet_weight_devmem() retrieves the weight device memory handler from a registered BM network.
bmnet_neuron_devmem() retrieves neuron device memory handler from a registered BM network.
bmnet_input_devmem() retrieves input device memory handler from a registered BM network.
bmnet_output_devmem() retrieves output device memory handler from a registered BM network.
bmnet_import_weight_devmem() imports weight device memory for a registered BM network. application should allocate weight device memory firstly, then call it to import weight memory. This function and bmnet_import_neuron_devmem() function are usually used with bmnet_register_noalloc() function. Application can register BM network without allocating weight and neuron device memory, and then use these two functions to import weight and neuron memory.
bmnet_import_neuron_devmem() imports neuron device memory for a registered BM network. Application should allocate neuron device memory firstly, then call it to import neuron memory.
bmnet_load_input() loads input data for a registered BM network.
bmnet_load_neuron() loads neuron data for a registered BM network.
bmnet_store_output() stores output data for a registered BM network. Application uses this function to copy output data from device memory to host memory.
bmnet_store_neuron() stores neuron data for a registered BM network. Application uses this function to copy neuron data from device memory to host memory.
bmnet_inference() runs inference with a registered BM network.
User allocates a BMKernel context by filling a bmk1880 info t structure and passing it to bmk1880 register function. The function returns a handle of the initialized context.
In the bmk1880 info t structure: chip version is an integer describing the version of chip to work with, and can be 1880 or 1880; cmdbuf (short for “command buffer”) is a user-allocated buffer to contain generated hardware instructions and cmdbuf size describes its size in bytes. Note that user is responsible to free cmdbuf after the use of referring BMKernel context.
bmk1880 cleanup frees the context previously allocated by bmk1880 register.
bmk1880 acquire cmdbuf returns a buffer of hardware instructions generated so far and set (*size) to buffer’s valid size in bytes. The buffer is an array of cmd hdr t structures each containing one variable-sized generated hardware instruction.
In the cmd hdr t structure, engine id is the identifier of engine on which the contained in- struction is supposed to be executed. And len indicates in bytes the length of the hardware instruction immediately following this cmd hdr t structure.
bmk1880 reset resets current BMKernel context to its initial state as returned by bmk1880 - register. This function is usually called after bmk1880 acquire cmdbuf to empty the cmdbuf buffer.
bmk1880 parallel enable claims that following computations on different engines can be executed with no synchornization with each other. This function enables engine-oriented parallel programming style.
bmk1880 parallel disable disables engine-oriented parallel programming style.
bmk1880 create streams creates nr streams streams, indexed 0 to (nr streams - 1), that following calls to bmk1880 set stream can refer to. This function enables dependency-oriented parallel programming style. Note this style can not be disabled once enabled.
bmk1880 destroy streams destroys all the streams created by the previous call to bmk1880 - create streams and resets the system back to serial mode.
bmk1880 set stream set current stream to stream i that has been created by calling bmk1880 - create streams. Following computations will be put into this stream until another bmk1880 set - stream specifying a different stream index is called.
bmk1880 add dependency further restricts that the computation represented by before must take place strictly before that represented by after. Both before and after are pointers returned by some computation API.
During all kinds of computation, input values are first converted into 32-bit ones before any internal computation, and final 32-bit values are saturated into ranges that can be represented by the final 8-bit or 16-bit integer format. That is, if the value before saturation can be represented by the final integer format, it is unchanged. Otherwise it is saturated into the maximun or minimum in the final integer format, whichever is nearer to the original value. For example, if the final integer format is FMT_U8, then the representable maximum and minimum are 255 and 0 respectively. In this case, any value that is bigger than 255 becomes 255 after saturation, and values smaller than 0 are saturated into 0’s.
About signedness, one general rule applies to all kinds of computation when not otherwise specified: the result is unsigned if and only if all input tensors or matrice are unsigned. A tensor or matrix is said to be signed if it is of format FMT_I8, unsigned if FMT_U8.
fmt t describes the type of basic data in a tensor or matrix. The naming consists of three parts. “FMT” is a fixed prefix. A following “I” or “U” stands for signed integer or unsigned integer respectively. “8” describes the bit-width of the type.
shape t describes the shape of a tensor or matrix. shape t4 and shape t2 are used to construct shape t’s for tensor and matrix, respectively.
stride t describes the stride of a tensor or matrix. stride t4 and stride t2 are used to construct stride t’s for tensor and matrix, respectively.
tensor lmem represents a tensor or matrix in lmem. fmt, shape, stride are as explained above. If stride is NULL, aligned will be referred as indication of two frequently used stride values.
For tensors, if aligned is false, the stride values are as in the default unaligned stride on page 5. If aligned is true, the values are as in the default aligned stride on page 5. For matrice, stride values are computed by the shapes of corresponding specially shaped tensors, following the same rule.
tensor gmem represents a tensor or matrix in gmem.
bmk1880 chip info returns a structure describing design parameters of the BM1880 chip.
bmk1880 tl prealloc allocates a tensor lmem structure on heap memory, and constructs it as dictated by parameters. The parameter la is the starting address in lmem. The tensor lmem’s aligned field is set to false. If the allocation succeeds, a pointer to the constructed structure is returned, NULL otherwise.
Same as bmk1880 tl prealloc, except the aligned field is set to true.
bmk1880 tl alloc allocates a tensor lmem structure on heap memory, and constructs it as dic- tated by parameters. Unlike in bmk1880 tl prealloc, the starting address is not determined from parameters, but assigned by BMKernel automatically. BMKernel manages the starting addresses in lmem by a simple stack. The starting address in each returned tensor lmem increases mono- tonically against successive bmk1880 tl alloc calls. And the last allocated tensor lmem must be freed first, using function bmk1880 tl free explained soon. If the available memory in lmem is not enough to satisfy an allocation request, or some other error occurs, a NULL pointer is returned.
tensor lmem’s aligned field is set to false when ctrls is CTRL_NULL, and true when CTRL_AL.
bmk1880 tl alloc bank allocates memory from a specific lmem bank, as dictated by the bank id parameter.
bmk1880 tl free frees the tensor lmem structure allocated by bmk1880 tl prealloc, bmk1880 tl pre- alloc align, bmk1880 tl alloc and bmk1880 tl alloc bank back to heap memory. If the structure is allocated by bmk1880 tl alloc or bmk1880 tl alloc bank, bmk1880 tl free also increases the avail- able lmem memory managed by BMKernel and checks that the last allocate, first free rule is obeyed (see bmk1880 tl alloc).
bmk1880 gdma copy gmem instructs DMA to copy tensor or matrix within gmem. src and dst must be both tensors or matrice and must contain 8-bit basic data only. The shapes of src and dst may be different, as long as their total numbers of basic data equal. When src and dst are tensors, ctrls can be CTRL_TP, indicating N/C-transposition. In other cases, ctrls must be CTRL_NULL.
bmk1880 gdma copy lmem instructs DMA to copy a tensor (not matrix) within lmem, from src to dst. The shapes of src and dst may be different, as long as their total numbers of basic data equal. The basic data must be 8-bit.
bmk1880 gdma load instructs DMA to copy a tensor or matrix from gmem to lmem. The tensor or matrix starts at gaddr in gmem, and is strided by default values. When ctrls is CTRL_TP (instead of CTRL_NULL), it indicates N/C-transposition for a tensor, or row/column-transposition for a matrix. The basic data must be 8-bit.
Similar to bmk1880 gdma load, but copies the tensor or matrix from lmem to gmem.
Similar to bmk1880 gdma load, but enables users to specify stride values in gmem.
Similar to bmk1880 gdma store, but enables users to specify stride values in gmem.
bmk1880 gdma lrn shift instructs DMA to compute a tensor (not matrix) dst from tensor src, both of which are of same shape (N, C, H, W ). If right shift is true, the computation copies datum at index (ni, ci, hi, wi) in tensor src into index (ni, ci + lrn step, hi, wi) in tensor dst for each 0 ≤ ci < C − lrn step, and set datum at index (ni, ci, hi, wi) in tensor dst to zero for each 0 ≤ ci < lrn step. If right shift is false, the computation copies datum at index (ni, ci, hi, wi) in tensor src into index (ni, ci − lrn step, hi, wi) in tensor dst for each lrn step ≤ ci < C, and set datum at index (ni, ci, hi, wi) in tensor dst to zero for each C − lrn step ≤ ci < C. The basic data must be 8-bit.
bmk1880 tpu mul instructs TPU to compute resi = (ai × bi) ≫ rshift width for each datum ai in tensor a and bi in tensor b, where resi, ai and bi are of same index. All tensors must be of same shape, and the basic data in all tensor lmem structures must be 8-bit. If the result is a 16-bit tensor, res high and res low represent its high and low 8-bit parts, respectively. res high should be NULL if the result is 8-bit. rshift width indicates the bits to be shifted to right for each result value before saturation.
Similar to bmk1880 tpu mul, but tensor b is replaced by an 8-bit constant. The constant is signed if b is signed is true, unsigned otherwise.
bmk1880 tpu mac instructs TPU to compute resi = (ai × bi + (resi ≪ lshift width)) ≫ rshift width for each datum ai in tensor a, bi in tensor b and resi represented by res high and res low together, where resi, ai and bi are of same index. All tensors must be of same shape, and the basic data in all tensor lmem structures must be 8-bit. The result is a 16-bit tensor if res is int8 is false, or a 8-bit tensor otherwise. rshift width indicates the bits to be shifted to right for each result value before saturation. Note that res high and res low are used both as input resi’s and output resi’s. Input resi’s are fixed to be 16-bit so that both res high and res low must be non-NULL. When the result is a 8-bit tensor, it is stored into res low.
Similar to bmk1880 tpu mac, but tensor b is replaced by an 8-bit constant. The constant is signed if b is signed is true, unsigned otherwise.
bmk1880 tpu add instructs TPU to compute resi = ai + bi for each datum ai in tensor a and bi in tensor b, where resi, ai and bi are of same index. All tensors must be of same shape, and the basic data in all tensor lmem structures must be 8-bit. Tensor a and tensor b must all be 16-bit so that a high and b high must not be NULL. If the result is a 16-bit tensor, res high and res low represent its high and low 8-bit parts, respectively. res high should be NULL if the result is 8-bit.
Similar to bmk1880 tpu add, but tensor b is replaced by a 16-bit constant. The constant is signed if b is signed is true, unsigned otherwise.
bmk1880 tpu sub instructs TPU to compute resi = ai − bi for each datum ai in tensor a and bi in tensor b, where resi, ai and bi are of same index. All tensors must be of same shape, and the basic data in all tensor lmem structures must be 8-bit. Tensor a and tensor b must all be 16-bit so that a high and b high must not be NULL. The result must be signed integers so that the fmt t field in res high and res low must be FMT_I8. If the result is a 16-bit tensor, res high and res low represent its high and low 8-bit parts, respectively. res high should be NULL if the result is 8-bit.
bmk1880 tpu max instructs TPU to compute resi = max(ai,bi) for each datum ai in tensor a and bi in tensor b, where resi, ai and bi are of same index. All tensors must be of same shape, and the basic data in all tensor lmem structures must be 8-bit. Tensor a and tensor b must both be signed or unsigned at the same time.
Similar to bmk1880 tpu max, but computes resi = min(ai, bi).
Similar to bmk1880 tpu min, but tensor b is replaced by an 8-bit constant. The constant is signed if b is signed is true, unsigned otherwise.
bmk1880 tpu arith shift instructs TPU to compute resi = ai ≫ bitsi for each datum ai in tensor a and bitsi in tensor bits, where resi, ai and bitsi are of same index. All tensors must be of same shape, and the basic data in all tensor lmem structures must be 8-bit. Tensor a must be 16-bit and signed so that the fmt fields in a high and a low must be FMT_I8. Tensor bits must be signed and every datum in it must range in [−16, 16]. The result tensor must be 16-bit so that res high must be non-NULL.
bmk1880 tpu and int8 instructs TPU to compute resi = ai ∧ bi for each datum ai in tensor a and bi in tensor b, where resi, ai and bi are of same index. All tensors must be of same shape, and the basic data in all tensor lmem structures must be 8-bit.
Similar to bmk1880 tpu and int8, but all input and output tensors are 16-bit. So res high, a high and b high must be non-NULL.
Similar to bmk1880 tpu and int8, but computes resi = ai ∨ bi.
Similar to bmk1880 tpu and int16, but computes resi = ai ∨ bi.
Similar to bmk1880 tpu and int8, but computes resi = ai ⊕ bi.
Similar to bmk1880 tpu and int16, but computes resi = ai ⊕ bi.
bmk1880 tpu copy instructs TPU to copy tensors within lmem, from src to dst. The basic data must be 8-bit.
Similar to bmk1880 tpu copy, but user provides stride t structures specifying the layouts of ten- sors dst and src. The basic data must be 8-bit.
bmk1880 tpu lut instructs TPU to compute a tensor res from tensor idx, by using tensor table as a lookup table and values in tensor idx as indice. Tensor table must be of shape (1, slices, 16, 16), where slices is the number of lmem slices. Tensor idx and tensor res must be of same shape. Assuming their shape is (N,C,H,W), the datum resi of index (ni,ci,hi,wi) in tensor res is computed from idxi of same index (ni, ci, hi, wi) in tensor idx as resi = tablei , where tablei is of index (0, ct, idxi , idxi mod 16) in tensor table, and ct is the index of lmem slice the datum idxi 16 resides in. The basic data in all tensor lmem structures must be 8-bit.
bmk1880 tpu relu instructs TPU to compute resi = max(0,ai) for each datum ai in tensor a, where resi and ai are of same index. The basic data in all tensor lmem structures must be 8-bit.
bmk1880 tpu conv instructs TPU to compute a tensor ofmap from tensor ifmap, weight and bias, by using ifmap as input feature map, weight as convolution kernel and bias as bias to be added into the convolution result. relu enable may be true, indicating ReLU activations after adding bias values but before shifting every basic datum. rshift width specifies the number of bits to shift every basic datum rightward after optional ReLU activations.
ofmap and ifmap must be aligned (see BMKernel 1880 Guide.pdf).
weight is of a special layout which is very different from that described in programming model(see BMKernel 1880 Guide.pdf). If ifmap is of shape (Nin, Cin, Hin, Win), ofmap is of shape (Nout, Cout, Hout, Wout) and convolution kernels are of shape (Hkernel, Wkernel), then weight should be of shape (Cin, Cout, Hkernel, Wkernel).
The layout of weight, however, is as if it is of shape (1,Cout,Hkernel ×Wkernel,Cin). This special layout can be precisely defined by applying the following stride values to weight’s logical shape (Cin, Cout, Hkernel, Wkernel):
bias may be NULL, indicating no bias values. If it is non-NULL, and assume ofmap is of shape (N, C, H, W ), then bias must be a 16-bit tensor of shape (1, C, 1, 1). Since a 16-bit tensor is stored as two 8-bit tensors in lmem, bias’s tensor lmem structure must be of shape (2,C,1,1) and must be unaligned (see unaligned stride values in section 2.4 BMKernel 1880 Guide.pdf). During the phase of adding bias, the value of datum at index (0,ci,0,0) in the 16-bit tensor are added to all data in ofmap whose C-dimension index is ci.
param contains detailed convolution parameters that can be classified into four categories by their functions. They are insertion, padding, striding and dilations parameters, which are detailed below. Insertion parameters specify the number of zeros to be inserted into specific locations within ifmap. They include ins h, ins last h, ins w and ins last w. ins h specifies the number of zeros to be inserted after every non-last basic datum, along the H-dimension. Consider ifmap of shape (N, C, H, W ) for example. After inserting zeros, ifmap′ will be of shape (N, C, H′, W ), where H′ = 1 + (H − 1) × (ins h + 1). Denoting as xni,ci,hi,wi the value of basic datum at index (ni,ci,hi,wi) of tensor ifmap, and as x′ni,ci,hi,wi the value of that of tensor ifmap′, the following holds:
ins last h specifies the number of zeros to be inserted only after every last basic datum. Similarly, ins w and ins last w specify the number of zeros to be inserted along the W -dimension. Padding parameters specify the number of zeros to be inserted around elements within ifmap. pad top specifies the number of zeros to be inserted before every first basic datum along the H-dimension. pad bottom specifies the number after every last basic datum along the H-dimension. Similary, pad left and pad right specify the number along the W-dimension. Striding parameters specify the number of basic data convolution kernel should stride over after each convolution step. stride h and stride w specify the number along the H-dimension and W-dimension, respectively. Dilation parameters specify the dilation of the convolution kernel weight. That is, (stride h − 1) zeros are inserted between each two basic data along the H-dimension. Similary (stride w − 1) zeros are inserted along the W-dimension.
Similar to bmk1880 tpu conv, but use winograd algorithm to accelerate the computation. More- over, weight must contain only 3 × 3 kernels and must be default strided in lmem (see section 2.4 BMkernel 1880 Guide.pdf). The other parameters, including those in param, are similar to those of same names in function bmk1880 tpu conv (see section 4.41 BMkernel 1880 Guide.pdf).
Similar to bmk1880 tpu conv, but computes a depthwise convolution. Moreover, weight is default strided in lmem (see section 2.4 BMkernel 1880 Guide.pdf ). The other parameters, including those in param, are similar to those of same names in function bmk1880 tpu conv.
bmk1880 tpu max pooling instructs TPU to compute a tensor ofmap from tensor ifmap, by doing a (kh × kw) max pooling over ifmap. The size parameters of pooling kernel, kh and kw, are specified in param. Other parameters in param are similar to those of same names in bmk1880 - conv param t .
ofmap and ifmap must be aligned.
Similar to bmk1880 tpu max pooling, but does an average pooling over ifmap as controlled by avg pooling const. At every pooling step, all related basic data in ifmap are summed together, multiplied by avg pooling const, and then shifted rightward by rshift width bits.
bmk1880 tpu matrix mac instructs TPU to compute a matrix res by multiplying left matrix left with right matrix right, and then add matrix bias (if not NULL), and finally shift to right by rshift width bits. Noth that all tensor lmem structures involved must be matrice instead of tensors. ctrls may have CTRL_RELU or CTRL_RA flag set, but not both. After adding bias but before right shifting, ReLU activations are performed in which negative values are rectified to 0 if ctrls is CTRL_RELU, or the original values in res are shifted leftward by lshift width bits and then added into the results if ctrls is CTRL_RA. res is int8 indicates whether the result is 8-bit or 16-bit.
The use of res matrix is unusual when ctrls is CTRL_RA or when res is int8 is false. Assume that the result is a matrix of shape (R,C). When ctrls is CTRL_RA, the original result is a 16-bit matrix of shape (R,C) represented by res. Since a 16-bit matrix’s high and low 8-bit parts are stored separately as two 8-bit matrice in lmem, res’s tensor lmem structure must be of 8-bit format (FMT_I8 or FMT_U8), must be of shape (R × 2, C), and must be aligned (see aligned stride values in section 2.4 BMkernel 1880 Guide.pdf). When res is int8 is false, the final result is a 16-bit matrix similarly represented by res. When ctrls is CTRL_RA but res is int8 is true, the original result is 16-bit while the final result is 8-bit. In this case, only the low 8-bit parts (located at lower addresses) of the res matrix are written with the final result. In the final case where both the original and final result are 8-bit matrice, res is a normal 8-bit matrix of shape (R,C).
Note that bias is different from those in bmk1880 tpu conv, bmk1880 winograd or bmk1880 - tpu depthwise. Firstly, it is a matrix. Moreover, if res is of shape (R,C), then bias must be a 16-bit matrix of shape (1,C). Since a 16-bit matrix’s high and low 8-bit parts are stored separately as two 8-bit matrice in lmem, bias’s tensor lmem structure must be of shape (2,C) and must be aligned
res, left, right and bias must all be aligned.
bmk1880 tpu matrix mac 2 instructs TPU to compute a matrix res by multiplying left matrix left with right matrix right. left, right and res must be tensors, though the computation is matrix multiplication. res and left must be of shape (1, 256, 1, 256). right must be of shape (256, 16, 1, 16). The basic data in all tensor lmem structures must be 8-bit.
TensorOp represents a BMNET IR, which is a bridge between front end and back end. it provides lots of member method to set information to or get from it. Below is the prototype:
Return the number of inputs.
Return the number of outputs.
const TensorShape& TensorOp::input_shape( int index)
Return shape of input by index.
Return shape of output by index.
Return a mutable pointer to a new added TensorShape of outputs. The returned TensorShape could be modified latter.
Return offset of input tensor by index, while it was stored in device memory.
Return offset of output tensor by index, while it was stored in device memory.
Return a mutable pointer to parameters of customized BMNET IR.
Return reference of customized BMNET IR’s paramters.
CustomizedCaffeLayer is abstract class, which is used to implement a Layer to convert CAFFE Layer into BMNet IR(please refer to Chapter 5 for details about BMNet IR). If you want to introduce a customized CAFFE layer into BMNet, please inherit this class and implement all pure virtual functions of it. The CustomizedCaffeLayer inherits from CaffeLayer/Layer class. Below are the prototypes of them:
Pure virtual function, return type of new added CAFFE layer.
Pure virtual function, is used to print information of CAFFE Layer.
Option. It is used to set sub type of Customized Layer only. Implement by default. If child class will override it, this parent class setup function must be call first.
Pure virtual function, is used to setup BMNET IR according to LayerParameter of CAFFE Layer. In this function, you should setup output shape and fill parameters to TensorOp.
Protected member method, should be called when setup output offset of Layer’s top.
Protected member variable, which is reference of customized CAFFE layer’s LayerParameter.
CustomizedTensorFixedInst is abstract class, which is used to implement a Layer to convert BMNET IR into instructions by BMKernel APIs. Please inherit this class and implement all pure virtual functions of it. The CustomizedTensorFixedInst inherits from TensorFixedInst/ TensorInst class. Below are the prototypes of them:
Pure virtual function, return type of customized BMNET IR.
Pure virtual function, is used to print information of BMNET IR.
Pure virtual function, is used to convert BMNET IR into instructions using BMKernel APIs.
Protected member method, return the base address, where the neurons are stored in device memory.
Protected member method, return the base address, where weight is stored in device memory.
Protected member variable, which is reference of BMNET IR.
TGCustomizedParamter represents a customized BMNET IR’s parameters. It provides member methods to set parameters to or get from it. Below is the prototype:
Return the number of int parameters, which stored in TGCustomizedParamter.
Return the number of float parameters, which stored in TGCustomizedParamter.
Return int parameter by index.
Return int parameter by index.
Append a new int parameter to TGCustomizedParamter.
Append a new int parameter to TGCustomizedParamter.
TensorShape represents a shape of tensor. Below is the prototype:
Return the number of dims.
Return one dim by index.
Append a dim to TensorShape.
Copy from another TensorShape instance.
CaffeBuilder is a class, which provides a uniform interface to combine front end/optimizer/back end core code into one, to compile CAFFE neuron network graph into bmodel file. The CaffeBuilder inherits from Builder class, which is a base compiler class. Below are the prototypes of them:
Constructor function of CaffeBuilder class.
modified_proto are optional parameters, that means you no need to fill all of this parameters. Below combination are valid: 1) caffemodel only; 2) caffemodel, as well as modified_protos
Core member function of CaffeBuilder class, used to compile the network by specifying input shape and optimization level.
Below are the values for opt.
store the optimized network graph as a file.
Store compiled instructions, weight and other information of the network as a bmodel file.
Register a new added customized layer, which used to convert CAFFE layer into BMNet IR (Intermediate representation).
Register a new added customized TensorInst (Tensor Instruction), which used to convert BMNet IR into instructions.
bmk1880 tpu mdsum instructs TPU to compute a tensor res of shape (1,C,1,1) from tensor a of shape (N,C,H,W). Every datum resci of index (0,ci,0,0) in tensor res is computed as where ani,ci,hi,wi is a datum of index (ni,ci,hi,wi) in tensor a. The basic data in all tensor lmem structures must be 8-bit. If the result is a 16-bit tensor, res high and res low represent its high and low 8-bit parts, respectively. Res_high should be NULL if the result is 8-bit. a and res must both be signed or unsigned at the same time.
Parameter
Type
Description
index
Input
Not used now. Set 0 as default value.
ctx
Output
The pointer of BM context handle.
Parameter
Type
Description
ctx
Input
The BM context handle which is created by bm_init().
Parameter
Type
Description
count
Output
The count of BM device.
devinfo
Output
The array of device info.
Parameter
Type
Description
index
Input
The index of BM device.
dev
Output
The pointer of BM device handle.
Parameter
Type
Description
dev
Input
The BM device handle.
Parameter
Type
Description
dev
Input
The BM device handle.
Parameter
Type
Description
ctx
Output
The pointer of BM context handle.
Parameter
Type
Description
ctx
Input
The BM context handle.
Parameter
Type
Description
ctx
Input
The BM context handle.
dev
Input
The BM device handle.
Parameter
Type
Description
ctx
Input
The BM context handle.
Parameter
Type
Description
ctx
Input
The BM context handle.
Parameter
Type
Description
ctx
Input
The BM context handle.
p_bk_ctx
Output
The pointer of BM kernel handle.
Parameter
Type
Description
ctx
Input
The BM context handle.
Parameter
Type
Description
ctx
Input
The BM context handle.
Parameter
Type
Description
ctx
Input
The BM context handle.
size
Input
The size of device memory.
Parameter
Type
Description
ctx
Input
The BM context handle.
mem
Input
The previously allocated device memory.
offset
Input
The offset in the previously allocated device memory.
size
Input
The size of device memory
Parameter
Type
Description
ctx
Input
The BM context handle.
shape
Input
The shape of device memory.
Parameter
Type
Description
ctx
Input
The BM context handle.
mem
Input
The previously allocated device memory.
Offset
Input
The offset in the previously allocated device memory.
shape
Input
The shape of device memory.
Parameter
Type
Description
ctx
Input
The BM context handle.
mem
Input
The device memory handle.
Parameter
Type
Description
ctx
Input
The BM context handle.
mem
Input
The device memory handle.
Parameter
Type
Description
ctx
Input
The BM context handle.
mem
Input
The device memory handle.
Parameter
Type
Description
ctx
Input
The BM context handle.
dst
Input
The device memory handle.
src
Input
The system memory pointer.
Parameter
Type
Description
ctx
Input
The BM context handle.
dst
Input
The system memory pointer.
src
Input
The device memory handle.
Parameter
Type
Description
ctx
Input
The BM context handle.
info
Input
The BM network info.
net
Output
The registered network handle.
Parameter
Type
Description
ctx
Input
The BM context handle.
bmodel
Input
bmodel filename.
net
Output
The registered network handle.
Parameter
Type
Description
ctx
Input
The BM context handle.
info
Input
The BM network info.
net
Output
The registered network handle.
Parameter
Type
Description
net
Input
The BM network handle.
input_shape
Input
The input shape.
Parameter
Type
Description
net
Input
The BM network handle.
output_info
Output
The output info.
Parameter
Type
Description
net
Input
The BM network handle.
Parameter
Type
Description
net
Input
The BM network handle.
Parameter
Type
Description
net
Input
The BM network handle.
Parameter
Type
Description
net
Input
The BM network handle.
Parameter
Type
Description
net
Input
The BM network handle.
Parameter
Type
Description
net
Input
The BM network handle.
Parameter
Type
Description
net
Input
The BM network handle.
weight_mem
Input
The weight device memory handle.
Parameter
Type
Description
net
Input
The BM network handle.
neuron_mem
Input
The neuron device memory handle.
Parameter
Type
Description
net
Input
The BM network handle.
input
Input
The input data pointer.
Parameter
Type
Description
net
Input
The BM network handle.
neuron_offset
Input
The offset of neuron buffer.
neuron_size
Input
The neuron buffer size.
neuron
Input
The pointer to the neuron buffer.
Parameter
Type
Description
net
Input
The BM network handle.
output
Input
The output buffer pointer.
Parameter
Type
Description
net
Input
The BM network handle.
neuron_offset
Input
The offset of neuron buffer.
neuron_size
Input
The neuron buffer size.
neuron
Input
The pointer to the neuron buffer.
Parameter
Type
Description
net
Input
The BM network handle.
input
Input
The input buffer pointer.
output
Input
The output buffer pointer.
Parameter
Type
Description
index
int
[Required] index of input that to be returned.
Parameter
Type
Description
index
int
[Required] index of output that to be returned.
Parameter
Type
Description
index
int
[Required] index of input that to be returned.
Parameter
Type
Description
index
int
[Required] index of output that to be returned.
Parameter
Type
Description
op
TensorOp*
[Required] pointer to a instance of BMNET IR
Parameter
Type
Description
offset
int
[Required] offset of output, should be 0.
Parameter
Type
Description
index
index
[Required] index of int parameter that to be returned.
Parameter
Type
Description
index
index
[Required] index of float parameter that to be returned.
Parameter
Type
Description
value
int
[Required] int parameter.
Parameter
Type
Description
value
float
[Required] float parameter.
Parameter
Type
Description
Index
int
[Required] index of dim that to be returned.
Parameter
Type
Description
value
int
[Required] new dim to be appended.
Parameter
Type
Description
value
const TensorShape&
[Required] source TensorShape instance.
Parameter
Type
Description
ver
CHIP_VER
[Required] The target chip version. Currently only BM_CHIP_BM1880 is
available.
modified_proto
const char*
[Optional] The modified prototxt file, please refer Chapter 4 to get more detail.
caffemodel
const char*
[Required] The specified caffemode file of network
weight_bin
const char*
[Optional] The specified weight file of network
in_ctable
const char*
[Required] The specified calibration table file of network
out_ctable
const char*
[Required] The specified weight file of network
Parameter
Type
Description
n,c,h,w
int
[Required] The input shape
opt
int
[Optional] The input optimization options. The default value is BM_OPT_LAYER_GROUP_WITH_WEIG
value
Description
OPT_NONE
No optimization
BM_OPT_LAYER_GROUP
Divides layers into clusters to optimize the bandwidth overhead.
BM_OPT_LAYER_GROUP_WITH_WEIG
Add additional optimization to reduce the device memory footprint and reshape weight.
Parameter
Type
Description
dst
const char*
[Required] File to be stored
Parameter
Type
Description
net_name
const char*
[Required] the network name.
dst
const char*
[Required] File to store bmodel.
Plugin_path
const char*
[Required] cpu op plugins.
Parameter
Type
Description
Layer
Layer*
[Required] pointer to instance of Class Layer
Parameter
Type
Description
inst
TensorInst*
[Required] pointer to instance of Class Layer