Programming with BMNNSDK

There are two ways to program with runtime library:

  • BMNet

  • BMKernel.

Programming by BMNet

We provide multiple utility tools to convert CAFFE models into machine instructions. These instructions, as well as model’s weights, would be packed into a file named bmodel (model file for BITMAIN targets), which can be executed in BITMAIN board directly. BMNet has implemented many common layers, the full list of build-in layers is in below table, and many more layers are in developing:

Activation

BatchNorm

Concat


Convolution

Eltwise

Flatten

InnerProduct

Join

LRN

Normalize

Permute

Pooling

PReLU

PriorBox

Reorg

Reshape

Scale

Split

Upsample

Programming flow as follow :

BMNet takes CAFFE framework generated caffemodel and deploy file deploy.prototxt as input. After processing in stages such as front end, optimizer and back end, bmodel file can be generated.

$ bm_builder.bin \
–t bm1880 \
-n googlenet \
-c /data/bmnet_models/googlenet/googlenet.caffemodel \
-m /data/bmnet_models/googlenet/googlenet_deploy.prototxt \
--enable-layer-group=yes \
-s 1,3,224,224 \
-o bmnet/out/googlenet_1_3_224_224.bmodel

If layers of your network model are all supported in BMNet, it is very convenient to use command line to compile the network, otherwise you can refer to BMKernel model.

Programming by BMKernel

If programming by kernel, then call bmruntime_bmkernel_create() function to create a BMkernel. After BMkernel is created, applications can use BMkernel interfaces to generate kernel commands, and then submit the commands by bmruntime_bmkernel_submit(). At last, bmruntime_bmkernel_destroy() should be called to release the kernel resources. Programming flow chart as follow :

Programming flow chart

BMNET provides a serials API to add customized layers without modifying the BMNet core code. Customized layer could be a pure new layer or could be instead of original caffe layer in bmnet. Below tutorial will guide through the steps to create a simple custom layer (use LeakyRelu layer as an example, source code could be found in bmnet/example/customized_layer_1880/) instead of original caffe layer in BMNet.

Add new caffe layer definition

Modify the bmnet_caffe.proto in path “bmnet/examples/customized_layer_1880/proto”. Firstly, you need to check whether the layer exist or not. If it exists skip this step, if it doesn’t exist please append a new line at the end of LayerParameter with a new index and add definition of LayerParameter.

Note: new layer must be added at the end line, for example, add a ReLUParameter.

message LayerParameter {
optional string name = 1; // the layer name
optional string type = 2; // the layer type
repeated string bottom = 3; // the name of each bottom blob
repeated string top = 4; // the name of each top blob
...
optional AccuracyParameter accuracy_param = 102;
optional EmbedParameter embed_param = 137;
optional ExpParameter exp_param = 111;
optional ReLUParameter relu_param = 123;
}
// Message that stores parameters used by ReLU Layer
message ReLUParameter {
optional float negative_slope = 1 [default = 0];
enum Engine {
DEFAULT = 0;
CAFFE = 1;
CUDNN = 2;
}
optional Engine engine = 2 [default = DEFAULT];
}

Add new CAFFE layer class

Create a child class that inherited from CustomizedCaffeLayer, and implement layer_name(), dump(), codegen() member methods :

  • layer_name(): needs to return the string name of layer type.

  • setup(): option. Only support to set set_sub_type if necessay. if not implement set_sub_type = layer type.

  • dump(): dump the parameter’s details of new added CAFFE layer in this function.

  • codegen(): convert parameters of CAFFE layer to tg_customized_param, which is param- eter of customized IR.

#include <bmnet/frontend/caffe/CaffeFrontendContext.hpp> #include <bmnet/frontend/caffe/CustomizedCaffeLayer.hpp>
class LeakyReluLayer:
public CustomizedCaffeLayer {
public:LeakyReluLayer () : CustomizedCaffeLayer() {} // return type name of new added CAFFE layer. std::string layer_name() {
return std::string("ReLU");
}
// dump parameters of CAFFE layer object layer_.
void dump () {
const caffe::ReLUParameter &in_param = layer_.relu_param(); float negative_slope = in_param. negative_slope(); std::cout << "negative_slope:" << negative_slope;
}
void setup(TensorOp* op) {
CustomizedCaffeLayer::setup(op);
TGCustomizedParameter* param = op->mutable_tg_customized_param();
param->set_sub_type("leakyrelu");
// convert parameters of CAFFE layer to customized // IR(TensorOp *op)'s parameter(tg_customized_param)
void codegen(TensorOp *op) {
// get input shape
const TensorShape & input_shape = op->input_shape(0);
// get parameter from caffe proto
const caffe:: ReLUParameter&in_param = layer_.relu_param();
float negative_slope = in_param.negative_slope();
// set normal output shape
TensorShape *output_shape = op->add_output_shape();
output_shape ->CopyFrom(input_shape);
// set out_param
TGCustomizedParameter* out_param = op->
mutable_tg_customized_param();
out_param ->add_f32_param(negative_slope);
}
};

Add new Tensor Instruction class

Create a child class that inherited from CustomizedTensorFixedInst, a class to convert IR to instructions, and implement inst_name(), dump(), encode() member functions:

  • inst_name(): needs to return IR name, lowercase with prefix “tg” + subtype, sub_type is set at 6.2.

  • dump(): dump tg_customized_param’s details of IR op.

  • encode(): convert IR to instructions. If the IR could be deployed to NPU,

please use BMKernel api to implement it, or you can just implement a pure CPU version used c++ language.

#include <bmnet/targets/plat-bm188x/BM188xBackendContext.hpp>
#include <bmnet/targets/plat-bm188x/CustomizedTensorFixedInst.hpp>
#include <bmkernel/bm_kernel.h>
namespace bmnet {
class TGLeakyReluFixedInst: public CustomizedTensorFixedInst { public:
TGLeakyReluFixedInst() : CustomizedTensorFixedInst() {}
~ TGLeakyReluFixedInst () {}
// return type name of IR
std::string inst_name() {
return std::string("tg_leakyrelu");
}
// dump tg_customized_param of IR op_.
void dump () {
const TGCustomizedParameter& param = op_.tg_customized_param();
float alpha = param.f32_param(0);
std::cout << "alpha:" << alpha << std::endl;
}
//extract parameters of tg_customized_param ,
// and implement instructions.
void encode();
private:
void forward(
gaddr_t bottom_gaddr , gaddr_t top_gaddr ,
int input_n , int input_c ,
int input_h , int input_w);
}
}

NPU Version

If the IR could be deployed to NPU, please use BMKernel APIs to implement the function encode(). More details about BMKernel APIs, please refer to related document

void TGLeakyReluFixedInst::encode() {
const TGCustomizedParameter& param = op_.tg_customized_param();
const TensorShape& input_shape = op_.input_shape(0);
float negative_slope = param.f32_param(0);
assert(negative_slope > 0);
gaddr_t input_data_gaddr = op_.global_input(0) +
get_global_neuron_base();
gaddr_t output_data_gaddr = op_.global_output(0) +
get_global_neuron_base();
forward (
input_data_gaddr ,
output_data_gaddr ,
input_shape.dim(0),
input_shape.dim(1),
input_shape.dim(2),
input_shape.dim(3));
}
void TGLeakyReluFixedInst::forward(
gaddr_t bottom_gaddr,
gaddr_t. top_gaddr,
int input_n ,
int input_c ,
int input_h ,
int input_w)
{
gaddr_t slice_bottom_gaddr = bottom_gaddr
gaddr_t slice_top_gaddr = top_gaddr
int count = input_n * input_c * input_h *input_w;
int slice_num = get_slice_num_element_wise(*_ctx,3, count + 1);
int gt_right_shift_width = m_calibrationParameter.relu_param().
gt_right_shift_width();
int le_right_shift_width = m_calibrationParameter.relu_param().
le_right_shift_width();
int gt_scale = m_calibrationParameter.relu_param().gt_scale();
int le_scale = m_calibrationParameter.relu_param().le_scale();
for (int slice_idx = 0; slice_idx < slice_num; slice_idx++) { int count_sec = count / slice_num + (slice_idx < count % slice_num);
// set shape
shape_t input_shape = shape_t1(count_sec);
tensor_lmem *bottom = _ctx->tl_alloc(input_shape, FMT_I8,CTRL_AL);
tensor_lmem *relu = _ctx->tl_alloc(input_shape, FMT_I8,CTRL_AL);
tensor_lmem *neg = _ctx->tl_alloc(input_shape, FMT_I8,CTRL_AL);
// load
_ctx->gdma_load(bottom, slice_bottom_gaddr, CTRL_NEURON); bmk1880_relu_param_t p13;
p13.ofmap = relu;
p13.ifmap = bottom;
_ctx->tpu_relu(&p13);
bmk1880_mul_const_param_t p;
p.res_high = NULL;
p.res_low = relu;
p.a = relu;
p.b = gt_scale;
p.b_is_signed = true;
p.rshift_width = gt_right_shift_width;
_ctx->tpu_mul_const(&p);
bmk1880_min_const_param_t p1;
p1.min = neg;
p1.a = bottom;
p1.b = 0;
p1.b_is_signed = 1;
_ctx->tpu_min_const(&p1);
p.res_high = NULL;
p.res_low = neg;
p.a = neg;
p.b = le_scale;
p.b_is_signed = true;
p.rshift_width = le_right_shift_width;
_ctx->tpu_mul_const(&p);
bmk1880_or_int8_param_t p2;
p2.res = bottom;
p2.a = relu;
p2.b = neg;
_ctx->tpu_or_int8(&p2);
//move result to global
_ctx->gdma_store(bottom, slice_top_gaddr, CTRL_NEURON);
//free
_ctx->tl_free(neg);
_ctx->tl_free(relu);
_ctx->tl_free(bottom);
slice_bottom_gaddr += count_sec * INT8_SIZE;
slice_top_gaddr += count_sec * INT8_SIZE;
}
}

CPU Version

If the IR could only be converted using CPU, please add a new cpu op, and store IR op_ to it:

void TGLeakyReluFixedInst::encode() {
op_.add_threshold_x(m_inputCalibrationParameter[0].blob_param(0).threshold_y());
op_.add_threshold_y(m_calibrationParameter.blob_param(0).threshold_y());
add_cpu_op(_ctx->bm_get_bmk(),"LeakyReluOp", op_);
}

Navigate to the cpu_op folder, and create a new cpp source file, the name of which should be same as type name of customized layer. In the file, you need to create a child class that inherited from CpuOp, and implement run() member method with c++ code. Finally, please register the new class with REGISTER_CPU_OP().

#include <bmnet/targets/cpu/cpu_op.hpp>
namespace bmnet {
class LeakyReluOp: public CpuOp {
public:
void run() {
assert(op_.type() == "ELU");
const TensorShape& input_shape = op_.input_shape(0);
int count = GetTensorCount(input_shape);
char *input_data = NULL;
char *output_data = NULL;
float *bottom_data = NULL;
float *top_data = NULL;
if (op_.threshold_x_size() > 0) {
input_data = reinterpret_cast <char*>(op_.global_input(0));
bottom_data = (float*)malloc(sizeof(float) * count);
for(inti=0;i<count;++i){
bottom_data[i] = input_data[i] * op_.threshold_x(0) /128.0; }
}
}
else {
bottom_data = reinterpret_cast <float*>(op_.global_input(0));
}
if (op_.threshold_y_size() > 0) {
output_data = reinterpret_cast <char*>(op_.global_output(0));
top_data = (float*)malloc(sizeof(float) * count);
}
else {
top_data = reinterpret_cast <float*>(op_.global_output(0));
}
float negative_slope = op_.tg_customized_param().f32_param(0);
for (int i = 0; i < count; ++i) {
top_data[i] = std::max(bottom_data[i], (float)0)
+ negative_slope * (std::min(bottom_data[i], (float)0));
}
for(inti=0;i<count;++i){
if (op_.threshold_y_size() > 0) {
int fixed_data = (int)(top_data[i] * 128 / op_.threshold_y (0) + 0.5);
output_data[i] = (fixed_data < -128) ? -128 : ((fixed_data > 127) ? 127 : fixed_data);
}
}
if (op_.threshold_y_size() > 0) {
free(top_data);
}
if (op_.threshold_x_size() > 0) {
free(bottom_data);
}
}
};
}//namespace of bmnet.
// register CPU OP LeakyReluOp
REGISTER_CPU_OP(LeakyReluOp);

In order to compile the new added source file, please add it the CMakeLists.txt in the same folder.

Programming application

Introduction to development environment

We provide a docker development image for users, it includes tools and dependent libraries that required for BMNNSDK application development, and users can use it to develop the BMNNSDK application.

BMNNSDK Docker development image: bmtap2-dev_latest.docker(Note: the docker development image in this section is different from the docker deployment image in the previous chapter.)

Docker development image does not contain the BMNNSDK, please import the BMNNSDK to Docker development image for development before you use it.

Use the development environment

Please make sure you have installed the BMNNSDK before you use the docker development environment, and then import it to the docker development environment.

The example for compiling the usb mode

$ tar xvf bmtap2-bm1880-usb-x.y.z.tar.gz
$ cd bmtap2-bm1880-usb -x.y.z
$ docker run -v $PWD:/workspace/ -e LOCAL_USER_ID=`id -u` -it bmtap2-dev:lates

Afer entering the docker container, the example for compiling usb mode(the command executed in the container, please use user@workspace$)

// example for compiling bmnet inference
user@:/workspace$ cd examples/bmnet_inference
user@:/workspace$ make -f Makefile.pcie
// example for compling tensor scalaer
user@:/workspace$ cd examples/tensor_scalar
user@:/workspace$ make -f Makefile. pcie

The example for compiling SoC mode

Unzip the BMNNSDK compression package of SoC mode, import it to the docker development image, and run the docker development image.

$ tar xvf bmtap2-bm1682-SoC-x.y.z.tar.gz
$ cd bmtap2-bm1682-SoC-x.y.z
$ docker run -v $PWD:/workspace/ -e LOCAL_USER_ID=`id -u` -it bmtap2-dev:latest

Afer entering the docker container, the example for compiling SoC mode.

// the example for compiling bmnet inference
user@:/workspace$ cd examples/bmnet_inference
user@:/workspace$ make -f Makefile.SoC
// the example for compiling tensor scalaer
user@:/workspace$ cd examples/tensor_scalar
user@:/workspace$ make -f Makefile. SoC

Running the sample code

Code will be generated in the local directory:

user@:/workspace$ exit
$ ls examples/bmnet_inferecne/bmnet_inference
$ ls examples/tensor_scalar/tensor_scalar

Deploy the code to the deployment environment, and run it. For USB mode, you can deploy it to a PC installed with the BM1880 development board. For SoC mode, you can deploy it to the BM1880 SoC development board via SD card, Ethernet, or packaged file system.

Running

The API of BMNet inference engine are needed for programming. Programming flow chart as follow :

Programming flow chart

Example code as follows:

bmctx_t ctx;
bm_init(0, &ctx);
bmnet_t net;
bmnet_output_info_t output_info;
bmnet_register_bmodel(ctx, bmodel, &net); // bmodel = “test.bmodel”
bmnet_set_input_shape(net, input_shape);
bmnet_get_output_info(net, &output_info);
size_t output_size = output_info.output_size;
uint8_t *output = new uint8_t[output_size];
if (output == NULL) {
fprintf(stderr, "output memory alloc failed.\n");
exit(-1);
}
bmnet_inference(net, input, output);
f_output.write((char *)output, output_size);
f_output.close();
delete[] output;
delete[] input;
bmnet_cleanup(net);
bm_exit(ctx);