# Programming with BMNNSDK

There are two ways to program with runtime library:

* BMNet
* BMKernel.

## Programming by BMNet

We provide multiple utility tools to convert CAFFE models into machine instructions. These instructions, as well as model’s weights, would be packed into a file named bmodel (model file for BITMAIN targets), which can be executed in BITMAIN board directly. BMNet has implemented many common layers, the full list of build-in layers is in below table, and many more layers are in developing:

|            |              |         |             |           |
| ---------- | ------------ | ------- | ----------- | --------- |
| Activation | BatchNorm    | Concat  | Convolution | Eltwise   |
| Flatten    | InnerProduct | Join    | LRN         | Normalize |
| Permute    | Pooling      | PReLU   | PriorBox    | Reorg     |
| Reshape    | Scale        | Split   | Upsample    | ​         |

Programming flow as follow :

BMNet takes CAFFE framework generated caffemodel and deploy file deploy.prototxt as input. After processing in stages such as front end, optimizer and back end, bmodel file can be generated.

```bash
$ bm_builder.bin \
–t bm1880 \
   -n googlenet \
   -c /data/bmnet_models/googlenet/googlenet.caffemodel \
   -m /data/bmnet_models/googlenet/googlenet_deploy.prototxt \
   --enable-layer-group=yes \
   -s 1,3,224,224 \
   -o bmnet/out/googlenet_1_3_224_224.bmodel
```

If layers of your network model are all supported in BMNet, it is very convenient to use command line to compile the network, otherwise you can refer to BMKernel model.

## Programming by BMKernel

If programming by kernel, then call bmruntime\_bmkernel\_create() function to create a BMkernel. After BMkernel is created, applications can use BMkernel interfaces to generate kernel commands, and then submit the commands by bmruntime\_bmkernel\_submit(). At last, bmruntime\_bmkernel\_destroy() should be called to release the kernel resources. Programming flow chart as follow :

![Programming flow chart](https://2115705518-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-LOME7sTpZZ0QmlqOlnw%2F-LPENsoGlZ5oROG7gfFC%2F-LPEOAhACuK0soCuo3oT%2F2.jpg?alt=media\&token=97aa830e-f0f3-4801-ae7a-41c00e34ce5e)

BMNET provides a serials API to add customized layers without modifying the BMNet core code. Customized layer could be a pure new layer or could be instead of original caffe layer in bmnet. Below tutorial will guide through the steps to create a simple custom layer (use LeakyRelu layer as an example, source code could be found in bmnet/example/customized\_layer\_1880/) instead of original caffe layer in BMNet.

### Add new caffe layer definition

Modify the bmnet\_caffe.proto in path “bmnet/examples/customized\_layer\_1880/proto”. Firstly, you need to check whether the layer exist or not. If it exists skip this step, if it doesn’t exist please append a new line at the end of LayerParameter with a new index and add definition of LayerParameter.&#x20;

{% hint style="info" %}
Note: new layer must be added at the end line, for example, add a ReLUParameter.
{% endhint %}

```c
message LayerParameter {
optional string name = 1; // the layer name
optional string type = 2; // the layer type
repeated string bottom = 3; // the name of each bottom blob
repeated string top = 4; // the name of each top blob
...
optional AccuracyParameter accuracy_param = 102;
optional EmbedParameter embed_param = 137;
optional ExpParameter exp_param = 111;
optional ReLUParameter relu_param = 123; 
}
```

```c
// Message that stores parameters used by ReLU Layer 
message ReLUParameter {
optional float negative_slope = 1 [default = 0];
enum Engine { 
   DEFAULT = 0; 
   CAFFE = 1;
   CUDNN = 2;
  }
  optional Engine engine = 2 [default = DEFAULT]; 
}
```

### Add new CAFFE layer class

Create a child class that inherited from CustomizedCaffeLayer, and implement layer\_name(), dump(), codegen() member methods :

* layer\_name(): needs to return the string name of layer type.&#x20;
* setup(): option. Only support to set set\_sub\_type if necessay. if not implement set\_sub\_type = layer type.&#x20;
* dump(): dump the parameter’s details of new added CAFFE layer in this function.&#x20;
* codegen(): convert parameters of CAFFE layer to tg\_customized\_param, which is param- eter of customized IR.

```cpp
#include <bmnet/frontend/caffe/CaffeFrontendContext.hpp> #include <bmnet/frontend/caffe/CustomizedCaffeLayer.hpp>
class LeakyReluLayer: 
public CustomizedCaffeLayer { 
  public:LeakyReluLayer () : CustomizedCaffeLayer() {} // return type name of new added CAFFE layer. std::string layer_name() {
  return std::string("ReLU"); 
}

// dump parameters of CAFFE layer object layer_.
  void dump () {
  const caffe::ReLUParameter &in_param = layer_.relu_param(); float negative_slope = in_param. negative_slope(); std::cout << "negative_slope:" << negative_slope;
}
void setup(TensorOp* op) {
  CustomizedCaffeLayer::setup(op);
  TGCustomizedParameter* param = op->mutable_tg_customized_param(); 
  param->set_sub_type("leakyrelu");

// convert parameters of CAFFE layer to customized // IR(TensorOp *op)'s parameter(tg_customized_param)
 void codegen(TensorOp *op) {
 // get input shape
 const TensorShape & input_shape = op->input_shape(0);
 
 // get parameter from caffe proto
 const caffe:: ReLUParameter&in_param = layer_.relu_param();
 float negative_slope = in_param.negative_slope();
 
 // set normal output shape
 TensorShape *output_shape = op->add_output_shape();
 output_shape ->CopyFrom(input_shape);
 
 // set out_param
 TGCustomizedParameter* out_param = op->
      mutable_tg_customized_param(); 
 out_param ->add_f32_param(negative_slope);
 }
};
```

### Add new Tensor Instruction class

Create a child class that inherited from CustomizedTensorFixedInst, a class to convert IR to instructions, and implement inst\_name(), dump(), encode() member functions:

* *inst\_name(): needs to return IR name, lowercase with prefix “tg*” + sub*type, sub\_type is set at 6.2.*&#x20;
* *dump(): dump tg\_customized\_param’s details of IR op*.&#x20;
* encode(): convert IR to instructions. If the IR could be deployed to NPU,&#x20;

please use BMKernel api to implement it, or you can just implement a pure CPU version used c++ language.

```cpp
#include <bmnet/targets/plat-bm188x/BM188xBackendContext.hpp>
#include <bmnet/targets/plat-bm188x/CustomizedTensorFixedInst.hpp>
#include <bmkernel/bm_kernel.h>
namespace bmnet {
class TGLeakyReluFixedInst: public CustomizedTensorFixedInst { public:
   TGLeakyReluFixedInst() : CustomizedTensorFixedInst() {} 
   ~ TGLeakyReluFixedInst () {}
   // return type name of IR
   std::string inst_name() {
     return std::string("tg_leakyrelu");
   }
   // dump tg_customized_param of IR op_.
   void dump () {
     const TGCustomizedParameter& param = op_.tg_customized_param(); 
	float alpha = param.f32_param(0);
     std::cout << "alpha:" << alpha << std::endl;
   }
   //extract parameters of tg_customized_param ,
   // and implement instructions. 
   void encode();
 private:
     void forward(
        gaddr_t bottom_gaddr , gaddr_t top_gaddr , 
		int input_n , int input_c ,
		int input_h , int input_w);
}；
}

```

#### NPU Version

If the IR could be deployed to NPU, please use BMKernel APIs to implement the function encode(). More details about BMKernel APIs, please refer to related document

```cpp
void TGLeakyReluFixedInst::encode() {
  const TGCustomizedParameter& param = op_.tg_customized_param(); 
  const TensorShape& input_shape = op_.input_shape(0);
  float negative_slope = param.f32_param(0); 
  assert(negative_slope > 0);
  gaddr_t input_data_gaddr      = op_.global_input(0) +
     get_global_neuron_base();
  gaddr_t output_data_gaddr     = op_.global_output(0) +
     get_global_neuron_base();
  forward ( 
       input_data_gaddr , 
	  output_data_gaddr , 
	  input_shape.dim(0), 
	  input_shape.dim(1), 
	  input_shape.dim(2), 
	  input_shape.dim(3));
} 
void TGLeakyReluFixedInst::forward(
    gaddr_t     bottom_gaddr,
	gaddr_t.    top_gaddr,
	int         input_n ,
	int         input_c ,
	int         input_h , 
	int         input_w)
{
  gaddr_t slice_bottom_gaddr    = bottom_gaddr
  gaddr_t slice_top_gaddr       = top_gaddr
  int count                     = input_n * input_c * input_h *input_w;
  int slice_num                 = get_slice_num_element_wise(*_ctx,3, count + 1);

  int gt_right_shift_width = m_calibrationParameter.relu_param().
     gt_right_shift_width();
  int le_right_shift_width = m_calibrationParameter.relu_param().
     le_right_shift_width();
  int gt_scale = m_calibrationParameter.relu_param().gt_scale();
  int le_scale = m_calibrationParameter.relu_param().le_scale();

  for (int slice_idx = 0; slice_idx < slice_num; slice_idx++) { int count_sec = count / slice_num + (slice_idx < count % slice_num);
      // set shape
      shape_t input_shape = shape_t1(count_sec);
      tensor_lmem *bottom = _ctx->tl_alloc(input_shape, FMT_I8,CTRL_AL);
      tensor_lmem *relu = _ctx->tl_alloc(input_shape, FMT_I8,CTRL_AL);
      tensor_lmem *neg = _ctx->tl_alloc(input_shape, FMT_I8,CTRL_AL); 
	  // load
      _ctx->gdma_load(bottom, slice_bottom_gaddr, CTRL_NEURON); bmk1880_relu_param_t p13;
      p13.ofmap = relu;
      p13.ifmap = bottom;
      _ctx->tpu_relu(&p13); 
	  bmk1880_mul_const_param_t p; 
	  p.res_high = NULL;
	  p.res_low = relu;
	  p.a = relu;
      p.b = gt_scale;
      p.b_is_signed = true;
      p.rshift_width = gt_right_shift_width;
      _ctx->tpu_mul_const(&p);
      bmk1880_min_const_param_t p1;
      p1.min = neg;
      p1.a = bottom;
      p1.b = 0;
      p1.b_is_signed = 1;
      _ctx->tpu_min_const(&p1);
      p.res_high = NULL;
      p.res_low = neg;
      p.a = neg;
      p.b = le_scale;
      p.b_is_signed = true;
      p.rshift_width = le_right_shift_width;
      _ctx->tpu_mul_const(&p);
      bmk1880_or_int8_param_t p2;
      p2.res = bottom;
      p2.a = relu;
      p2.b = neg;
      _ctx->tpu_or_int8(&p2);
      //move result to global
      _ctx->gdma_store(bottom, slice_top_gaddr, CTRL_NEURON);
      //free
      _ctx->tl_free(neg);
      _ctx->tl_free(relu);
      _ctx->tl_free(bottom);
      slice_bottom_gaddr += count_sec * INT8_SIZE;
      slice_top_gaddr += count_sec * INT8_SIZE;
   }
}
```

#### CPU Version

If the IR could only be converted using CPU, please add a new cpu op, and store IR op\_ to it:

```cpp
void TGLeakyReluFixedInst::encode() {
  op_.add_threshold_x(m_inputCalibrationParameter[0].blob_param(0).threshold_y());
  op_.add_threshold_y(m_calibrationParameter.blob_param(0).threshold_y());
  add_cpu_op(_ctx->bm_get_bmk(),"LeakyReluOp", op_); 
}
```

Navigate to the cpu\_op folder, and create a new cpp source file, the name of which should be same as type name of customized layer. In the file, you need to create a child class that inherited from CpuOp, and implement run() member method with c++ code. Finally, please register the new class with REGISTER\_CPU\_OP().

```cpp
#include <bmnet/targets/cpu/cpu_op.hpp>

namespace bmnet {

class LeakyReluOp: public CpuOp {
public:
void run() {
     assert(op_.type() == "ELU");
	const TensorShape& input_shape = op_.input_shape(0);
	int count = GetTensorCount(input_shape);
	char *input_data = NULL;
	char *output_data = NULL;
	float *bottom_data = NULL;
	float *top_data = NULL;
	if (op_.threshold_x_size() > 0) {
	  input_data = reinterpret_cast <char*>(op_.global_input(0)); 
	  bottom_data = (float*)malloc(sizeof(float) * count); 
	  for(inti=0;i<count;++i){
	  bottom_data[i] = input_data[i] * op_.threshold_x(0) /128.0; }
       }
	}
     else {
	  bottom_data = reinterpret_cast <float*>(op_.global_input(0));
	}
	if (op_.threshold_y_size() > 0) {
       output_data = reinterpret_cast <char*>(op_.global_output(0)); 
	  top_data = (float*)malloc(sizeof(float) * count);
	}
	else {
	  top_data = reinterpret_cast <float*>(op_.global_output(0));
	}
	  float negative_slope = op_.tg_customized_param().f32_param(0); 
	  for (int i = 0; i < count; ++i) {
   top_data[i] = std::max(bottom_data[i], (float)0)
       + negative_slope * (std::min(bottom_data[i], (float)0));
}
for(inti=0;i<count;++i){
   if (op_.threshold_y_size() > 0) {
      int fixed_data = (int)(top_data[i] * 128 / op_.threshold_y (0) + 0.5);
      output_data[i] = (fixed_data < -128) ? -128 : ((fixed_data > 127) ? 127 : fixed_data);
    }
}
if (op_.threshold_y_size() > 0) {
   free(top_data);
   }
if (op_.threshold_x_size() > 0) {
   free(bottom_data);
   }
  }
};
}//namespace of bmnet.
// register CPU OP LeakyReluOp
REGISTER_CPU_OP(LeakyReluOp); 
```

In order to compile the new added source file, please add it the CMakeLists.txt in the same folder.

## Programming application

### Introduction to development environment

We provide a docker development image for users, it includes tools and dependent libraries that required for BMNNSDK application development, and users can use it to develop the BMNNSDK application.&#x20;

{% hint style="info" %}
BMNNSDK Docker development image: bmtap2-dev\_latest.docker（Note: the docker development image in this section is different from the docker deployment image in the previous chapter.）&#x20;
{% endhint %}

Docker development image does not contain the BMNNSDK, please import the BMNNSDK to Docker development image for development before you use it.

### Use the development environment

Please make sure you have installed the BMNNSDK before you use the docker development environment, and then import it to the docker development environment.

#### The example for compiling the usb mode

```bash
$ tar xvf bmtap2-bm1880-usb-x.y.z.tar.gz
$ cd bmtap2-bm1880-usb -x.y.z
$ docker run -v $PWD:/workspace/ -e LOCAL_USER_ID=`id -u` -it bmtap2-dev:lates
```

Afer entering the docker container, the example for compiling usb mode(the command executed in the container, please use user\@workspace$)

```bash
// example for compiling bmnet inference
user@:/workspace$ cd examples/bmnet_inference
user@:/workspace$ make -f Makefile.pcie

// example for compling tensor scalaer
user@:/workspace$ cd examples/tensor_scalar
user@:/workspace$ make -f Makefile. pcie
```

#### The example for compiling SoC mode

Unzip the BMNNSDK compression package of SoC mode, import it to the docker development image, and run the docker development image.

```bash
$ tar xvf bmtap2-bm1682-SoC-x.y.z.tar.gz
$ cd bmtap2-bm1682-SoC-x.y.z
$ docker run -v $PWD:/workspace/ -e LOCAL_USER_ID=`id -u` -it bmtap2-dev:latest
```

Afer entering the docker container, the example for compiling SoC mode.

```bash
// the example for compiling bmnet inference
user@:/workspace$ cd examples/bmnet_inference
user@:/workspace$ make -f Makefile.SoC
// the example for compiling tensor scalaer
user@:/workspace$ cd examples/tensor_scalar
user@:/workspace$ make -f Makefile. SoC
```

### Running the sample code

Code will be generated in the local directory:

```
user@:/workspace$ exit
$ ls examples/bmnet_inferecne/bmnet_inference
$ ls examples/tensor_scalar/tensor_scalar
```

Deploy the code to the deployment environment, and run it. For USB mode, you can deploy it to a PC installed with the BM1880 development board. For SoC mode, you can deploy it to the BM1880 SoC development board via SD card, Ethernet, or packaged file system.

## Running

The API of BMNet inference engine are needed for programming. Programming flow chart as follow :

![Programming flow chart](https://2115705518-files.gitbook.io/~/files/v0/b/gitbook-legacy-files/o/assets%2F-LOME7sTpZZ0QmlqOlnw%2F-LPEXCfs5OoEpG5Iltie%2F-LPEcDr9hRBgxwdyKrkm%2F66.png?alt=media\&token=8dbe3f15-f39a-42a4-bd74-69cff024cfbe)

Example code as follows:

```c
  bmctx_t ctx;
  bm_init(0, &ctx);
  
  bmnet_t net;
  bmnet_output_info_t output_info;
  bmnet_register_bmodel(ctx, bmodel, &net); // bmodel = “test.bmodel”
  bmnet_set_input_shape(net, input_shape);
  
  bmnet_get_output_info(net, &output_info);
  size_t output_size = output_info.output_size;
  uint8_t *output = new uint8_t[output_size];
  if (output == NULL) {
    fprintf(stderr, "output memory alloc failed.\n");
    exit(-1);
  }
  bmnet_inference(net, input, output);

  f_output.write((char *)output, output_size);
  f_output.close();

  delete[] output;
  delete[] input;

  bmnet_cleanup(net);
  bm_exit(ctx);

```
