Generate Python Extensions using Nim language

A Python Extension is generally some code written in language other than python to extend the python ecosystem by exposing an interface compatible with Python API.Such extensions are generally in form of compiled code and loaded dynamically by the Python runtime.

The ability to create extensions quickly, to be used directly from Python, would be a great advantage.Ideally, users would be able to write the bottleneck parts of their code in their favorite language (generally faster than Python), to offload compute-intensive parts, and use them from Python language with all its flexibility. The extensions written in such a way allow users to utilize all the features offered by foreign language/dialect including but not limited to multiprocessing, assembly level instructions, meta-progamming hence also bypassing Python GIL. Lots of extensions(modules) like numpy, scikit-learn fundamental to data-science are also written in C/C++.

There are lots of options for writing such extensions ranging from more specific dialects like Cython, Numba(for accelerating Numpy array based operations) to general foreign language bindings like Rust, C++, C. Writing extensions for Python is anything but a delightful experience,generally because a lot of boiler-plate/glue code is needed to wrap the extension.

In this post, we would be looking at Nim. Nim compiles to C, hence it can generate compiled code to be loaded as a module from Python. I personally found Nim as a highly productive language and have written a lot of code for various use-cases. For me, it fulfills what it promises while offering a lot of flexibility like directly compiling any already existing C code for fast prototyping.

We would cover a concrete example to show the practical usefulness of Nim based Python extension by writing a simple Image Preprocessing Pipeline. We will try to establish a simple baseline with generic single threaded code without going into optimizations.

Image preprocessing.

Image preprocessing is generally the first step in computer-vision based ML models, and follows roughly the same order as:

Order:

  1. Converting Uint8 [HWC] to float32 format ranging [0-1].
  2. Optionally [HWC] format to [CHW] depending on the framework.
  3. Resizing to expected INPUT-SIZE (dictated by the model being used), like 224x224, 256x256.
  4. Normalization of data (by subtracting MEAN and dividing by standard-deviation).

Optimizing this part would be an advantage since this pipeline would need to be run for each frame/image. Preprocessing and post-processing code is generally least optimized code given that code is highly specific for each model, hence contribute a significant latency if not properly optimized.

Code:

Assumptions:

  1. We assume that input data is contiguous in memory i.e. data is packed without any padding and in consecutive memory locations.
  2. Input data contains Uint8 data i.e. value from [0-255].
  3. Input data follows HWC format, i.e. stride for channels' dimension is 1.
  4. Output data produced would also be contiguous.
  5. Output data format is [CHW], i.e. stride for Width dimension would be 1.

Since resizing would involve collecting an input/source pixel (nearest neighbour) or weighted combination of input pixels (bilinear), based on the logical coordinates (i, j) for each output pixel location. We can convert the collected input (uint8) data into float32, before updating corresponding output/destination memory location thus effectively fusing steps 1 and 3.

We can decide if we want output format to be CHW or HWC beforehand or by writing 2 different implementations, thus also fusing step 2.

# Writing a simple Resize function that fuses steps 1, 2 and 3 in our pipeline for case if output format is CHW.

import math

# nearest neighbour based routine to calculate the correponding input source index, given the output index.
proc nearest_neighbour_compute_source_index(scale:float, out_index:int, input_size:int):int=
    return min( int( floor(out_index.float * scale)), input_size-1 )

proc hwc2chw_resize_simple(inpRawData_ptr:ptr uint8, outRawData_ptr:ptr float32, inpH:int, inpW:int, outH:int, outW:int, C:int=3)=

    let
        inpRawData = cast[ptr UncheckedArray[uint8]](inpRawData_ptr)
        outRawData = cast[ptr UncheckedArray[float32]](outRawData_ptr)

    let
        scale_h:float = inpH.float/outH.float
        scale_w:float = inpW.float/outW.float

    # For each position in output image/array, get the correponding input pixel value. 
    for h in 0..<outH:
      for w in 0..<outW:
        for c in 0..<C:
          let src_h = nearest_neighbour_compute_source_index(scale=scale_h, out_index=h, input_size=inpH) # logical index
          let src_w = nearest_neighbour_compute_source_index(scale=scale_w, out_index=w, input_size=inpW) # logical index

          var inp_f32 = float32(inpRawData[src_h*inpW*C + src_w*C + c])/255'f32

          outRawData[c*outH*outW + h*outW + w] = inp_f32

Model may need image-data either in BGR format or in RGB format depending on data model was trained with. Let us try to include that constraint into our code with no extra cost.

proc hwc2chw_resize_simple(inpRawData_ptr:ptr uint8, outRawData_ptr:ptr float32, inpH:int, inpW:int, outH:int, outW:int, C=3, reverse_channels:bool=false)=

    let reverse_channels = int(reverse_channels) #0/1
    let
        inpRawData = cast[ptr UncheckedArray[uint8]](inpRawData_ptr)
        outRawData = cast[ptr UncheckedArray[float32]](outRawData_ptr)

    let
        scale_h:float = inpH.float/outH.float
        scale_w:float = inpW.float/outW.float

    #For each position in output image/array, get the correponding input pixel value. 
    for h in 0..<outH:
      for w in 0..<outW:
        for c in 0..<C:
          let src_h = nearest_neighbour_compute_source_index(scale=scale_h, out_index=h, input_size=inpH) #logical index
          let src_w = nearest_neighbour_compute_source_index(scale=scale_w, out_index=w, input_size=inpW) #logical index

          #if reverse_channels is true, we get (C-1-c)'th channel, otherwise c'th channel.  
          var inp_f32 = float32(inpRawData[src_h*inpW*C + src_w*C + (1-reverse_channels)*c + reverse_channels*(C-1-c)])/255'f32


          #update corresponding memory-location for output array by converting logical indices to array indices.
          outRawData[c*outH*outW + h*outW + w] = inp_f32

Let us also fuse normalization step i.e. subtracting Mean and dividing std-deviation along Channel dimension.

proc hwc2chw_resize_simple(inpRawData_ptr:ptr uint8, outRawData_ptr:ptr float32, inpH:int, inpW:int, outH:int, outW:int, C=3, mean_array:array[3, float32]=[0'f32,0,0], std_array:array[3, float32]=[1'f32,1,1], reverse_channels:bool=false)=

    let reverse_channels = int(reverse_channels) #0/1
    let
        inpRawData = cast[ptr UncheckedArray[uint8]](inpRawData_ptr)
        outRawData = cast[ptr UncheckedArray[float32]](outRawData_ptr)

    let
        scale_h:float = inpH.float/outH.float
        scale_w:float = inpW.float/outW.float

    #For each position in output image/array, get the correponding input pixel value. 
    for h in 0..<outH:
      for w in 0..<outW:
        for c in 0..<C:
          let src_h = nearest_neighbour_compute_source_index(scale=scale_h, out_index=h, input_size=inpH) #logical index
          let src_w = nearest_neighbour_compute_source_index(scale=scale_w, out_index=w, input_size=inpW) #logical index

          #if reverse_channels is true, we get (C-1-c)'th channel, otherwise c'th channel.  
          var inp_f32 = float32(inpRawData[src_h*inpW*C + src_w*C + (1-reverse_channels)*c + reverse_channels*(C-1-c)])/255'f32

          #minus correponding mean and divide by standard deviation.
          inp_f32 = (inp_f32 - mean_array[(1-reverse_channels)*c + reverse_channels*(C-1-c)])/(std_array[(1-reverse_channels)*c + reverse_channels*(C-1-c)] + 1e-5'f32)

          #update corresponding memory-location for output array by converting logical indices to array indices.
          outRawData[c*outH*outW + h*outW + w] = inp_f32

In this way we were able to fuse generally all the steps for a preprocessing pipeline with almost no-extra cost. We can write more specific code depending on the model needs but with similar speeds hence significantly reducing the latency for preprocessing part for a model.

Generating Python extension.

We now can generate a Python extension to include this functionality in a Python compiled module. We would be using an awesome module nimpy which is designed with ABI compatibility, and hence should work independent of Python versions.

Reading the rawData produced from Python side.

According to code written in Nim,we would be needing access to raw Uint8 values for given frame/image. To produce the data from the Python side, we would be using Numpy which implements buffer protocol, hence easily allowing access to underlying buffer/rawData on the Nim side, which is exactly what we need. Since Numpy is one of the most-used modules for matrix/array manipulation, being directly able to provide a Numpy array to routines exposed by the compiled extension is a great benefit.

Writing code for extension

import nimpy
import nimpy / [raw_buffers, py_types]

proc preprocessPipeline_nim(image:PyObject, output:PyObject, reverse_channels:bool=false){.exportpy.}=

  var image_buf:RawPyBuffer
  image.getBuffer(image_buf, PyBUF_SIMPLE or PyBUF_ND)

  var output_buf:RawPyBuffer
  output.getBuffer(output_buf, PyBUF_WRITABLE or PyBUF_ND)


  #access to rawData
  let rawData_ptr = cast[ptr uint8](image_buf.buf)
  let output_ptr =  cast[ptr float32](output_buf.buf)

  #get dimensions of our input image.
  let
      H = cast[ptr UncheckedArray[Py_ssize_t]](image_buf.shape)[0].int  
      W = cast[ptr UncheckedArray[Py_ssize_t]](image_buf.shape)[1].int
      C = cast[ptr UncheckedArray[Py_ssize_t]](image_buf.shape)[2].int

      outH = cast[ptr UncheckedArray[Py_ssize_t]](output_buf.shape)[1].int  
      outW = cast[ptr UncheckedArray[Py_ssize_t]](output_buf.shape)[2].int
      C_out = cast[ptr UncheckedArray[Py_ssize_t]](output_buf.shape)[0].int

  assert C == C_out
  doAssert C == 3,"Expected No of channels to be 3" & " but got " & $C


  hwc2chw_resize_simple(
    inpRawData_ptr = rawData_ptr,
    outRawData_ptr = output_ptr,
    inpH = H,
    inpW = W,
    outH = outH,
    outW = outW,
    C = C,
    reverse_channels = bool(reverse_channels)
  )


  #release the buffers reference for Python GC.
  image_buf.release()
  output_buf.release()

Compiling the extension

Since i am compiling on Windows, to please Windows gods I have to offer some sacrifice in the form of extra flags like --tlsEmulation:off and -passL:-static.

# default cc/compiler is gcc,
nim c [-d:danger] --cc:gcc --threads:on --app:lib --tlsEmulation:off --passL:-static --out:preprocessPipeline.pyd preprocessPipeline.nim

Using the extension from Python

import numpy as np

import preprocessPipeline as pipeline

#wrap this into a simple function, for easy usage.minimal extra latency in form of function call overhead.
def preprocess(image, output_shape:tuple, reverse_channels:bool=False):
  """ Takes an input image of uint8 , HWC format, returns resized image float32[0-1] with format CHW"""

  #image: numpy nd array of uint8 data-type of shape [H, W, C]
  #output_shape: (out_height, out_width)

  #returns: np.array of shape [C, out_height, out_width]
  inpH = image.shape[0]
  inpW = image.shape[1]
  C = image.shape[2]

  outH = output_shape[0]
  outW = output_shape[1]

  output = np.empty((C, outH, outW), dtype=np.float32)
  pipeline.preprocessPipeline_nim(image, output, reverse_channels=False)#output would be updated in-place.

  return output

Timing:

Now we can try out our extension to visualize some results. It takes about ~700 microseconds on my system for preprocess function to run with an image with resolution of 576x768x3.

import cv2
import numpy as np

frame = cv2.imread("test.jpg") #Uint8 HWC format [576,768,3]
output = preprocess(frame, output_shape=(240,256), reverse_channels=False)
output_reversed = preprocess(frame, output_shape=(240,256), reverse_channels=True)

#This preprocessed output, can be directly used as an input to computer-vision model.

cv2.imshow("frame", output.transpose(1,2,0))
cv2.waitKey(0)

Corresponding pipeline in Pytorch framework would look like as shown below. Even though Pytorch is research oriented framework but pipeline like shown below is used quite regularly for making predictions even with trained/deployed-in-production models.

import PIL
from torchvision import transforms

pipeline = transforms.Compose([
        transforms.Resize((240,256),1), #1 for interpolation Nearest-neighbour
        transforms.ToTensor(),  #[HWC] -> [CHW] [0-1] float32
        transforms.Normalize(mean=[0,0,0], std=[1,1,1])
])

image = PIL.Image.open("test.jpg") ##Uint8 HWC format [576,768,3]
output = pipeline(image)

Timing:

It takes about 5.6 milliseconds, without even color-conversion routine being run/fused. Even though it is a very basic comparison of timing, it indicates the flexibility we can have when writing extension for very specific needs and with much lower latency than already compiled modules.

Remarks:

Like this we should be able to fuse more operations depending upon model requirements, like resizing with keeping aspect ratio unchanged.

In our experience we have found by writing preprocessing/postprocessing code from scratch in Nim/C/Rust/Zig and wrapping existing fast-implementations of operations like Convolution we can make a lot of deep-learning based models run in real-time on Consumer Grade CPUs even without AVX512 instructions, thus opening a lot of exciting opportunities, along with added benefit of deploying them in production without much friction.