A Python Extension is generally some code written in language other than python to extend the python ecosystem by exposing an interface compatible with Python API.Such extensions are generally in form of compiled code and loaded dynamically by the Python runtime.
The ability to create extensions quickly, to be used directly from Python, would be a great advantage.Ideally, users would be able to write the bottleneck parts of their code in their favorite language (generally faster than Python), to offload compute-intensive parts, and use them from Python language with all its flexibility. The extensions written in such a way allow users to utilize all the features offered by foreign language/dialect including but not limited to multiprocessing, assembly level instructions, meta-progamming hence also bypassing Python GIL. Lots of extensions(modules) like numpy
, scikit-learn
fundamental to data-science
are also written in C/C++
.
There are lots of options for writing such extensions ranging from more specific dialects like Cython, Numba(for accelerating Numpy array based operations) to general foreign language bindings like Rust, C++, C. Writing extensions for Python is anything but a delightful experience,generally because a lot of boiler-plate/glue code is needed to wrap the extension.
In this post, we would be looking at Nim. Nim compiles to C, hence it can generate compiled code to be loaded as a module from Python. I personally found Nim as a highly productive language and have written a lot of code for various use-cases. For me, it fulfills what it promises while offering a lot of flexibility like directly compiling any already existing C code for fast prototyping.
We would cover a concrete example to show the practical usefulness of Nim based Python extension by writing a simple Image Preprocessing Pipeline
. We will try to establish a simple baseline with generic single threaded code without going into optimizations.
Image preprocessing is generally the first step in computer-vision based ML models, and follows roughly the same order as:
Optimizing this part would be an advantage since this pipeline would need to be run for each frame/image. Preprocessing and post-processing code is generally least optimized code given that code is highly specific for each model, hence contribute a significant latency if not properly optimized.
Uint8
data i.e. value from [0-255].stride
for channels' dimension is 1.contiguous
.stride
for Width dimension would be 1.Since resizing
would involve collecting an input/source pixel (nearest neighbour) or weighted combination
of input pixels (bilinear), based on the logical coordinates (i, j) for each output pixel location.
We can convert the collected input (uint8
) data into float32
, before updating corresponding output/destination memory location thus effectively fusing steps 1 and 3.
We can decide if we want output format to be CHW
or HWC
beforehand or by writing 2 different implementations, thus also fusing step 2.
# Writing a simple Resize function that fuses steps 1, 2 and 3 in our pipeline for case if output format is CHW.
import math
# nearest neighbour based routine to calculate the correponding input source index, given the output index.
proc nearest_neighbour_compute_source_index(scale:float, out_index:int, input_size:int):int=
return min( int( floor(out_index.float * scale)), input_size-1 )
proc hwc2chw_resize_simple(inpRawData_ptr:ptr uint8, outRawData_ptr:ptr float32, inpH:int, inpW:int, outH:int, outW:int, C:int=3)=
let
inpRawData = cast[ptr UncheckedArray[uint8]](inpRawData_ptr)
outRawData = cast[ptr UncheckedArray[float32]](outRawData_ptr)
let
scale_h:float = inpH.float/outH.float
scale_w:float = inpW.float/outW.float
# For each position in output image/array, get the correponding input pixel value.
for h in 0..<outH:
for w in 0..<outW:
for c in 0..<C:
let src_h = nearest_neighbour_compute_source_index(scale=scale_h, out_index=h, input_size=inpH) # logical index
let src_w = nearest_neighbour_compute_source_index(scale=scale_w, out_index=w, input_size=inpW) # logical index
var inp_f32 = float32(inpRawData[src_h*inpW*C + src_w*C + c])/255'f32
outRawData[c*outH*outW + h*outW + w] = inp_f32
Model may need image-data either in BGR
format or in RGB
format depending on data model was trained with. Let us try to include that constraint into our code with no extra cost.
proc hwc2chw_resize_simple(inpRawData_ptr:ptr uint8, outRawData_ptr:ptr float32, inpH:int, inpW:int, outH:int, outW:int, C=3, reverse_channels:bool=false)=
let reverse_channels = int(reverse_channels) #0/1
let
inpRawData = cast[ptr UncheckedArray[uint8]](inpRawData_ptr)
outRawData = cast[ptr UncheckedArray[float32]](outRawData_ptr)
let
scale_h:float = inpH.float/outH.float
scale_w:float = inpW.float/outW.float
#For each position in output image/array, get the correponding input pixel value.
for h in 0..<outH:
for w in 0..<outW:
for c in 0..<C:
let src_h = nearest_neighbour_compute_source_index(scale=scale_h, out_index=h, input_size=inpH) #logical index
let src_w = nearest_neighbour_compute_source_index(scale=scale_w, out_index=w, input_size=inpW) #logical index
#if reverse_channels is true, we get (C-1-c)'th channel, otherwise c'th channel.
var inp_f32 = float32(inpRawData[src_h*inpW*C + src_w*C + (1-reverse_channels)*c + reverse_channels*(C-1-c)])/255'f32
#update corresponding memory-location for output array by converting logical indices to array indices.
outRawData[c*outH*outW + h*outW + w] = inp_f32
Let us also fuse normalization
step i.e. subtracting Mean and dividing std-deviation along Channel dimension.
proc hwc2chw_resize_simple(inpRawData_ptr:ptr uint8, outRawData_ptr:ptr float32, inpH:int, inpW:int, outH:int, outW:int, C=3, mean_array:array[3, float32]=[0'f32,0,0], std_array:array[3, float32]=[1'f32,1,1], reverse_channels:bool=false)=
let reverse_channels = int(reverse_channels) #0/1
let
inpRawData = cast[ptr UncheckedArray[uint8]](inpRawData_ptr)
outRawData = cast[ptr UncheckedArray[float32]](outRawData_ptr)
let
scale_h:float = inpH.float/outH.float
scale_w:float = inpW.float/outW.float
#For each position in output image/array, get the correponding input pixel value.
for h in 0..<outH:
for w in 0..<outW:
for c in 0..<C:
let src_h = nearest_neighbour_compute_source_index(scale=scale_h, out_index=h, input_size=inpH) #logical index
let src_w = nearest_neighbour_compute_source_index(scale=scale_w, out_index=w, input_size=inpW) #logical index
#if reverse_channels is true, we get (C-1-c)'th channel, otherwise c'th channel.
var inp_f32 = float32(inpRawData[src_h*inpW*C + src_w*C + (1-reverse_channels)*c + reverse_channels*(C-1-c)])/255'f32
#minus correponding mean and divide by standard deviation.
inp_f32 = (inp_f32 - mean_array[(1-reverse_channels)*c + reverse_channels*(C-1-c)])/(std_array[(1-reverse_channels)*c + reverse_channels*(C-1-c)] + 1e-5'f32)
#update corresponding memory-location for output array by converting logical indices to array indices.
outRawData[c*outH*outW + h*outW + w] = inp_f32
In this way we were able to fuse generally all the steps for a preprocessing pipeline with almost no-extra cost. We can write more specific code depending on the model needs but with similar speeds hence significantly reducing the latency for preprocessing part for a model.
We now can generate a Python extension to include this functionality in a Python compiled module. We would be using an awesome module nimpy which is designed with ABI compatibility, and hence should work independent of Python versions.
rawData
produced from Python side.According to code written in Nim,we would be needing access to raw Uint8 values for given frame/image.
To produce the data from the Python side, we would be using Numpy
which implements buffer protocol, hence easily allowing access to underlying buffer/rawData
on the Nim side, which is exactly what we need.
Since Numpy is one of the most-used modules for matrix/array manipulation, being directly able to provide a Numpy array to routines exposed by the compiled extension is a great benefit.
import nimpy
import nimpy / [raw_buffers, py_types]
proc preprocessPipeline_nim(image:PyObject, output:PyObject, reverse_channels:bool=false){.exportpy.}=
var image_buf:RawPyBuffer
image.getBuffer(image_buf, PyBUF_SIMPLE or PyBUF_ND)
var output_buf:RawPyBuffer
output.getBuffer(output_buf, PyBUF_WRITABLE or PyBUF_ND)
#access to rawData
let rawData_ptr = cast[ptr uint8](image_buf.buf)
let output_ptr = cast[ptr float32](output_buf.buf)
#get dimensions of our input image.
let
H = cast[ptr UncheckedArray[Py_ssize_t]](image_buf.shape)[0].int
W = cast[ptr UncheckedArray[Py_ssize_t]](image_buf.shape)[1].int
C = cast[ptr UncheckedArray[Py_ssize_t]](image_buf.shape)[2].int
outH = cast[ptr UncheckedArray[Py_ssize_t]](output_buf.shape)[1].int
outW = cast[ptr UncheckedArray[Py_ssize_t]](output_buf.shape)[2].int
C_out = cast[ptr UncheckedArray[Py_ssize_t]](output_buf.shape)[0].int
assert C == C_out
doAssert C == 3,"Expected No of channels to be 3" & " but got " & $C
hwc2chw_resize_simple(
inpRawData_ptr = rawData_ptr,
outRawData_ptr = output_ptr,
inpH = H,
inpW = W,
outH = outH,
outW = outW,
C = C,
reverse_channels = bool(reverse_channels)
)
#release the buffers reference for Python GC.
image_buf.release()
output_buf.release()
Since i am compiling on Windows, to please Windows
gods I have to offer some sacrifice in the form of extra flags like --tlsEmulation:off
and -passL:-static
.
# default cc/compiler is gcc,
nim c [-d:danger] --cc:gcc --threads:on --app:lib --tlsEmulation:off --passL:-static --out:preprocessPipeline.pyd preprocessPipeline.nim
import numpy as np
import preprocessPipeline as pipeline
#wrap this into a simple function, for easy usage.minimal extra latency in form of function call overhead.
def preprocess(image, output_shape:tuple, reverse_channels:bool=False):
""" Takes an input image of uint8 , HWC format, returns resized image float32[0-1] with format CHW"""
#image: numpy nd array of uint8 data-type of shape [H, W, C]
#output_shape: (out_height, out_width)
#returns: np.array of shape [C, out_height, out_width]
inpH = image.shape[0]
inpW = image.shape[1]
C = image.shape[2]
outH = output_shape[0]
outW = output_shape[1]
output = np.empty((C, outH, outW), dtype=np.float32)
pipeline.preprocessPipeline_nim(image, output, reverse_channels=False)#output would be updated in-place.
return output
Now we can try out our extension to visualize some results.
It takes about ~700 microseconds
on my system for preprocess
function to run with an image with resolution of 576x768x3
.
import cv2
import numpy as np
frame = cv2.imread("test.jpg") #Uint8 HWC format [576,768,3]
output = preprocess(frame, output_shape=(240,256), reverse_channels=False)
output_reversed = preprocess(frame, output_shape=(240,256), reverse_channels=True)
#This preprocessed output, can be directly used as an input to computer-vision model.
cv2.imshow("frame", output.transpose(1,2,0))
cv2.waitKey(0)
Corresponding pipeline in Pytorch framework would look like as shown below. Even though Pytorch is research oriented framework but pipeline like shown below is used quite regularly for making predictions even with trained/deployed-in-production models.
import PIL
from torchvision import transforms
pipeline = transforms.Compose([
transforms.Resize((240,256),1), #1 for interpolation Nearest-neighbour
transforms.ToTensor(), #[HWC] -> [CHW] [0-1] float32
transforms.Normalize(mean=[0,0,0], std=[1,1,1])
])
image = PIL.Image.open("test.jpg") ##Uint8 HWC format [576,768,3]
output = pipeline(image)
It takes about 5.6 milliseconds
, without even color-conversion
routine being run/fused.
Even though it is a very basic comparison of timing, it indicates the flexibility we can have when writing extension for very specific needs and with much lower latency than already compiled modules.
Like this we should be able to fuse more operations depending upon model requirements, like resizing with keeping aspect ratio unchanged.
In our experience we have found by writing preprocessing/postprocessing code from scratch in Nim/C/Rust/Zig and wrapping existing fast-implementations of operations like Convolution
we can make a lot of deep-learning based models run in real-time on Consumer Grade CPUs even without AVX512
instructions, thus opening a lot of exciting opportunities, along with added benefit of deploying them in production without much friction.