Implement Computer Vision on remote live camera feed using WebRTC
Computer vision has been immensely popular these days and its application have grown surprisingly high due to covid times. However there are still many gaps while implementing CV algorithms on real time feeds. Imagine a scenario where you have to perform object detection on CCTV footage and edge device is just not computationally sufficient to run your deep learning inference code and also you dont want to setup IP camera network. Though edge inference is recommended for time-critical workflows, what if I tell you that it is possible to perform near-realtime inference on cloud even for time-critical workflows ?
In this blog, I will illustrate how to perform Machine learning tasks like computer vision frame by frame on remote live stream using WebRTC but before that lets dive into whats WebRTC.
WebRTC is a low-latency framework — an open source project by Google; that is to say, it’s a combination of standards, protocols, and JavaScript APIs that operate together to deliver real-time communication. And when we say, “real-time communication,” WebRTC streaming is capable of, not just sub-second delivery, but sub-500 millisecond video delivery. It’s the lowest-latency streaming technology currently available, achieving an almost instantaneous stream. WebRTC was created as an alternative to proprietary technologies, a free and open standard for real-time communication that was also plugin-free. It supported browser-to-browser communication (and person-to-person live streaming) through a set of standardised protocols. So all those “click to join” meetings that you attended during your work from home last year, were all probably conducted using WebRTC.
Enough of literature, now let’s get our hands dirty !
So here’s a flowchart explaining the workflow of our architecture.
We will first work on the back-end part i.e. the Python flask app. We will create a server.py file where we will load our model and define a function to handle frames that we receive from the front end.
We will define model architecture and load trained model weights which I recommend is the best way to load TensorFlow models otherwise weights would get randomised. Please don't judge my AI skills from the model ;-). I have developed a very simple model here since AI model is not the focus of this article.
Next step is to write the function which will fetch frames one by one, preprocess it and feed it to the saved model for getting predictions. We resize the array, expand the dimension (for batch size) and pass it to the model. We shall print the model output on the console. You can use it to send the predictions to your target source (may be using MQTT or database query). For now we don’t have anything to return but still we will return a json so that our JS code can receive it and send the next post request.
Now lets focus on the front-end part. On the front end we have to perform the following things — fetch stream using browser, draw each frame on canvas, convert into Blob (byte array) and send it to our python flask API for prediction. So to capture the first frame, we will write a function startFrameCapture in JavaScript. Here we will define canvas size based on video dimensions.
We shall now write a function postFile which recursively calls our python API and sends the frame to that API. Make sure to edit the apiServer variable if you are planning to host this on server. you can find more details on this, on my Github readme
Our index page will only contain a simple JavaScript to capture the video frame from the browser itself using HTML 5
This marks an end to our tutorial. You can refer to my GitHub repository for full code. However it does not contain model checkpoints. You will definitely have to ride your horses for the model training part. The rest can be forked from the repository. In case you want to reduce latency you can reduce the upload width parameter in JS code but that will affect the clarity and dimension of the feed, so make sure you train your model with similar data for relevant predictions.
In search of spoon-feeding ? Find full code on my Github !
Don’t forget to clap this tutorial, star my GitHub and comment down any queries that you face and I will try my best to get back to you as soon as possible.