Experimenting with Selfie Segmentation (Teams “background effect”)

6 min readOct 11, 2021

MediaPipe provides Machine Learning models packaged and readily available on a variety of platform — android, ios, C++, python and JavaScript according to the documentation. This story uses the JavaScript implementation.

This post uses the MediaPipe Selfie Segmentation model that “segment the prominent humans in the scene”. The post shows how to use this model to extract foreground — where the prominent human is — and background — the remaining of the image — from a webcam feed. The live version is presented in the demo page https://www.video-mash.com/demo.html.

An important limitation is that the model works well when the prominent human is located close (less than 2m) from the camera.

Casually observing — from the background

The views/opinions expressed in this story are my own. This story relates my personal experience and choices, and is provided for information in the hope that it will be useful but without any warranty.

Similarly to a lot of people, I have been working from home for the best part of the last 18 months. During that time I noticed that people would point to the screen to emphasize the point they are trying to make. This works great in face to face… but not so much in a video call…

I got together with a friend of mine, Sebastian Reddy, to look at how to overlap someone’s image over the screen sharing as a way of allowing for such interaction to take place. The brief included “the setup should be easy so that a dad could install and use it”. We quickly found that the largest hurdle was to create a virtual webcam feed to broadcast the manipulated screen share and webcam feed.

At the same time, MS Teams released the presenter mode feature — see https://www.theverge.com/2021/3/2/22308927/microsoft-teams-presenter-mode-powerpoint-live-features. So the need for the use case was no longer there.

So we changed the idea to do something a little more fun. Can we extract video foreground and superpose it on a camera feed to give the impression that the activity is taking place in one’s living room? And can we superpose ones in the background of a video so that one’s can videobomb any video?

We got a website together with a crude implementation of these features and released it at https://www.video-mash.com.

This story focuses on setting up MediaPipe Selfie Segmentation, handling the webcam feed, using MediaPipe Selfie Segmentation to manipulate the webcam feed and displaying it. This image manipulation is the corner stone employed in the website mentioned above.

The source code for this story is available on the following github project.

Preliminaries

Setup the website using the Vue CLI to create a new project with the VueJS framework. Select Vue 2.x and use a multi-page web application without reliance on a Vuex store. Also install tailwind CSS with the PostCSS 7 compatibility build.

Once the project is created, install MediaPipe Selfie Segmentation using:

npm install @mediapipe/selfie_segmentation

The Selfie Segmentation model requires a link to the models files when it is initiated. This example uses the models provided by CDN as listed in the documentation. The website uses a different approach with the models provided as assets within the website to avoid depending on an external third party that may change in the future and hence may break the functionality. This issue happened to https://www.share-ml.io and implementing a fix is still on my to-do list.

The models were copied from node_modules\@mediapipe\selfie_segmentation to ensure that they are consistent with the installed version of MediaPipe.

Webcam Feed

The webcam feed is obtained in JavaScript using Navigator.mediaDevices as described in MDN Web Docs. The MediaDevice object is used to request user permission and obtain the webcam feed using the getUserMedia function as shown in the script below. The function is called with preference for user facing camera, and without audio. The function returns a promise with a media-stream when successful. This media stream is applied as the source of a video html element. This element is hidden from view, but is used to transfer the media stream into an input suitable for MediaPipe selfie segmentation model.

The code extract below shows the query of the webcam feed and its handling via a video element created on the fly.

navigator
.mediaDevices
.getUserMedia({audio: false, video: {facingMode: "user"}})
.then((media_stream) => {
  this.webcam_video = document.createElement('video');
  this.webcam_video.srcObject = media_stream;
  this.webcam_video.style.display = 'none';
  document.body.appendChild(this.webcam_video);
  this.webcam_video.onplay = this.playing;
  this.webcam_video.play();
})
.catch((e) => {this.on_error("Unable to start webcam", e);});

MediaPipe Selfie Segmentation

The selfie segmentation model is created as described in the model documentation, with a number of minor differences: (a) the model is stored into a component variable this.self_segmentation, and (b) a flag is set to true after the selfie segmentation model is created.

The selfie segmentation model exposes an onResults function that is used to set the callback function to be called after the model has run. This callback is set to the component function this.on_results in the present case.

this.selfie_segmentation 
= new SelfieSegmentation({locateFile: (file) => {
       return `https://cdn.jsdelivr.net/npm/@mediapipe/selfie_segmentation/${file}`;
}});
this.selfie_segmentation.setOptions({ modelSelection: 1, });
this.selfie_segmentation.onResults(this.on_results);
this.selfie_segmentation_ready = true;

The first call to the model takes significantly longer than subsequent calls and may makes the window unresponsive for a little while. No attempt to alleviate this drawback is implemented here. In the website, an interstitial loading model page is shown when the model runs its first iteration — done as a warm-up prior to treating relevant images.

Handling Selfie Segmentation Results

The selfie segmentation results consists of an object with two keys: image that contains the original image and segmentationMask that contains a bitmap image with colored pixels where the prominent humans are located as shown in the image below.

Original image and segmentation mask result returned from MediaPipe selfie segmentation model

These two images can be used with canvas compositing operations described here to extract the relevant sections of the image. For example, drawing the image , setting the composite operation to destination-in and drawing the segmentation mask will render the prominent humans. The same approach using the composite operation destination-out is used to render the background. The context parameters are saved prior to setting the composite operation and restored afterwards so that the composite operation is reverted back to its original state.

The on_result function below shows how the selfie segmentation result is handled to render the original image, the segmentation mask, the background and the foreground — ie the prominent human(s).

on_results: function(results) {
  { // Draw the original image
    const c = this.canvas_in();
    const ctx = c.getContext('2d');
    ctx.clearRect(0, 0, c.width, c.height);
    ctx.drawImage(results.image, 0, 0, c.width, c.height);
  }
  { // Draw the segmentation mask
    const c = this.canvas_mask();
    const ctx = c.getContext('2d');
    ctx.clearRect(0, 0, c.width, c.height);
    ctx.drawImage(results.segmentationMask, 0, 0, c.width, c.height);
  }
  { // Draw the background
    const c = this.canvas_background();
    const ctx = c.getContext('2d');
    ctx.save();
    ctx.clearRect(0, 0, c.width, c.height);
    ctx.drawImage(results.image, 0, 0, c.width, c.height);
    ctx.globalCompositeOperation = 'destination-out';
    ctx.drawImage(results.segmentationMask, 0, 0, c.width, c.height);
    ctx.restore();
  }
  { // Draw the foreground
    const c = this.canvas_foreground();
    const ctx = c.getContext('2d');
    ctx.save();
    ctx.clearRect(0, 0, c.width, c.height);
    ctx.drawImage(results.image, 0, 0, c.width, c.height);
    ctx.globalCompositeOperation = 'destination-in';
    ctx.drawImage(results.segmentationMask, 0, 0, c.width, c.height);
    ctx.restore();
  }
}

Triggering Selfie Segmentation

The only missing part relates to triggering the selfie segmentation model — ie calling it with the webcam frame. This is done in the playing function that has been assigned to the onplay properties of the webcam_video element as shown in the section Webcam Feed above . This function is called once when the webcam starts to play. Its implementation calls itself using the requestAnimationFrame callback and calls the send function of the selfie segmentation model object. The send call is asynchronous, but needs to be await as it can not handle multiple simultaneous calls based on my experience.

playing: async function() {
  if (this.selfie_segmentation_ready) {
    await this.selfie_segmentation.send({image: this.webcam_video});
  }
  window.requestAnimationFrame(this.playing);
}

Limitations

The following should be kept in mind in addition to the limitations mentioned above —long first call, unable to request two segmentation concurrently. I was not able to create multiple instances of the selfie segmentation model to be run concurrently. The segmentation works well when the person is located close to the camera, but is less effective when people are further away.

The implementation does not look at the camera and canvas aspect ratio. This can lead to image distortion. A more appropriate implementation would draw the images on the canvas based on the camera aspect ratio to avoid this distortion.

I am thinking of taking this project in a couple of different direction such as integrating phone movement, but would be keen to hear your thoughts and feedback.

I tried to give a small selection of video, but if you think of good videos to use on the https://www.video-mash.com website, please feel free to let me know.