Throughout the last few years in machine learning, I’ve always wanted to build real machine learning products.
A few months ago, after taking the great Fast.AI deep learning course, it seemed like the stars aligned, and I have the opportunity: The advances in deep learning technology permitted doing many things that weren’t possible before, and new tools were developed and made the deployment process more accessible than ever.
In the aforementioned course, I’ve met Alon Burg, who is an experienced web developer, an we’ve partnered up to pursue this goal. Together, we’ve set ourselves the following goals:

  1. Improving our deep learning skills
  2. Improving our AI product deployment skills
  3. Making a useful product, with a market need
  4. Having fun (for us and for our users)
  5. Sharing our experience

Considering the above, we were exploring ideas which:

  1. Haven’t been done yet (or haven’t been done properly)
  2. Will be not too hard to plan and implement — our plan was 2–3 months of work, with a load of 1 weekly work day.
  3. Will have an easy and appealing user interface — we wanted to do a product that people will use, not only for demonstration purposes.
  4. Will have training data readily available — as any machine learning practitioner knows, sometimes the data is more expensive than the algorithm.
  5. Will use cutting edge deep learning techniques (which were still not commoditized by Google, Amazon and friends in their cloud platforms) but not too cutting edge (so we will be able to find some examples online)
  6. Will have the potential to achieve “production ready” result.

Our early thoughts were to take on some medical project, since this field is very close to our hearts, and we felt (and still feel) that there is an enormous number of low hanging fruits for deep learning in the medical field. However, we realized that we are going to stumble upon issues with data collection and perhaps legality and regulation, which was a contradiction with our will to keep it simple. Our second choice was a background removal product.

Background removal is a task that is quite easy to do manually, or semi manually (Photoshop, and even Power Point has such tools) if you use some kind of a “marker” and edge detection, see here an example. However, fully automated background removal is quite a challenging task, and as far as we know, there is still no product that has satisfactory results with it, although some do try.

What background will we remove? This turned out to be an important question, since the more specific a model is in terms of objects, angle, etc. the higher quality the separation will be. When starting our work, we thought big: a general background remover that will automatically identify the foreground and background in every type of image. But after training our first model, we understood that it will be better to focus our efforts in a specific set of images. Therefore, we decided to focus on selfies and human portraits.

A selfie is an image with a salient and focused foreground (one or more “persons”) guarantees us a good separation between the object (face+upper body) and the background, along with quite an constant angle, and always the same object (person).

With these assumptions in mind, we embarked on a journey of research, implementation and hours of training to create a one click easy to use background removal service.

The main part of our work was training the model, but we couldn’t underestimate the importance of proper deployment. Good segmentation models are still not compact as the classification model (e.g SqueezeNet) and we actively examined both server and browser deployment options.

If you want to read more details about the deployment process(es) of our product, your are welcomed to check out our posts on server side and client side.

If you want to read about the model and it’s training process, keep going.

Semantic Segmentation

When examining deep learning and computer vision tasks which resemble ours, it is easy to see that our best option is the semantic segmentation task.

Other strategies, like separation by depth detection also exist, but didn’t seem ripe enough for our purposes.

Semantic segmentation is a well known computer vision task, one of the top three, along with classification and object detection. The segmentation is actually a classification task, in the sense of classifying every pixel to a class. Unlike image classification or detection, segmentation model really shows some “understanding” of the images, in not only saying “there is a cat in this image”, but pointing where and what is the cat, on a pixel level.

So how does the segmentation work? To better understand, we will have to examine some of the early works in this field.

The earliest idea was to adopt some of the early classification networks such as VGG and AlexnetVGG was the state of the art model back in 2014 for image classification, and is very useful nowadays because of it’s simple and straightforward architecture. When examining VGG early layers, it may be noticed that there are high activation around the item to classify. Deeper layers have even stronger activation, however they are coarse in their nature since the repetitive pooling action. With these understandings in mind, it was hypothesized that classification training can also be used with some tweaks to finding/segmenting the object.