During my work, I often came across the opinion that deployment of DL models is a long, expensive and complex process. As a result, the problem ends up being solved via regex and crutches, at best, or by returning to manual processing, at worst. Well, this decision can be even rational, especially when a problem seems to require some investment like a perfect manual annotation or a technical upgrade. A good example is a real-time classification of images. But we focus on the goal, not the obstacles. With the proper attitude, such a fast model can be deployed even without a GPU on a production server or any annotation at all.
- Two challenges: NSFW/SFW Classification + Text Classification/Labelling;
- CPU inference time for pretrained models:
- Our classifier:
- MobileNetv3 encoder;
- Multi-head: classifiers + semseg;
- Can be extended/customized;
- Metrics are comparable with the pretrained models;
- In total: 50-100ms for both tasks;
- No ready annotation, no gpu for inference, no complex algorithmic optimizations, no rocket science.
More details about the task
We faced two challenges at once:
- Text search in the image: is there any text or not?
- Safe for work/ not safe for work.
While text classifier's criteria are brutally simple: no text = 0, text = 1, NSFW problem is more vague and its interpretation may differ. In general, it looks something like this:
That's kind of what it works like. Add a couple more random criteria for a complete picture.
Our NSFW classifier had to work on this principle, plus detect two additional categories.
Since NSFW classification, as well as Text detection, are common CV tasks, there is plenty of research done on this field. Good, if you got lucky and found a paper with fast clean code on PyTorch. Sad but true, most of the papers either don't have open source code at all or have implementations similar to black boxes.
Run a pic through the monolith code on the github
For instance, NSFW classification has Yahoo classifier on Caffe with docker image to run. It works but slowly: ~700-800ms on CPU for one pic.
For text detection, there is a cool EAST paper which has several implementations on GitHub:
- Approved repo with TF and C++ with inference time ~200-300ms for one pic on CPU;
- PyTorch repo with >300 stars and no pretrained model;
- PyTorch repo with result ~700-1500ms on CPU.
Such black boxes are fit for "proof-of-concept", just to get an idea of how things work, estimate runtime, performance, etc. But deploying it is not the best decision, at least because:
- Fixing bugs in the black box is a sort of BDSM, especially if an author doesn't care about proper error tracking;
- Usually, there is no option to effectively extend/optimize black box. You can even retrain it, but inference time will remain the same.
Much better option is to train your own model with clean, clear and reproducible code, that can be extended and optimized. And lack of annotation is not an obstacle, even if you don't have enough resources to make some.
If there's no annotation, then
- Nice NSFW datasets: 1, 2 with >1M images in total;
- Text detection/recognition:
At some point, the public datasets are no longer sufficient and you have to look for shortcuts. The most obvious source to get data for the custom labels is Google Images Search. Well, parsing Google from scratch is a task for strong-willed people, but fortunately, there are services/wrappers build to do it for you:
- Python script;
- Downloads images from search by text request;
- Killer feature - search for similar images by image URL;
- A number of downloads per day is limited
- Paid service for SEO stuff;
- Returns image URLs from search by text;
- Instead of inventing each text request yourself, you can parse synonyms from Wordstat service.
However, while grabbing images from the internet is a great way to extend your dataset, it comes with a drawback: the labels are noisy(esp. for NSFW). This problem can be solved by adding a simple heuristic like dropping all images except top 20.
Often only top results from Google Images are relevant. Yep, children are also NSFW
In fact, a whole dataset is partly unlabelled: we know neither how many NSFW pictures in the 'Text' part, nor how many 'Text' in the NSFW's.
To make sure our classifier doesn't just learn NSFW=No text, we distilled dataset via black boxes from above: ran all images w/o annotation through the pretrained models and labeled only those it classified with good confidence. For example, in the case of text detection, we noted that only the images with the bounding box area > 0.1% of the whole image contained text.
Actually, output is correct only from time to time, so the threshold must be selected depending on which is more important.
Despite a part of mislabeled data, distillation still adds diversity to the dataset and even extend it with the quality data.
The main idea was to train a multi-head model with a light and fast encoder. In the first approach, we decided to try EfficientNet B0-B1(github) for this purpose, but then switched to the MobileNet v3(github) architecture.
The whole point of MobileNet is to run on mobile, so it is faster and lighter even than EfficientNet. The only catch is a slight loss of accuracy, but in real-life tasks, it fades into the background. MobileNet is really fast, even on the CPU.
EfficientNet B1 and MobileNet v3 performance on val.
Decoder heads include:
- NSFW/SFW classification head;
- Classification by custom categories: nudes or children in swimsuits, for example;
- Text classification head;
- Head for semantic segmentation of text.
In general, a multi-head is better than several separate models, because it regularizes and improves performance for all tasks. As a bonus, the inference is faster.
So the last semseg head was added for a reason, although it does not participate in inference. It plays the role of a regularizer for the related text classifier head. By the way, semseg-like was used even in the EAST paper as was mentioned at the channel.
Suddenly, semseg can be well trained even on the bounding boxes as polygons.
As a result, CPU inference of the trained model on the images resized to
- 224x224 - ~50ms
- 384x384 - ~80ms i.e. 3-5 times faster than ready implementations. Also, semseg head adds ~50-100 ms to inference depending on the size of the image.
- Public datasets + Google images + Distillation;
- Light and fast encoder - MobileNet v3 - pretrained on ImageNet;
- Multi-head classifier + Semseg head;
- Simple service on Flask + Docker.