Get acquainted with U-NET architecture + some keras shortcuts

Or U-NET for newbies, or a list of useful links, insights and code snippets to get you started with U-NET

Posted by snakers41 on August 14, 2017

TLDR of the article, taken from the original working paper from here

In the  field of Machine Learning / Data Science / Deep Learning there is a list of powerful architectures for image classification that have become de-facto standard approaches to many  computer vision or related problems:

  • AlexNet;
  • VGG-16, VGG-19;
  • Inception Nets;
  • ResNet;
  • Squeeze Net mentioned briefly as a combination of many of the above;

Their accuracy, computational difficulty and parameter number are nicely summarized in these charts (also taken from the above resource):

Very many tasks can be reduced to using similar architectures, for example I managed to reduce bird voice recognition task to a simple VGG-16 style network for proof of concept. This happens because most of these networks contain convolutional layers that act as filters and learn specific features of the input data like circles, specific shapes, edges, curves etc, which can be applicable in many scenarios. For example one of approaches in modern DS competitions is to take a list of pre-trained networks, cut-off the last layer, and train a new fully connected layer on the features that were previously trained in a larger network. This works, because (I will not bother looking up the link, but I guess it was somewhere in Andrej's Karpathy blog) the more data is fed to the neural network, the more abstract features it can learn and the more generalizable it can become.

Fabled VGG-16 style architecture

0. Why U-NET

So, all of this is really nice, but what connection does it have to U-NET architecture? Since machine vision is considered (btw read the amazing article under the link) "semi-solved" for general purposes image classification, it is only rational that more specialized architectures will emerge. As far as I know a whole bunch of encode-decoder style network architectures and generative models are in active phase of development now (I did not have enough time to start the second MOOC though).

But I also learned from reading Kaggle kernels and forums that U-NET is considered one of standard architectures for image classification tasks, when we need not only to segment the whole image by its class, but also to segment areas of image by class, i.e. produce a mask that will separate image into several classes.

What is the easiest way to obtain hands-on experience on such a topic? Of course participate in some relatively easy Kaggle competition, like this one  (this is more about your GPU-farm handling skills and patience than Data Science, like the most of modern 2017+ competitions)  to learn the caveats and employ the network. So let's begin! (this will be my first real participation in Kaggle, where I devoted a significant portion of my time to the challenge).

This article will more focus on practical aspects of applying U-NET with keras, reviewing some existing kernels and papers that demonstrate this architecture abilities. I try not to focus on this particular competition, but provide as much information, that may be useful for general public or people educating themselves on the subject. Also I did not study GANs in depth yet.

1. U-NET's capabilities

The easiest way to judge the neural network is by its:

  • Performance on various real-life tasks;
  • Computational difficulty (how many and which GPUs you need, how long it will train);
  • Availability of examples in the community;

Applied tasks solved by U-NET

This is the list of most notable examples, I have researched on the topic:

Original video explanation of how and why the network works logically:

From these example we can immediately pros and cons of this type of architecture. 


  • As we see from example, this tool is really versatile and can be used for any reasonable image masking task;
  • High accuracy given proper training, adequate dataset and training time;
  • This architecture is input image size agnostic since it does not contain fully connected layers (!);
  • This also leads to smaller model weight size (for 512x512 U-NET - ca. 89mb);
  • Can be easily scaled to have multiple classes;
  • Code samples are abundant (though none of them worked for me from the box, given that the majority was for keras >1.2. Even keras 2.0 required some layer cropping to launch properly);
  • Pure tensorflow black-box implementation also exists (did not try it);
  • Relatively easy to understand why the architecture works, if you have basic understanding of how convolutions work;


  • The architecture initially was designed to overcome the ad hoc choice of 'rolling window' / patches hyper-parameters.  To have decent results the size of U-NET that you will be using must be comparable with the size of features and its surroundings (i.e. you should include some context for the model to work properly);
  • Because of many layers takes significant amount of time to train;
  • Relatively high GPU memory footprint for larger images:
    • 640x959 image => you can fit 4-8 images in one batch with 6GB GPU;
    • 640x959 image => you can fit 8-16 images in one batch with 12GB GPU;
    • 1280*1918 => you can fit 1-2 images in one batch with 12GB GPU;
  • Is less covered in blogs / tutorials than more conventional architectures;
  • The majority of Keras implementations are for outdated Keras versions;
  • Is not standard to have pre-trained models widely available (it's too task specific);

2. More advanced tasks and use-cases

If we consider a list of more advanced U-NET usage examples we can see some more applied patters:

A bit more advanced U-NET implementation


  • For more advanced tasks smaller U-NETS are used in ensembles;
  • An alteration of original U-NET design - add dropout layer for regularization and batch-norm for regularization and training speed-up (I did not test this, I just normalized data). I guess this can be useful for noisy-data;
  • Balanced classes and proper sampling are key for noisier data with high dimensions and noise;
  • Serious Deep Learning people usually do competitions in teams an train 40+ models;

3. Getting more practical, we will use mostly Keras, right?

When reviewing current libraries for U-NET, it basically was a choice between:

  1. In-memory numpy arrays and / or writing your own generators;
  2. Tensorflow only or tensorflow for actual training and keras for model shortcut;
  3. Keras with tensorflow or theano back-end;
  4. Black box tensorflow model;

I will evaluate the merits of each approach from my standpoint and provide programming sugar samples that I ended up using in my models.

In memory numpy arrays

I have often seen that many Kaggle kernels or code samples rely on plainly putting datasets into numpy arrays into your RAM and then doing some manual preprocessing. This is insanely fast if you can read all the data into a single huge numpy array that actually will fit into memory. It may be a good idea to investigate sparse data structures if your data is sparse (images usually are not really sparse).

The immediate problem with this is that images are usually stored compressed, and if you have a 5GB dataset, it just means that you will probably end up needing 20-30GBs of RAM to allocate this dataset into memory. Of course having a 64-128GM RAM machine will give you a blazing fast speed, but I believe GPU processing speed will be a bottleneck anyway.

Some more advanced examples actually implement data-preprocessing and generators from scratch or using cv2. In my opinion it's much easier just to use provided Keras functions.

Tensorflow only

After reading this tutorial or code from this repository it may seem that using tensorflow directly is easy, but it's not. It is more low level and when you actually try to power through any tutorial or example you immediately will face more low-level problems. Which may be ok, if you  are a full-time Deep Learning researcher, but probably not when you are doing this essentially for hobby. 

Also I found some inspiring examples of using multi-GPU setups using Keras + tf mixtures, which may be really handy, but the bulk of the work is managed by Keras anyway.

I tried finding a starting vector, but I was unable to penetrate it and abandoned this approach. 

Keras with tensorflow or theano back-end

In 2017 given Google's mobile efforts and focus on machine learning it seems reasonable to try using tensorflow+keras as it supports multi-GPU training and you can deploy your models to mobile SDK, but essentially with one GPU and research set-up there is no difference in using Keras + tf or Keras + theano.

Keras has a whole bunch of nice flow_from_directory methods and image preprocessing sugar that can be handy for a variety of deep learning tasks, especially when you are facing overfitting issues. But you cannot really use this for regression purposes (at least it is not straight forward) because from the box these methods support files structured like this (a folder per class) :

cd /path/to/our/images/
tree imgs
 ├── 622
 │   └── 57
 │       └── 49557622.png
 ├── 651
 │   └── 03
 │       ├── 3303651.png
 │       ├── 35903651.png
 │       ├── 44603651.png
 │       └── 95403651.png
 └── 756
     ├── 55
     │   ├── 31255756.png
     │   └── 95555756.png
     ├── 61
     │   ├── 1161756.png
     │   ├── 461756.png
     │   └── 5561756.png
     ├── 64
     │   └── 58164756.png
     └── 67
         ├── 3367756.png
         ├── 3767756.png
         └── 5467756.png

The official Keras documentation gives this example to avoid this issue:

from keras.callbacks import ModelCheckpoint
model = Sequential()
model.add(Dense(10, input_dim=784, kernel_initializer='uniform'))
model.compile(loss='categorical_crossentropy', optimizer='rmsprop')
saves the model weights after each epoch if the validation loss decreased
checkpointer = ModelCheckpoint(filepath='/tmp/weights.hdf5', verbose=1, save_best_only=True), y_train, batch_size=128, epochs=20, verbose=0, validation_data=(X_test, Y_test), callbacks=[checkpointer])

There was even a number of discussions on Github related to extension of Keras' methods (most notable this one)  and a whole extensive article by AppNexus developers  about overcoming Keras' limitations when training on a really high number of classes. It turned our that the solution was the simple fact that Keras can accept the following inputs for fit_generator method:

generator: A generator. The output of the generator must be either
a tuple (inputs, targets)
a tuple (inputs, targets, sample_weights). All arrays should contain the same number of samples. The generator is expected to loop over its data indefinitely. An epoch finishes when steps_per_epoch batches have been seen by the model.

I would like to share the final code for mask and image generators that I used for preprocessing, which is really useful:

def get_generator(train_folder, train_mask_folder, valid_folder, valid_mask_folder,
    batch_size = 4
    seed = 42
    # Example taken from
    # We create two instances with the same arguments
    data_gen_args_train = dict(
    data_gen_args_masks = dict(
    image_datagen = image.ImageDataGenerator(**data_gen_args_train)
    mask_datagen = image.ImageDataGenerator(**data_gen_args_masks)
    image_generator = image_datagen.flow_from_directory(
        batch_size = batch_size,
        target_size = (640, 959),
    mask_generator = mask_datagen.flow_from_directory(
        batch_size = batch_size,    
        target_size = (640, 959),    
    valid_image_generator = image_datagen.flow_from_directory(
        batch_size = batch_size,
        target_size = (640, 959),
    valid_mask_generator = mask_datagen.flow_from_directory(
        batch_size = batch_size,    
        target_size = (640, 959),    
    # combine generators into one which yields image and masks
    train_generator = zip(image_generator, mask_generator)
    valid_generator = zip(valid_image_generator, valid_mask_generator)   
    return train_generator,valid_generator

Also if you want to write a generator from scratch, these code snippets may be handy (use them as starting points for your own code):

def train_generator():
    while True:
        for start in range(0, len(ids_train_split), batch_size):
            x_batch = []
            y_batch = []
            end = min(start + batch_size, len(ids_train_split))
            ids_train_batch = ids_train_split[start:end]
            for id in ids_train_batch.values:
                img = cv2.imread('input/train/{}.jpg'.format(id))
                img = cv2.resize(img, (input_size, input_size))
                mask = cv2.imread('input/train_masks/{}_mask.png'.format(id), cv2.IMREAD_GRAYSCALE)
                mask = cv2.resize(mask, (input_size, input_size))
                img, mask = randomShiftScaleRotate(img, mask,
                                                   shift_limit=(-0.0625, 0.0625),
                                                   scale_limit=(-0.1, 0.1),
                                                   rotate_limit=(-0, 0))
                img, mask = randomHorizontalFlip(img, mask)
                mask = np.expand_dims(mask, axis=2)
            x_batch = np.array(x_batch, np.float32) / 255
            y_batch = np.array(y_batch, np.float32) / 255
            yield x_batch, y_batch

def test_generator():
    global ids_test_split
    global batch_size
    while True:
        for start in range(0, len(ids_test_split), batch_size):
            x_batch = []
            end = min(start + batch_size, len(ids_test_split))
            ids_test_split_batch = ids_test_split[start:end]
            for id in ids_test_split_batch.values:
                img = imread('data/test/{}.jpg'.format(id))
                img = cv2.resize(img, (rescale_width,rescale_height))
            x_batch = np.array(x_batch, np.float32)
            yield train_data_centering(x_batch)

Black box model

Did not go in depth with this approach, but you basically have a pre-defined model (unless you want to go under the hood to learn tensorflow) that you can use on images in folders with a specific file format. Cannot say it's bad, but having no image preprocessing sugar makes this option worse than pure tensorflow or keras + tf approach because this lacks Keras' flexibility.

Also if for some reason I wanted to test a model with dropout and batch_norm, I would not be able to do it.

4. What is next?

Next I will try as many hypotheses as I can in the above challenge and will report all my code and findings.