A long journey with 1.5m of flat photos - Part 1 - proof of concept

It may end well, or it may end in nothing - let us see together

Posted by snakers41 on July 8, 2017

Although I consider myself proficient in foreign languages, this is my first post on my website in English. Enjoy.

TLDR;

I personally like starting all my articles with a mysterious picture...but if you are reeding this you have have a hunch what this is...


0. How it started

This article will be a bit different from the rest, because I cannot publish the code and all the details for personal reasons.

Without further ado I woke up one day and found ssh server access in one of my Telegram chats. The machine that I ssh-ed into had some peculiar details:

  • It was virtual machine inside of VM-ware  container;
  • It had Titan X GPU passed through from a host machine via  kvm-qemu, which is kinda advanced stuff (for GPU benchmarks for deep learning go here);
  • (Also it had a second similar GPU not passed through);
  • It had ca. 1.5m of random flat photos there;
  • Also I found this paper in my Telegram chat;


Well, what can you do with 1.5m photos (obviously scraped from websites I am kinda familiar with - Russian flat booking / purchase websites) and bleeding edge GPU? Of course train neural networks (or mine cryptocurrencies)! But you need annotation for this purpose and some clean purpose. The purpose may be kinda obvious:

  • Wall carpet recognition (a running joke in Russia); 
  • Architectural style detection;
  • Finding similar objects / flats;
  • Finding objects like TVs, fridges, etc;

There was no annotated subset of these pictures. So I decided to have fun with them and see what I can squeeze from the dataset using just plain-vanilla unsupervised learning methods and ready pre-trained neural networks in couple of hours.

First of all, I searched the Net for a moment and found a list of amazing articles and results:




Top models in last 5-10 years in imagenet competitions


So having no training dataset the best thing we can do is take a couple of existing big pre-trained neural networks  and play with their layers. Let's begin.

1. Preliminary investigation

First of all, let's count all the pictures we have (I will omit some steps as the owner of the machine asked me not to show any real details).

currentDir = os.getcwd()
images_path = '/mnt/dataset1/Images/'
pic_path = '/mnt/dataset1/Images/Download/'
pic_list = pd.read_csv(images_path+"downloadedImages.txt")
pic_list.shape


We get (1530210, 1) ~ 1.5m pictures. Not bad. I will not go into great details about my analysis of where the pictures came from / their sizes / formats etc. It's enough to say that there are ca. 10-15 top Russian flat boards, pictures are usually dull pictures of Russian flats and are mostly ~HD quality pictures with mediocre lighting and jpg and png file-formats.

Let's create a subset of pictures for our purposes.

pic_list[pic_list['2_split']=='folder1'].raw.values
pic_array = pic_list[pic_list['2_split']=='folder1].raw.values
pic_array = np.random.permutation(pic_array)
pic_array_shuf = pic_array[0:10000]
pic_array_shuf.size


Because pictures are located on an external mounted volume, let's also copy them to our virtual drive with a simple list of commands:

from shutil import copyfile
copyfile(orig_path, test_path+test_split)
for pic in log_progress(pic_array_shuf):
    orig_path = pic_path + pic
    copyfile(orig_path, test_path+pic.split('/')[5])
%ls -ls $test_path | wc -l


Now we have 10,000 randomly selected pictures in our folder., Why 10,000? To make all the calculations within reasonable amount of time. This is only investigation after all.

2. Naive approach

So. We have 10,000 pictures and we do not know anything about them except for their name, path and size. Let's steal a keras documentation example from here  and do the following.

Oh, I have not mentioned that to do that within reasonable time, you need to setup keras and your GPU properly. My dependency list looks small:

sudo passwd
sudo apt-get install tmux
sudo pip3 install jupyter_contrib_nbextensions
sudo pip3 install jupyter_nbextensions_configurator
sudo jupyter nbextensions_configurator enable --user
sudo pip3 install numpy 
sudo pip3 install matplotlib 
sudo pip3 install keras
sudo pip3 install tensorflow 
sudo pip3 install sklearn
sudo apt install glances
cd ~/
cd flat-nn/
jupyter notebook --no-browser --port=8888 --ip=server-ip


But to run keras on GPU properly you also need to setup CUDA drivers and do some configs. I posted a compilation here on my telegram channel. Initially this config was taken from fast.ai forums. Key parts of this config include (this is very system specific, use with caution, replace pip with pip3 if necessary):

# install and configure theano
pip install theano
echo "[global]
device = gpu
floatX = float32
[cuda]
root = /usr/local/cuda" > ~/.theanorc
# install and configure keras
pip install keras==1.2.2
mkdir ~/.keras
echo '{
    "image_dim_ordering": "th",
    "epsilon": 1e-07,
    "floatx": "float32",
    "backend": "theano"
}' > ~/.keras/keras.json
# install cudnn libraries
wget "http://platform.ai/files/cudnn.tgz" -O "cudnn.tgz"
tar -zxf cudnn.tgz
cd cuda
sudo cp lib64/* /usr/local/cuda/lib64/
sudo cp include/* /usr/local/cuda/include/


So, having installed everything, let's start:

# Extract features from an arbitrary intermediate layer with VGG19
from keras.applications.vgg19 import VGG19
from keras.preprocessing import image
from keras.applications.vgg19 import preprocess_input
from keras.models import Model
import numpy as np


If you did everything properly, you should receive some variation of this message:

Using gpu device 0: GeForce GTX TITAN X (CNMeM is disabled, cuDNN not available)


Let's see what is inside of VGG-19 model

base_model = VGG19(weights='imagenet')
base_model.summary()


We can take the fc1 or fc2 and do some test with them. They are supposed to contain high-level abstract features, that would recognize shapes, small objects, corners, text, eyes, etc

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_1 (InputLayer)         (None, 224, 224, 3)       0         
_________________________________________________________________
block1_conv1 (Conv2D)        (None, 224, 224, 64)      1792      
_________________________________________________________________
block1_conv2 (Conv2D)        (None, 224, 224, 64)      36928     
_________________________________________________________________
block1_pool (MaxPooling2D)   (None, 112, 112, 64)      0         
_________________________________________________________________
block2_conv1 (Conv2D)        (None, 112, 112, 128)     73856     
_________________________________________________________________
block2_conv2 (Conv2D)        (None, 112, 112, 128)     147584    
_________________________________________________________________
block2_pool (MaxPooling2D)   (None, 56, 56, 128)       0         
_________________________________________________________________
block3_conv1 (Conv2D)        (None, 56, 56, 256)       295168    
_________________________________________________________________
block3_conv2 (Conv2D)        (None, 56, 56, 256)       590080    
_________________________________________________________________
block3_conv3 (Conv2D)        (None, 56, 56, 256)       590080    
_________________________________________________________________
block3_conv4 (Conv2D)        (None, 56, 56, 256)       590080    
_________________________________________________________________
block3_pool (MaxPooling2D)   (None, 28, 28, 256)       0         
_________________________________________________________________
block4_conv1 (Conv2D)        (None, 28, 28, 512)       1180160   
_________________________________________________________________
block4_conv2 (Conv2D)        (None, 28, 28, 512)       2359808   
_________________________________________________________________
block4_conv3 (Conv2D)        (None, 28, 28, 512)       2359808   
_________________________________________________________________
block4_conv4 (Conv2D)        (None, 28, 28, 512)       2359808   
_________________________________________________________________
block4_pool (MaxPooling2D)   (None, 14, 14, 512)       0         
_________________________________________________________________
block5_conv1 (Conv2D)        (None, 14, 14, 512)       2359808   
_________________________________________________________________
block5_conv2 (Conv2D)        (None, 14, 14, 512)       2359808   
_________________________________________________________________
block5_conv3 (Conv2D)        (None, 14, 14, 512)       2359808   
_________________________________________________________________
block5_conv4 (Conv2D)        (None, 14, 14, 512)       2359808   
_________________________________________________________________
block5_pool (MaxPooling2D)   (None, 7, 7, 512)         0         
_________________________________________________________________
flatten (Flatten)            (None, 25088)             0         
_________________________________________________________________
fc1 (Dense)                  (None, 4096)              102764544 
_________________________________________________________________
fc2 (Dense)                  (None, 4096)              16781312  
_________________________________________________________________
predictions (Dense)          (None, 1000)              4097000   
=================================================================
Total params: 143,667,240
Trainable params: 143,667,240
Non-trainable params: 0
_________________________________________________________________



base_model = VGG19(weights='imagenet')
model = Model(inputs=base_model.input, outputs=base_model.get_layer('fc2').output)
images_path = '/mnt/dataset1/Images/'
pic_path = '/amnt/dataset1/Images/Download/'
test_path = currentDir+'/flat_pics/'
def predict_batches(model, path, batch_size=8):
    test_batches = get_batches(
        path,
        shuffle=False,
        batch_size=batch_size,
        class_mode=None,
        imageSizeTuple = (224,224)
    )
    return\
        test_batches,\
        model.predict_generator(test_batches, test_batches.samples/batch_size)
test_batches, preds = predict_batches(model, test_path, batch_size=128)
preds.shape


So we get a (10000, 4096) matrix filled with VGG-19 predictions (you can take imagenet model for this purpose). Ideally we should do the following

  • Stack two neural networks together using keras functional API;
  • Set all the layers except for the new layers in our new NNs as not-trainable;
  • Train a siamese NN for a day - week and see the results;


...but  we have no annotation, remember? So let's do the easiest thing possible, Let's use a sklearn version on affinity propagation algorithm to calculate  distances between 4000+ features from VGG-19.  This is really inefficient because it is not parallelized and it's not using the GPU. Of course we could modify this example and calculate everything in seconds, but let's just be lazy and wait for a couple of minutes instead of thinking too much =). Let's also fill the diagonal elements of the matrix with some random value (a picture looks like itself most, we do not need this).

from sklearn.cluster import AffinityPropagation
from sklearn import metrics
af = AffinityPropagation().fit(preds)
np.save('/mnt/virtfs/home/av/flat-nn/flat-nn/aff_matrix', af.affinity_matrix_)
aff_matrix = af.affinity_matrix_
np.fill_diagonal(aff_matrix, -9000)
most_different = aff_matrix.argmin(axis=1)
most_similar = aff_matrix.argmax(axis=1)
pic_filenames =test_batches.filenames


3. Naive results

Let's see what our naive approach gives us! Do not forget that this was just exploration with any data annotation whatsoever. Also note that I did not do any proper picture size analysis and adjustment - I just fed everything to keras as-as.

# utility for easy plot generation
def plots(ims,
          figsize=(12,6),
          rows=1,
          interp=False,
          titles=None):
    if type(ims[0]) is np.ndarray:
        ims = np.array(ims).astype(np.uint8)
        if (ims.shape[-1] != 3):
            ims = ims.transpose((0,2,3,1))
    f = plt.figure(figsize=figsize)
    for i in range(len(ims)):
        sp = f.add_subplot(rows, len(ims)//rows, i+1)
        if titles is not None:
            sp.set_title(titles[i], fontsize=18)
        plt.imshow(ims[i], interpolation=None if interp else 'none')


def plots_idx(idx, titles, path, filenames, figsize):
    plots([image.load_img(path + filenames[i]) for i in idx], titles=titles, figsize = figsize, rows=1)
for pic in np.arange(9):
    idx = [pic,most_different[pic],most_similar[pic]]
    titles = ['pic','most_different', 'most_similar']
    path = test_path
    filenames = pic_filenames
    
    plots_idx (idx,titles,path,filenames, figsize = (10,20))


See some random examples below


















Apart from the awful soviet toilet, which pops up randomly, we can see that this naive approach:'

  • Detects copies of the same picture easily;
  • Can distinguish corners:
  • Can distinguish outdoor photos vs. indoor photos;
  • Can distinguish city landscapes and sea-view;
  • Can distinguish flat floor maps;


Not bad for totally unrelated algorithm (imagenet) without any data processing and / or training on random images! Imagine what the results would be, if we trained our classifier

4. Let's go deeper

Let's use t-sne projection (or plain PCA) to project our 4,000 dimension data into 2-axis plane.

%%time
from sklearn.manifold import TSNE
tsne = TSNE(random_state=17)
X_tsne = tsne.fit_transfor
plt.figure(figsize=(12,10))
plt.scatter(X_tsne[:, 0], X_tsne[:, 1],  
            edgecolor='none', alpha=0.7, s=40,
            cmap=plt.cm.get_cmap('nipy_spectral', 10))
plt.title('Flats VGG-19 last FCN layer. t-SNE projection')

We clearly can spot some clusters here:


What is remarkable, is that they clearly correspond to different picture types.

Floor maps


Exteriors of blocks of flats



And interiors



Not bad for data exploration analysis, huh?