What did the bird say? Part 7 - full dataset preprocessing (169GB)

Or how I prepared a huge dataset for playing with neural networks - 169GB of bird songs

Posted by snakers41 on August 7, 2017


Well, basically this article in a nutshell


Article list

What did the bird say? Bird voice recognition. Part 1 - the beginning

What did the bird say? Bird voice recognition. Part 2 - taxonomy

What did the bird say? Bird voice recognition. Part 3 - Listen to the birds

What did the bird say? Bird voice recognition. Part 4 - Dataset choice, data download and pre-processing, visualization and analysis

What did the bird say? Bird voice recognition. Part 5 - Data pre-processing for CNNs

What did the bird say? Bird voice recognition. Part 6 - Neural network MVP

What  did the bird say? Bird voice recognition. Part 7 - full dataset preprocessing (169GB)

What  did the bird say? Bird voice recognition. Part 8 - fast Squeeze-net on 800k images - 500s per epoch (10-20x speed up)

0. TLDR

Preprocessing 169GB of .mp3 files is really tricky if you are using high-level programming libraries because:

  • Multi-threading may be tricky if you have no explicit experience in system programming / high load python programming;
  • If you rely on high level libraries for data preprocessing, speed may be an issue (rewriting audio preprocessing in something like pytorch or tensorflow can even be trickier and not worth investing much time if you can just wait for a couple of days for the scrips to finish) - audio unpacking and spectrogram take ca. 10x more time than any other data manipulations combined together;
  • Though input data weighs a lot (~150GB) the output images weight only ca. 2-3GB, which essentially is a couple orders of magnitude lower. This is really nice for neural network training and exploration (CNN weight size is proportional to the square of image dimensions, e.g. if picture resolution increases by 2x, then NN weights will be 4x as heavy);


Obvious top-level advice(probably applicable to any domain):

  • Divide and conquer - divide a task into subtasks and slowly gnaw at each of them;
  • Probably if you know an 'ugly' but doable solution - just go for it;
  • Do not try to have more than 2 degrees of freedom in each task which is new to you (ideally have only one), e.g. you have to write the preprocessing logic and implement multi-threading, probably you should not add any more sophistication to that;
  • Try as many simple approaches as fast as you can;


But I nailed it. See below how.

1. Choosing the dataset

We have finished the previous article on doing the proof of concept of the CNN that was able to tell birds by their songs with 70% accuracy, which is good enough, but not really good. The obvious way to increase the dataset size 10-100x. In neural networks it's known that drastically increasing dataset size allows them to learn much better, actually better than expected.

As you may have noticed for the MVP there were the following stages of the process:

  • Choose a balanced dataset;
  • Download the files;
  • Extract spectrograms from them and convert them into sliding windows and save as images;


I will not bore you with the details of the implementation (as usual I will just provide my scripts and notebooks at the end of the article), but basically I just compared how many songs we can get per bird genus with colloquial name (vgenus) and for bird species. It turns our that we can have either ca. 200 bird genus with at least 50+ songs per genus, or ca. 1500-2000 bird species with much lower number of songs per species.

Though scientifically going for bird genus may not make sense, I decided to go for bird genus instead of species to ensure that we can ship a reasonable application.

merged_df = pd.read_csv('merged_data.csv')
sample_df = pd.DataFrame()
cols_meta = ['id','area', 'l_genus', 'class', 'vclass','order', 'vorder', 'family', 'vfamily', 'vgenus', 'sp', 'file']
# Download only files with vgenus count of bird songs > 50
for vgenus in merged_df[pd.notnull(merged_df.vgenus)].vgenus.unique():
    if(merged_df[merged_df['vgenus']==vgenus].shape[0] > 50):
        sample_df = sample_df.append(merged_df[merged_df['vgenus']==vgenus][cols_meta])
        
sample_df['processed'] = 0
sample_df['length'] = 0
sample_df['sr'] = 0
sample_df['mel_size'] = 0
cols_collect = ['id', 'file', 'processed','length','sr','mel_size']
sample_df


2. Downloading the bird songs with multi-curl (multicurl)

In a nutshell, it's not difficult to download 150+GB of files, but it's more difficult to do so file by file and in reasonable amount of time utilizing 100% of the network speed. Therefore I started with a bit of research. It turned out that advanced web-scraping is not really popular nowadays and the most obvious candidate - multi-curl library has not been updated for ca. 5+ years. My friend advised me to use this high-level library for professional crawlers (also check out the developer's list of projects, they are impressive) that has multi-curl as a dependency,  but in the end I decided that this is overkill and just found a working example of multi-curl implementation.


In the end, I ended up using a slightly modified version of the multi-curl download script. This allowed me to download ca. 130k of files in very reasonable amount of time, most of time utilizing ~50% of my network connection speed.

# Usage: python retriever-multipy <file with URLs to fetch> [<# of
#          concurrent connections>]

import os.path
import pandas as pd
import numpy as np
import sys
import pycurl
current_dir = os.getcwd()
FILE_PATH = os.path.join(current_dir,'bird_songs')
sample_df = pd.read_csv(os.path.join(current_dir,'sample_processed.csv'))
sample_df = sample_df.set_index('id')
# We should ignore SIGPIPE when using pycurl.NOSIGNAL - see
# the libcurl tutorial for more info.
try:
    import signal
    from signal import SIGPIPE, SIG_IGN
    signal.signal(signal.SIGPIPE, signal.SIG_IGN)
except ImportError:
    pass
num_conn = 10
'''
try:
    if sys.argv[1] == "-":
        urls = sys.stdin.readlines()
    else:
        urls = open(sys.argv[1]).readlines()
    if len(sys.argv) >= 3:
        num_conn = int(sys.argv[2])
except:
    print "Usage: %s <file with URLs to fetch> [<# of concurrent connections>]" % sys.argv[0]
    raise SystemExit
'''
# Make a queue with (url, filename) tuples
queue = []
for index, row in sample_df[sample_df['processed']==0].iterrows():
    url = row.file.strip()
    if not url or url[0] == "#":
        continue
    filename = FILE_PATH+'/{}.mp3'.format(str(index))
    queue.append((url, filename,index))
    
# Check args
assert queue, "no URLs given"
num_urls = len(queue)
num_conn = min(num_conn, num_urls)
assert 1 <= num_conn <= 10000, "invalid number of concurrent connections"
print ("PycURL %s (compiled against 0x%x)" % (pycurl.version, pycurl.COMPILE_LIBCURL_VERSION_NUM))
print ("----- Getting", num_urls, "URLs using", num_conn, "connections -----")
# Pre-allocate a list of curl objects
m = pycurl.CurlMulti()
m.handles = []
for i in range(num_conn):
    c = pycurl.Curl()
    c.fp = None
    c.setopt(pycurl.FOLLOWLOCATION, 1)
    c.setopt(pycurl.MAXREDIRS, 5)
    c.setopt(pycurl.CONNECTTIMEOUT, 30)
    c.setopt(pycurl.TIMEOUT, 300)
    c.setopt(pycurl.NOSIGNAL, 1)
    m.handles.append(c)
# Main loop
freelist = m.handles[:]
num_processed = 0
while num_processed < num_urls:
    # If there is an url to process and a free curl object, add to multi stack
    while queue and freelist:
        url, filename, file_id = queue.pop(0)
        c = freelist.pop()
        c.fp = open(filename, "wb")
        c.setopt(pycurl.URL, url)
        c.setopt(pycurl.WRITEDATA, c.fp)
        m.add_handle(c)
        # store some info
        c.filename = filename
        c.url = url
        c.file_id = file_id
    # Run the internal curl state machine for the multi stack
    while 1:
        ret, num_handles = m.perform()
        if ret != pycurl.E_CALL_MULTI_PERFORM:
            break
    # Check for curl objects which have terminated, and add them to the freelist
    while 1:
        num_q, ok_list, err_list = m.info_read()
        for c in ok_list:
            c.fp.close()
            c.fp = None
            m.remove_handle(c)
            # print ("Success:", c.filename, c.url, c.getinfo(pycurl.EFFECTIVE_URL))
            sample_df.set_value(index = c.file_id, col = 'processed', value=1)  
            
            if (num_processed % 50  == 0) & (num_processed>0):
                print('{} / {} items processed'.format(num_processed,num_urls))
                sample_df.to_csv('sample_processed.csv')
         
            freelist.append(c)
        for c, errno, errmsg in err_list:
            c.fp.close()
            c.fp = None
            m.remove_handle(c)
            print ("Failed: ", c.filename, c.url, errno, errmsg)
            freelist.append(c)
        num_processed = num_processed + len(ok_list) + len(err_list)
        if num_q == 0:
            break
    # Currently no more I/O is pending, could do something in the meantime
    # (display a progress bar, etc.).
    # We just call select() to sleep until some more data is available.
    m.select(1.0)
# Cleanup
for c in m.handles:
    if c.fp is not None:
        c.fp.close()
        c.fp = None
    c.close()
m.close()


3. Preprocessing the sound with librosa, numpy and scipy

If you have read the previous articles, then probably you have seen that I actually adopted the following pipeline for file preprocessing:

  • Use librosa library to produce sound spectrograms;
  • Normalize them by applying logarithms;
  • Cut them into 5s rolling windows with 3s steps (arbitrary choice made just by manually inspecting the data);
  • Save as images;


When making my final decision on which preprocessing approach to use, I mostly kept in mind advice from this article and the below video.


After having written the full end-to-end preprocessing and logging script I timed it and realized that:

  • Processing each file takes ca. 0.25-0.7s, mp3 unpacking takes ca. 90% of the time (basically all the other operations combined take 10x-100x less time);
  • About ca. 5%-7% of all .mp3 files just do not open and python just hangs forever trying to process them - probably it's  because of corrupted files of just library is unstable (it has been released early this year after all);
  • 0.5s * 130,000  ~ a couple of days worth of computation - if I could speed it up a bit (3-10x) - it would be great;


So, if we try to use the multi-threading approach since we cannot avoid using high-level library for sound processing (it does not make sense considering the time constraints) a speed up of 4x (number of cores in my CPU) or 8-12x (2 or 3 processes per CPU).  When I started researching multi-threading, I mostly found a lot of old stale boilerplate code, where code junk takes ca. 90% of the space. Also making sure we terminate the stale processes also should be taken into consideration.

Given that, I found a few amazing blog posts / articles on the topic, that would make any life easier:


After the painful examinations of the insides of python libraries and testing a lot of scripts I debugged the following multi-thread script. There was just one problem - it quickly processed 4-12 items and then just hang (regardless of the number of workers - 4 survived a bit longer, 4+ died after one pass).

The worst thing is that I cannot really understand why. My working hypothesis is that it has something to do with ffmpeg encoder used by librosa library - probably it has some issues with multi-threading, which I had no time to investigate.

import os.path
import requests
import pandas as pd
import numpy as np
import time
import librosa
import signal
import scipy.misc
import os
from multiprocessing.dummy import Pool as ThreadPool 
# hyper params
global_counter = 0
slice_length = 200 
gap_length = 100
train_test = 'train'
fraction = 0.7
CURRENT_PATH = os.getcwd()
AUDIO_PATH = os.path.join(os.getcwd(),'bird_songs')
MEL_PATH = os.path.join(os.getcwd(),'sound_features')
sample_df = pd.read_csv(os.path.join(CURRENT_PATH,'feature_extraction_database.csv'))
sample_df = sample_df.set_index('id')
subsample = sample_df.sample(frac=fraction)

total_list_length = len(list(subsample.index.values))
list1 = list(subsample.index.values)[0:100]
list2 = list(subsample.vgenus_fold.values)[0:100]
def preprocess_sound_timeout(file_id,folder):
    pool = ThreadPool(processes=1)
    try:
        return pool.apply_async(preprocess_sound, [file_id,folder]).get(timeout=15)
    except BaseException as e:
        print(str(e))
        return 0
    
def preprocess_sound(file_id,folder):
    global sample_df
    global slice_length
    global gap_length
    global time_slices
    global global_counter
    global train_test
    
    try:
        if (global_counter % 50 == 0) & (global_counter>0):
            print('{} / {} items processed'.format(global_counter,total_list_length))
            sample_df.to_csv('feature_extraction_database.csv')
                
        y, sr = librosa.load(os.path.join(AUDIO_PATH,str(file_id)+'.mp3'))
        S = librosa.feature.melspectrogram(y, sr=sr, n_mels=64)
        log_S = librosa.logamplitude(S, ref_power=np.max)
        
        end = time.time()
    
        # Slice the data into ~5s slices
        pointer = 0
        counter = 0
        song_length = log_S.shape[1]
        while (pointer<song_length):
            scipy.misc.imsave(os.path.join(MEL_PATH,str(train_test),str(folder),str(file_id))+'_'+str(counter)+'.jpg',log_S[:,pointer:pointer+slice_length])
            pointer = pointer+slice_length+gap_length
            counter = counter+1
        
        
        # Save results to the global dataframe with the results 
        sample_df.set_value(index = file_id, col = 'features_extracted', value = 1)
        sample_df.set_value(index = file_id, col = 'cuts', value = counter)
        sample_df.set_value(index = file_id, col = 'mel_size', value=log_S.itemsize)
        sample_df.set_value(index = file_id, col = 'length_s', value=y.shape[0]/sr)
        
        del log_S,S,y,sr
   
    except BaseException as e:
        print(str(e))
        return 0
    else:
        global_counter = global_counter + 1
        print('Item processed: {}'.format(file_id))
        return 1
    
pool = ThreadPool(24) 
results = pool.starmap(preprocess_sound_timeout, zip(list1, list2))
pool.close()
pool.join()


Probably there is a reason why all multi-processing code example show only the most simple workers - like writing to files, opening files, making HTTP calls, etc

Having no time to spare, I decided to use old and faithful approach of just slowly iterating over all of the data, which produced the following script. Notice .sample(frac=1.0) call on the dataframe. This is used for randomization, because I learned from experience that bad files usually cluster in big chunks, which makes performance evaluation of the script impossible.

import os.path
import pandas as pd
import numpy as np
import time
import librosa
import signal
import scipy.misc
import os

def printProgressBar (iteration, total, prefix = '', suffix = '', decimals = 1, length = 100, fill = '█'):
    """
    Call in a loop to create terminal progress bar
    @params:
        iteration   - Required  : current iteration (Int)
        total       - Required  : total iterations (Int)
        prefix      - Optional  : prefix string (Str)
        suffix      - Optional  : suffix string (Str)
        decimals    - Optional  : positive number of decimals in percent complete (Int)
        length      - Optional  : character length of bar (Int)
        fill        - Optional  : bar fill character (Str)
    """
    percent = ("{0:." + str(decimals) + "f}").format(100 * (iteration / float(total)))
    filledLength = int(length * iteration // total)
    bar = fill * filledLength + '-' * (length - filledLength)
    print('\r%s |%s| %s%% %s' % (prefix, bar, percent, suffix), end = '\r')
    # Print New Line on Complete
    if iteration == total: 
        print()
# hyper params
global_counter = 0
slice_length = 200 
gap_length = 100
train_test = 'train'
CURRENT_PATH = os.getcwd()
AUDIO_PATH = os.path.join(os.getcwd(),'bird_songs')
MEL_PATH = os.path.join(os.getcwd(),'sound_features')
sample_df = pd.read_csv(os.path.join(CURRENT_PATH,'feature_extraction_database.csv'))
sample_df = sample_df.set_index('id')
subsample = sample_df[(sample_df['train']==1)&(sample_df['features_extracted']==0)].sample(frac=1)
total_list_length = len(list(subsample.index.values))
list1 = list(subsample.index.values)[0:5]
list2 = list(subsample.vgenus_fold.values)[0:5]
class TimeoutException(Exception):   # Custom exception class
    pass
def timeout_handler(signum, frame):   # Custom signal handler
    raise TimeoutException
    
def preprocess_sound(file_id,folder):
    global sample_df
    global slice_length
    global gap_length
    global time_slices
    global global_counter
    global train_test
    
    signal.signal(signal.SIGALRM, timeout_handler)
    signal.alarm(5)
    
    try:
        if (global_counter % 50 == 0) & (global_counter>0):
            
            printProgressBar(global_counter, total_list_length, prefix = 'Progress:', suffix = 'Complete', length = 50)
            print('{} / {} items processed'.format(global_counter,total_list_length))
            
            sample_df.to_csv('feature_extraction_database.csv')
                
        y, sr = librosa.load(os.path.join(AUDIO_PATH,str(file_id)+'.mp3'))
        S = librosa.feature.melspectrogram(y, sr=sr, n_mels=64)
        log_S = librosa.logamplitude(S, ref_power=np.max)
        
        end = time.time()
        # Slice the data into ~5s slices
        pointer = 0
        counter = 0
        song_length = log_S.shape[1]
        while (pointer<song_length):
            scipy.misc.imsave(os.path.join(MEL_PATH,str(train_test),str(folder),str(file_id))+'_'+str(counter)+'.jpg',log_S[:,pointer:pointer+slice_length])
            pointer = pointer+slice_length+gap_length
            counter = counter+1
        
        # Save results to the global dataframe with the results 
        sample_df.set_value(index = file_id, col = 'features_extracted', value = 1)
        sample_df.set_value(index = file_id, col = 'cuts', value = counter)
        sample_df.set_value(index = file_id, col = 'length_s', value=y.shape[0]/sr)
        
        del log_S,S,y,sr
    except BaseException as e:
        print(str(e))
    
    except TimeoutException:
        print('Timeout for item {}'.format(str(file_id)))
        return 0
    else:
        signal.alarm(0)
        global_counter = global_counter + 1
        print('Item processed: {}'.format(file_id))
        return 1
printProgressBar(0, total_list_length, prefix = 'Progress:', suffix = 'Complete', length = 50)
for index in zip(list1,list2):
    preprocess_sound(index[0],index[1])  


After running the above script for a couple of days I managed to pre-process almost all of the files I have. Also interestingly enough, ~169GB of bird sounds compressed into ca. 2-3GBs of spectrograms, which is nice.



Compression level



File count



Mesmerizing progress indicator


4. Downloads and links

As usual, I attach the files that I used, but this week they are bit messy because I had to do a lot of fiddling with data and scripts:


Phew, all of this required a lot of fiddling with scripts, data and python (and was frustrating sometimes)! I hope this will help you somehow with your endeavors.