What did the bird say? Bird voice recognition. Part 1 - the beginning
What did the bird say? Bird voice recognition. Part 2 - taxonomy
What did the bird say? Bird voice recognition. Part 3 - Listen to the birds
What did the bird say? Bird voice recognition. Part 4 - Dataset choice, data download and pre-processing, visualization and analysis
What did the bird say? Bird voice recognition. Part 5 - Data pre-processing for CNNs
What did the bird say? Bird voice recognition. Part 6 - Neural network MVP
What did the bird say? Bird voice recognition. Part 7 - full dataset preprocessing (169GB)
What did the bird say? Bird voice recognition. Part 8 - fast Squeeze-net on 800k images - 500s per epoch (10-20x speed up)
There is such a variety of species! Who should we choose, how and what to do with the data?
1. What is next?
Now that we have got a basic grasp of our bird data and have listened extensively for their calls, let's choose a random subset and think about feature preprocessing for our neural networks. We have to make sure:
- For proof of concept our data-set is not too big (remember - we have access to 200k bird calls in xeno-canto database) for fast neural network training;
- The dataset is balanced;
- We apply right preprocessing methods;
- It would be better if we could actually fit the whole dataset to the memory of our GPU at once, so that we would be able to pre-process our data using matrix operations in the GPU;
2. Considerations for choosing random subset
Obviously we should favour the entries where we have taxonomic data and vernacular names (it will be easier to interpet data without consulting wikipedia each time that Gallus means chicken for example).
As expected, this produces the below counts. We have ca. 365k birds songs, 286k of which have taxonomic data and only 134k of which have vernacular genus names. Actually this really not bad given that average bird song mat contain 5 - 10 actual songs (see later).
Let's see how many bird songs we have on average for birds with vernacular genus names.
table = pd.pivot_table(merged_df,
margins = True,
table = table.sort_values(by=('id','All'),ascending=False)
table[table[('id','All')]<4000][('id','All')].apply(lambda x: np.log10(x+1)).plot.hist(alpha=0.7)
We see that average is between 10^2 (100) and 10^2.5 (300), which is significantly better than having just 30 songs per species on average. One might argue that species may have significantly different songs, but for a proof of concept being able to tell a crow from a titmouse will be ok.
I will omit some boring eye inspection steps where I basically produced pivot tables and looked through the majority of the data manually. In the end I decided to use this simple algorithm to produce random dataset:
- Choose only bird genus with > 200 songs;
- Choose random 200 songs for any such given genus;
This gives us 132 unique bird genus and 26400 bird songs, which I assume is enough for proof of concept, provided that each song weights ca. 100kb (it was naive of me to suppose that).
merged_df = pd.read_csv('merged_data.csv')
# Given that we have an average of ca. 300 calls per class
# let's choose only the genus with > 300 calls and let's choose 300 calls per genus, so that the sample is not biased
sample_df = pd.DataFrame()
for vgenus in merged_df[pd.notnull(merged_df.vgenus)].vgenus.unique():
if(merged_df[merged_df['vgenus']==vgenus].shape > 200):
sample_df = sample_df.append(merged_df[merged_df['vgenus']==vgenus].sample(n=200))
3. Downloading and preprocessing the data
Obviously, we cannot feed raw sound into CNNs (because if we simplify - they learn shapes). Raw sound data when decoded looks something like this in the most simplistic case (check out this awesome notebook for more):
A general rule of thumb is that if a person cannot tell the difference in 90-95% cases with naked eye, then the machine probably would also have a hard time telling the difference. In a more realistic environment sound may look like this (check out this amazing notebook):
Compare these three examples - you will see that they are not really so different when visualized. But compare all of this with the spectrogram like this:
It immediately becomes apparent that we should feed our networks not raw sound but preprocessed sound in the form of spectrograms or any deeper form of sound analysis available with librosa (i believe that logs of mel-spectrograms and MFCC are the obvious candidates). Also given that we have to download all the bird songs separately (one of my professional web-scraping friends recommends this tool to me - http://docs.grablib.org/en/latest/spider/intro.html - but I guess getting multi-curl scraping tools is a task itself).
So we need to download data, preprocess it and save it as numpy arrays.
Below you can find a simplistic python script I used for this purpose. It manages to download 1500-1700 audio files and then dies for some reason without errors or explanations. Pay attention to the timeout alarm bit and try-catch statements - sometimes files take just too much time to be processed. Also this script uses only 50% of my 4 CPUs and ca. 10% of my internet bandwidth, which is not efficient, but we are still in MVP phase here...
(Remove last break in case you want to download more than one file. sample.csv is a random subset produced above).
import pandas as pd
import numpy as np
cwd = os.getcwd()
sample_df = pd.read_csv('sample.csv')
sample_df = sample_df.set_index('id')
class TimeoutException(Exception): # Custom exception class
def timeout_handler(signum, frame): # Custom signal handler
c = 0
r = requests.get(url, stream=True)
with open(local_filename, 'wb') as f:
for chunk in r.iter_content(chunk_size=1024):
if chunk: # filter out keep-alive new chunks
#f.flush() commented by recommendation from J.F.Sebastian
for index, row in sample_df[sample_df['processed']==0].sample(n=1000).iterrows():
file_path = row.file
page = requests.get(file_path)
file_path = page.url
file_id = index
audio_path = 'bird_calls/'+str(file_id)+'.mp3'
mel_path = 'sound_features/mel_'+str(file_id)
harm_path = 'sound_features/harm_'+str(file_id)
perc_path = 'sound_features/perc_'+str(file_id)
mfcc_path = 'sound_features/mfcc_'+str(file_id)
# Change the behavior of SIGALRM
# ! curl $file_path --output $audio_path
y, sr = librosa.load(audio_path)
sample_df.set_value(index = file_id, col = 'length_s', value=y.shape/sr)
sample_df.set_value(index = file_id, col = 'length', value=y.shape)
sample_df.set_value(index = file_id, col = 'sr', value=sr)
S = librosa.feature.melspectrogram(y, sr=sr, n_mels=64)
log_S = librosa.logamplitude(S, ref_power=np.max)
mfcc = librosa.feature.mfcc(S=log_S, n_mfcc=13)
sample_df.set_value(index = file_id, col = 'mel_size', value=os.path.getsize(cwd+'/'+mel_path+'.npy'))
sample_df.set_value(index = file_id, col = 'mfcc_size', value=os.path.getsize(cwd+'/'+mfcc_path+'.npy'))
sample_df.set_value(index = file_id, col = 'processed', value=1)
continue # continue the for loop if function A takes more than 5 second
# Reset the alarm
if c % 50 == 0:
print ('50 entries saved')
The whole process is a bit slow and kinda looks like this. The boring part is that I cannot just leave it overnight because it's failing. I guess for full-scale models I will need to learn the scraping libraries properly...
4. So, what do we have inside? Let's play with the visuals and the data!
Well, obviously I had some results, when I was writing this article =) Let's see what we managed to download and how it will affect the later stages of our project. This is based on ca. 1700 observations from our ca. 22k random subset (I guess we can start testing basic CNNs on ca. 3-5k birds songs).
sample_df = pd.read_csv('sample.csv')
sample_df[sample_df.processed==1].length_s.fillna(value=0).apply(lambda x: np.log10(x+1)).plot(kind='hist')
This is log10 of lengths of songs in seconds. We can see that on average a song takes from 10 to 60 seconds.
sample_df[sample_df.processed==1].mel_size.fillna(value=0).apply(lambda x: np.log10(x+1)).plot(kind='hist')
This is a bit of a problem - on average mel spectrogram files range from 100kb to 1mb in size. I hoped that we would have 10x smaller files. This mostly impacts feasibility of running an MVP of the model by loading the whole dataset into memory at one...
All of this is nice, but probably we should have some fun after all? Let's do it. Let's visualize our spectrograms and play 10 random bird songs. This is really really beautiful and inspiring. This is really like some form of modern art. Just look and enjoy (you can repeat all of this if you want to...) =)
cols = ['vfamily', 'vgenus', 'sp']
for index, row in sample_df[sample_df['processed']==1].sample(n=10).iterrows():
sr = row.sr
log_S = np.load('sound_features/mel_'+str(row.id)+'.npy')
# log_S = librosa.logamplitude(S, ref_power=np.max)
# Make a new figure
# Display the spectrogram on a mel scale
# sample rate and hop length parameters are used to render the time axis
librosa.display.specshow(log_S, sr=sr, x_axis='time', y_axis='mel')
# Put a descriptive title on the plot
plt.title('mel power spectrogram')
# draw a color bar
# Make the figure layout compact
Sit down and relax now...
vgenus ["Bewick's Wrens"]
vfamily ["American Sparrows", "Buntings", "Emberizid F...
vgenus ["Crowned Sparrows"]
vfamily ["Crows", "Jays", "Magpies"]
vgenus ["Tufted Jays"]
vgenus ["Least Grebes"]
vfamily ["Tyrant Flycatchers"]
vgenus ["Crested Flycatchers"]
vfamily ["Crows", "Jays", "Magpies"]
vgenus ["Typical Grebes"]
vfamily ["Sandpipers", "Snipes"]
vgenus ["Barn Swallows"]
5. Resources, downloads and further reading
- This notebook (please use collapsible cells plugin):
- Notes on Music Information Retrieval Notes on Music Information Retrieval:
- Especially this notebook;
- The sound of Hydrogen notebook;
And finally a somewhat related video for fun: