As always, the first thing I do when working with machine learning or deep learning projects is collecting all required data. Today, I am taking thousands of audio files from a Kaggle competition which you can download from this link: https://www.kaggle.com/c/freesound-audio-tagging/data. The dataset contains 41 classes in which each of those represents the name of a musical instrument such as cello, chime and clarinet. Actually there are also some other non-musical instrument sounds like telephone and fireworks in the dataset. However, here in my project I decided to choose only 5 musical instruments to be classified for simplicity.
Now let’s start with importing all required modules:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm import tqdm
from python_speech_features import mfcc
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix
from keras.layers import Conv2D, MaxPool2D, Flatten, Dense, Dropout
from keras.models import Sequential
Here I would like to highlight several imports that you probably might still not familiar with, those are: librosa, tqdm and mfcc. First, librosa is a Python module which I use to load all audio data. Next, tqdm is actually not very necessary, I just like to use it to display progress bar in a loop operation. Lastly, mfcc is a function coming with python_speech_features module which is very important to extract audio features to make those audio wave data more informative.
The next step is to load train.csv in form of a pandas data frame using the following code.
df = pd.read_csv('train.csv')
Below is how the data frame looks like. You can see here that it contains filename-label pair and manually_verified column which I guess it’s used to tag whether an audio clip is verified by a real person.
As I’ve mentioned earlier, in this project I will only use 5 out of 41 classes in the dataset. Those classes are Cello, Saxophone, Acoustic_guitar, Double_bass and Clarinet. Here is how to filter out those classes.
df = df[df['label'].isin(['Cello','Saxophone','Acoustic_guitar','Double_bass', 'Clarinet'])]
1. Machine Learning Concepts Every Data Scientist Should Know
2. AI for CFD: byteLAKE’s approach (part3)
3. AI Fail: To Popularize and Scale Chatbots, We Need Better Data
4. Top 5 Jupyter Widgets to boost your productivity!
If you check the shape of the data you will find that the number of data is now only 1500 (previously there are 9473 files). As the data already filtered to only 5 classes, we will load those actual audio files using the following approach:
path = 'audio_train/'
audio_data = list()for i in tqdm(range(df.shape)):
audio_data.append(librosa.load(path+df['fname'].iloc[i]))audio_data = np.array(audio_data)
Well, this process is relatively simple yet it takes several minutes to run. Here I declared an empty list called audio_data and append each of raw audio data using librosa.load() function. Keep in mind that the shape of this audio_data variable is (1500, 2) in which the first axis represents the number of raw audio waves and the second axis represents 2 columns (which stores audio waves and sample rate respectively). Lastly, I also convert the audio_data list into Numpy array.
By the way, if you use tqdm() function like what I did, you will have the output which looks something like this: