Final Product

We decided to implement our strongest classifier and filtering technique for our final model. Those being the Kth Nearest Neighbor and Notch Filtering. The Notch Filter was done on the fundamental frequency, its task only being to remove outlying data. This allowed us to get better results from our feature extraction.

Speaking of our feature extraction we ended up deciding to extract a range of audio features that we felt were critical to identifying specific artists. These features are listed below with an accompanying description.

Features

MFCC: A representation of the spectral characteristics of a signal, obtained by transforming the signal into a set of coefficients derived from its mel-scaled power spectrum.

Pitch: The perceived frequency of a sound

Zero-Crossing Rate: Rate at which a signal changes its sign

Short-Time Energy: Quantifies the magnitude of a signal within specific time intervals

Spectral Centroid: Represents the center of mass of a sound spectrum, indicating the average frequency at which the energy of a signal is concentrated.

Harmonic Ratio: Quantifies the degree to which a signal's frequency components adhere to a harmonic series

MFCC Delta: Represents the rate of change of the Mel-frequency cepstral coefficients over time

MFCC Delta Delta: Represents the acceleration or second-order temporal changes of Mel-frequency cepstral coefficients

Using these features we were able to train our KNN classifier. We decided for the model to consider 20 neighbors as this was the best balance between performance and computational load. Furthermore we used the Euclidean distance metric as well as squared inverse distance weighting to assign more significance to closer neighbors. The classifier was trained on 19 clips per artist, and tested on 1 clip per artist

Classification was not done on entire clips, instead on frames of the clips. Furthermore only on frames where there was a voice detected. This was implemented by looking at the Short-Time Energy, and Zero-Crossing Rate. If these two features were below or above a certain set threshold we determined there was vocals present and looked at the frame. Going frame by frame instead of clip by clip allowed for more data and for clip sizes to vary which made the model much stronger.

The final model is implemented for four artists Drake, Lana Del Rey, Pitbull, and Taylor Swift. It allows users to provide a .wav file of any length and it will predict the artist that is in that clip.

Above we can see the frame per frame confusion matrix for our final KNN classifier. Its accuracy frame per frame being roughly 77% for four unseen files. This is just slightly above our set goal of 75%. These are especially impressive results due to the face that this is frame per frame accuracy.

Above we can see the clip accuracy from the model. Which as expected due to the frame per frame prediction accuracy is perfect. This is not to say the model is perfect without flaw. As these are only four clips. This accuracy is somewhat misleading nevertheless as it can quickly drop off due to its all or nothing nature. That's why the frame per frame accuracy should be looked at more significantly.

Overall though the final model accomplishes our teams set goals fairly well. Not to mention it does it in a fairly simple manner. Making the possibility for improvement in the future to remain present. If the restrictions of this project, specifically the time, were removed we feel we could create a much stronger model by including more data and improving our filtering methods.