Leveraging Spotify’s API and other sources for the Hit Song Science research field. The whole project (report and code source) can be viewed here.
I decided to use my Master Thesis to apply the concepts of Machine Learning I have been learning over the year with online courses. The idea for me was to use it as a first Data Science project. Being passionate about music, I chose to tackle the Hit Song Science subject which consists in predicting the overall popularity of a track.
Methodology and Results
To do so, I built my own database of Spotify’s Top 2018 and 2019 songs and I extracted additional information from Genius.com, Google Trends, MusicBrainz and LastFM. To define the popularity of a music, I used the continuous variable, provided by Spotify, and a binary one (top 20% of the dataset using the other popularity feature). From those data, I created new features, which includes Google Trends Standard Deviation over a 3-month period ending one week after the release of an album, Peak (indicator of the highest interest position) or Holiday Period (dummy variable indicating the release of a track during the months of May, June, July or August). The idea was to use three different subsets: Audio Features, Artist Metadata and Song Metadata.
1. Natural Language Generation:
The Commercial State of the Art in 2020
2. This Entire Article Was Written by Open AI’s GPT2
3. Learning To Classify Images Without Labels
4. Becoming a Data Scientist, Data Analyst, Financial Analyst and Research Analyst
First, I did an Exploratory Data Analysis to discover the variable distributions, determine the correlations between them and visualize other kinds of relationships. Pearson and Spearman tests were also performed between the quantitative variables and the song popularity. Then, I tested the following models according to the literature review: Linear Regression, Logistic Regression, K-Nearest Neighbors, Random Forest, Support Vector Machines (linear and gaussian kernels) and a Single-Layer Perceptron. I also used feature selection and regularization (L1 and L2) methods to improve my results and prevent the models from overfitting. The results were finally compared to a dummy model using a Test set. Unsurprisingly, the perceptron yielded the best result with a F1 score of 0.70 (Precision of 0.58 and Recall of 0.88).
Conclusion and Recommendations
To conclude, the audio features are indeed not sufficient to explain the popularity of a track and metadata are essential. The selection methods I used gave me a set of influential features for both kinds of tasks (regression and classification), with a majority of metadata (even if I mostly had song/artist metadata in the dataset). It was interesting to observe that Google Trends features exerted a significant influence on the outcome of the classifiers, increasing the probability for a track to be a Hit. I also found the classification task to be more promising, due to the comparison of the models with a dummy classifier and a dummy regressor.
I would like to end this article by presenting some ideas that could be tried for further work on the Hit Song Science research field:
- Try to constitute a dataset per genre and to train models on each dataset.
- Build a large dataset, thinking about the distribution of Spotify’s popularity value. There is now an offset limit with Spotify’s API for the search point so I would recommend using lists of Spotify’s track ID that can be found on Kaggle for instance, paying attention to the release date of the tracks.
- Use detailed audio features which are available on Spotify’s API. The ones I used summarize that information but probably simplify it.
- Use the market feature to create an indicator of the number of countries where the song is available. Maybe various strategies for launching an artist could be interpreted: is it better to focus on some markets or to promote a track worldwide (use historical data)?
- Many tags can be obtained from LastFM’s API, which can be useful to understand how the song is perceived by the listeners.
- Tackle the Hit Song Science problem as a multi-classification task to smooth the definition of popularity.
- Use a combination of Spotify’s followers, LastFM’s subscribers, Deezer’s subscribers, Genius’s page views or Instagram’s followers to analyse the popularity of an artist.
This project was very interesting, and I learnt a lot as it was my first application of the content I had been working on with online courses since the beginning of the year. I discovered many other resources and books to deepen my understanding of Data Science and I look forward to improving myself and evolving in the field of data.
Huge thanks to Julien Fouquau, my thesis tutor at ESCP Business School and Ulysse Couerbe for giving me a great list of online courses to start learning the basics of Machine Learning.