Polyphonic music transcription is a complex problem which lends itself well to deep learning. As music is often formed from a temporal structure (e.g. scales, arpeggios, etc.) recurrent neural networks are ideal for such a task. In this Stanford CS229 Machine Learning final project, we implement a 4-layer recurrent neural network which operates on spectrogram representations of data. We present an analysis of learned temporal structures from the network, and compare its performance to a baseline multivariate linear regression model.
Piano-roll visualization of RNN predictions and comparison to ground truth labels. From left to right, the input spectrogram, RNN predictions using predetermined cutoff value, ground truth labels.
(LEFT) MSE loss over training epochs calculated on training and validation datasets for best-performance RNN model. (RIGHT) Precision-recall curve calculated on fully-trained RNN using validation dataset.
The top 10 highest weighted previous notes in final layer hidden-to-hidden weights were considered for C4 and C#/Db4. From these we can see that these predictions are favoured when preceded by octaves and notes from the scales of C major and A-flat major respectively. This indicates that RNN's may be capable of learning music structures.