Explores various machine learning techniques for generating music. Compares the performance of traditional RNNs, LSTMs, and transformers on generating sample sequences of music.
The problem of music generation has been widely explored for a long time. Music has very similar parallels to how speech and language is structured. Just like language, music is temporal and in the traditional western sense, has a defined set of standards/rules for how music should be structured. What makes music generation a more challenging problem than language is that music has an artistic/expressive component as well as both low and high level structure. For “good” music, it isn’t enough to simply generate a series of notes and harmonies that obey music theory conventions. At the low level, “good” music makes use of varying dynamics, note groupings, and articulation. At the high level, “good” music may feature overarching motifs and specific forms (round, sonata form, ABAB, etc). This level of complexity is analagous to the problem of generating poetry and generating speech that mimics a human reciting it. The poetry will have structures like rhyme, rhythm, motifs, metaphors, etc. and the speech reading it will have to replicate expressiveness to be convinving. This level of complexity is not yet achievable with high level of robusts by current speech generators, LLMs, and NLP methods.
It is this level of structural complexity required for generating “good” music that make machine learning methods, specifically deep learning, a compelling approach to tackling the problem of generating “good” music. Deep learning methods should be able to capture music’s low level music theory structure as well as the high level It is the hope that given enough data and the right architectures, music generation will be able to mimick a level akin to the best human composers. While music generation such as OpenAi’s jukebox
There has been several studies/project done in the generation of music. OpenAi has done work with using audio samples to generate music. They took a representation learning and autoencoder approach leveraging VQ-VAEs. Other work
Before music can be learned and generated, it needs to first be converted to a format that can be input to a ML model. To achieve this I used a subset of a piano midi dataset
I then tested 3 different models to see how they performed. The first model I tested was an RNN with hidden_size = 64, RNN_layers = 2, and sequences of 24, 48, 64, and 200. I next tested LSTM models with hidden_size = 64, RNN_layers = 2, and sequences of 24, 48, 64, and 200 and compared a birection vs. single directional model. The last model I analyzed was a transformer. In which I first took my note encodings and created an embedded representation of the notes and combined this with positional encoding in the sequence of music to get my final embedding to pass into my transformer architecture.
I found that the RNN architecture to be the worst performing model. It has a high ringing for some training and mostly unstructured and random. The results of a sample music generation can be found here. The LSTM model took longer to train but performed better with hidden size = 64, sequence_length=48, and 30 epochs. I found that it worked even better when using a bidirectional architecture. A sample generation can be found here in which it was fed the starting 10 notes of Polonaise in A-flat major, Op. 53 and was asked to generate a long sequence from that. The transformer took the longest to train and its results can be found here
As expected the base RNN architecture failed to generate anything meaningful. It took a while to find hyperparameters that would make the LSTM generate something of note, but when it did successfully generate music I was surprised by some of the resemblences it had to music in the training data.
One noticeable flaw in my work is that I that my metric for success outside of training error is qualitative. It would have been useful for evaluation of my model implementations if I had a quanititative metric. I originally calculated the loss of my models based on how they replicated unseen music from a test set given sequences from the same music, however losses for every model failed to converge in a reasonable amount of time. It is certainly difficult to tell if poor performance is due to implementation or a small dataset and limited compute resources.
Continuing on the idea of lack of data. One of the challenges I faced was in the curation of my dataset. I originally was going to generate music tokens for my network based on a very descriptive musical format cally lilypond. However, there were inconsisencies between samples of music in how they were resprented in the lilypond text format, so creation of a program to transcribe the text to a good format for representing music was very difficult which is why I turned to the more standardized MIDI file format. It is unfortunate because a lot of the the complex expression in music is lost in midi format, making it harder if not impossible for models trained on midi input to learn these complex representations/behavior. I say impossible because if data for musical expression is completely absent from training, then this important component of music is simply out of distribution and impossible to learn. So a better way to encode/represent music is needed for better results.
Moving forward, it would be interesting to explore how representation learning can be used to enhance the generation of music. I wanted to explore the use of VAEs and some of the more advanced variations like the one in used in OpenAi’s jukebox, VQ-VAE. These methods maybe be able to capture both the high level structure and complex low level structure found in music. I also want to explore methods for encoding the dynamics, articulation, and expression found in music, something I was not able to do this time around. Lastly, exploring a better way to encode and learn the duration of notes would lead to better music generation.