Coursework 2:
Submitted for the partial fulfilment of the DSM150 course
By Hendrik Matthys van Rooyen
230221176
Music Generation: ADL Piano MIDI Dataset¶
Background¶
Coursework Evolution¶
The development of the coursework, from the first to the fourth attempt, saw a series of changes aimed at overcoming specific challenges that were encountered.
Initially, the objective was to successfully read and process a MIDI file, using the mido
Python library. This stage focused primarily on extracting pitches from MIDI files. However, difficulties arose with mido
processing certain files, attributed to issues with the selected data structures. An initial attempt was also made to train a basic generative network.
Subsequent revisions led to the adoption of music21
instead of mido
. This change was motivated by the need to resolve structural issues and because music21
facilitated the extraction of a wider range of features in a simpler way. Despite these improvements, only pitch was utilized as a feature, and no attention was given to chords. Efforts were made to implement a system for converting these features into categorical inputs for the network, alongside developing a method for converting data back into MIDI format.
The focus then expanded to include not just pitch, but also the duration of notes. This approach enabled the training of functional networks and the generation of music, though it became apparent that the music lacked overlapping notes due to a failure to treat chords as single units and not considering note offsets.
By the fourth attempt, significant advancements were made by incorporating note pitch, duration, and offsets. This phase aimed to accurately convert songs into data and then back into MIDI without major losses. It also explored accommodating songs that use multiple instruments. However, a challenge emerged with chords, which occupied a large portion of the pitch vocabulary used by the network but were rarely selected in the generated music. Various network models, training durations, and outcomes were experimented with, including the use of floating-point values for durations and offsets, which was ultimately abandoned due to the generated music converging towards a single value.
In the most recent iteration, the handling of chords was refined by eliminating duplicate notes within chords, sorting them alphabetically, and ultimately focusing on the root note of each chord. This adjustment significantly reduced the complexity of the dataset, from over 10,000 combinations to 95. Additionally, durations and offsets were rounded to reduce complexity, and songs were segmented to minimize overlap.
Implementation¶
Read and write MIDI files¶
Reading MIDI Files¶
Directory Traversal for MIDI File Collection: The process begins with the find_midi_files function, which performs a recursive search through the specified directory and its subdirectories, identifying files that end with .mid or .midi extensions. This ensures a thorough collection of all potential MIDI files for analysis, compiling a comprehensive list of paths to these files.
Musical Data Extraction: With all MIDI files located, get_notes_and_durations
comes into play, processing each file individually to extract vital musical information. This step is crucial for understanding the musical composition encoded within each file. It distinguishes between polyphonic structures (which contain distinct instrumental parts) and monophonic structures (which do not), allowing for a nuanced extraction of notes, chords, and rests. Each musical element's duration and offset (the time at which it begins relative to the start of the piece) are also captured. This rich dataset forms the foundation for any subsequent musical analysis or manipulation.
Writing MIDI Files¶
Reconstruction of MIDI Files: The create_midi_from_notes
function undertakes the task of reconstructing a MIDI file from scratch, based on lists of notes, their durations, and offsets. This function exemplifies the synthesis aspect of working with MIDI files, transforming abstract musical concepts back into a structured digital format. By leveraging music21's stream objects, it meticulously assembles a sequence of musical elements, each positioned and timed according to the input lists. This capability enables not just the replication of existing musical pieces but also the creation of entirely new compositions.
Application Example¶
The example code at the end demonstrates how to read specific MIDI files, process them to extract musical elements, and then create a new MIDI file from this data. This demonstrates a full cycle of reading, processing, and writing MIDI data, which can be the basis for more complex operations like musical analysis or automated composition.
from music21 import converter, note, chord, instrument
import numpy as np
import glob
import os
def find_midi_files(directory):
if os.path.isfile(directory) and directory.endswith(('.mid', '.midi')):
return [directory]
midi_files = []
for root, dirs, files in os.walk(directory):
for file in files:
if file.endswith(('.mid', '.midi')):
full_path = os.path.join(root, file)
midi_files.append(full_path)
return midi_files
def simplify_chord(ch):
new_chord = []
#if(ch.third != None): new_chord.append(ch.third)
#if(ch.fifth != None): new_chord.append(ch.fifth)
#if(ch.seventh != None): new_chord.append(ch.seventh)
if(len(new_chord) == 0): new_chord.append(ch.root())
return chord.Chord(new_chord)
def round_time(time):
if time > 10.0:
return 10.0
return round(time * 16) / 16
def get_notes_and_durations(directory):
"""
Extracts notes, durations, and offsets from MIDI files in a given directory.
Parameters:
directory (str): The path to the directory containing MIDI files.
Returns:
list: A list of tuples, where each tuple contains three lists: notes, durations, and offsets for a song. Each list in the tuple represents a song in the MIDI files. The notes list contains the pitch of every note (or chord) in the song. The durations list contains the duration of every note (or chord) in quarter lengths. The offsets list contains the start time of every note (or chord) in quarter lengths.
"""
midi_files = find_midi_files(directory)
all_songs = []
for index, file in enumerate(midi_files):
print(f"[{index}] {file}")
try:
midi = converter.parse(file)
parts = instrument.partitionByInstrument(midi)
if parts: # file has instrument parts
notes_to_parse = []
for part in parts.parts:
notes_to_parse.extend([element for element in part.recurse() if isinstance(element, (note.Note, chord.Chord))])
else: # file has notes in a flat structure
notes_to_parse = midi.flat.notesAndRests
except Exception as e:
print(f"Error processing {file}: {e}")
continue
notes_to_parse = sorted(notes_to_parse, key=lambda x: x.offset)
song_notes, song_durations, song_offsets = [], [], []
prev_offset = 0
first_offset = round_time(notes_to_parse[0].offset)
for element in notes_to_parse:
if isinstance(element, (note.Note)):
note_instance = str(element.pitch)
elif isinstance(element, chord.Chord):
unique_sorted_notes = sorted(str(n.pitch) for n in simplify_chord(element).notes)
note_instance = '.'.join(unique_sorted_notes)
elif isinstance(element, note.Rest):
note_instance = 'Rest'
element_offset = element.offset
offset = round_time(element_offset - prev_offset - first_offset)
prev_offset = element_offset - first_offset
song_notes.append(note_instance)
song_durations.append(round_time(element.duration.quarterLength))
song_offsets.append(offset)
all_songs.append((song_notes, song_durations, song_offsets))
return all_songs
from music21 import stream, note, chord, duration, tempo
def create_midi_from_notes(notes, durations, offsets, output_file_path='output.mid'):
"""
Create a MIDI file from lists of notes, durations, and offsets.
Parameters:
notes (list): A list of note representations. Each representation can be a note (e.g., 'C4'), a chord (e.g., 'C4.E4.G4'), or a rest ('Rest').
durations (list): A list of durations for each note or chord. Each duration is a float representing the duration in quarter lengths.
offsets (list): A list of offsets for each note or chord. Each offset is a float representing the time at which the note or chord should start, in quarter lengths.
output_file_path (str, optional): The path to the output MIDI file. If the directory does not exist, it will be created. Defaults to 'output.mid'.
Returns:
None. The function writes the output to a MIDI file at the specified path.
"""
output_directory = os.path.dirname(output_file_path)
if not os.path.exists(output_directory):
os.makedirs(output_directory)
output_stream = stream.Score()
current_offset = 0
for i, note_repr in enumerate(notes):
current_offset += offsets[i]
if note_repr == 'Rest':
new_element = note.Rest()
elif '.' in note_repr: # it's a chord
pitches = note_repr.split('.')
new_element = chord.Chord(pitches)
else: # it's a note
new_element = note.Note(note_repr)
new_element.duration = duration.Duration(durations[i])
#print(f"{new_element}: - {new_element.duration.quarterLength} - {offsets[i]}({current_offset})")
output_stream.insert(current_offset, new_element)
output_stream.write('midi', fp=output_file_path)
import matplotlib.pyplot as plt
from music21 import converter
def plot_piano_roll(midi_path, title='Piano Roll'):
# Load the MIDI file
midi = converter.parse(midi_path)
notes_to_parse = midi.flat.notes
plt.figure(figsize=(10, 10))
for note in notes_to_parse:
if note.isNote:
start = note.offset
duration = note.duration.quarterLength
pitch = str(note.pitch)
plt.plot([start, start + duration], [pitch, pitch], color="blue", linewidth=5)
elif note.isChord:
start = note.offset
duration = note.duration.quarterLength
for p in note.pitches:
pitch = str(p)
plt.plot([start, start + duration], [pitch, pitch], color="blue", linewidth=5)
plt.title(title)
plt.xlabel('Time (in quarter length)')
plt.ylabel('MIDI Note Number')
plt.grid(True)
plt.show()
#adl-piano-midi\\Rock\\Soft Rock\\Phil Collins
#adl-piano-midi\\Classical\\Classical\\Frederic Chopin
#adl-piano-midi\\Rock\\Glam Rock\\Elton John\\
#adl-piano-midi\Rock\Album Rock
#'\\Blues\\Ragtime\\Sue Keller'
#'\\World\\Swedish Pop\\Abba'
folder_path = '\\Rock\\Soft Rock\\Lionel Richie\\Say You Say Me.mid'
songs = get_notes_and_durations('adl-piano-midi'+folder_path)
[0] adl-piano-midi\Rock\Soft Rock\Lionel Richie\Say You Say Me.mid
#test_index = 5
#create_midi_from_notes(songs[test_index][0], songs[test_index][1], songs[test_index][2], './generation test files/soft rock 11.mid')
Pre-process Data¶
The main goal of this section is to organize and transform the musical data into a structured form that neural networks can learn from. This involves converting musical elements into numerical representations and structuring these into sequences that serve as inputs for the model.
Steps Involved¶
1. Initialization and Data Preparation:
Empty lists (all_notes
, all_durations
, all_offsets
) are initialized to store the extracted musical elements from multiple MIDI files.
A transition_marker is defined to signify the end of a song within the dataset, ensuring that the model recognizes song boundaries during training.
2. Data Consolidation:
Iterating through a collection of songs, the code appends notes, durations, and offsets from each song into the respective lists. This step includes adding the transition_marker between songs to maintain the distinction between consecutive songs in the dataset.
3. Vocabulary Construction:
By removing the transition markers and calculating the unique elements in each category (notes, durations, offsets), the code establishes the vocabulary size for each aspect of the musical data.
4. Mapping Creation:
Dictionaries (note_to_int
, duration_to_int
, offset_to_int
) are created to map each unique musical element to an integer, facilitating the numerical representation necessary for machine learning models.
5. Sequence Preparation:
The code constructs sequences of a fixed length (sequence_length
) from the combined lists, excluding sequences containing the transition marker to prevent the model from learning to transition between unrelated songs.
6. Data Formatting for Neural Network Training:
Input sequences are reshaped and normalized based on the vocabulary size for each element type. This normalization is critical for learning, as it helps in maintaining a consistent scale across the dataset.
Outputs are one-hot encoded, turning them into binary class matrices, essential for classification tasks within neural networks.
from keras.utils import to_categorical
import numpy as np
sequence_length = 50
all_notes = []
all_durations = []
all_offsets = []
# Define a special marker for transitions between songs
transition_marker = ('<end_song>', -1, -1)
# Iterate through each tuple in the list and add transition markers
for tuple_ in songs:
all_notes.extend(tuple_[0] + [transition_marker[0]])
all_durations.extend(tuple_[1] + [transition_marker[1]])
all_offsets.extend(tuple_[2] + [transition_marker[2]])
# Remove the last added marker as it is not needed after the last song
all_notes.pop()
all_durations.pop()
all_offsets.pop()
n_vocab = len(set(all_notes))
d_vocab = len(set(all_durations))
o_vocab = len(set(all_offsets))
if(len(songs) > 1):
n_vocab = n_vocab - 1
d_vocab = n_vocab - 1
o_vocab = n_vocab - 1
print((n_vocab, d_vocab, o_vocab))
pitchnames = sorted(set(all_notes) - {transition_marker[0]})
durationnames = sorted(set(all_durations) - {transition_marker[1]})
offsetnames = sorted(set(all_offsets) - {transition_marker[2]})
note_to_int = dict((note, number) for number, note in enumerate(pitchnames))
duration_to_int = dict((duration, number) for number, duration in enumerate(durationnames))
offset_to_int = dict((offset, number) for number, offset in enumerate(offsetnames))
network_input_notes = []
network_input_durations = []
network_input_offsets = []
network_output_notes = []
network_output_durations = []
network_output_offsets = []
for i in range(len(all_notes) - sequence_length):
if transition_marker[0] not in all_notes[i:i + sequence_length + 1]:
sequence_in_notes = all_notes[i:i + sequence_length]
sequence_out_note = all_notes[i + sequence_length]
sequence_in_durations = all_durations[i:i + sequence_length]
sequence_out_duration = all_durations[i + sequence_length]
sequence_in_offsets = all_offsets[i:i + sequence_length]
sequence_out_offset = all_offsets[i + sequence_length]
network_input_notes.append([note_to_int[note] / n_vocab for note in sequence_in_notes])
network_input_durations.append([duration_to_int[duration] / d_vocab for duration in sequence_in_durations])
network_input_offsets.append([offset_to_int[offset] / o_vocab for offset in sequence_in_offsets])
network_output_notes.append(note_to_int[sequence_out_note])
network_output_durations.append(duration_to_int[sequence_out_duration])
network_output_offsets.append(offset_to_int[sequence_out_offset])
network_input_notes = np.reshape(network_input_notes, (len(network_input_notes), sequence_length, 1))
network_input_durations = np.reshape(network_input_durations, (len(network_input_durations), sequence_length, 1))
network_input_offsets = np.reshape(network_input_offsets, (len(network_input_offsets), sequence_length, 1))
network_output_notes = to_categorical(network_output_notes, num_classes=n_vocab)
network_output_durations = to_categorical(network_output_durations, num_classes=d_vocab)
network_output_offsets = to_categorical(network_output_offsets, num_classes=o_vocab)
(31, 7, 12)
network_input_notes.shape
(777, 50, 1)
This preprocessing step is fundamental to the machine learning pipeline presented in the notebook. It bridges the gap between the raw musical data extracted from MIDI files and the data's actual application in training neural network models. By converting musical elements into a structured and numerical format, it enables the subsequent steps of model design, training, and eventually music generation based on learned patterns. This approach exemplifies how data from creative domains like music can be made amenable to computational techniques, paving the way for innovative applications like automated composition and interactive music systems.
Design Network¶
This model utilizes separate LSTM (Long Short-Term Memory) layers for each type of musical element (notes, durations, offsets), integrating dropout layers to reduce overfitting. The outputs from these LSTM layers are then fed into dense layers to predict the next note, duration, and offset, respectively. This design aims to learn and generate music by understanding the patterns in note sequences, note durations, and their timings independently before making predictions.
model_num = 4
from keras.layers import Input, LSTM, Dropout, Dense
from keras.models import Model
if(model_num == 1):
# Input layer
input_notes = Input(shape=(network_input_notes.shape[1], network_input_notes.shape[2]), name='input_notes')
input_durations = Input(shape=(network_input_durations.shape[1], network_input_durations.shape[2]), name='input_durations')
input_offsets = Input(shape=(network_input_offsets.shape[1], network_input_offsets.shape[2]), name='input_offsets')
# Shared LSTM layers for notes
lstm_notes1 = LSTM(256, return_sequences=True)(input_notes)
dropout_notes1 = Dropout(0.3)(lstm_notes1)
lstm_notes2 = LSTM(256)(dropout_notes1)
dropout_notes2 = Dropout(0.3)(lstm_notes2)
# Shared LSTM layers for durations
lstm_durations1 = LSTM(256, return_sequences=True)(input_durations)
dropout_durations1 = Dropout(0.3)(lstm_durations1)
lstm_durations2 = LSTM(256)(dropout_durations1)
dropout_durations2 = Dropout(0.3)(lstm_durations2)
# Shared LSTM layers for offsets
lstm_offsets1 = LSTM(256, return_sequences=True)(input_offsets)
dropout_offsets1 = Dropout(0.3)(lstm_offsets1)
lstm_offsets2 = LSTM(256)(dropout_offsets1)
dropout_offsets2 = Dropout(0.3)(lstm_offsets2)
# Separate output layers for notes and durations
notes_output = Dense(n_vocab, activation='softmax', name='notes_output')(dropout_notes2)
durations_output = Dense(d_vocab, activation='softmax', name='durations_output')(dropout_durations2)
offsets_output = Dense(o_vocab, activation='softmax', name='offsets_output')(dropout_offsets2)
model = Model(inputs=[input_notes, input_durations, input_offsets], outputs=[notes_output, durations_output, offsets_output])
model.compile(loss={'notes_output': 'categorical_crossentropy', 'durations_output': 'categorical_crossentropy', 'offsets_output': 'categorical_crossentropy'}, optimizer='rmsprop')
model.summary()
The second model introduces bidirectional LSTMs and a concatenation layer, enhancing the model's ability to capture patterns in both forward and backward directions of the sequences. This design is more complex and aims to improve the network's understanding of the musical data by integrating information from notes, durations, and offsets simultaneously before making predictions. This could potentially lead to better generation of musical sequences that are coherent and musically pleasing.
from keras.layers import Input, LSTM, Dropout, Dense, Concatenate, Bidirectional
from keras.models import Model
if(model_num == 2):
# Input layer
input_notes = Input(shape=(network_input_notes.shape[1], network_input_notes.shape[2]), name='input_notes')
input_durations = Input(shape=(network_input_durations.shape[1], network_input_durations.shape[2]), name='input_durations')
input_offsets = Input(shape=(network_input_offsets.shape[1], network_input_offsets.shape[2]), name='input_offsets')
# Shared LSTM layers for notes
lstm_notes1 = Bidirectional(LSTM(256, return_sequences=True))(input_notes)
dropout_notes1 = Dropout(0.3)(lstm_notes1)
# Shared LSTM layers for durations
lstm_durations1 = LSTM(256, return_sequences=True)(input_durations)
dropout_durations1 = Dropout(0.3)(lstm_durations1)
# Shared LSTM layers for offsets
lstm_offsets1 = LSTM(256, return_sequences=True)(input_offsets)
dropout_offsets1 = Dropout(0.3)(lstm_offsets1)
# Concatenation layer - Combining all features
concat_layer = Concatenate()([dropout_notes1, dropout_durations1, dropout_offsets1])
# Final LSTM layer to interpret the combined features
combined_features = LSTM(256)(concat_layer)
combined_dropout = Dropout(0.3)(combined_features)
# Separate output layers for notes and durations
notes_output = Dense(n_vocab, activation='softmax', name='notes_output')(combined_dropout)
durations_output = Dense(d_vocab, activation='softmax', name='durations_output')(combined_dropout)
offsets_output = Dense(o_vocab, activation='softmax', name='offsets_output')(combined_dropout)
model = Model(inputs=[input_notes, input_durations, input_offsets], outputs=[notes_output, durations_output, offsets_output])
model.compile(loss={'notes_output': 'categorical_crossentropy', 'durations_output': 'categorical_crossentropy', 'offsets_output': 'categorical_crossentropy'}, optimizer='rmsprop')
model.summary()
The third and most advanced model incorporates embedding layers for notes, durations, and offsets, followed by bidirectional LSTMs and an attention mechanism. The embeddings provide a dense representation of the musical elements, while the attention mechanism allows the model to focus on specific parts of the input sequences when making predictions. This model is designed to capture more nuanced patterns and relationships within the musical data, potentially leading to the generation of complex and varied musical compositions.
from keras.layers import Input, LSTM, Dropout, Dense, Concatenate, Embedding, Bidirectional, Attention
from keras.models import Model
if(model_num == 3):
max_sequence_length = 50
note_embedding_size = 100
duration_embedding_size = 50
offset_embedding_size = 50
# Input layers
input_notes = Input(shape=(max_sequence_length,), name='input_notes')
input_durations = Input(shape=(max_sequence_length,), name='input_durations')
input_offsets = Input(shape=(max_sequence_length,), name='input_offsets')
# Embedding layers
note_embedding = Embedding(input_dim=n_vocab, output_dim=n_vocab)(input_notes)
duration_embedding = Embedding(input_dim=d_vocab, output_dim=n_vocab)(input_durations)
offset_embedding = Embedding(input_dim=o_vocab, output_dim=n_vocab)(input_offsets)
# Shared LSTM layers for notes, durations, and offsets
# Using Bidirectional LSTMs to capture patterns in both directions
lstm_notes = Bidirectional(LSTM(256, return_sequences=True))(note_embedding)
dropout_notes = Dropout(0.3)(lstm_notes)
lstm_durations = Bidirectional(LSTM(256, return_sequences=True))(duration_embedding)
dropout_durations = Dropout(0.3)(lstm_durations)
lstm_offsets = Bidirectional(LSTM(256, return_sequences=True))(offset_embedding)
dropout_offsets = Dropout(0.3)(lstm_offsets)
# Attention Mechanism
attention_notes = Attention()([dropout_notes, dropout_notes])
attention_durations = Attention()([dropout_durations, dropout_durations])
attention_offsets = Attention()([dropout_offsets, dropout_offsets])
# Concatenation layer - Combining all features
concat_layer = Concatenate()([attention_notes, attention_durations, attention_offsets])
# Final LSTM layer to interpret the combined features
combined_features = LSTM(256)(concat_layer)
combined_dropout = Dropout(0.3)(combined_features)
# Separate output layers for notes, durations, and offsets
notes_output = Dense(n_vocab, activation='softmax', name='notes_output')(combined_dropout)
durations_output = Dense(d_vocab, activation='softmax', name='durations_output')(combined_dropout)
offsets_output = Dense(o_vocab, activation='softmax', name='offsets_output')(combined_dropout)
model = Model(inputs=[input_notes, input_durations, input_offsets], outputs=[notes_output, durations_output, offsets_output])
model.compile(loss={'notes_output': 'categorical_crossentropy',
'durations_output': 'categorical_crossentropy',
'offsets_output': 'categorical_crossentropy'},
optimizer='adam')
model.summary()
For the fourth, a simplified approach is explored, receiving all three inputs and concatenating them immediately, this is done in order to explore the capability of simpler models.
In exploring this model it also became clear that the networks overfit nearly instantly
from keras.layers import Input, LSTM, Dropout, Dense, Concatenate, Bidirectional
from keras.models import Model
from keras.optimizers import Adam
if(model_num == 4):
# Input layer
input_notes = Input(shape=(network_input_notes.shape[1], network_input_notes.shape[2]), name='input_notes')
input_durations = Input(shape=(network_input_durations.shape[1], network_input_durations.shape[2]), name='input_durations')
input_offsets = Input(shape=(network_input_offsets.shape[1], network_input_offsets.shape[2]), name='input_offsets')
# Concatenation layer - Combining all features
concat_layer = Concatenate()([input_notes, input_durations, input_offsets])
# LSTM layer to interpret the combined features
combined_features1 = Bidirectional(LSTM(512, return_sequences=True))(concat_layer)
dropout_offsets1 = Dropout(0.3)(combined_features1)
# Final LSTM layer to interpret the combined features
combined_features2 = Bidirectional(LSTM(512))(dropout_offsets1)
dropout_offsets2 = Dropout(0.3)(combined_features2)
# Separate output layers for notes and durations
notes_output = Dense(n_vocab, activation='softmax', name='notes_output')(dropout_offsets2)
durations_output = Dense(d_vocab, activation='softmax', name='durations_output')(dropout_offsets2)
offsets_output = Dense(o_vocab, activation='softmax', name='offsets_output')(dropout_offsets2)
model = Model(inputs=[input_notes, input_durations, input_offsets], outputs=[notes_output, durations_output, offsets_output])
optimizer = Adam(learning_rate=0.005)
model.compile(
loss={'notes_output': 'categorical_crossentropy', 'durations_output': 'categorical_crossentropy', 'offsets_output': 'categorical_crossentropy'},
loss_weights={'notes_output': 1.0, 'durations_output': 1.0, 'offsets_output': 1.0},
optimizer=optimizer)
model.summary()
Model: "functional_1"
┏━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┓
┃