tianhao

TIANHAO ZHANG
COMPOSER, PROGRAMMER, MUSIC THEORIST

Composing Machines: Pivotal Works in the History of Score-Level Music Generation

I. Introduction

The process of music composition can often be encapsulated within an algorithm or a specific set of rules. For instance, in Western classical music, particularly in earlier genres, clear rules about part-writing are adhered to by many composers in most cases. While such algorithms have been widely employed by human composers for hundreds of years to assist or even accelerate the composition process, generating music based on algorithms without any intervention by human composers became popular only after the advent of computers due to significantly enhanced computing power.

Despite that the concept of algorithmic composition existed long before computers, this paper focuses on the approaches aided by computers. This survey summarizes some pivotal or influential works in the history of specifically score-level music generation research, spanning from earlier attempts to some recent complex neural network designs. This research encompasses projects about melody generation, polyphony generation, melody harmonization, music inpainting, and some other approaches that are worth mentioning.

II. Earlier attempts

II-a. The Dawn: 1956 – 1957

The earliest attempts at automating the composition process using computers were made in the 1950s. In 1956, an algorithm described in a paper by Richard C. Pinkerton published in Scientific American may be the first study about generating music using a computer.[18] This algorithm can only output monophonic music, or melodies, probably the simplest form of music.

Using a statistical approach, Pinkerton first analyzed 39 selected nursery tunes to derive the sequential probabilities of notes. More specifically, the melodies he chose were transposed into C major or A minor key, and they use only the 7 pitches in the diatonic scale. He introduced the symbol “O” to represent a one-beat rest or a holdover of the note to the next beat, resulting in a total of 8 possible choices. Following the analysis, he employed the probability tables to reconstruct similar melodies.

The idea of modeling an existing musical style using a specific dataset in Pinkerton’s work is very inspiring. Many subsequent significant research endeavors shared a similar methodology.

Shortly thereafter, in 1957, computer scientists Frederick P. Brooks et al. built upon Pinkerton’s work and developed a Markov model.[2] This may be the first time that a machine learning algorithm about music generation was implemented using a digital computer.

Similar to Pinkerton’s approach, Brooks et al. also employed probabilistic analysis. However, instead of solely examining the single note preceding the note being generated, they enhanced the methodology by considering the (m - 1) notes ahead in a Markov analysis of order m. In other words, the model takes not a single pitch but an array of (m - 1) elements as the input, while the output is still a single pitch. During their experiments, they tested various values of m to compare the outcomes.

In their paper, Brooks et al. enumerated three important difficulties in generating music or any form of art using a machine learning approach, which are still relevant nowadays:

1. The generalization or analysis may be overly simple or naïve, preventing the system from generating structures akin to the training examples.

2. The training set may be not large enough to allow sophisticated modeling.

3. The generation result may be too similar or even identical to one of the training examples, resulting in a deficiency in creativity.

Obviously, the 1^st problem is reminiscent of what is now known as high bias in modern machine learning theory. The 2^nd and the 3^rd problems are the same since a lack of creativity is caused by insufficient diversity in the training set, and it is what researchers nowadays refer to as high variance. Brooks et al also stated the 1^st problem is more likely to manifest when m is too low, whereas the 2^nd and the 3^rd problems are more prone to arise when m is a larger number.

Also, interestingly, Brooks et al. observed that the music composition task may be a smaller or simplified version of natural language generation. They found the similarity between these two sequential data forms and foretold the subsequent utilization of similar algorithms for these two tasks in later studies.

II-b. Illiac Suite

The earliest music score generated by a computer widely recognized in musical academia was the Illiac Suite. This composition was created by Lejaren Hiller and Leonard Isaacson in 1957, using the ILLIAC I at the University of Illinois at Urbana–Champaign.[8] This piece was crafted for string quartet and has four movements (or “experiments”) each generated through distinct algorithms.

The first experiment is mainly the generation of cantus firmi. The piece begins with monodies of 3 to 12 notes in length. Only the white keys are allowed. The program embedded a set of rules to ensure the melody adheres to a specific style. For example, the first and the last note of a cantus firmus have to both be C, and melodic intervals of a major or minor seventh are prohibited.

The second experiment delves into generating four-voice first-species counterpoint. Starting from randomly generated notes confined to the white keys, conventional counterpoint rules were added successively. All harmonic and melodic intervals were permitted before the first step. The 6^th step limited the use of dissonant harmonic intervals, including seconds, sevenths, and tritones. Parallel fifths and octaves were banned at the 7^th step.

The third experiment is more innovative. As far as we know, it was the first time in history that a computer wrote chromatic music. The composers also tried to include three musical elements other than pitch: rhythm, dynamics, and articulations in the generative process. Starting from purely random chromatic music, simple rules are added later in the movement to control the pitches. Tone rows were generated to be used in the last section.

The fourth experiment is, as the work of Brooks et al., also based on probabilistic analysis of existing music. A fascinating design element embedded in this movement is the “ith order” Markov chain. The composers thought the concept of tonality is, in a simple form, essentially the melody returning to its beginning note. No matter how large m is, as long as the number of notes in a generated melody becomes more than m, the music generated may not be defined as tonal since the beginning note is no longer considered in the calculation. To address this concern, in the Coda of this movement, all notes on the strong beats are generated according to the initial note of the melody instead of the note(s) preceding. This technique was described as the “ith order” Markov chain where i is the index of the note being currently generated.

The ith-order Markov chain likely represents the earliest attempt to incorporate long-term dependencies into a musical generative system. Evidently, later content in a long musical composition should be related to the materials found in the beginning. However, an obvious shortcoming found in Hiller’s approach is that the short-term dependency was sacrificed for those strong beat notes. Architectures capable of handling both long and short-term “memory” simultaneously are considerably more intricate and were developed long after Hiller’s time.

II-c. Gottfried Michael Koenig

The German-Dutch composer Gottfried Michael Koenig was a significant early contributor to computer-aided music generation. While the Illiac Suite stood as a pioneering success, Hiller’s programs were made specifically for his compositional goal of that piece. In other words, he did not program software that could generate many different works. Koenig’s “Project 1”, developed in 1964, was one of the earliest endeavors to achieve that.[11]

The earliest version of this program was running on IBM 7090 and underwent testing at the University of Bonn in Germany. It was written in IBM’s FORTRAN II language. Koenig had experience working in the Cologne electronic music studio and was heavily influenced by musical serialism. Such influence can be found in the music generation procedure of Project 1.

The program generates the musical elements sequentially: timbre, rhythm, pitch, register, and finally dynamics, instead of considering them together comprehensively as a human composer typically does. For example, it decides the rhythms for all notes before filling in any pitch. That means there is little interdependency among the elements.

A piece of music composed by Project 1 has seven “form-sections”. This term was coined to distinguish them from the seven “program sections”. Each program-section generates an array of numbers according to a distinct rule. The array will then be used in one of the seven form-sections to determine the musical elements, such as the pitch.

Somewhat disappointingly, Koenig pointed out that users of Project 1 had minimal influence over the output. The number of parameters that could be modified by the user was rather limited. The program can generate many different pieces, but they are essentially all derived from the same algorithm. As a response to this limitation, the program was soon revised and improved to “Project 2”. This updated version offered users substantially more creative freedom [12].

Project 2 was written in ALGOL 60, an early programming language. Koenig wrote a much more detailed explanation of this new program in 1970 compared to the short report of Project 1. While Project 1 primarily catered to his personal needs, Project 2 aimed to be accessible to any composer, serving “as many purposes as possible.” The input data includes 63 entries that empower the user to customize the instruments, chords, pitches, intervals, rhythms, dynamics, and so on.

Unlike the approach of modeling an existing musical style by analyzing a collection of pieces, Koenig's work leaned towards a rule-based algorithmic composition method. Although early serial compositions, including those by Arnold Schoenberg, could also be seen as being “generated” by specific rules, Koenig’s rules were notably more intricate and relied on the computational capabilities of machines. As he wrote in his paper, “the computer can help solve problems which previously had to remain unsolved because of the time it took to solve them”.

II-d. David Cope and EMI

EMI (Experiments in Musical Intelligence) was a software developed by American composer and scientist David Cope in 1987.[4] Cope’s approach drew inspiration from natural language processing (NLP) technologies at the time.

The system was designed to model the style of different composers, and output “original” music based on the chosen style. In his paper, Cope defined “style” as a set of rules or restrictions, meaning which actions or symbols could appear in the musical sequence, given the context. A composer with enough computer literacy could customize the “dictionary” in the software to define their desired rules and thus generate different musical outputs. Entries in a dictionary could include conditional statements, forming the “grammar” of music. Ideally, Cope believed that through sustained interaction, a composer could use the system to model their own style almost perfectly.

While the program could generate music without any intervention by a human, Cope encourages composers to regard the generated scores as sources of inspiration or initial points of departure. The composers can then accept, adjust, or discard the music excerpts according to their personal aesthetics.

With a sense of humor, Cope claimed that “the inspiration derived explicitly from the author’s duress over an impending deadline for a commissioned work.” Being a composer himself, Cope may be the first to research computer-assisted composition or human-computer collaboration in musical tasks since he has the motivation to do so. Cope’s work inspired some later researchers who also aim to accelerate (not replace) humans’ compositional process.

III. Neural Networks

III-a. Overview

The concept of artificial neural networks has existed for a long time. In 1958, Frank Rosenblatt developed the first practical neural network called the Perceptron.[21] However, it was not until the 1980s and the 1990s, with the development of new neural network architectures and learning algorithms, that neural networks gained more widespread use in practical applications. Like Markov models, artificial neural networks can also be considered statistical models. They harness statistical techniques to learn from data and make predictions.

Neural networks have been widely used in music generation tasks, but many are audio-based, meaning both the training examples and outputs are in an audio format. The number of score-based approaches or symbolic music generation research is relatively limited by the size of training sets available.

One of the neural networks’ most significant limitations is the need for a huge training set. The abundance of online music audio files has fueled the success of AI models that deal with musical audio. For example, a famous dataset named MSD, the Million Song Dataset, contains the audio feature analysis results of one million contemporary popular music tracks.[1]

On the other hand, the most influential file format of symbolized music notation is MIDI. Yet, the quantity of MIDI files accessible online remains relatively small, posing challenges for MIDI-related machine-learning tasks. The Lakh MIDI dataset collected by computer scientist Colin Raffel in 2016 is often considered one of the largest MIDI datasets, and it includes 176,581 unique MIDI files.[19] While this figure falls far short of a million, the author did not make any limitation on the musical style, meaning any genre of music, not only popular music, has been included already. Moreover, the author did not try to remove the corrupt files from the dataset, hence a part of it could be invalid. The size and quality of symbolic music datasets present obstacles to further research.

Nevertheless, score-level generation has its unique advantages. An exciting one is its potential to foster human-machine collaboration. Music generation at the audio level is relatively more popular yet less helpful for musicians. A generated audio of music cannot be de-mixed into separate instrumental tracks (with current technology) and does not contain the insights necessary for musicians to understand the piece. Mixed audio of multi-instrument music is difficult to modify or improve, but composers can freely edit scores to align with their aesthetics without any struggle. Scores can also be used by human instrument players to produce live performances, which remains an important habitat of music even in modern society.

The following delves into several pivotal studies in this field.

III-b. Peter M. Todd

The first attempt to apply neural networks in algorithmic composition was done as early as 1989 by Peter M. Todd at Stanford University[22]. Leveraging the new technology of parallel distributed processing (PDP) at the time, the computation can be done parallelly by a number of smaller units instead of a powerful central unit. Therefore, the neural network architecture which has many neurons became more efficient.

In Todd’s experiment, the form of music examples was limited to monophony. He used a neural network structure now known as the recurrent neural network (RNN). In short words, a recurrent neural network routes the output layer back to the input layer, so when trying to determine the next element in the sequence, the network has the information of the “context” or the previously generated notes. The part of the input layer that contains this information is called context units in Todd’s writing. RNNs were designed to specifically deal with sequential data, finding extensive application in natural language processing as well.

Todd discussed the representation of rhythms. He believed it is better to define the smallest measurement of time length in a model, termed a “time slice”. Obviously, the size of a time slice should be the greatest common factor of all the note lengths found in the training set, and it will later become the shortest possible length of a note in the generated music.

A unit described as the note-begin marking unit was added to the structure to make a clear distinction between two separate notes of the same pitch and a holdover. For example, if the unit of pitch C5 is on for two time slices, we should see if the note-begin unit is on at the second time slice. In case it is, there are two attacks with the same pitch, but in case it is off, it is a single note with a longer duration. Similar designs remain necessary in later research on neural networks that generate music in order to deal with this ambiguity.

III-c. LSTM for Music Generation

A vanilla RNN design has a flaw known as the vanishing gradient problem. In short, it is difficult for a vanilla RNN to “realize” that it needs to memorize an early piece of information. For example, a sonata form typically has three sections: exposition, development, and recapitulation. A recapitulation section is approximately a repetition of the exposition section with merely small changes. Since the development section could be fairly long, it is difficult for a sequential model to capture this long-term dependency.

In more technical terms, during the training process or the backward propagation, because of the large number of elements in a musical sequence, the gradient would have a very hard time propagating all the way back through the sequence to affect the computations that are much earlier.

To deal with the vanishing gradient problem, scientists have come up with multiple modifications to the RNN model. One of the most successful solutions is the LSTM (long short-term memory). Introduced by Sepp Hochreiter and Jürgen Schmidhuber in 1997, LSTM adds the input gate, forget gate, and output gate to the RNN model [9]. During the training process, an LSTM network can learn which pieces of information should be remembered and which could be forgotten.

It was not until 2002 that LSTM was harnessed for music generation for the first time. Douglas Eck and Jürgen Schmidhuber (one of the inventors of LSTM) published a paper titled “A First Look at Music Composition using LSTM Recurrent Neural Networks” [6]. In this study, they successfully trained a model to generate blues melodies with accompanying chords.

In this study, the training set and the generation result were limited to 12-bar blues with 4/4 time. The quantization was set to 8^th notes, meaning there are always 96 units or steps in a musical sequence. Although the authors acknowledged that these limitations made the AI composition task much easier than it should be, they considered this study a success because it was the first time an RNN model generated music with a long-term structure. They also saw this study as an answer to Mozer’s criticism, who doubted if RNN models can create music with “global coherence” at all [15].

The authors also pointed out the intricacies of objectively evaluate AI-generated music or any AI art. They simply presented the results on a website without arguing they are objectively good music, yet they subjectively believed the generated blues tunes were pleasant. As a jazz musician himself, Eck was “struck by how much the compositions sound like real bebop jazz improvisation”. Although there were some algorithmic evaluation methods invented and used in later research, the lack of a universally accepted objective evaluation metric remains a problem for researchers even today. Given the inherent subjectivity of art, this problem is perhaps insoluble.

III-d. C-RNN-GAN

In a paper published in Nov. 2016, machine learning researcher Olof Mogren proposed a structure named C-RNN-GAN that implemented GAN into the RNN model.[14] He used the architecture to generate polyphonic music, which marked the first time that the GAN design was used in a music generation task.

One of the most interesting inventions in the development of deep learning, the generative adversarial network (GAN) is an architecture that trains two networks named the generator and the discriminator with conflicting goals. They “fight” in a zero-sum game and force each other to improve. More specifically, the generator tries to compose a piece of music that looks similar to the real ones in the training set composed by humans, and the discriminator then tries to recognize any differences.

The “C” in the name C-RNN-GAN means “continuous”. Mogren claimed that the symbolic representations used in earlier music generation RNNs were discrete. Using 3697 files in MIDI format as the dataset, the author believed the program was modeling continuous sequences of musical data. The continuity was corroborated by the MIDI resolution of 384 ticks per quarter note in his dataset, probably a smaller value than any time slices defined in previous research.

From each note-on event found in data, Mogren’s program extracts the information of note length, pitch, intensity (velocity), and the time interval since the attack of the previous note. It was an early attempt to include note intensity in a music generation model. Each note in the MIDI format is associated with a velocity value, which serves as a more nuanced representation of dynamics than what typically appears on a musical score. Therefore, this information is beyond the scores and reaches the level of musical performance.

However, some MIDI files included in Mogren’s dataset did not have meaningful velocity data. Some MIDI files on the Internet were not recorded using performance devices like MIDI keyboards but inputted using notation software. This led to uniform velocity values for all notes within such files, rendering them unsuitable for modeling. It seems like Mogren did not have much other experience working on musical tasks, so it is possible that he was not aware of this fact. This predicament continues to be relevant today for those seeking to incorporate velocity data into their work.

III-e. Music Transformer

After the invention of the powerful transformer model in 2017 [24], it has replaced RNNs to become the state-of-the-art choice for natural language processing tasks. In Dec. 2018, probably for the first time in history, Cheng-Zhi Anna Huang et al applied transformer architecture to music generation[10]. This project was named Music Transformer.

Much like RNNs, the transformer model grapples with sequential data. A detailed explanation of how a transformer works is beyond the scope of this paper, but one of its important characteristics is the self-attention mechanism. The model weights the importance of each part of the inputted data to determine where to pay more “attention” on. Based on this mechanism, the algorithm has shown the ability to maintain long-term coherence in its outputs. Self-referencing its previously composed notes, the model could develop music around a motif, altering and reusing it like a human composer usually does.

Two renowned datasets were used in Huang et al.’s experiments to train two separate models. The first was JSB Chorales. To model the Bach counterpoint works, all the rhythms in the dataset were quantized with the smallest unit of a 16^th note. The trained model could compose 4-voice polyphony in the Baroque chorale style.

The second dataset was the Piano-e-competition dataset, which contains about 1100 MIDI files of piano solo performances. Notably, these files encapsulated expressive dynamics and nuanced note timings because they were submitted for the performance competition. The “TIME_SHIFT” events were added to the vocabulary to factor in these expressive timings. The model trained using this dataset could output piano solo pieces with decent humanized performance.

This pioneering endeavor by Huang and colleagues definitively affirmed that transformer models are aptly suited for score-level music generation tasks. Since then, transformers have become arguably more popular than RNNs in cutting-edge research of AI composition.

III-f. LakhNES and Transfer Learning

Machine learning scientists have devised a method called transfer learning to deal with a small training set. When only a limited amount of data is available for the wanted task, it is possible to use other data in the same format to “pre-train” the neural network. For example, to model a specific musical style, music in other styles may be used for pre-training. This concept could be particularly helpful for score-level music generation tasks because of the shortage of symbolic musical data.

In 2019, Chris Donahue et al. developed a project called LakhNES[5]. The algorithm used was a modified version of the transformer called Transformer-XL. Their goal was to generate multi-track music in the NES (Nintendo Entertainment System) style based on a smaller dataset named the NES Music Database (NES-MDB) which contains 46 hours of music written for the NES ensemble. Their instrumentation is two pulse waves, one triangle wave, and one noise generator.

To achieve better results, the researchers used the large Lakh MIDI dataset, which has 9000 hours of music in various styles, for the pre-training. The musical pieces in this larger dataset have vastly different instrumentations, so Donahue et al. ingeniously developed a method to “map” the data to the NES ensemble.

In simple words, Donahue et al.’s program scanned through the Lakh dataset and identified monophonic tracks and percussion tracks. It mapped the monophonic tracks to either a pulse wave or the triangle wave randomly and also transcribed the percussion tracks to the noise generator. The researchers claimed this methodology as the main contribution of their work. They also underscored the potential application of a comparable pre-training procedure to different instrumentations—be it a string quartet or a vocal choir. Experiments were conducted to validate that the pre-training significantly improved the model’s performance.

The success of LakhNES provided a practical solution to the problem of lacking large musical datasets with the same instrumentation. Transfer learning opened up more opportunities for polyphonic music modeling.

III-g. Music InpaintNet and Music SketchNet

Most previous music generation programs, especially the earlier ones, assumed that music should be generated continuously. This implies that the generation of the current element only depends on the elements that appeared before, not anything after. Drawing a parallel to the concept of image inpainting, music inpainting is the task to generate missing music according to surrounding information. The future musical context must be somehow considered during the generation to ensure seamless continuity.

There are relatively few research projects related to music inpainting, but this topic is important if we hope to achieve a smooth and productive human-machine collaboration in composing music. In practice, it is common for human composers to write music in a non-sequential manner, wherein they may craft the ending of a piece before finalizing all the middle sections.

Published in 2019, Music InpaintNet developed by Pati et al. was a relatively early and successful project on music inpainting [17]. In the paper, the authors made a formal definition of the music inpainting problem: “given a past musical context C_p and a future musical context C_f, generate an inpainted sequence C_i which can connect C_p and C_f in a musically meaningful manner.” This definition demarcated music inpainting from other generative tasks.

Music InpaintNet aims at fixing incomplete melodies based on the surrounding context. The researchers trained an RNN that takes both previous and future musical materials as input and outputs a vector representation of the predicted middle passage. That vector is then decoded into a musical score using a VAE (variational autoencoder).

A later project published in 2020 called Music SketchNet by Chen et al. is also considered one of the most influential works on music inpainting [3]. The authors claimed that their work is music “sketching” which is one level deeper than inpainting because they introduced an innovative design. In their new framework, the user can optionally guide the model by specifying a pitch contour or/and a rhythmic pattern.

In the Music SketchNet pipeline, after a part named SketchInpainter predicts the sequence of the missing music, another component named SketchConnector combines the prediction with the user’s input before decoding. This connector is Transformer-based. To simulate the user providing some sketching, during the training, predicted pitch and rhythm information are randomly “unmasked” to be seen as part of the ground truth.

One obvious limitation of these two works is that they can only process monophonic music. Once polyphonic music inpainting technology matures, it could vastly enhance the composers’ efficiency and impact our musical culture. As Chen et al. specified at the end of their paper, “how to represent a polyphonic music piece in the latent space is another pressing issue” that needs future work.

III-h. MTHarmonizer

Harmony is one of the major musical elements that composers need to consider, especially in tonal music. A harmonizer program is a system that takes a melody as input and generates a harmonic progression as output to accompany that melody. Earlier attempts to make a harmonizer used hidden Markov models (HMMs) [16].

In 2017, Hyungui Lim et al. proposed an RNN using BiLSTM (Bidirectional Long Short-term Memory) layers that could generate harmony for a given melody [13]. It was the first project that used deep learning in the harmonization task.

In 2021, Yin-Cheng Yeh et al. criticized the excessive use of common chords found in the outputs of Lim et al.’s BiLSTM-based model [25]. They pointed out that C, F, and G major chords were heavily repeated. To solve this problem, they then came up with an improved approach named MTHarmonizer. Their most important change made was taking the harmonic functions into consideration.

The harmonic function is arguably the most crucial concept in tonal music. Yeh et al. wrote that the distribution of the harmonic functions was more balanced than the chord labels, making the training process more efficient. Also, since the functions and the chord labels are interdependent, the neural network could learn which chords are interchangeable because they have the same function. This might trigger more exploration beyond the basic C, F, and G major triads.

However, the researchers also confessed that the MTHarmonizer could only deal with triads. When simplifying the more complex chords to triads, their important harmonic “colors” were also eliminated. To develop a more powerful model, more chord structures, such as the seventh chords and suspended chords, should be defined in the training process.

Another useful application of human-machine collaboration, harmonizer programs can develop the melodic materials created by human composers. The harmonizers can also assume an important role within a larger compositional AI framework since the generated harmonic progression can be helpful in a later generation of the complex multi-track musical texture.

III-i. PopMAG: Multi-track Generation

From the perspective of a musical AI researcher, polyphonic music can be classified into two types. While a piano solo piece can be surely polyphonic, all the notes are played by a single instrument, resulting in a similar timbre if not the same. An orchestral piece is also polyphonic, but much more complicated given the number of instrument types. For automatic music generation, while it is easier to generate simpler polyphony like a piano solo, it is much more challenging to take the differences among instruments into consideration. The task to generate polyphonic music with multiple instruments is sometimes named “multi-track music generation”.

In recent 5 years, multiple novel research projects have been undertaken in this realm, including MMM (Multi-track Music Machine) by Ens et al. [7] and MusAE (Music Adversarial Autoencoder) by Valenti et al. [23]. Researchers spent the effort to establish interdependence among the tracks so the notes across various instruments together form satisfying harmony. One of the successful works is PopMAG by Yi Ren et al. in 2020 [20].

In their paper, Ren et al. focused on the generation of pop songs. The task of pop music generation was usually separated into two parts: firstly, generate a melody and the harmonic progression, and secondly generate the multi-instrument accompaniment. PopMAG stands for “pop music accompaniment generation”, meaning it was built for the latter part. It takes a melody sequence and a harmony sequence as the input.

The authors pointed out that previous works generated the instrument tracks separately, which could undermine the interdependency. To surmount this issue, they proposed a new representation of multi-track music named “MuMIDI” which made the simultaneous generation of all tracks possible.

The key to the simultaneous generation is to represent the data from all tracks in one single sequence, as MuMIDI does, but the downside of that is the sequence may get too elongated. The long-term dependency problem is harder to deal with for longer sequences, while repetition and self-referencing are crucial components to make a piece of music understandable for human listeners.

To solve this conflict, the authors provided two designs. One is from the algorithm side: they applied the Transformer-XL model which is good at capturing long-term dependencies “as the backbone of the encoder and decoder”. The other is from the data side: some commonly used musical data formats including MIDI use three or more tokens to represent a single note. In MuMIDI, multiple parameters of one note, including pitch, duration, and velocity, are condensed into one unit or one token in the sequence, so the sequence length can be significantly compressed. This approach is grounded in the notion that all attributes belonging to a musical note should be part of that note instead of separate steps in the sequence. From a listener’s perspective, it is the note that forms a musical unit in one’s mind instead of those attributes.

A token for a musical note in MuMIDI may look like “<Pitch_60, Vel_28, Dur_4>”. For velocity, the authors interestingly quantized the 128 possible values in MIDI into 32 levels. This significant decrease in the number of velocity levels speeds up the training while still maintaining enough detail in dynamics.

A limitation of PopMAG is that it can only process music in 4/4 time since the authors believed this is the most used time signature in pop music. Another flaw is in the harmony system. There are 7 possible chord qualities in MuMIDI: major, minor, augmented, diminished, major 7, minor 7, and half-diminished. Although it may be not obvious to machine learning scientists, it is apparent to anyone familiar with music theory that the dominant 7th chord should be on this list. Nonetheless, compared to MTHarmonizer, it is a progress to add three types of 7^th chords in addition to triads.

Despite its limitations, PopMAG boasted commendable achievements. The trained model can generate accompaniments for the given melody with five instrument tracks: piano, string, guitar, bass, and drum. In the subjective evaluations, about 40% of the generated pieces have reached a quality level on par with human-composed music. In terms of human-machine collaboration, the authors also mentioned that PopMAG can be used to finish partially composed pieces where some instrument tracks are already authored by a human composer.

The success of PopMAG shows how crucial the data format design can be. The original structure of the common MIDI format may not be optimal for training. Through innovative alterations to musical data representation, the learning outcome could be significantly improved.

IV. Conclusion

IV-a. Concession

Given the huge number of studies conducted in this field, this survey paper is by no means a complete history of score-level music generation. Actually, the term “score-level” is not even well defined. For example, the MIDI format includes velocity information for each note which is far beyond the detail level of traditional musical scores. Many research projects, especially the earlier ones, did not involve velocities in the training. But some recent studies, including Music Transformer mentioned above, produced AI models that can generate music with expressive performance. Some might argue these two types of research need to be distinguished instead of calling both “score-level”. For this survey, score-level music is any symbolic representation of music. In simpler terms, it is any non-audio format of music that can be readily translated into human-readable scores.

IV-b. Conclusion

Music is an old form of art with many existing styles and deep cultural backgrounds. A purely rule-based algorithmic approach seems insufficient to tackle the intricate task of automated musical composition. Most studies of score-level music generation focused on the statistical methodologies that analyze and learn from datasets of human-composed music.

Earlier machines with limited computation power could not carry the modeling of a significantly huge musical dataset, hence the outputs of earlier Markov-chain programs were relatively simplistic compositions. Based on the fast GPUs we have built in the recent decade, in this “big data” era, much more advancements have been made in music generation. We have an unprecedently large number of musical files available on the internet which enabled us to create artificial neural networks capable of composing complex multi-track music, even with sophisticated rhythm patterns and expressive performance. Breaking up more and more limitations, the machines now are becoming much “freer” in music-making compared to the earlier programs.

However, music is multi-dimensional and complicated. With all the current successes discussed above, machine music generators are still much inferior to human composers in many aspects. From a collaborative standpoint, existing systems are not yet capable of consistently producing high-quality compositional advice across various musical styles. More work remains to be done.

IV-c. Future Directions

To further improve score-level music generation, future works could concentrate on addressing three important issues: symbolic representations of musical data, multi-track generation, and evaluation metrics of music.

The quantity and quality of data have been key to success in machine learning research. The traditional musical score captures the essential rhythms and pitches, but that format is far from containing everything about music. The data format popular today, the MIDI files, can be much more informative than a traditional written score, yet still possess constraints. With further engineering of symbolic music representation, future formats may better present the structure of pieces and facilitate the training process.

Secondly, the multi-track music generation is still in its nascent stages. Current models such as PopMAG usually focus on approximately five instruments, while there are at least 30 types of instruments get very commonly used in music. The differences in instrumentation among the musical pieces also make the modeling difficult. Although the number of music pieces in the whole dataset may be large, it is difficult to ensure there are enough data for a specific ensemble. Hopefully, future AI models can master a broader array of instruments and generate more varied music.

Finally, as for evaluation metrics, while it is apparently unrealistic to look for an algorithm to determine how “good” a piece is, there are numerous worthwhile possibilities to explore from the perspective of music theory. For example, it may be reasonable to devise an algorithm to calculate how much a piece fits the jazz style, which could be used for assessing AI models that aim at generating jazz music. In tandem with subjective ratings, such objective quantitative evaluations of the generated pieces could provide valuable guidance for the research.

IV-d. Evaluation

Started by musicians and electroacoustic music composers, it seems the earlier research about algorithmic music or music generation was mainly done in musical academia. As we turned into the era of deep learning, in the recent decade or two, a significant portion of the research about music generation based on machine learning algorithms was done in the computer science departments of academic institutions. Remarkably, many of the researchers who published recent studies in this field had minimal or no conservatory training.

As machine learning develops rapidly, professional musicians, especially those with techno-fluency, should participate more in this interdisciplinary research, so the outcomes of these projects can better align with musicians’ needs. For all three possible future directions mentioned above, knowledge of music theory and experience in composition can be significantly helpful for the research. This survey paper aims to present the musicians with the history and the current state of these studies in understandable language to facilitate such collaborations. Hopefully, with the new achievements in score-level music generation in the future, machines can assist and accelerate humans’ compositional process in a creative manner so our musical culture can become more abundant and diverse.

Bibliography

[1] Bertin-Mahieux, Thierry, Daniel PW Ellis, Brian Whitman, and Paul Lamere. "The Million Song Dataset." (2011): 591-596.

[2] Brooks, Frederick P., A. L. Hopkins, Peter G. Neumann, and William V. Wright. "An experiment in musical composition." IRE Transactions on Electronic Computers 3 (1957): 175-182.

[3] Chen, Ke, Cheng-I. Wang, Taylor Berg-Kirkpatrick, and Shlomo Dubnov. "Music sketchnet: Controllable music generation via factorized representations of pitch and rhythm." arXiv preprint arXiv:2008.01291 (2020).

[4] Cope, David. "Experiments in music intelligence (EMI)." In ICMC. 1987.

[5] Donahue, Chris, Huanru Henry Mao, Yiting Ethan Li, Garrison W. Cottrell, and Julian McAuley. "LakhNES: Improving multi-instrumental music generation with cross-domain pre-training." arXiv preprint arXiv:1907.04868 (2019).

[6] Eck, Douglas, and Juergen Schmidhuber. "A first look at music composition using lstm recurrent neural networks." Istituto Dalle Molle Di Studi Sull Intelligenza Artificiale 103, no. 4 (2002): 48-56.

[7] Ens, Jeff, and Philippe Pasquier. "Mmm: Exploring conditional multi-track music generation with the transformer." arXiv preprint arXiv:2008.06048 (2020).

[8] Hiller, Lejaren Arthur, and Leonard M. Isaacson. Experimental Music; Composition with an electronic computer. Greenwood Publishing Group Inc., 1979.

[9] Hochreiter, Sepp, and Jürgen Schmidhuber. "Long short-term memory." Neural computation 9, no. 8 (1997): 1735-1780.

[10] Huang, Cheng-Zhi Anna, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Simon, Curtis Hawthorne, Andrew M. Dai, Matthew D. Hoffman, Monica Dinculescu, and Douglas Eck. "Music transformer." arXiv preprint arXiv:1809.04281 (2018).

[11] Koenig, Gottfried Michael. "Project 1." Electronic Music Reports, no. 2 (1970): 32-44.

[12] Koenig, Gottfried Michael. "PROJECT2: A Programme for Musical Composition." Electronic Music Reports, no. 3 (1970).

[13] Lim, Hyungui, Seungyeon Rhyu, and Kyogu Lee. "Chord generation from symbolic melody using BLSTM networks." arXiv preprint arXiv:1712.01011 (2017).

[14] Mogren, Olof. "C-RNN-GAN: Continuous recurrent neural networks with adversarial training." arXiv preprint arXiv:1611.09904 (2016).

[15] Mozer, M. C. "Neural network composition by prediction: Exploring the benefits of psychophysical constraints and multiscale processing." Cognitive Science 6 (1994): 247-280.

[16] Paiement, Jean-François, Douglas Eck, and Samy Bengio. "Probabilistic melodic harmonization." In Advances in Artificial Intelligence: 19th Conference of the Canadian Society for Computational Studies of Intelligence, Canadian AI 2006, Québec City, Québec, Canada, June 7-9, 2006. Proceedings 19, pp. 218-229. Springer Berlin Heidelberg, 2006.

[17] Pati, Ashis, Alexander Lerch, and Gaëtan Hadjeres. "Learning to traverse latent spaces for musical score inpainting." arXiv preprint arXiv:1907.01164 (2019).

[18] Pinkerton, Richard C. "Information theory and melody." Scientific American 194, no. 2 (1956): 77-87.

[19] Raffel, Colin. "Learning-based methods for comparing sequences, with applications to audio-to-midi alignment and matching." PhD diss., Columbia University, 2016.

[20] Ren, Yi, Jinzheng He, Xu Tan, Tao Qin, Zhou Zhao, and Tie-Yan Liu. "Popmag: Pop music accompaniment generation." In Proceedings of the 28th ACM international conference on multimedia, pp. 1198-1206. 2020.

[21] Rosenblatt, Frank. "The perceptron: a probabilistic model for information storage and organization in the brain." Psychological review 65, no. 6 (1958): 386.

[22] Todd, Peter M. "A connectionist approach to algorithmic composition." Computer Music Journal 13, no. 4 (1989): 27-43.

[23] Valenti, Andrea, Antonio Carta, and Davide Bacciu. "Learning style-aware symbolic music representations by adversarial autoencoders." arXiv preprint arXiv:2001.05494 (2020).

[24] Vaswani, Ashish, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. "Attention is all you need." Advances in neural information processing systems 30 (2017).

[25] Yeh, Yin-Cheng, Wen-Yi Hsiao, Satoru Fukayama, Tetsuro Kitahara, Benjamin Genchel, Hao-Min Liu, Hao-Wen Dong, Yian Chen, Terence Leong, and Yi-Hsuan Yang. "Automatic melody harmonization with triad chords: A comparative study." Journal of New Music Research 50, no. 1 (2021): 37-51.

Prev: An Introduction to I......

Next: Instrument Classifie......

TIANHAO ZHANGCOMPOSER, PROGRAMMER, MUSIC THEORIST

Composing Machines: Pivotal Works in the History of Score-Level Music Generation

TIANHAO ZHANG
COMPOSER, PROGRAMMER, MUSIC THEORIST