My first AI Song Contest experience: it’s not a bug; it’s a feature!

For the last couple of months I have been busy hacking away in my contribution to the Australian entry to the AI Eurovision song contest. It was a great opportunity to e-meet other individuals passionate about music and technology, led by the Uncanny Valley people . We had the benefit of Uncanny Valley’s experience in creating the AI Christmas carol back in 2018, allowing the team to build on that experience, to produce something of quality within a very short time frame.

After some initial brain storming and planning with input from our entire diverse team, roles soon fell into place based on everyone’s strengths. Caroline Pegram’s vision and drive brought the team together, and contributed to the creation of an “instrument” trained on Australian koala, kookaburra and Tasmanian devil noises, used for the song’s lead break, as well as being a key part of the lyrics team with Charlton Hill and others. Caroline also seems to be a marketing genius, with her photo concept becoming a feature on the song contest website and various articles about the competition; and many being quite taken by the idea of incorporating Australian animal noises.

Charlton Hill is Head of Innovation at Uncanny Valley, and collaborated closely with Justin Shave, Uncanny Valley’s head composer and sonic technologist.  Charlton is quoted as saying “We like to collaborate and rage with the machine”. He has some great songs in his back catalogue.

Brendan Wright was our resident deep learning specialist, working with  Justin Shave, Charlton Hill and Oliver Bown, generating MIDI melodies with a Modified Sample-RNN trained on a 200 melody Eurovision data-set, as well as generating lyrics with a Modified GPT2 containing a pre-trained language model and tuned with Eurovision lyrics. The team also explored audio generation from source-separated Eurovision vocal tracks, but it wasn’t quite delivering the results they wanted. It was however an excellent learning opportunity.

Using my background in information retrieval, pattern matching, music theory, song writing and algorithmic vocal composition, I hacked together solutions in python, awk and bash, for pre-processing MIDI melodies for the melody generator, converting lyrics and melodies into stress patterns to allow pattern matching, using my old code from my PhD, and generating synthetically sung snippets from aligned melodies and lyrics. This allowed the production team at Uncanny Valley to easily select choice snippets for the final song.

Limited development time meant that there were plenty of quirks in the snippets I provided. For example, the first version I uploaded for the team didn’t render the sung lyrics with syllables going to separate notes. This is evident in some of the song’s lines where a polysyllabic word is all at one pitch, as in “Dreams still live on the wings of happiness”. (Note that vocals start a third of the way through the audio files here, due to sinsy adding a bar of silence at the beginning and end of any vocal rendering, and my code making an entire phrase equate to a bar of music, since rhythms were not aligned to bars or beats.)

A couple of other things are evident here. First, the pitch is rather high, so I wrote code later to change the octave to something more “normal” based on average pitch. This had no impact on the final snippet selection, however, so it was more for my own satisfaction.

Second, the rendered line is actually “Dreams still live on the wings of happiness, dreams still”. This is a happy accident from the combination of my as yet not fully functioning code, and my students’ robust code. The lack of syllable breaking meant that there were often more notes available for  the words than originally matched via the syllable stress patterns, as the code that allocates words to notes would only add one word/syllable to each note. My students’ original code very cleverly loops through input text until there are no more notes to apply lyrics to. This can be handy for quick and dirty sinsy renderings, for example, by adding “la” to everything. In my later versions, I kept that feature, but also ensured that “- ” could be used in input to indicate syllabification of a word, (and to extend a syllable to more than one note if desired) and that the musicXML would be  correctly generated, so that sinsy would interpret the syllables as belonging to one word. I have, after the competition closing date, added “_” processing to allow a final syllable to be extended to more notes if required. But if I had done this sooner we may never have had “Welcome home oh welcome home oh oh oh welcome home”:

or “The world is beautiful the world” of our chorus. If this song becomes popular, I can see this style of communication catching on in a similar way to LOLcat phrases (“serious cat is serious”), Yoda grammar, or Buffy’s influence on language, eg. “my bad”.

Unfortunately, not all “features” of the sinsy output were identified in time. It was only after the submission deadline that I discovered that sinsy only understands tempo tags, and not the metronome tags used by Sibelius and Musescore, so my final renderings were not only 40 cents flat, but at 100bpm instead of 120. Fortunately the Uncanny Valley team had planned ahead and recorded everything with real vocalists, and the final vocal was a blending of post-processed sinsy with lead singer Antigone.

Justin Shave not only has a degree in music, computer science and mathematics (not too different to my own degree), he has produced songs for artists such as Sia and  Darren Hayes,  co-writing songs for Hayes’s third album, and playing keytar on-stage on its world tour.  He has had a long standing collaboration with singer Antigone, who features as lead vocalist on the track. Justin’s substantial music production experience turned a subset of the sung snippets into an amazing commercial-quality song. It has been a great experience working with the team. Meanwhile, enjoy Beautiful the World, and vote in the AI Song Contest for your favourite!