I just found out how good AI music generation has become, and have spent a little time finding out what I could about the technology. This is all completely new to me – I play and record music, and I know something about AI (mostly related to language), but I hadn’t previously looked into what the state of the art was in music generation in the current era of Large Language Models. I’m pretty blown away by how realistic sounding AI generated music now can be, in much the same way that I was blown away by how LLMs could generate English text that is indistinguishable from human-generated writing. There are undoubtedly interesting questions about *how* these models work, and to what extent they are using processes similar to humans, parallel to the questions linguists and computer scientists ask about LLMs. So far I haven’t been able to find out anything about how the models work at all, and assume they largely work in a similar way to the language systems. What I have been able to quickly find something out about are the societal implications of the emergence of this technology and the ethical and legal issues around its creation and use, and I’ll share some of that along with a few of my own initial thoughts. If I were teaching a course on AI, I’d definitely have plenty to work with for a few classes on music, and it looks to me like the case of music is a good illustration of some general points about AI and its implications.
So first, why do I think AI music generation is so good? It’s because I’ve just listened to some music that a couple people have produced using current music generation software. The platform I will focus on is the new version of Suno (v. 4, released in Beta Nov. 2024). Michael J. Epstein has made a series of Facebook posts on an experiment he did with creating a song with the new Suno, submitting it to Spotify playlist placement lists, and tracking its performance. His first post from Nov. 23, along with a Dec. 3 update is here, and his Jan. 3 update is here. The song currently has had 64,593 listens. As any musician who has music on Spotify knows, this is a lot. My band’s highest number is 5,604, and the reason for it getting that high made for another story.
You can hear the song in Spotify by following the link in the title above, and you can hear a preview here by pushing the play button. It starts with a finger-plucked acoustic guitar accompanying a female voice, with bass, electric guitar and drums coming in as the song progresses. As far as I can tell, nothing about the song sounds “fake”, more than how all recorded music is fake to varying degrees, especially now. There are no analogues of the “extra fingers” that sometimes tip you off that an image is AI-generated (in the Nov. 23 post, Epstein points to some recent work indicating that “[t]he days of extra fingers in AI art…are over”). Epstein notes that none of the playlist curators he submitted his song to “identified anything (at least out loud) as inauthentic about the music” (Jan. 6 comment on Facebook, not linkable).
Some more examples of authentic sounding AI-generated music, as well as plenty of food for thought on the societal implications of this new technology can be found in the above podcast that Epstein’s Jan. 3 post links to. In the second part of a Dec. 27 episode of “On the Media” entitled “How AI and Algorithms Are Transforming Music”, Mark Henry Phillips talks about the existential crisis that his own experiments with current music generation have created for him as a composer and producer of commercial music. The podcast includes samples of the music from his experiments, and compares them to his non-AI assisted productions in terms of their quality, and also in terms of the effort needed to produce them. Phillips suspects that he will soon no longer be able to make a living as a musician, as companies start to directly use the technology themselves.
Alongside pointing out the threats that this technology poses for the already precarious ability of musicians and music producers to make a living, Epstein and Phillips note its potential for their own creative practices. This software now allows you to upload recordings that it will extend. They both mention looking forward to using that technology to help finish off demos, and Phillips provides an example of a horn part that the software created for one of his songs. This resonated with me, since I have lots of my own compositions and productions in various stages of incompleteness, so I decided to try Suno myself.
I’m actually not sure if I will wind up using Suno or similar technologies in my own music work, for a few reasons. The first is that I enjoy making music largely because of how different it is from my academic day job in linguistics. I love getting together with the other people in my band and playing our songs – it’s just as much of a chance to hang out with friends as anything else. And when I’m playing music on my own, what I’m usually doing is fooling around on the guitar finding new riffs and chord progressions, and coming up with vocal parts (hence the pile of unfinished songs). I’m not averse to using technology – I enjoy working with Logic recording software for instance – but I find it hard to get going on that, especially after a day of doing my “real” work on the computer.
The second is that I didn’t find my initial experiments with Suno very inspiring. I tried uploading a bit of one of my Voice Memo recordings that had me singing with my acoustic guitar, strumming fast. I gave Suno the style prompts “indie rock, post-punk, indie pop” and asked it to extend it. I also gave it some prompts for lyrics. The extensions it created did sound like a human performer, but they were unusable to me. They were very bland sounding, both the style, which ended up being middle of the road pop-folk, and the lyrical and melodic content. I was using the free Suno, and this was the first thing I did. I don’t doubt that if I worked more with the paid version, I could get some useable ideas and sounds, based on Epstein and Phillips’ reports and results. But at least for now, that seems like too much work, and not the type of work I want to be doing when have time for music.
I also tried just having Suno generate a couple pieces of instrumental music based on style prompts. I thought these were terrible. Again, they sounded like “real” music (though less than Epstein and Phillips’ examples), but I didn’t think they were good examples of the styles, and I didn’t enjoy listening to them. The most egregious example was the response to “punk, 1970s, New York City”, which if forced to categorize myself, I’d call “video game hair metal”. This experience does make me wonder if there will still be a need for commercial music producers after all (though undoubtedly fewer of them, given the speed at which AI-assisted producers will be able to work).
The final reason that I may end up not using this technology is an ethical one. Suno and Udio are being sued by a group of major record labels in lawsuits co-ordinated by the Recording Industry Association of America (RIAA). In its response to the lawsuits (p. 9), Suno says that their model’s
“training data includes essentially all music files of reasonable quality that are accessible on the open Internet, abiding by paywalls, password protections, and the like, combined with similarly available text descriptions.”
Suno’s response at the above link is worth reading in full, and a summary of it and of the RIAA response to the response, can be found in this article.
I very much value the protection of creators’ rights, and of their ability to make a living, so there is a big part of me that would be happy to boycott both music and language generation software, insofar as they interfere with these. But I have not thought nearly enough about these issues, and I am certainly onboard as well with the critiques of the music industry in the Suno response.
In his podcast, Phillips draws some analogies between how human musical composition works, and how the music generation software works, and suggests that there is a closer link between those than between text generation and writing. It’d be interesting from a scientific standpoint to try to look at those connections in more detail. In linguistics there is currently a great deal of controversy over whether the LLMs are useful as models of human language (for two poles of the debate, see Piantadosi and Kodner et al.). There is also something about the way that Phillips describes the human and the computer creation process that brings to mind a potential argument that using web available training data is analogous to how humans wind up creating original music. But even in the case of human creation, questions of authorship, copyright, and fair use are incredibly difficult to arbitrate, both ethically and legally.
Update Jan 7: Michael J. Epstein shared with me in a message these details about the process for making his songs in Suno, which highlight how much human intervention and creative choice there is in the work he did, and also how much faster it was than non-AI work: “…it does take a lot of thought and practice to prompt to get what you want. For every song I was posting, it probably took 20 initial generations of it to pick a baseline and then another 20+ section regenerations with dynamic prompting. So, I was not just dropping something in and releasing the output. That said, it’s obviously a trivial amount of work relative to writing and recording songs the way I have been for decades.” And in case you are curious about the playlist placement services he used, I saw this in one of his Facebook comments: “I used Sound Campaign and Playlist Push and had success with Sound Campaign and not much with Playlist Push, but I think it’s more about the genre issue. I do hear many people do not have success with services like these, so it’s definitely hit or miss, and I suspect the more generic, boring, and mainstream your music is, the better it will do…”
*Update Jan 8: I just added a parenthesized asterisk to the title, after having done another quick experiment in Suno. It had already been feeling to me that I needed to qualify what I meant by “good”, and now I’m sure I need to. What I mean is that it seems capable of generating music that is indistinguishable from non-AI music, at least in some genres, and when used correctly. Here’s a good example of how it is not good in a broader sense (besides the questions of ethical and societal goodness).
I asked Suno to generate songs in the style of “Balinese gamelan” and “Javanese gamelan”. I am no expert, but I could likely accurately sort examples of these two styles of music, and I wanted to see how Suno would do on a non-western musical system. It wound up producing a variety of things that I would label soundtrack or elevator music, and as far as I can tell none of the instruments sounded like they were from a gamelan orchestra and none of the songs used anything but standard western musical structure. There are huge issues in AI language generation about it being used with low resource languages, creating errorful examples of writing in those languages that might be taken as genuine because they use the correct orthography and get some other things right. At least no one will take these to be examples of gamelan music!