Audio generation

Machine Learning in Linux: Audiocraft – audio processing and generation with deep learning


Audiocraft produces remarkable results. It’s not going to make us a music maestro, but the samples generated are impressive even without a lot of tweaking of the text descriptions.

We were initially disappointed to read that a GPU with at least 16GB of VRAM is necessary to use the melody model. Graphics cards with this amount of RAM are expensive for the average user. But fortunately, that information doesn’t appear to be correct. Our test machine with 8GB VRAM mid-range graphics card is able to generate 30 second clips with the melody model.

If you don’t have an NVIDIA GPU, how long does it take to generate music extracts with just the CPU? We made a small code change to audiocraft/models/ to force the software to use the CPU instead of the dedicated GPU.

Here are the results to generate a 10 second music extract using the text description “A cheerful country song with acoustic guitars”. For the melody model we used Ravel’s Bolero mp3 file.

All times in seconds with model pre-loaded. CPU: Intel i5-12400F; GPU: NVIDIA GeForce 3060 Ti

The table should help give you an indication of how long it will take to generate music extracts on your system.

Using the GPU offers a huge speed advantage over the CPU. No surprise there. But if you’re happy waiting a minute or two to generate a clip, you can use the software without a dedicated graphics card. Or you can use Google Colab.

With our test machine, we can only use the large model with the CPU as the GPU has insufficient VRAM, borking out with the error message torch.cuda.OutOfMemoryError: CUDA out of memory.

Developer: Meta Platforms, Inc. and affiliates
License: MIT License

Audiocraft is written in Python. Learn Python with our recommended free books and free tutorials.

Artificial intelligence icon For other useful open source apps that use machine learning/deep learning, we’ve compiled this roundup.

Pages in this article:
Page 1 – Introduction and Installation
Page 2 – In Operation
Page 3 – Summary

Notify of

This site uses Akismet to reduce spam. Learn how your comment data is processed.

1 Comment
Newest Most Voted
Inline Feedbacks
View all comments
9 months ago

Thanks for the article. This is one of the best deep learning tools I’ve tried although it’s dog slow on my AMD CPU.

It’s like Stable Diffusion but for audio.

What irks me is the large RAM requirements. Fortunately I’ll be getting a GeForce RTX 4060 Ti so this will have enough RAM.