Zero-Shot Unsupervised and Text-Based Audio Editing Using DDPM Inversion

ICML 2024

Technion - Israel Institute of Technology

ArXiv Code Presentation 🤗 Text-Based Space

Abstract

Editing signals using large pre-trained models, in a zero-shot manner, has recently seen rapid advancements in the image domain. However, this wave has yet to reach the audio domain. In this paper, we explore two zero-shot editing techniques for audio signals, which use DDPM inversion on pre-trained diffusion models. The first, adopted from the image domain, allows text-based editing. The second, is a novel approach for discovering semantically meaningful editing directions without supervision. When applied to music signals, this method exposes a range of musically interesting modifications, from controlling the participation of specific instruments to improvisations on the melody.

Video Overview

For people in a hurry. Images generated by DALL-E 2 and Copilot.

1. Samples of Editing

We present samples of audio editing using our proposed methods. The samples are organized into two sections: text-based editing and unsupervised editing.

1.1. Samples of Text-Based Editing

#	Source Prompt	Target Prompt	Original Audio	Edited Audio	Edit T_start
1	A recording of a sneaky jazz song.	A recording of a tense classical music score.			110
2	A recording of a hard rock song.	A recording of a jazz song.			100