Zero-Shot Unsupervised and Text-Based Audio Editing Using DDPM Inversion
ICML 2024
Technion - Israel Institute of Technology

Abstract
Editing signals using large pre-trained models, in a zero-shot manner, has recently seen rapid advancements in the image domain. However, this wave has yet to reach the audio domain. In this paper, we explore two zero-shot editing techniques for audio signals, which use DDPM inversion on pre-trained diffusion models. The first, adopted from the image domain, allows text-based editing. The second, is a novel approach for discovering semantically meaningful editing directions without supervision. When applied to music signals, this method exposes a range of musically interesting modifications, from controlling the participation of specific instruments to improvisations on the melody.
Video Overview
For people in a hurry. Images generated by DALL-E 2 and Copilot.
1. Samples of Editing
We present samples of audio editing using our proposed methods. The samples are organized into two sections: text-based editing and unsupervised editing.
1.1. Samples of Text-Based Editing
# | Source Prompt | Target Prompt | Original Audio | Edited Audio | Edit Tstart |
---|---|---|---|---|---|
1 | A recording of a sneaky jazz song. | A recording of a tense classical music score. | 110 | ||
2 | A recording of a hard rock song. | A recording of a jazz song. | 100 |