Zero-Shot Unsupervised and Text-Based Audio Editing Using DDPM Inversion

ICML 2024

Technion - Israel Institute of Technology

Teaser

Abstract

Editing signals using large pre-trained models, in a zero-shot manner, has recently seen rapid advancements in the image domain. However, this wave has yet to reach the audio domain. In this paper, we explore two zero-shot editing techniques for audio signals, which use DDPM inversion on pre-trained diffusion models. The first, adopted from the image domain, allows text-based editing. The second, is a novel approach for discovering semantically meaningful editing directions without supervision. When applied to music signals, this method exposes a range of musically interesting modifications, from controlling the participation of specific instruments to improvisations on the melody.

Video Overview

For people in a hurry. Images generated by DALL-E 2 and Copilot.

1. Samples of Editing

We present samples of audio editing using our proposed methods. The samples are organized into two sections: text-based editing and unsupervised editing.

1.1. Samples of Text-Based Editing

# Source Prompt Target Prompt Original Audio Edited Audio Edit Tstart
1 A recording of a sneaky jazz song. A recording of a tense classical music score. 110
2 A recording of a hard rock song. A recording of a jazz song. 100