AEROMamba: Efficient Audio Super-Resolution

AEROMamba: An efficient architecture for audio super-resolution using generative adversarial networks and state space models

Federal University of Rio de Janeiro, Brazil
LAMIR Workshop 2024

Abstract

Audio super-resolution aims to enhance low-resolution signals by creating high-frequency content. In this work, we modify the architecture of AERO (a state-of-the-art system for this task) for music super-resolution. SPecifically, we replace its original Attention and LSTM layers with Mamba, a State Space Model (SSM), across all network layers. Mamba is capable of effectively substituting the mentioned modules, as it offers a mechanism similar to that of Attention while also functioning as a recurrent network. With the proposed AEROMamba, training requires 2-4x less GPU memory, since Mamba exploits the convolutional formulation and leverages GPU memory hierarchy. Additionally, during inference, Mamba operates in constant memory due to recurrence, avoiding memory growth associated with Attention. This results in a 14x speed improvement using 5x less GPU. Subjective listening tests (0 to 100 scale) show that the proposed model surpasses the AERO model. In the MUSDB dataset, degraded signals scored 38.22, while AERO and AEROMamba scored 60.03 and 66.74, respectively. For the PianoEval dataset, scores were 72.92 for degraded signals, 76.89 for AERO, and 84.41 for AEROMamba.

Section Ⅰ: Results

Results for the MUSDB18 and PianoEval datasets comparing ViSQOL, LSD, and subjective scores across different models, as well as performance metrics on two GPU types (NVIDIA RTX 3090 and RTX 2080 Ti) for 10-second samples.

MUSDB18 Results

Model	ViSQOL ↑	LSD ↓	Score ↑
Low-Resolution	1.82	3.98	38.22
AERO	2.90	1.34	60.03
AEROMamba	2.93	1.23	66.47

Comparison of ViSQOL, LSD, and subjective scores for various models on the MUSDB18 dataset.

PianoEval Results

Model	ViSQOL ↑	LSD ↓	Score ↑
Low-Resolution	4.36	1.09	72.92
AERO	4.38	0.99	76.89
AEROMamba-HQ	4.38	1.00	84.41

Comparison of ViSQOL, LSD, and subjective scores for various models on the PianoEval dataset. Models labeled with `-HQ` were trained on PianoEval-HQ.

Performance Comparison

Method	NVIDIA RTX 3090		NVIDIA RTX 2080 Ti		Parameters
	GPU Usage (MB)	Time (s)	GPU Usage (MB)	Time (s)
AERO	17091	1.246	16420*	--	19,432,958
AEROMamba	3000	0.087	1914	0.063	20,964,190

Section Ⅱ: Examples for MUSDB tracks upsampled from 11.025kHz to 44.1kHz.

	Original low resolution 11.025 kHz	Original high resolution 44.1 kHz	AERO 11.025 -> 44.1 kHz	AEROMamba 11.025 -> 44.1 kHz

Section Ⅲ: Examples for PianoEval tracks upsampled from 11.025kHz to 44.1kHz.

	Original low resolution 11.025 kHz	Original high resolution 44.1 kHz	AERO 11.025 -> 44.1 kHz	AEROMamba 11.025 -> 44.1 kHz

Section Ⅳ: PianoEval Metadata

We collected the PianoEval data set, which consists of two parts. The first is composed of the 24 Preludes for Piano, op. 28, by Chopin performed by 33 pianists in 45 different recordings available on CD (Compact Disc), totaling approximately 22 hours. The second part contains excerpts of Ligeti piano études, a Schumann sonata, and the Barber sonata, played by three different performers, respectively, totaling approximately 3.5 hours. Each file is stored in WAV format, stereo mode and sampled at 44.1 kHz. Information about performers, record label and year of recording are detailed in the Tables below.

Train/Validation - Part 1

Pianist	Record label	Year
Arrau, C.	Columbia	1950/1
Arrau, C.	Philips	1973
Argerich, M.	Deutsche Grammophon	1975
Ashkenazy, V.	Decca	1976
Ashkenazy, V.	Decca	1992
Bolet, J.	RCA	1974
Blechacz, R.	Deutsche Grammophon	2007
Cherkassky, S.	ASV	1968
Cortot, A.	HMV	1926
Cortot, A.	HMV	1933/4
Cortot, A.	Gramophone	1942
Cortot, A.	Archipel [live]	1955
Cortot, A.	EMI	1957
Davidovich, B.	Decca	1979
de Larrocha, A.	Decca	1974
Duchable, F.	Erato	1988
Dutra, G.	Yellow Tail	1997
El Bacha, A. R.	Forlane	1999
François, S.	EMI	1959
Freire, N.	Columbia	1970
Harasiewicz, A.	Philips	1963

Train/Validation - Part 2

Pianist	Record label	Year
Katsaris, C.	Sony	1992
Kissin, Y.	RCA	1999
Lima, A. M.	Caras¹	1981
Lucchesini, A.	EMI	1988²
Magaloff, N.	Philips	1975
Novaes, G.	Music and Arts [live]	1949
Ohlsson, G.	EMI	1974
Ohlsson, G.	Hyperion	1989
Perahia, M.	Columbia	1975
Petri, E.	Columbia	1942
Pires, M.	Erato	1975
Pires, M.	Deutsche Grammophon	1992
Pogorelich, I.	Deutsche Grammophon	1989
Pollini, M.	Deutsche Grammophon	1974
Pollini, M.	Deutsche Grammophon	2011
Proença, M.	Delphos	1999
Rubinstein, A.	RCA	1946
Switala, W.	NIFC	2006/7
Tiempo, S.	Victor	1990
Varsi, D.	Genuin	1988

The superscript ¹ refers to a magazine, and the superscript ² refers to the release year, not the recording year.

Test

Pianist	Record label	Year
B. Glemser	Naxos	1993
D. Pollack	Naxos	1995
P. L. Aimard	Sony	1995

BibTeX

@inproceedings{Abreu2024lamir, author = {Wallace Abreu and Luiz Wagner Pereira Biscainho}, title = {AEROMamba: An Efficient Architecture for Audio Super-Resolution Using Generative Adversarial Networks and State Space Models}, booktitle = {Proceedings of the 1st Latin American Music Information Retrieval Workshop}, year = {2024}, address = {Rio de Janeiro, Brazil}, }