AEROMamba: An efficient architecture for audio super-resolution using generative adversarial networks and state space models

Federal University of Rio de Janeiro, Brazil
LAMIR Workshop 2024

Abstract

Audio super-resolution aims to enhance low-resolution signals by creating high-frequency content. In this work, we modify the architecture of AERO (a state-of-the-art system for this task) for music super-resolution. SPecifically, we replace its original Attention and LSTM layers with Mamba, a State Space Model (SSM), across all network layers. Mamba is capable of effectively substituting the mentioned modules, as it offers a mechanism similar to that of Attention while also functioning as a recurrent network. With the proposed AEROMamba, training requires 2-4x less GPU memory, since Mamba exploits the convolutional formulation and leverages GPU memory hierarchy. Additionally, during inference, Mamba operates in constant memory due to recurrence, avoiding memory growth associated with Attention. This results in a 14x speed improvement using 5x less GPU. Subjective listening tests (0 to 100 scale) show that the proposed model surpasses the AERO model. In the MUSDB dataset, degraded signals scored 38.22, while AERO and AEROMamba scored 60.03 and 66.74, respectively. For the PianoEval dataset, scores were 72.92 for degraded signals, 76.89 for AERO, and 84.41 for AEROMamba.

Section Ⅰ: Results

Results for the MUSDB18 and PianoEval datasets comparing ViSQOL, LSD, and subjective scores across different models, as well as performance metrics on two GPU types (NVIDIA RTX 3090 and RTX 2080 Ti).

MUSDB18 Results

Model ViSQOL ↑ LSD ↓ Score ↑
Low-Resolution1.823.9838.22
AERO2.901.3460.03
AEROMamba2.931.2366.47

Comparison of ViSQOL, LSD, and subjective scores for various models on the MUSDB18 dataset.

PianoEval Results

Model ViSQOL ↑ LSD ↓ Score ↑
Low-Resolution4.361.0972.92
AERO4.380.9976.89
AEROMamba-HQ4.381.0084.41

Comparison of ViSQOL, LSD, and subjective scores for various models on the PianoEval dataset. Models labeled with `-HQ` were trained on PianoEval-HQ.

Performance Comparison

Method NVIDIA RTX 3090 NVIDIA RTX 2080 Ti Parameters
GPU Usage (MB) Time (s) GPU Usage (MB) Time (s)
AERO170911.24616420*--19,432,958
AEROMamba30000.08719140.06320,964,190

Section Ⅱ: Examples for MUSDB tracks upsampled from 11.025kHz to 44.1kHz.

Original low resolution
11.025 kHz
Original high resolution
44.1 kHz
AERO
11.025 -> 44.1 kHz
AEROMamba
11.025 -> 44.1 kHz

Section Ⅲ: Examples for PianoEval tracks upsampled from 11.025kHz to 44.1kHz.

Original low resolution
11.025 kHz
Original high resolution
44.1 kHz
AERO
11.025 -> 44.1 kHz
AEROMamba
11.025 -> 44.1 kHz

Section Ⅳ: PianoEval Metadata

We collected the PianoEval data set, which consists of two parts. The first is composed of the 24 Preludes for Piano, op. 28, by Chopin performed by 33 pianists in 45 different recordings available on CD (Compact Disc), totaling approximately 22 hours. The second part contains excerpts of Ligeti piano études, a Schumann sonata, and the Barber sonata, played by three different performers, respectively, totaling approximately 3.5 hours. Each file is stored in WAV format, stereo mode and sampled at 44.1 kHz. Information about performers, record label and year of recording are detailed in the Tables below.

Train/Validation - Part 1

Pianist Record label Year
Arrau, C.Columbia1950/1
Arrau, C.Philips1973
Argerich, M.Deutsche Grammophon1975
Ashkenazy, V.Decca1976
Ashkenazy, V.Decca1992
Bolet, J.RCA1974
Blechacz, R.Deutsche Grammophon2007
Cherkassky, S.ASV1968
Cortot, A.HMV1926
Cortot, A.HMV1933/4
Cortot, A.Gramophone1942
Cortot, A.Archipel [live]1955
Cortot, A.EMI1957
Davidovich, B.Decca1979
de Larrocha, A.Decca1974
Duchable, F.Erato1988
Dutra, G.Yellow Tail1997
El Bacha, A. R.Forlane1999
François, S.EMI1959
Freire, N.Columbia1970
Harasiewicz, A.Philips1963

Train/Validation - Part 2

Pianist Record label Year
Katsaris, C.Sony1992
Kissin, Y.RCA1999
Lima, A. M.Caras11981
Lucchesini, A.EMI19882
Magaloff, N.Philips1975
Novaes, G.Music and Arts [live]1949
Ohlsson, G.EMI1974
Ohlsson, G.Hyperion1989
Perahia, M.Columbia1975
Petri, E.Columbia1942
Pires, M.Erato1975
Pires, M.Deutsche Grammophon1992
Pogorelich, I.Deutsche Grammophon1989
Pollini, M.Deutsche Grammophon1974
Pollini, M.Deutsche Grammophon2011
Proença, M.Delphos1999
Rubinstein, A.RCA1946
Switala, W.NIFC2006/7
Tiempo, S.Victor1990
Varsi, D.Genuin1988

The superscript 1 refers to a magazine, and the superscript 2 refers to the release year, not the recording year.

Test

Pianist Record label Year
B. GlemserNaxos1993
D. PollackNaxos1995
P. L. AimardSony1995

BibTeX

@inproceedings{Abreu2024lamir,
        author    = {Wallace Abreu and Luiz Wagner Pereira Biscainho},
        title     = {AEROMamba: An Efficient Architecture for Audio Super-Resolution Using Generative Adversarial Networks and State Space Models},
        booktitle = {Proceedings of the 1st Latin American Music Information Retrieval Workshop},
        year      = {2024},
        address   = {Rio de Janeiro, Brazil},
      }