Skip to the content.

Abstract

In this paper, we present a deep-learning-based framework for audio-visual speech inpainting, i.e. the task of restoring the missing parts of an acoustic speech signal from reliable audio context and uncorrupted visual information. Recent work focuses solely on audio-only methods and they generally aim at inpainting music signals, which show highly different structure than speech. Instead, we inpaint speech signals with gaps ranging from 100 ms to 1600 ms to investigate the contribution that vision can provide for gaps of different duration. We also experiment with a multi-task training approach where a phone recognition task is learned together with speech inpainting. Results show that the performance of audio-only speech inpainting approaches degrade rapidly when gaps get large, while the proposed audio-visual approach is able to plausibly restore missing information. In addition, we show that multi-task learning is effective, although the largest contribution to performance comes from vision.

Demo

The following video contains several examples of inpainted speech signals generated by models proposed in our paper.

800 ms Gap

Example 1

Input
Audio-only MTL
Audio-Visual MTL
Ground Truth

Example 2

Input
Audio-only MTL
Audio-Visual MTL
Ground Truth

1600 ms Gap

Example 1

Input
Audio-only MTL
Audio-Visual MTL
Ground Truth

Example 2

Input
Audio-only MTL
Audio-Visual MTL
Ground Truth

Paper

The paper is available here. If this project is useful for your research, please cite:

@inproceedings{morrone2021audio,
  title={Audio-visual speech inpainting with deep learning},
  author={Morrone, Giovanni and Michelsanti, Daniel and Tan, Zheng-Hua and Jensen, Jesper},
  booktitle={2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  pages={6653--6657},
  year={2021},
  organization={IEEE}
}