Hush-Hush Speak: Speech Reconstruction Using Silent Videos

Published in The Twentieth Annual Conference of the International Speech Communication Association., 2019

Recommended citation: Shashwat Uttam*, Yaman Kumar*, Dhruva Sahrawat*, Mansi Agarwal, Rajiv Ratn Shah, Debanjan Mahata. The Twentieth Annual Conference of the International Speech Communication Association. InterSpeech 2019.

[PDF] [DOI]

Abstract

Speech Reconstruction is the task of recreation of speech using silent videos as input. In the literature, it is also referred to as lipreading. In this paper, we design an encoder-decoder architecture which takes silent videos as input and outputs an audio spectrogram of the reconstructed speech. The model, despite being a speaker-independent model, achieves comparable results on speech reconstruction to the current state-of-the-art speaker-dependent model. We also perform user studies to infer speech intelligibility. Additionally, we test the usability of the trained model using bilingual speech.