Methods of speech decoding from neural activity play an important role in developing neuroprosthetic devices for individuals with severe neuromuscular and communication disorders. Some neurodegenerative impairments can lead to communication disorders. For example, amyotrophic lateral sclerosis affects motor production and speech articulation, or various forms of aphasia affect cognitive production or language comprehension. Measuring the electrical activity of the brain for data acquisition is a major step for each neural-based speech decoding method. Electrocorticography (ECoG) is an invasive procedure for capturing brain signals by placing electrodes directly on the exposed surface of the brain to record electrical activity from the cerebral cortex.
ECoG offers several advantages for the study of speech articulation and perception compared with conventional neuroimaging methods (e.g., functional MRI and positron emission tomography) and non-invasive electromagnetic techniques (e.g., electroencephalography [EEG] and magnetoencephalography). This superiority is due to ECoG’s high spatial and temporal resolution capabilities in accurately capturing the fast-changing dynamics of brain signals correlated with speech production and perception. The top advantages of ECoG include spatial resolution at millimeter scale, frequency bandwidth of up to 200 Hz or higher (which makes ECoG more suitable for dealing with high gamma band than EEG because modulations of the 70–180 Hz gamma band range are highly correlated with speech perception and production), amplitude of up to 100 µV (ECoG signals exhibit higher signal-to-noise ratios than EEG with an amplitude of 20 µV), and lesser sensitivity to movement and myoelectrical artefacts than that of EEG.1
In this article, we’ll review the methods proposed for speech recognition based on ECoG signals and the progress made in the use of deep learning methods in this area.
ECOG-BASED SPEECH DECODING MODELS
ECoG-based speech decoding methods usually provide a framework for associating ECoG recordings (i.e., high gamma band power) with the behavioral tasks of speech production and perception. In this framework, a neural-based decoding model will be trained using articulated or perceived speech for reconstruction of speech from ECoG signals. In general, ECoG-based speech decoding models can be examined in three major aspects:
1. Speech: The level of speech production (i.e., sentences, words, syllables, etc.) that needs to be decoded based on different approaches. For example, decoding formants or phonemes may enable a model to perform a continuous reconstruction of speech, while discrete classification of sentences or words is caused by decoding individual words or phrases discretely.
2. ECoG signals: Brain signals that can be invasively captured from different areas of the cortex involved in speech perception and articulation2 (e.g., premotor area, primary motor area, and Broca’s area associated with speech preparation and articulation, posterior and middle superior temporal gyrus, and Wernicke’s involved in speech perception and processing). ECoG recordings need to be preprocessed for further analysis. This step consists of various filters, such as a spatial common average reference filter to remove any undesirable noise, fluctuations, and artefacts; a high-pass filter starting between 0.5 and 2 Hz to further attenuate low-frequency fluctuations and heartbeat artifacts; and a notch filter at harmonics of 60 or 50 Hz to eliminate power line interference. After the preprocessing step, the spectro-temporal features of ECoG signals are extracted using different processing methods, such as the discrete Fourier transform (DFT),3 autoregressive model,4 or band-power filtering.5
3. Machine learning methods: Mapping the spectro-temporal features extracted from ECoG signals to speech materials may not be a simple linear procedure. As such, machine learning methods should be employed to learn useful representations and create a complex mapping between ECoG features and speech materials. Although conventional machine learning methods have previously shown an acceptable performance for ECoG-based speech recognition, deep learning has recently introduced state-of-the-art methods with significant improvement in speech recognition. Thanks to recent developments in computer science and big data, deep neural networks (DNNs) have demonstrated great success in classification and regression models. For example, the use of deep convolutional neural networks and deep recurrent neural networks in mapping ECoG features to speech materials has resulted in much better performance than that with the use of conventional machine learning methods such as shallow neural networks, support vector machines (SVM), and linear discrimination analysis (LDA).
Before applying the spectro-temporal features of ECoG signals to a machine learning method, they should be compared to the features of a reference signal (e.g., recorded speech) in the decoding step to identify the most important channels related to speech perception or production (i.e., using an ANOVA)6 and reduce the feature dimensions (i.e., using principal component analysis).7 This stage often requires additional advanced signal processing and machine learning techniques to decode articulated or perceived speech from ECoG signals.
Moses, et al.,8 introduced a real-time method to decode perceived speech from ECoG signals captured from superior temporal gyrus while the subjects listened to spoken speech. The stimulus included multiple repetitions of 10 sentences from the TIMIT dataset. The phoneme sequences of spoken speech were considered to have been decoded, and high gamma band powers of ECoG signals were considered the neural features. Two classifiers, LDA and the hidden Markov model (HMM), were applied to decode perceived speech from the ECoG features. To calculate the power of the high gamma band, the ECoG signals were notch-filtered at 60, 120, and 180 Hz to reduce line noise. Then, each channel was band-passed at 70 to 150 Hz, squared, and smoothed using a low-pass filter at 10 Hz. Thereafter, z-scoring, two-tailed Welch’s t-tests, and principal component analysis (PCA) were applied to identify poor and modulated channels by the presence of stimulus and dimension reduction, respectively. The PCA-LDA learning method treated each stimulus (sentence) as one of 10 classes, then HHM-based classification decoded the phoneme sequences in each sentence. The results showed stimulus prediction accuracies of 90 percent and 98 percent using the PCA-LDA classification and the HMM-based classification, respectively, with a chance accuracy of 10 percent.
Mugler, et al.,9 proposed a method to decode articulated speech using ECoG recordings while the subjects produced words from a Modified Rhyme Test (MRT). This research attempted to decode the entire set of phonemes in American English. ECoG feature selection was performed on the time-frequency features (SFFT features in the mu, beta, and high gamma frequency bands) using an ANOVA, then LDA was used for classification. The results showed accuracies of 36.1 percent for all consonant phonemes (a chance accuracy of 7.4%) and 23.9 percent for all vowel phonemes (a chance accuracy of 12.9%).
Bouchard, et al.,10 investigated the phonetic organization of the speech sensorimotor cortex by examining modulations of the ECoG high gamma band during the production of consonant-vowel syllables. Spatial patterns of cortical activity showed that the gamma band activity recorded by electrodes over the sensorimotor cortex was present and had different spatial organizations for consonants versus vowels.
Ramsey, et al.,11 presented a model for decoding four spoken phonemes based on ECoG signals captured from the sensorimotor cortex by using a high-density electrode grid. Three classifiers—SVM and spatiotemporal- and spatial-matched filters—were applied to decode speech from ECoG signals. The results showed that the spatiotemporal-matched filters outperformed the other methods. Moreover, high-density grids and discrete phonemes positively affected the classification performance.
In a different study, Badino, et al.,12 presented research on motor contribution to speech perception from neurobiological and computational perspectives. Neurobiological research that applies different types of transcranial magnetic stimulation (TMS) (i.e., repetitive or focal TMS) to the motor cortex region during speech production and discrimination demonstrates the activation of motor centers during speech perception tasks. From the computational perspective, they claimed that combining acoustic features with articulatory data recovered from acoustics considerably improves the automatic speech recognition (ASR) performance.
Recently, Anumanchipalli, et al., proposed an efficient neural decoder to leverage kinematic and sound representations encoded in human cortical activity to synthesize audible speech.13 They employed recurrent neural networks to decode recorded cortical activity into representations of articulatory movement, then mapped the representations to speech acoustics. Their proposed decoder model can also synthesize speech when subjects silently mime words. Their findings can lead to significant clinical developments in using speech neuroprosthetic technology to retrieve speech produced by patients with severe communication disorders.
COMBINING NEURAL, ACOUSTIC FEATURES
ECoG recordings (i.e., high gamma band) captured from the motor cortex during speech production contain a wide range of sensorimotor data related to the vocal tract. As shown in Figure 1, combining acoustic features with articulatory data can improve ASR performance.12 However, the motor information (i.e., the articulatory gesture) is recovered from the acoustic signal by an acoustic-to-articulatory mapping. Therefore, combining pure motor data, which are directly extracted from ECoG signals during speech production, with acoustic features may enhance the performance of speech recognition systems. In other words, if we can exactly extract the motor features related to the vocal tract from ECoG signals (i.e., recorded from the motor cortex during speech production) and add them to the observation vectors (i.e., containing acoustic coefficients such as mel-frequency cepstral coefficents [MFCCs] or mel-frequency spectral coefficients [MFSCs]), then word recognition accuracy will considerably increase. Accordingly, the procedure in Figure 2 is proposed.
In the preprocessing stage of ECoG signals, various filters can be applied to remove any undesirable noise, fluctuations, and artefacts. Each channel is band-passed at 70 to 180 Hz to obtain high gamma signals. Thereafter, a time-frequency analysis is applied (i.e., windowing followed by DFT calculation and segmentation) to the high gamma signals to calculate the high gamma band power of time-frequency cells in each channel. Subsequently, a statistical technique (i.e., ANOVA or Welch’s t-tests) can be used to determine the relevant channels. In the next step, principal component analysis can be applied to reduce the dimensionality of the features to the minimum number of features required to explain the desired variance. On the acoustic side, signal is segmented by windows, which have the same length of time as the windows used in the neural side. Thereafter, the desired acoustic features will be extracted from each frame. For example, MFCCs and MFSCs are efficient features for ASR. Feature vectors will be created by combining neural and acoustic features. Lastly, a machine learning method can be used for speech decoding based on the combined input features. Previous hybrid deep neural network combined with HMM illustrated a satisfactory performance for speech recognition by combining acoustic features and articulatory data recovered from speech signals. In the proposed model, a convolutional neural network (CNN) can be employed to create complex mapping between the multidimensional input feature (neural and acoustic features) and system output (speech material such as related words or phrases). Since CNNs utilize a weight-sharing method in the convolutional layers, the number of system parameters (weights and biases) will decrease compared with the fully connected layers. This property makes CNNs more efficient during the training stage for subsampling the input features.
Notably, the features extracted from other types of non-acoustic signals (i.e., signal describing facial movements related to speech production14 or audio-visual information during speech production15) will serve as complementary features to the neural-acoustic features proposed above.
In deep learning, using appropriate and sufficient data to train networks can improve the generalization of trained models in unseen conditions. Therefore, combining the neural, acoustic, facial, and visual information captured during speech production can provide the big data needed for appropriate speech recognition training of individuals with severe neuromuscular and communication disorders.
Credit: Google News