基于深度神经网络的语音识别系统外文翻译资料

2021-12-22 10:12

Speech Recognition

Victor Zue, Ron Cole, amp; Wayne Ward

MIT Laboratory for Computer Science, Cambridge, Massachusetts, USA

Oregon Graduate Institute of Science amp; Technology, Portland, Oregon, USA

Carnegie Mellon University, Pittsburgh, Pennsylvania, USA

1 Defining the Problem

Speech recognition is the process of converting an acoustic signal, captured by a microphone or a telephone, to a set of words. The recognized words can be the final results, as for applications such as commands amp; control, data entry, and document preparation. They can also serve as the input to further linguistic processing in order to achieve speech understanding, a subject covered in section.

Speech recognition systems can be characterized by many parameters, some of the more important of which are shown in Figure. An isolated-word speech recognition system requires that the speaker pause briefly between words, whereas a continuous speech recognition system does not. Spontaneous, or extemporaneously generated, speech contains disfluencies, and is much more difficult to recognize than speech read from script. Some systems require speaker enrollment---a user must provide samples of his or her speech before using them, whereas other systems are said to be speaker-independent, in that no enrollment is necessary. Some of the other parameters depend on the specific task. Recognition is generally more difficult when vocabularies are large or have many similar-sounding words. When speech is produced in a sequence of words, language models or artificial grammars are used to restrict the combination of words.

The simplest language model can be specified as a finite-state network, where the permissible words following each word are given explicitly. More general language models approximating natural language are specified in terms of a context-sensitive grammar.

One popular measure of the difficulty of the task, combining the vocabulary size and the language model, is perplexity, loosely defined as the geometric mean of the number of words that can follow a word after the language model has been applied (see section for a discussion of language modeling in general and perplexity in particular). Finally, there are some external parameters that can affect speech recognition system performance, including the characteristics of the environmental noise and the type and the placement of the microphone.

Parameters	Range
Speaking Mode	Isolated words to continuous speech
Speaking Style	Read speech to spontaneous speech
Enrollment	Speaker-dependent to Speaker-independent
Vocabulary	Small(lt;20 words) to large(gt;20,000 words)
Language Model	Finite-state to context-sensitive
Perplexity	Small(lt;10) to large(gt;100)
SNR	High (gt;30 dB) to law (lt;10dB)
Transducer	Voice-cancelling microphone to telephone

Table: Typical parameters used to characterize the capability of speech recognition systems

Speech recognition is a difficult problem, largely because of the many sources of variability associated with the signal. First, the acoustic realizations of phonemes, the smallest sound units of which words are composed, are highly dependent on the context in which they appear. These phonetic variabilities are exemplified by the acoustic differences of the phoneme，At word boundaries, contextual variations can be quite dramatic---making gas shortage sound like gash shortage in American English, and devo andare sound like devandare in Italian.

Second, acoustic variabilities can result from changes in the environment as well as in the position and characteristics of the transducer. Third, within-speaker variabilities can result from changes in the speaker#39;s physical and emotional state, speaking rate, or voice quality. Finally, differences in sociolinguistic background, dialect, and vocal tract size and shape can contribute to across-speaker variabilities.

Figure shows the major components of a typical speech recognition system. The digitized speech signal is first transformed into a set of useful measurements or features at a fixed rate, typically once every 10--20 msec (see sectionsand 11.3 for signal representation and digital signal processing, respectively). These measurements are then used to search for the most likely word candidate, making use of constraints imposed by the acoustic, lexical, and language models. Throughout this process, training data are used to determine the values of the model parameters.

Figure: Components of a typical speech recognition system.

Speech recognition systems attempt to model the sources of variability described above in several ways. At the level of signal representation, researchers have developed representations that emphasize perceptually important speaker-independent features of the signal, and de-emphasize speaker-dependent characteristics. At the acoustic phonetic le

基于深度神经网络的语音识别系统

Victor Zue,舒维都，Ron Cole,罗恩科尔，韦恩沃德

MIT Laboratory for Computer Science, Cambridge, Massachusetts, USA麻省理工学院计算机科学实验室，剑桥，马萨诸塞州，美国

Oregon Graduate Institute of Science amp; Technology, Portland, Oregon, USA俄勒冈科学与技术学院，波特兰，俄勒冈州，美国

Carnegie Mellon University, Pittsburgh, Pennsylvania, USA卡耐基梅隆大学，匹兹堡，宾夕法尼亚州，美国

一定义问题

Speech recognition语音识别is the process of converting an acoustic signal, captured by a microphone or a telephone, to a set of words.是指音频信号的转换过程，被电话或麦克风的所捕获的一系列的消息。所识别的消息作为The recognized words can be the final results, as for applications such as commands amp; control所施最后的结果，用于控制应用，如命令与数据录入，以及文件准备.。They can also serve as the input to further linguistic processing in order to achieve speech understanding, a subject covered in section它们也可以作为处理输入的语言，以便进一步实现语音理解，在第一个主题涵盖.。

Speech recognition systems can be characterized by many parameters, some of the more important of which are shown in Figure语音识别系统可以用多个参数来描述，一些更重要参数在图形中显示出来..An isolated-word speech recognition一个孤立字语音识别系统要求词与词之间短暂停顿，而连续语音识别系system does not.系统对那些不Spontaneous, or extemporaneously generated, speech contains自发的，或临时生成的，言语不流利的语音，比用讲稿读出disfluencies, and is much more difficult to recognize than speech read f，，更难以识别。Some systems require speaker enrollment有些系统要求发言者登记——即用户在使用系统前必须为系统提供演讲样本或发言底稿，而其他系统据说是独立扬声器，因为没有必要登记。一些参数特征依赖于特定的任务。当词汇量比较大或有较多象声词的时候，Recognition is generally more difficult when vocabularies are large or have many similar-sounding words.识别起来一般比较困难。When speech is produced in a sequence of words, language models or artificial grammars are used to restrict the combination of words.当语音由有序的词语生成时，语言模型或特定语法便会限制词语的组合。

The simplest language model can be specified as a finite-state network最简单的语言模型可以被指定为一个有限状态网络，每个语音所包含的所有允许的词语都能顾及到。More general language models approximating natural language are specified in terms of a context-sensitive grammar更普遍的近似自然语言的语言模型在语法方面被指定为上下文相关联。

One popular measure of the difficulty of the task, combining the vocabulary size一种普及的任务的难度测量，词汇量和语言模型相结合的语音比较复杂and the language model, is perplexityyu识别，大量语音的几何意义可以按照语音模型的应用定义宽泛些（参见文章对语言模型普遍性与复杂性的详细讨论）。Finally, there are some external parameters that can affect speech recognition system performance, including the characteristics of the environmental noise and the type and the placement of the microphone.最后，还有一些其他参数，可以影响语音识别系统的性能，包括环境噪声和麦克风的类型和安置。

参数	范围
语音模型	孤立词到连续语音
语音种类	朗读语音到自然语音
登记	依赖扬声器到独立的扬声器
词汇	小(lt;20 字)到大(gt;20,000 字)
语言模型	有限个状态到上下文相关
混乱	小(lt;10)到多(gt;100)
信噪比	高(gt;30分贝)到低(lt;10分贝)
传感器	消音麦克到电话

Table: Typical parameters used to characterize the capability of speech recognition systems 表格：特有参数用于表征语音识别系统的性能

Speech recognition is a difficult problem, largely because of the many sources of variability associated with the signal.语音识别是一个困难的问题，主要是因为与信号相关的变异有很多来源。 First, the acoustic realizations of phonemes首先，音素，作为组成词语的最小的语音单位，它的声学呈现, the smallest sound units of which words are composed, are highly dependent on the context in which the是高度依赖于他们所出现的语境的。 These phonetic variabilities are exemplified by the acoustic differences of the phoneme这些语音的变异性正好由音素的声学差异做出了验证。在词语的范围里，At word boundaries, contextual v语境的变化can be quite dramatic---making gas shortage sound like gash shortage in American English, and devo andare sound like devandare in Italian.会相当富有戏剧性---使得美国英语里的gas shortage听起来很像gash shortage，而意大利语中的devo andare听起来会很像devandare。

Second, acoustic variabilities其次，声变异can result from changes in the environment as well as in the position and characteristics of the transducer.可能由环境变化，以及传输介质的位置和特征引起。 Third, within-speaker variabilities第三，说话人的不同，演讲者身体和情绪上的差异 can result from changes in the speaker#39;s physical and emotional state, speaking rate, or voice quality.由于可能导致演讲速度，质量和话音质量的差异。Finally, differences in sociolinguistic background, dialect, and vocal tract size and shape can contribute to across-speaker variabilities .最后，社会语言学背景，方言的差异和声道的大小和形状更进一步促进了演讲者的差异性。

Figure数字图形展示了shows the major components of a typical speech recognition system.图形展示了语音识别系统的主要组成部分。The digitized speech signal is first transformed into a set of useful measurements or features at a fixed rate, typically once every 10--20 msec (see sections数字化语音信号先转换成一系列有用的测量值或有特定速率的特征，通常每次间隔10 - 20毫秒（见第and 11.3 for signal representation and digital signal processing, respectivel11.3章节，分别描述了模拟信号和数字信号的处理）。然后These measurements are then used to search for the most likely word candidate, making use of constraints imposed by the acoustic, lexical, and language models.这些测量被用来寻找最有可能的备选词汇，使用被声学模型、词汇模型、和语言模型强加的限制因素。 Throughout this process, training data are used to determine the values of the model parameters.整个过程中，训练数据是用来确定模型参数值的。

Figure: Components of a typical speech recognition system. 图：一个典型语音识别系统的组成部分

Speech recognition systems attempt to model the sources of variability described above in several ways.语音识别系统尝试在上述变异的来源的某些方面做模型。At the level of signal representation, researchers have developed representations that emphasize perceptually important speaker-independent features of the signal, and de-emphasize speaker-dependent characteristics [

您需要先支付 30元 才能查看全部内容！立即支付

注册

找回密码

基于深度神经网络的语音识别系统外文翻译资料

Speech Recognition

1 Defining the Problem

一定义问题

您可能感兴趣的文章

最新文档

推荐栏目

登录

注册

找回密码

基于深度神经网络的语音识别系统外文翻译资料

Speech Recognition

1 Defining the Problem

一 定义问题

您可能感兴趣的文章

最新文档

推荐栏目

一定义问题