语音合成技术概述外文翻译资料

2022-08-24 11:08

An Overview of Text-To-Speech Synthesis Techniques

Abstract

The goal of this paper is to provide a short but comprehensive overview of text-to-speech synthesis by highlighting its natural language processing (NLP) and digital signal processing (DSP) components. First, the front-end or the NLP component comprised of text analysis, phonetic analysis, and prosodic analysis is introduced then two rule-based synthesis techniques (formant synthesis and articulatory synthesis) are explained. After that concatenative synthesis is explored. Compared to rule- based synthesis, concatenative synthesis is simpler since there is no need to determine speech production rules. However, concatenative synthesis introduces the challenges of prosodic modification to speech units and resolving discontinuities at unit boundaries. Prosodic modification results in artifacts in the speech that make the speech sound unnatural. Unit selection synthesis, which is a kind of concatenative synthesis, solves this problem by storing numerous instances of each unit with varying prosodies. The unit that best matches the target prosody is selected and concatenated. Finally, hidden Markov model (HMM) synthesis is introduced.

Keywords: Speech Synthesis, Grapheme-to-Phoneme (G2P) Conversion, Concatenative Synthesis, Hidden Markov Model (HMM)

1. Introduction

Speech is the primary means of communication between people. The goal of speech synthesis or text-to-speech (TTS) is to automatically generate speech (acoustic waveforms) from text [1]. In other words, a text-to-speech synthesizer is a computer-based system that should be able to read any text aloud. There is a fundamental difference between text-to-speech synthesizer and any other talking machine (such as a cassette- player) in the sense that we are interested in the automatic production of new sentences [2]. Speech synthesis performs this mapping in two phases. The first one is text analysis, where the input text is transcribed into a phonetic representation, and the second one is the generation of speech waveforms, where the acoustic output is produced from this phonetic and prosodic information. These two phases are usually called as high- and low-level synthesis.

There are three main approaches to speech synthesis: articulatory synthesis, formant synthesis, and concatenative synthesis. Articulatory synthesis generates speech by

direct modeling of human articulator behavior. Formant synthesis models the pole frequencies of speech signal. Formants are the resonance frequencies of the vocal tract. Since the formants constitute the main frequencies that make sounds distinct, speech is synthesized using these estimated frequencies. On the other hand, concatenative speech synthesis produces speech by concatenating small, prerecorded units of speech, such as phonemes, diphones, and triphones to construct the utterance. The following figure gives a high-level block diagram of the concatenative TTS synthesis process.

2. Text Analysis

2.1 Text normalization

The first task of all text-to-speech systems is to preprocess or normalize the input text in a variety of ways. We will need to break the input text into sentences. For each sentence, we divide it into a sequence of tokens (such as words, numbers, dates and other types). Non-natural language tokens such as acronyms and abbreviations must be converted to natural language tokens. In the following subsections, the steps of text normalization are explained in more details.

2.1.1 Sentence Tokenization

The first task in text normalization is sentence tokenization. This step has some difficulties because sentence boundaries are not always indicated by periods and can sometimes be indicated by other punctuations like colons. To determine sentence boundaries, the input text is divided into tokens separated by whitespaces and then any token containing one of these characters ! , . , or ? is selected and a machine learning classifier can be used to determine whether each of these characters inside these tokens indicate an end-of-sentence or not.

2.1.2 Non-Standard words

The second task in text normalization is normalizing non-standard words such as numbers, abbreviations or acronyms. These tokens need to be converted to a sequence of natural words so that a synthesizer can pronounce them correctly. The difficulty with non-standard words is that they are often ambiguous. For example, a number like 1773 can be spoken in a variety of ways and the The final task in text normalization is homograph disambiguation. Homographs are words that have the same spelling but differ in their pronunciation. For example, the two forms of the word use in the following sentence 'Its no use to ask to use the telephone.' have different pronunciations. The correct pronunciation of each of these forms can easily be determined if the part-of-speech is known. The first form of the word use is a noun whereas the second one is a verb. Indeed, Liberman and Church (1992) showed that the part-of-speech can disambiguate many of the most frequent homographs in 44 million words [4]. For situations in which homographs cannot be resolved by the part-of-speech, a word sense disambiguation algorithm can be used to resolve them.

Formant synthesis is based on the source-filter model of speech production. In this model, speech is generated by a basic sound source, and then modified by the vocal tract. The sound source for vowels is a periodic signal with a fundamental frequency. For unvoiced consonants a random noise generator is used. Voiced fricatives use both sources. To produce intelligible speech three formants are needed correct way is determined from the context. The previous number is read as seventeen seventy three if it is a part of a date. I

剩余内容已隐藏，支付完成后下载完整资料

An Overview of Text-To-Speech Synthesis Techniques

语音合成技术概述

Abstract摘要

The goal of this paper is to provide a short but comprehensive overview of text-to-speech synthesis by highlighting its natural language processing (NLP) and digital signal processing (DSP) components. First, the front-end or the NLP component comprised of text analysis, phonetic analysis, and prosodic analysis is introduced then two rule-based synthesis techniques (formant synthesis and articulatory synthesis) are explained. After that concatenative synthesis is explored. Compared to rule-based synthesis, concatenative synthesis is simpler since there is no need to determine speech production rules. However, concatenative synthesis introduces the challenges of prosodic modification to speech units and resolving discontinuities at unit boundaries. Prosodic modification results in artifacts in the speech that make the speech sound unnatural. Unit selection synthesis, which is a kind of concatenative synthesis, solves this problem by storing numerous instances of each unit with varying prosodies. The unit that best matches the target prosody is selected and concatenated. Finally, hidden Markov model (HMM) synthesis is introduced.

本文的目的是通过重点分析文本到语音合成的自然语言处理（NLP）和数字信号处理（DSP）组件，来对语音合成技术提供一个简短而全面的概述。首先，介绍了由文本分析，语音分析和韵律分析组成的前端或NLP组件，然后介绍了两种基于规则的合成技术（共振峰合成和发音器合成技术）。然后介绍了拼接合成技术。与基于规则的合成相比，拼接合成更为简单，因为无需确定语音的产生规则。然而，拼接合成给语音单元带来了韵律修饰和解决单元边界的不连续性的挑战。韵律修饰会导致语音中的伪影，使语音听起来不自然。单元选择合成是拼接合成的一种，它通过对每个单元存储大量的、不同韵律的实例解决了此问题。在这种方法中，最符合目标韵律的单元将被选择并拼接。本文最后介绍了隐马尔可夫模型（HMM）合成。

Keywords: Speech Synthesis, Grapheme-to-Phoneme (G2P) Conversion, Concatenative Synthesis, Hidden Markov Model (HMM)
关键词：语音合成，音素到音素（G2P）转换，拼接合成，隐马尔可夫模型（HMM）

1. Introduction 简介

语言是人与人之间交流的主要手段。语音合成或文本语音转换（TTS）的目标是从文本自动生成语音（声音波形）[1]。换句话说，文本到语音合成器是一种基于计算机的系统，这种系统应该能够朗读出任何文本。语音合成器与任何其他发声机器（例如卡带播放器）之间存在着根本的区别，即，在本文中我们所关心的：自动生成新句子对应的语音[2]。语音合成分两个阶段执行这种从文字到语音的映射。第一个阶段是文本分析，其中输入的文本被转录成语音表示形式；第二个阶段是语音波形的生成，在这个阶段，根据语音和韵律信息产生声音输出。这两个阶段通常称为高级和低级合成。

There are three main approaches to speech synthesis: articulatory synthesis, formant synthesis, and concatenative synthesis. Articulatory synthesis generates speech by direct modeling of human articulator behavior. Formant synthesis models the pole frequencies of speech signal. Formants are the resonance frequencies of the vocal tract. Since the formants constitute the main frequencies that make sounds distinct, speech is synthesized using these estimated frequencies. On the other hand, concatenative speech synthesis produces speech by concatenating small, prerecorded units of speech, such as phonemes, diphones, and triphones to construct the utterance. The following figure gives a high-level block diagram of the concatenative TTS synthesis process.

语音合成有三种主要方法：发音器合成，共振峰合成和拼接合成。发音器合成通过对人类发音器官行为的直接建模来生成语音。共振峰合成则是模拟语音信号的极点频率。共振峰是声道的共振频率。由于共振峰构成了使声音与众不同的主要频率，因此使用预估的共振峰频率来合成语音。拼接合成则通过预先录制较小的语音单位（例如音素，二重音和三重音）并将它们连接起来，从而生成语音。下图给出了串联TTS合成过程的高级框图。

2. Text Analysis 文本分析

2.1 Text normalization 文本规范化

所有文本到语音转换系统的首要任务，是以多种方式预处理或规范化输入文本。我们需要将输入的文本分解为句子。对于每个句子，我们将其分为一系列标记（例如单词，数字，日期和其他类型）。非自然语言标记（例如首字母缩写词和缩略语）必须被转换为自然语言标记。在以下小节中，将更详细地说明文本规范化的步骤。

2.1.1 Sentence Tokenization 句子标记化

文本规范化的第一个任务是句子标记化。此步骤有一些困难，因为句子边界并不总是用句点来表示，有时可以用冒号等其他标点来表示。为了确定句子边界，将输入文本分为用空格隔开的标记，然后包含“！”，“。”， “？”这些字符之一的任何标记将被选择，此后可以使用机器学习分类器确定这些标记中的字符是否表示句子结尾。

2.1.2 Non-Standard words 非标准词

The second task in text normalization is normalizing non-standard words such as numbers, abbreviations or acronyms. These to

剩余内容已隐藏，支付完成后下载完整资料

资料编号：[235630]，资料为PDF文档或Word文档，PDF文档可免费转换为Word

您需要先支付 30元 才能查看全部内容！立即支付

注册

找回密码

语音合成技术概述外文翻译资料

An Overview of Text-To-Speech Synthesis Techniques

Abstract摘要

1. Introduction 简介

2. Text Analysis 文本分析

2.1 Text normalization 文本规范化

2.1.1 Sentence Tokenization 句子标记化

2.1.2 Non-Standard words 非标准词

您可能感兴趣的文章

最新文档

推荐栏目

登录

注册

找回密码

语音合成技术概述外文翻译资料

An Overview of Text-To-Speech Synthesis Techniques

Abstract摘要

1. Introduction 简介

2. Text Analysis 文本分析

2.1 Text normalization 文本规范化

2.1.1 Sentence Tokenization 句子标记化

2.1.2 Non-Standard words 非标准词

您可能感兴趣的文章

最新文档

推荐栏目