



版權說明:本文檔由用戶提供并上傳,收益歸屬內容提供方,若內容存在侵權,請進行舉報或認領
文檔簡介
1、中英文資料對照外文翻譯(文檔含英文原文和中文翻譯)Speech Recognition1Defining the ProblemSpeech recognitionistheprocess ofconvertinganacousticsignal,captured bya microphone or a telephone, to a set of words. The recognized words can be the final results, as for applications such as commands & control, data entry, and docume
2、nt preparation. They can also serve as the input to further linguistic processing in order to achieve speech understanding, a subject covered in section.Speech recognition systems can be characterized by many parameters,some of the more important of which are shown in Figure. An isolated-word speech
3、 recognition system requires that the speaker pause briefly between words, whereas a continuous speech recognition system does not. Spontaneous, or extemporaneously generated, speech contains disfluencies, and is much more difficultto recognize than speech read from script. Some systems require spea
4、ker enrollment-a user must provide samples of his or her speech before using them, whereas othersystems are said to be speaker-independent, in that no enrollment is necessary. Some of the otherparameters depend on the specific task. Recognition is generally more difficult when vocabularies are large
5、 or have many similar-sounding words. When speech is produced in a sequence of words, language models or artificial grammars are used to restrict the combination of words.Thesimplest language modelcan be specifiedas afinite-statenetwork,where the1permissible words followingeach word are given explic
6、itly.More general language models approximating natural language are specified in terms of a context-sensitive grammar.One popular measure of the difficulty of the task, combining the vocabulary size and thelanguage model, is perplexity, loosely definedas the geometric mean of the number of words th
7、at can follow a word after the language model has been applied (see section for a discussion of language modeling ingeneral and perplexityin particular). Finally,there are some external parameters that can affect speech recognition system performance, including the characteristics of the environment
8、al noise and the type and the placement of the microphone.ParametersRangeSpeaking ModeIsolated words to continuous speechSpeaking StyleRead speech to spontaneous speechEnrollmentSpeaker-dependent to Speaker-independentVocabularySmall(20,000 words)Language ModelFinite-state to context-sensitivePerple
9、xitySmall(100)SNRHigh (30 dB) to law (10dB)TransducerVoice-cancelling microphone to telephoneTable: Typical parameters used to characterize the capability of speech recognition systems Speech recognition is a difficult problem, largely because of the many sources of variabilityassociated with the si
10、gnal. First, the acoustic realizations of phonemes, the smallest sound units of which words are composed, are highly dependent on the context in which they appear. These phonetic variabilitiesare exemplifiedby the acoustic differences of the phoneme, Atword boundaries, contextual variations can be q
11、uite dramatic-making gas shortage sound like gashshortage in American English, and devo andare sound like devandare in Italian.Second, acoustic variabilities can result from changes in the environment as well as in the position and characteristics of the transducer. Third, within-speaker variabiliti
12、es can result from changesin the speakers physical and emotional state, speaking rate, or voice quality. Finally, differences in sociolinguistic background, dialect, and vocal tract size and shape can contribute to across-speaker variabilities.Figure shows the major components of a typical speech re
13、cognition system. The digitized speech signal is first transformed into a set of useful measurements or features at a fixed rate, typicallyonce every 10-20 msec (see sectionsand 11.3 for signal representation and digital signal processing, respectively). These measurements are then used to search fo
14、r the most likely word candidate, making use of constraints imposed by the acoustic, lexical, and language models. Throughout this process, training data are used to determine the values of the model parameters.Figure: Components of a typical speech recognition system.Speech recognition systems atte
15、mpt to model the sources of variability described above in several ways. At the level of signalrepresentation, researchers have developed representations thatemphasize perceptuallyimportantspeaker-independent featuresofthesignal,and de-emphasize speaker-dependent characteristics. Attheacousticphonet
16、iclevel,speaker variabilityis typicallymodeled using statistical techniques applied to large amounts of data.Speaker adaptation algorithms have also been developed that adapt speaker-independent acoustic models to those of the current speaker during system use, (see section). Effects of linguistic c
17、ontext at the acoustic phonetic level are typicallyhandled by training separate models for phonemes in different contexts; this is called context dependent acoustic modeling.Word level variabilitycan be handled by allowingalternate pronunciations of words in representations known as pronunciation ne
18、tworks. Common alternate pronunciations of words, as wellas effects of dialect and accent are handled by allowingsearch algorithmsto find alternate paths ofphonemes through these networks. Statistical language models, based on estimates of the frequency of occurrence of word sequences, are often use
19、d to guide the searchthrough the most probable sequence of words.The dominant recognition paradigm in the past fifteen years is known as hidden Markov models (HMM).AnHMMisa doubly stochastic model, inwhichthe generation ofthe underlyingphoneme stringand the frame-by-frame, surface acoustic realizati
20、ons are both represented probabilisticallyas Markov processes,as discussed in sections,and 11.2. Neural networks have also been used to estimate the frame based scores; these scores are then integrated into HMM-based system architectures, in what has come to be known as hybrid systems, as described
21、in section 11.5.An interesting feature of frame-based HMM systems is that speech segments are identified during the search process, rather than explicitly. An alternate approach is to first identify speech segments, then classify the segments and use the segment scores to recognize words. This appro
22、ach has produced competitive recognition performance in several tasks.2State of the ArtComments about the state-of-the-art need to be made in the context of specific applications whichreflectthe constraints on the task. Moreover, differenttechnologies are sometimes appropriate for different tasks. F
23、or example, when the vocabulary is small, the entire word canbe modeled as a single unit. Such an approach is not practical for large vocabularies, where word models must be built up from subword units.Performance of speech recognition systems is typically described in terms of word error rateE, def
24、ined as:where N is the total number of words in the test set, andS, I , and D are the total number of substitutions, insertions, and deletions, respectively.The past decade has witnessed significant progress in speech recognition technology. Worderror rates continue to drop by a factor of 2 every tw
25、o years. Substantial progress has been made in the basic technology, leading to the lowering of barriers to speaker independence, continuous speech, and large vocabularies. There are several factors that have contributed to this rapid progress. First, there is the coming of age of the HMM.HMMis powe
26、rful in that, withthe availabilityof training data, the parametersof the model can be trained automatically to giveoptimal performance.Second, much effort has gone into the development of large speech corpora forsystem development, training, and testing. Some of these corpora are designed for acoust
27、ic phonetic research, while others are highly task specific. Nowadays, it is not uncommon to have tens of thousands ofsentences availableforsystem trainingand testing.These corporapermit researchers to quantifythe acoustic cues important for phonetic contrasts and to determine parametersof the recog
28、nizers in a statistically meaningful way. While many of these corpora (e.g., TIMIT,RM,ATIS,and WSJ; see section 12.3) were originallycollected under the sponsorship of the U.S. Defense Advanced Research ProjectsAgency (ARPA) tospur human language technologydevelopment amongitscontractors, theyhave n
29、evertheless gained world-wide acceptance (e.g., in Canada, France, Germany, Japan, and the U.K.) as standards on which to evaluate speech recognition.Third, progress has been brought about by the establishment of standards for performance evaluation. Onlya decade ago, researchers trained and tested
30、theirsystems using locally collected data, and had not been very careful in delineating training and testing sets. As a result, itwas very difficulttocompare performance across systems, and a systems performance typically degraded when it was presented with previously unseen data. The recent availab
31、ility of a large body of data in the public domain, coupled with the specification of evaluation standards, has resulted in uniform documentation of test results, thus contributing to greater reliability in monitoringprogress(corpusdevelopmentactivitiesandevaluationmethodologiesare summarized in cha
32、pters 12 and 13 respectively).Finally, advances in computer technology have also indirectly influenced our progress. The availability of fast computers with inexpensive mass storage capabilities has enabled researchers to run many large scale experiments in a short amount of time. This means that th
33、e elapsed time between an idea and its implementationand evaluation isgreatly reduced. Infact, speech recognition systems withreasonable performance can now run inreal time using high-end workstations without additional hardware-a feat unimaginable only a few years ago.One of the most popular, and p
34、otentially most useful tasks with low perplexity (PP=11) isthe recognition of digits. For American English, speaker-independent recognition of digit strings spoken continuously and restricted to telephone bandwidth can achieve an error rate of 0.3% when the string length is known.One of the best kno
35、wn moderate-perplexity tasks is the 1,000-word so-called Resource Management (RM) task, in which inquiries can be made concerning various naval vessels in the Pacific ocean. The best speaker-independent performance on the RM task is less than 4%, using a word-pair language model that constrains the
36、possible words following a given word (PP=60).Morerecently, researchers have begun toaddress theissue ofrecognizing spontaneouslygenerated speech. For example, in the Air Travel Information Service (ATIS) domain, word error rates of less than 3% has been reported for a vocabulary of nearly 2,000 wor
37、ds and a bigram language model with a perplexity of around 15.High perplexity tasks with a vocabulary of thousands of words are intended primarily for the dictation application. After working on isolated-word, speaker-dependent systems for many years, the community has since 1992 moved towards very-
38、large-vocabulary (20,000 words andmore), high-perplexity (PP 200), speaker-independent, continuous speech recognition. The bestsystem in 1994 achieved an error rate of 7.2% on read sentencesdrawn from North America business news.With the steady improvements in speech recognition performance, systems
39、 are now being deployed within telephone and cellular networks in many countries. Within the next few years, speech recognitionwillbe pervasive intelephone networksaround the world.There are tremendous forces drivingthe development of the technology; in many countries, touch tone penetration is low,
40、and voice is the only option for controllingautomated services. In voice dialing, forexample, userscan dial 10-20 telephone numbers by voice (e.g., call home) after having enrolled their voices by saying the words associated with telephone numbers. AT&T, on the other hand, has installed a call routi
41、ngsystem using speaker-independentword-spotting technology that can detect a few key phrases (e.g., person to person, calling card) in sentences such as: I want to charge it to my calling card.Atpresent, several very large vocabulary dictation systems are available fordocument generation.Thesesystem
42、s generallyrequirespeakers topause betweenwords.Their performance can be further enhanced if one can apply constraints of the specific domain such as dictating medical reports.Even though much progress is being made, machines are a long way from recognizing conversational speech. Word recognition ra
43、tes on telephone conversations in the Switchboard corpus are around 50%. It will be many years before unlimited vocabulary, speaker-independentcontinuous dictation capability is realized.3Future DirectionsIn 1992, the U.S.National Science Foundation sponsored a workshop to identify the key research
44、challenges in the area of human language technology, and the infrastructure needed to support the work. The key research challenges are summarized in. Research in the following areas for speech recognition were identified:Robustness:In a robust system, performance degrades gracefully (rather than ca
45、tastrophically) as conditions become more different from those under which it was trained. Differences in channel characteristics and acoustic environment should receive particular attention.Portability:Portability refers to the goal of rapidly designing, developing and deploying systems for new app
46、lications. At present, systems tend to suffer significant degradation when moved to a new task. In order to return to peak performance, they must be trained on examples specific to the new task, which is time consuming and expensive.Adaptation:How can systems continuously adapt to changing condition
47、s (new speakers, microphone, task, etc) and improve through use? Such adaptation can occur at many levels in systems, subword models, word pronunciations, language models, etc.Language Modeling:Current systems use statistical language models to help reduce the search space and resolve acoustic ambig
48、uity. As vocabulary size grows and other constraints are relaxed to create more habitable systems, it will be increasingly important to get as much constraint as possible from language models; perhaps incorporatingsyntactic and semantic constraints that cannot be captured by purely statistical model
49、s.Confidence Measures:Most speech recognition systems assign scores to hypotheses forthe purpose ofrank ordering them. These scores do not provide a good indication of whether a hypothesis is correct or not, just that it is better than the other hypotheses. As we move to tasks that require actions,w
50、e need better methods to evaluate the absolute correctness of hypotheses.Out-of-Vocabulary Words:Systems are designed for use with a particular set of words, but system users may not know exactlywhichwordsare inthe system vocabulary. Thisleads toa certain percentage of out-of-vocabulary words in nat
51、ural conditions. Systems must have some method of detecting such out-of-vocabulary words, or they will end up mapping a word from the vocabulary onto the unknown word, causing an error.Spontaneous Speech:Systems that are deployed for real use must deal witha variety of spontaneousspeech phenomena, s
52、uch as filled pauses, false starts, hesitations, ungrammatical constructions and other common behaviors not found in read speech. Development on the ATIStask has resulted in progress in this area, but much work remains to be done.Prosody:Prosody refers to acoustic structure that extends over several
53、 segments or words. Stress, intonation, and rhythm convey important information for word recognition and the users intentions (e.g., sarcasm, anger). Current systems do not capture prosodic structure. How to integrate prosodic information into the recognition architecture is a critical question that
54、 has not yet been answered.Modeling Dynamics:Systems assume a sequence of input frames which are treated as if they were independent.But it is known that perceptual cues for words and phonemes require the integration of features that reflect the movements of the articulators, which are dynamic in na
55、ture. How to model dynamics and incorporate this information into recognition systems is an unsolved problem.語音識別一定義問題語音識別是指音頻信號的轉換過程,被 或麥克風的所捕獲的一系列的消息。所識別的消息作為最后的結果,用于控制應用,如命令與數據錄入,以及文件準備。它們也可以作為處理輸入的語言,以便進一步實現語音理解,在第一個主題涵蓋。語音識別系統可以用多個參數來描繪,一些更重要參數在圖形中顯示出來。一個孤立字語音識別系統要求詞與詞之間短暫停頓,而連續語音識別統對那些不自發的,或臨時
56、生成的,言語不流利的語音,比用講稿更難以識別。有些系統要求發言者登記 即用戶在使用系統前必須為系統提供演講樣本或發言底稿,而其他系統據說是獨立揚聲器,因為沒有必要登記。一些參數特征依賴于特定的任務。當詞匯量比較大或有較多象聲詞的時候,識別起來一般比較困難。當語音由有序的詞語生成時,語言模型或特定語法便會限制詞語的組合。最簡單的語言模型可以被指定為一個有限狀態網絡,每個語音所包含的所有允許的詞語都能顧及到。更普遍的近似自然語言的語言模型在語法方面被指定為上下文相關聯。一種普及的任務的難度測量,詞匯量和語言模型相結合的語音比較復雜,大量語音的幾何意義可以按照語音模型的應用定義廣泛些參見文章對語言模
57、型普遍性與復雜性的詳細討論。最后,還有一些其他參數,可以影響語音識別系統的性能,包括環境噪聲和麥克風的類型和安置。參數范圍語音模型孤立詞到連續語音語音種類朗讀語音到自然語音登記詞匯依賴揚聲器到獨立的揚聲器小(20,000 字)語言模型有限個狀態到上下文相關混亂信噪比小(100)高(30 分貝)到低(10 分貝)傳感器消音麥克到 表格: 特有參數用于表征語音識別系統的性能語音識別是一個困難的問題,主要是因為與信號相關的變異有很多來源。首先,音素,作為組成詞語的最小的語音單位,它的聲學呈現是高度依賴于他們所出現的語境的。這些語音的變異性正好由音素的聲學差異做出了驗證。在詞語的范圍里,語境的變化會相當富有戲劇性 - 使得美國英語里的gas shortage聽起來很像 gash shortage,而意大利語中的devoandare聽起來會很像 devandare。其次,聲變異可能由環境變化,以及傳輸介質的位置和特征引起。第三,說話人的不同,演講者身體和情緒上的差異可能導致演講速度,質量和話音質量的差異。最后,社會語言學背景,方言的差異和聲道的大小和形狀更進一步促進了演講者的差異性。數字圖形展示了語音識別系統的主要組成部分
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯系上傳者。文件的所有權益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網頁內容里面會有圖紙預覽,若沒有圖紙預覽就沒有圖紙。
- 4. 未經權益所有人同意不得將文件中的內容挪作商業或盈利用途。
- 5. 人人文庫網僅提供信息存儲空間,僅對用戶上傳內容的表現方式做保護處理,對用戶上傳分享的文檔內容本身不做任何修改或編輯,并不能對任何下載內容負責。
- 6. 下載文件中如有侵權或不適當內容,請與我們聯系,我們立即糾正。
- 7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。
最新文檔
- 配送安裝協議書
- 租憑產車協議書
- 用工賠償協議書
- 終止供暖協議書
- 小飯桌用品轉讓協議書
- 現任查前任離婚協議書
- 酒店賣卡協議書
- 曹妃甸綜合保稅協議書
- 船舶買賣協議書
- 戀愛一年期合同協議書
- 2025浙江省樂清蒼南永嘉二模聯考科學試題卷
- 2025年中國鎳合金箔帶材市場調查研究報告
- 2025人教版五年級數學下冊期末復習計劃
- 2024年河北省井陘縣事業單位公開招聘警務崗筆試題帶答案
- 2025年政治考研真題及答案
- (三模)合肥市2025屆高三年級5月教學質量檢測英語試卷(含答案)
- 福建省莆田市2025屆高三下學期第四次教學質量檢測試生物試題(含答案)
- 2025年4月自考00522英語國家概況答案及評分參考
- 2025人教版三年級下冊數學第七單元達標測試卷(含答案)
- 2025年安全生產月主題培訓課件:如何查找身邊安全隱患
- 2024年寧夏銀川公開招聘社區工作者考試試題答案解析
評論
0/150
提交評論