アーティキュレートリー・シンセシス

調音音声合成: 合成音声と声道モデル

ドイツ語文 "Lea und Doreen mögen Bananen"

(日本語訳: リーとドリーンはバナナが好き) を子音+母音調音結合モデルを使って
自然発話文の基本周波数と音長から再現。^[1]

アーティキュレートリー・シンセシス (英: articulatory synthesis)、調音合成 (ちょうおんごうせい) あるいは 調音音声合成 とは、人間の声道のモデルとそこで行なわれる調音プロセス (articulation) に基づいて音声合成を行なうための計算手法である。声道の形状は通常、舌や顎、唇といった調音器官の位置変更と関連した数多くの調音方法で制御できる。声道の表現を介した空気の流れのデジタル・シミュレーションで、音声が生成される。

機械式語り手

「音声合成#歴史」も参照

機械式「語り手」(talking heads) の製作の試みには長い歴史がある。^[2] オーリヤックのジェルベール (–1003)、アルベルトゥス・マグヌス (1198–1280)、ロジャー・ベーコン (1214–1294) らは皆、喋る頭 (speaking heads) を作ったと言われている (Wheatstone 1837^[要出典])。しかしながら、歴史的に確認された音声合成の始まりは訳注: クリスティアン・クラッツェンシュタイン (1723–1795)^[3] とヴォルフガング・フォン・ケンペレン (1734–1804)であり、ケンペレンは1791年に研究報告^[4]を出版した。(Dudley & Tarnoczy (1950)も参照)

電子式声道

最初の電子式アナログ声道は、Dunn (1950)やStevens, Kasowski & Fant (1953)、Fant (1960)のように静的なものだった。Rosen (1958)は動的な声道 (DAVO)を組み立て、後にDennis (1963)がコンピュータ制御を試みた。Dennis & et al. (1964))^[要出典]、比企 & et al. (1968))^[要出典]、Baxter & Strong (1969)らもアナログ声道ハードウェアについて説明している。

最初のコンピュータ・シミュレーションは、Kelly & Lochbaum (1962)が行なった; その後デジタルコンピュータによるシミュレーションを、例えば中田 & 光岡 (1965)、松井 (1968)、Mermelstein (1971))^[要出典]が行なった。本多, 井上 & 小川 (1968)はアナログコンピュータによるシミュレーションを行なった。

Haskinsと前田のモデル

研究室の実験で定期的に使用される最初のソフトウェアによる調音シンセサイザーは、1970年代半ばにHaskins Laboratoriesで Philip Rubin, Tom Baer, Paul Mermelstein により開発された。ASY (Articulatory Synthesis)^[5]として知られるこのシンセサイザーは、1960年代–1970年代にベル研究所で Paul Mermelstein, Cecil Coker, およびその同僚らによって開発された声道モデルに基づく音声生成の計算モデルだった。もう一つの頻繁に使用された著名なモデルは、前田眞治 (Shinji Maeda)による、舌の形状制御に因子ベースのアプローチ (factor-based approach) を使ったモデルである。^[要出典]^[要説明]

現代的なモデル

音声生成イメージング、調音制御モデリング、舌の生体力学モデリングの最近の進展は、調音合成が行われる方法に変化をもたらしている。^[6] 一例として、Philip Rubin, Mark Tiede,^[7] Louis Goldstein^[8] が設計したHaskins CASYモデル (Configurable Articulatory Synthesis)^[9]では、声道の縦断面を実際の核磁気共鳴画像(MRI)データと一致させており、MRIデータを声道の3次元モデルの構築に使用している。フル3次元の調音合成モデルは Olov Engwall^[10]が説明している。^[11] 幾何学的に基づいた^[要出典]3次元調音スピーチ・シンセサイザーはPeter Birkholzにより開発されている。(VocalTracLab^[12]参照) ArtiSynthプロジェクト^[13]は、ブリティッシュコロンビア大学のSidney Fels^[14]が率いており、人間の声道と上気道のための3次元生体力学モデリング・ツールキットを提供している。舌などの調音器官の生体力学モデリングは、Reiner Wilhelms-Tricarico,^[15] Yohan Payan^[16] と Jean-Michel Gerard, ^[17] 党建武 (Jianwu Dang)^[18] と本多清志 (Kiyoshi Honda)^[19] など数多くの科学者によって開拓されている。

商用モデル

数少ない商用の調音スピーチ・シンセシス・システムの一つは、NeXTベースのシステムで、多数の独自研究が実施されていたカナダのカルガリー大学のスピンオフ企業 Trillium Sound Researchにより開発・販売された。 1980年代後半スティーブ・ジョブスが設立し、1997年Apple Computerと合併した NeXTの様々な転生が消滅した後、TrilliumのソフトウェアはGNU General Public Licenseで公開され、Gnuspeech^[20]として継続している。 1994年に最初に発売されたこのシステムは、René Carré^[21]の"Distinctive Region Model" (DRM)^[22]^[23]で制御される、人間の口腔および鼻腔の導波路 (waveguide) モデルもしくは伝送路アナログ(transmission-line analog) を使った^[24](訳注: Tube Resonance Model (TRM)^[25])、フル調音ベースのテキスト読み上げ変換を提供する。

脚注

^ Birkholz, Peter (2013). “Modeling Consonant-Vowel Coarticulation for Articulatory Speech Synthesis”. PLOS ONE 8 (4): e60603. Bibcode: 2013PLoSO...860603B. doi:10.1371/journal.pone.0060603. PMC 3628899. PMID 23613734. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3628899/.
^ Rubin, Philip; Vatikiotis-Bateson, Eric (1998–2006), Talking Heads, Haskins Laboratories, http://www.haskins.yale.edu/featured/heads/heads.html . (PDF)
^ Paget 1930
^ Kempelen 1791
^ Articulatory Synthesis, Haskins Laboratories, http://www.haskins.yale.edu/facilities/asy.html
^ “15th ICPhS - Barcelona 2003 - Programme”, The 15th International Congress of Phonetic Sciences, Barcelona, 2003 (International Phonetic Association), オリジナルの2007-05-22時点におけるアーカイブ。, https://web.archive.org/web/20070522223702/http://shylock.uab.es/icphs/plenariesandsymposia.htm
^ Mark Tiede, Haskins Laboratories, http://www.haskins.yale.edu/staff/tiede.html
^ Louis M. Goldstein, Haskins Laboratories, http://www.haskins.yale.edu/staff/goldstein.html
^ CASY, Haskins Laboratories, http://www.haskins.yale.edu/facilities/casy.html
^ Olov Engwall, Sweden: Royal Institute of Technology (KTH), http://www.speech.kth.se/~olov/
^ Engwall 2003
^ Peter Birkholz, VocalTractLab, http://www.vocaltractlab.de/, "An articulatory speech synthesizer and tool to visualize and explore the mechanism of speech production with regard to articulation, acoustics, and control."
^ ArtiSynth, Canada: University of British Columbia, http://www.magic.ubc.ca/artisynth/pmwiki.php, "A 3D Biomechanical Modeling Toolkit for Physical Simulation of Anatomical Structures"
^ Sidney Fels, Canada: University of British Columbia, http://www.ece.ubc.ca/~ssfels/
^ Reiner Wilhelms-Tricarico, Haskins Laboratories, http://www.haskins.yale.edu/staff/tricarico.html
^ Yohan Payan, TIMC-IMAG, http://www-timc.imag.fr/Yohan.Payan/
^ http://www-timc.imag.fr/gmcao/en-fiches-projets/modele-langue.htm, TIMC-IMAG, http://www-timc.imag.fr/gmcao/en-fiches-projets/modele-langue.htm
^ Intelligent Information Processing Laboratory (Dang Lab), JAIST, http://iipl.jaist.ac.jp/dang-lab/en/
^ 本多清志 (Spring 2004), “生体イメージングによる音声生成機構の観測”, ATR Journal (51), http://results.atr.jp/atrj/ATRJ_51/12/12.html
^ Gnuspeech, GNU Project, Free Software Foundation (FSF), http://www.gnu.org/software/gnuspeech/
^ René Carré, Dynamique Du Langage, CNRS, http://www.ddl.ish-lyon.cnrs.fr/Annuaires/Index.asp?Langue=EN&Page=Rene%20CARRE
^ Mrayati, Carre & Guerin 1988
^ Mrayati, Carre & Guerin 1990
^ Hill, David; Manzara, Leonard; Schock, Craig (1995), “Real-time articulatory speech-synthesis-by-rules”, Proc. AVIOS Symposium: 27–44, http://pages.cpsc.ucalgary.ca/~hill/papers/avios95/body.htm . (PDF)
^ Manzara, Leonard, “The Tube Resonance Model Speech Synthesizer”, 49th Meeting of the Acoustical Society of America (ASA), http://www.gnu.org/software/gnuspeech/trm-write-up.pdf , poster

参考文献

Baxter, Brent; Strong, William J. (1969), “WINDBAG—a vocal-tract analog speech synthesizer”, Journal of the Acoustical Society of America 45: 309(A), doi:10.1121/1.1971456, http://scitation.aip.org/getpdf/servlet/GetPDFServlet?filetype=pdf&id=JASMAN000045000001000309000001&idtype=cvips&doi=10.1121/1.1971456&prog=normal
Birkholz, P.; Jackel, D.; Kröger, B.J. (2007), “Simulation of losses due to turbulence in the time-varying vocal system”, IEEE Transactions on Audio, Speech, and Language Processing 15: 1218–1225, http://www.phonetik.phoniatrie.rwth-aachen.de/bkroeger/documents/Birkholz_2007_IEEE_ASLP.pdf
Birkholz P, Jackel D, Kröger BJ (2006), “Construction and control of a three-dimensional vocal tract model”, Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2006) (Toulouse, France): 873–876, http://www.vocaltractlab.de/publications/birkholz-2006-icassp.pdf
Coker, C. H. (1968), “Speech synthesis with a parametric articulatory model”, Proc. Speech. Symp., Kyoto, Japan , paper A-4.
Coker, C. H. (1976). “A model for articulatory dynamics and control”. Proceedings of the IEEE 64 (4): 452–460. doi:10.1109/PROC.1976.10154. http://www.philol.msu.ru/~otipl/new/main/courses/modelling/articulatornye_modeli.rar.
Coker, C. H.; Fujimura, O. (1966). “Model for the specification of the vocal tract area function”. Journal of the Acoustical Society of America 40: 1271. doi:10.1121/1.2143456. http://scitation.aip.org/getpdf/servlet/GetPDFServlet?filetype=pdf&id=JASMAN000040000005001271000004&idtype=cvips&doi=10.1121/1.2143456&prog=normal.
Dennis, Jack B. (1963), “Computer control of an analog vocal tract”, Journal of the Acoustical Society of America 35: 1115(A), http://oai.dtic.mil/oai/oai?verb=getRecord&metadataPrefix=html&identifier=AD0289271
Dudley, Homer; Tarnoczy, Thomas H. (1950). “The speaking machine of Wolfgang von Kempelen”. Journal of the Acoustical Society of America 22 (2): 151–66. doi:10.1121/1.1906583.
Dunn, Hugh K. (1950). “Calculation of vowel resonances, and an electrical vocal tract”. Journal of the Acoustical Society of America 22 (6): 740–53. doi:10.1121/1.1906681.
Engwall, O. (2003), “Combining MRI, EMA & EPG measurements in a three-dimensional tongue model”, Speech Communication 41: 303-329, doi:10.1016/S0167-6393(02)00132-2
Fant, C. Gunnar M (1960), Acoustic theory of speech production, The Hague: Mouton
Fant, Gunnar (1970), Acoustic theory of speech production: with calculations based on X-ray studies of Russian articulations, Mouton/Walter de Gruyter, ISBN 9789027916006, https://books.google.co.jp/books?id=qa-AUPdWg6sC&lpg=PP1
Gariel, M. (1879). “Machine parlante de M. Faber”. J. Physique Théorique et Appliquée 8: 274–5. doi:10.1051/jphystap:018790080027401.
Gerard, J.M.; Wilhelms-Tricarico, R.; Perrier, P.; Payan, Y. (2003). “A 3D dynamical biomechanical tongue model to study speech motor control”. Recent Research Developments in Biomechanics 1: 49–64.
Henke, W. L. (1966), “Dynamic Articulatory Model of Speech Production Using Computer Simulation”, Unpublished doctoral dissertation, MIT, Cambridge, MA., http://dspace.mit.edu/bitstream/handle/1721.1/22396/Henke_William_PhD_1966.pdf?sequence=1
本多, 高; 井上, 誠一; 小川, 康男 (1968), Kohasi, Y., ed., “A hybrid control system of a human vocal tract simulator”, Reports of the 6th International Congress on Acoustics (Tokyo, International Council of Scientific Unions.): 175–8
Kelly, John L.; Lochbaum, Carol (1962), “Speech synthesis”, Proceedings of the Speech Communications Seminar, paper F7 (Stockholm, Speech Transmission Laboratory, Royal Institute of Technology)

Kempelen, Wolfgang R. Von (1791), Mechanismus der menschlichen Sprache nebst der Beschreibung seiner sprechenden Maschine, Wien: J. B. Degen
前田, 眞治 (1988), “Improved articulatory models”, Journal of the Acoustical Society of America 84 (Sup. 1): S146, doi:10.1121/1.2025845
前田, 眞治 (1990), Compensatory articulation during speech: evidence from the analysis and synthesis of vocal-tract shapes using an articulatory model, http://ed268.univ-paris3.fr/lpp/publications/1990_Maeda_Compensatory_Articulation.pdf In W. J. Hardcastle & A. Marchal, ed., Speech Production and Speech Modelling, Dordrecht: Kluwer Academic, pp. 131–149
松井, 英一 (1968), Kohasi, Y., ed., “Computer-simulated vocal organs”, Reports of the 6th International Congress on Acoustics (Tokyo, International Council of Scientific Unions.): 151–4
Mermelstein, Paul. (1969), Walker, D. E., ed., “Computer simulation of articulatory activity in speech production”, Proceedings of the International Joint Conference on Artificial Intelligence, Washington, D.C., 1969 (New York: Gordon & Breach)
Mermelstein, P. (1973). “Articulatory model for the study of speech production”. Journal of the Acoustical Society of America 53 (4): 1070–1082. doi:10.1121/1.1913427. PMID 4697807. http://blog.aanugraha.web.id/wp-content/uploads/2008/10/articulatory-model-for-the-study-of-speech-production.pdf.
中田, 和男; 光岡, 輝義 (1965). “Phonemic transformation and control aspects of synthesis of connected speech”. J. Radio Res. Labs. 12: 171–86.
Mrayati, M.; Carre, R; Guerin, B. (1988), “Distinctive regions and modes: a new theory of speech production”, Speech Communication 7 (3): 257–286, October 1988, doi:10.1016/0167-6393(88)90073-8
Mrayati, M.; Carré, R; Guérin, B. (1990), “Distinctive regions and modes: articulatory-acoustic-phonetic aspects: A reply to Boë and Perrier's comments”, Speech Communication 9 (3): 231–238, June 1990, doi:10.1016/0167-6393(90)90059-I
Paget, R. (1930), Human Speech, New York: Harcourt
Rahim, M.; Goodyear, C.; Kleijn, W.; Schroeter, J.; Sondhi, M. (1993). “On the use of neural networks in articulatory speech synthesis”. Journal of the Acoustical Society of America 93 (2): 1109–1121. doi:10.1121/1.405559.
Rosen, George (1958). “Dynamic analog speech synthesizer”. Journal of the Acoustical Society of America 30 (3): 201–9. doi:10.1121/1.1909541.
Rubin, P. E.; Baer, T.; Mermelstein, P. (1981). “An articulatory synthesizer for perceptual research”. Journal of the Acoustical Society of America 70 (2): 321–328. doi:10.1121/1.386780.
Rubin, P.; Saltzman, E.; Goldstein, L.; McGowan, R.; Tiede, M.; Browman, C. (1996), “CASY and extensions to the task-dynamic model”, Proceedings of the 1st ESCA Tutorial and Research Workshop on Speech Producing Modeling - 4th Speech Production Seminar: 125-128, http://www.haskins.yale.edu/Reprints/HL1026.pdf . (other PDF)
Stevens, Kenneth N.; Kasowski, S.; Fant, C. Gunnar M. (1953). “An electrical analog of the vocal tract”. Journal of the Acoustical Society of America 25 (4): 734–42. doi:10.1121/1.1907169.

外部リンク

From MRI and Acoustic Data to Articulatory Synthesis
Praat: doing phonetics by computer

“Smithsonian Speech Synthesis History Project (SSSHP) 1986-2002”. 2013年10月3日時点のオリジナルよりアーカイブ。2014年5月28日閲覧。

Introduction to Articulatory Speech Synthesis
Simulated singing with the singing robot Pavarobotti or a description from the BBC on how the robot synthesized the singing.

音声合成

モデル / 手法

物理モデル
ソースフィルタモデル
スペクトルモデル
波形接続合成
フォルマント合成
隠れマルコフモデル合成
チャネルボコーダ
フェーズボコーダ
LPCボコーダ
波形補間符号化
PSOLA
MBROLA（英語版）
逆フィルタ（英語版）

エンジン

商用	AquesTalk AITalk ReadSpeaker FineSpeech RECAIUS RubyTalk VoiceOperator CereProc（英語版） IVONA（英語版） Microsoft text-to-speech voices（英語版） PlainTalk（英語版） Syllaflow Seiren Voice
フリー	eSpeak（英語版） Gnuspeech（英語版） Festival Speech Synthesis System（英語版） Open JTalk
非OSS	MBROLA（英語版）

システム / API

商　用	Microsoft Speech API Microsoft Speech Server（英語版） Talk It!（英語版）
フリー	FreeTTS（英語版）

ハードウェア

歴史的	DECtalk（英語版） Pattern playback（英語版） The Voder（英語版） Wolfgang von Kempelen's speaking machine（英語版）
LSI	GI SP0256（英語版） TI LPC Speech Chips（英語版）
娯楽	Currah（英語版） Echo 2（英語版） Phasor（英語版） Intellivoice（英語版） Speak & Spell（英語版） PC-6000シリーズ PC-6600シリーズ Yamaha CX5M（英語版）

応用ソフトウェア

商用	VOICEROID CeVIO Megpoid Talk A.I.VOICE ボイスソムリエ AOLbyPhone（英語版） DialogOS（英語版） Dr. Sbaitso（英語版） Microsoft Agent（英語版） Microsoft Narrator（英語版） Voice font（英語版） VOICEPEAK
フリー	棒読みちゃん SofTalk VOICEVOX COEIROINK
サイト	コエステーション / CoeAvatar CoeFont

アクセシビリティ

RIAS（英語版）
Silent speech interface（英語版）
Speech-generating device（英語版）
Spoken Web（英語版）
TuVox（英語版）

スクリーン
リーダー
（リスト）

商用	JAWS PC Talker（） VoiceOver
フリー	BRLTTY（） Gnopernicus（英語版） GR for UNIX（） NonVisual Desktop Access Orca Thunder（英語版）
ハード	簡単ケータイらくらくホン

Self-voicing

商用	WordQ+SpeakQ（英語版）
フリー	Emacspeak（英語版）

音声ブラウザ

商用	aiBrowser ホームページリーダー Spoken Web（英語版）

ブラウザ拡張

フリー	Fire Vox（英語版） Text to Voice（英語版）

サイト拡張

商用	BrowseAloud（英語版） Readspeaker（英語版）

ボーカルシンセ

商用	Cantor（英語版） VOCALOID CeVIO Synthesizer V くまうた
フリー	AquesTone Flinger（英語版）ディレイラマ Sinsy NEUTRINO
シェア	UTAU
非OSS	MBROLA（英語版）
ハード	DECtalk（英語版） PC-6000シリーズ PC-6600シリーズ Yamaha CX5-M（英語版）
応用	ぼかりす
サイト	コエラボ