Windows Speech Recognition
|A component of Microsoft Windows|
Windows Speech Recognition tutorial included with Windows Vista.
|Included with||Windows Vista, Windows 7, Windows 8, Windows 8.1, Windows 10|
|Microsoft Speech API|
Windows Speech Recognition is a speech recognition component developed by Microsoft that enables the use of voice commands to perform operations, such as the dictation of text, within applications and the operating system itself.
Microsoft has been involved in speech recognition and speech synthesis research for many years. In 1993, Microsoft hired Xuedong Huang from Carnegie Mellon University to lead its speech development efforts. The company's research eventually ultimately led to the development of the Speech API, introduced in 1994. Speech recognition technology has been used in some of Microsoft's products prior to Windows Speech Recognition. Versions of Microsoft Office, including Office XP and Office 2003, included support for speech recognition among Office applications and other applications such as Internet Explorer. Installation of Office would enable limited speech functionality in Windows NT 4.0, Windows 98 and Windows ME. The 2002 edition of Windows XP Tablet PC Edition would also include support within the Tablet PC Input Panel feature, and the Microsoft Plus! for Windows XP expansion package enabled voice commands to be used in Windows Media Player. However, this support was limited to individual applications, and prior to Windows Vista, the Windows operating system did not include integrated support for speech recognition.
At the Windows Hardware Engineering Conference of 2002, Microsoft announced that Windows Vista, then known by its codename "Longhorn," would include advances in speech recognition technology and features such as support for microphone arrays. Bill Gates expanded upon this information during the Professional Developers Conference of 2003 where he stated that the company would "build speech capabilities into the system -- a big advance for that in 'Longhorn,' in both recognition and synthesis, real-time." Further reports said that the operating system would include integrated support for speech recognition, and certain pre-release builds throughout development of the operating system would include a speech engine with training features. In 2003, Microsoft clarified the extent of its intended integration for Windows Vista when the company stated within a pre-release software development kit that "the common speech scenarios, like speech-enabling menus and buttons, will be enabled system-wide" in the operating system.
During WinHEC 2004, Microsoft listed speech recognition as part of its "Longhorn" mobile PC strategy to improve productivity and listed microphone arrays as a hardware opportunity for the operating system. At WinHEC 2005, Microsoft released additional details pertaining to speech recognition in Windows Vista with a focus on accessibility, new mobility scenarios, and improvements to the speech user experience. Unlike the speech support included in Windows XP, which was integrated with the Tablet PC Input Panel and required switching between dictation and command modes, Windows Vista would separate the feature from the Tablet PC Input Panel by introducing a dedicated interface for speech input on the desktop and would also unify the previously separate dictation and command modes. In previous versions of Windows, speech recognition would not allow a user to speak a command after dictation or vice versa without first switching between these two modes. Microsoft also stated that speech recognition in Windows Vista would improve dictation accuracy, and support additional languages and microphone arrays. A demonstration of the feature at WinHEC 2005 focused on e-mail dictation with correction and editing commands, and a set of slides dedicated to microphone arrays was also released. Windows Vista Beta 1 would include an integrated speech recognition application. In an effort to persuade company employees to interact with Windows Speech Recognition during its development, Microsoft offered an opportunity to win a Premium model of its Xbox 360 video game console.
On July 27, 2006, before the operating system's release to manufacturing (RTM), a notable incident pertaining to speech recognition occurred during a demonstration by Microsoft at its annual Financial Analyst Meeting. Speech recognition initially failed to function correctly, which resulted in an unintended output of: "Dear aunt, let's set so double the killer delete select all" when several attempts to dictate led to consecutive output errors; the incident was a subject of significant derision among analysts and journalists present in the audience. Microsoft later revealed that the errors during the demonstration were due to an audio gain glitch that caused speech recognition to distort the dictated commands. The glitch was fixed prior to the operating system's release to manufacturing on November 8, 2006.
Reports surfaced in early 2007 that Windows Speech Recognition may be vulnerable to an attack that could allow attackers to take advantage of its capabilities to perform undesired operations on a targeted computer by playing audio through the targeted computer's speakers; it was the first vulnerability discovered after the operating system's general retail availability. While Microsoft stated that such an attack is theoretically possible, it would have to meet a number of prerequisites in order to be successful: the targeted system would be required to have the speech recognition feature previously activated and configured, speakers and microphone(s) connected to the targeted system would need to be turned on, and the exploit would require the software to interpret commands without a user noticing—an unlikely scenario as the affected system would perform user interface operations and produce audible feedback (as speakers would need to be active). Moreover, mitigating factors would include dictation clarity, and microphone feedback and placement. An exploit of this nature would also not be able to perform privileged operations for users or protected administrators without explicit user consent because of User Account Control.
Overview and features
Windows Speech Recognition allows a user to control a computer, including the operating system desktop user interface, through voice commands. Applications, including most of those bundled with Windows, can also be controlled through voice commands. By using speech recognition, users can dictate text within documents and e-mail messages, fill out forms, control the operating system user interface, perform keyboard shortcuts, and move the mouse cursor.
Speech recognition uses a speech profile to store information about a user's voice. Accuracy of speech recognition increases through use, which helps the feature adapt to a user's grammar, speech patterns, vocabulary, and word usage. Speech recognition also includes a tutorial to improve accuracy, and can optionally review a user's personal documents, including e-mail messages, to improve its command and dictation accuracy. In Windows 7 and later versions, an additional option is available that allows users to send speech information to Microsoft. Individual speech profiles can be created on a per-user basis, and backups of profiles can be performed via Windows Easy Transfer or through a downloable utility developed by Microsoft. Profiles archived through this utility carry the WSRPROFILE filename extension. Windows Speech Recognition relies on Microsoft Speech API. and third-party applications must support the Text Services Framework. Speech Recognition currently supports the following languages: Chinese (Traditional), Chinese (Simplified), English (U.S.), English (U.K.), French, German, Japanese, and Spanish.
The interface for Windows Speech Recognition primarily consists of a status area for instructions, for information about commands (e.g., if a command could not be heard by the speech recognizer), and also for information related to the state of the speech recognizer; a voice meter is also provided to display visual feedback to the user about voice volume levels. The status area represents the current state of Windows Speech Recognition in a total of three modes, listed below with their respective meanings:
- Listening: The speech recognizer is active and waiting for user input
- Sleeping: The speech recognizer will not listen for or respond to commands other than "Start listening"
- Off: The speech recognizer will not listen or respond to any commands; this mode can be enabled by speaking "Stop listening"
A disambiguation interface referred to as the alternates panel displays a list of items interpreted by the recognizer as being relevant to a user's spoken word(s); if the word or phrase that the user desired to insert into an application is listed among results, the user can speak the corresponding number of the word that appears among the results and confirm this choice by speaking "OK" to insert it within the application.
The alternates panel will also appear when launching programs or speaking commands that may refer to more than one item (e.g., speaking "Start Internet Explorer" may list the web browser and an alternate version of the web browser with add-ons disabled). However, a Windows Registry entry, ExactMatchOverPartialMatch, can limit commands to programs or commands with exact names if there is more than one instance of that item included among results.
Listed below are common commands available for Windows Speech Recognition. Words in italics indicate a variable that can be substituted for a desired item (e.g., the word "direction" in the "scroll direction" command can be substituted with the word "down" to scroll down). A "start typing" command enables Windows Speech Recognition to interpret dictation commands as keyboard shortcuts.
- Dictation commands: "New line," "new paragraph," "tab," "literal word," "numeral number," "go to word," "go after word," "no space," "go to start of sentence," "go to end of sentence," "go to start of paragraph," "go to end of paragraph," "go to start of document," "go to end of document," "go to field name" (e.g., go to address, cc, or subject). Special characters, such as a comma, can be dictated simply by stating the name of the special character.
- Navigation commands:
- Keyboard shortcuts: "Press keyboard key," "press ⇧ Shift plus a," "press capital b." The NATO phonetic alphabet is also supported. Keys that can be pressed without first giving the press command include: ← Backspace, Delete, End, ↵ Enter, Home, Page Down, Page Up, Tab ↹.
- Mouse commands: "Click," "click that," "double-click," "double-click that," "mark," "mark that," "right-click," "right-click that," mousegrid.
- Window management commands: "Close (alternatively maximize, minimize, or restore) window," "close that," or "close application name," "switch applications," "switch to program name," "scroll direction," "scroll direction in number of pages," "show desktop," "show numbers."
- Speech recognition commands: "Start listening," "stop listening," "show speech options," "open speech dictionary," "move speech recognition," "mimimize speech recognition." A list of applicable commands can be shown by speaking "What can I say?" This command is currently only available in English. Users can also query the recognizer about tasks in Windows by speaking "How can I task name, which opens the Help Pane that displays related information.
A mousegrid command enables users to control the mouse cursor by overlaying numbers across nine regions on the screen; these regions narrow as a user speaks—by number(s)—which region to focus on until they reach a desired interface element to interact with. Entire regions can be interacted with by speaking "click number of region," which moves the mouse cursor to the desired region and then clicks it. An individual item within a region, such as a computer icon, can also be selected by speaking "mark number of region" where the item appears. A user can then specify where to move the marked item by speaking "click number of region." These commands also work for multiple regions of the mousegrid.
Applications and operating system user interface elements that do not present obvious commands can still be controlled by asking the system to overlay numbers on top of them through a show numbers command. Once active, speaking the overlaid number selects that item so a user can open it or perform other operations. The command was designed so that users could interact with items that are not readily identifiable.
Windows Speech Recognition enables dictation of text in the operating system and applications. For applications that do not automatically support dictation, an option to enable dictation everywhere is available. If a mistake in dictation occurs, a user can correct the mistake by saying "correct word" or "correct that" and an alternate panel interface will appear and provide suggestions for correction; these suggestions can be selected by speaking the number corresponding to the number of the suggestion in the list and by speaking "OK." If the desired word is not listed among the included suggestions, a user can speak the desired word so that it might appear. Alternatively, a "spell it" command allows a user to speak the desired word on a per-letter basis so that it will appear among suggestions. Multiple words in a sentence may also be corrected at a time. As an example, if a user states "dictating" but speech recognition recognizes this word as "the thing," a user can state "correct the thing" to correct both words.
WSR includes a personal dictionary that allows users to include or exclude certain words or expressions from being dictated. By default, this dictionary includes over 100,000 words in the English language. When a user adds a word beginning with a capital letter to the dictionary, a user can specify whether it should always be capitalized during dictation or if capitalization depends on the context where the word is spoken; users may also record pronunciations for words added to the dictionary to increase recognition accuracy; words written via a stylus on a tablet PC for the Windows handwriting recognition feature are also stored. Most of the information stored within a dictionary is included as part of a user's speech profile.
Windows Speech Recognition supports custom macros through a separate utility released by Microsoft that enables the use of commands that are further based on natural language processing. As an example of this functionality, an e-mail macro released by Microsoft enables a natural language command where a user can state "send e-mail to contact about subject," which opens Microsoft Outlook to compose a new message with the designated contact and subject automatically inserted within the application. Microsoft has also released sample macros for the speech dictionary, for Windows Media Player, for Microsoft PowerPoint, for speech synthesis, to switch between multiple microphones, to customize various aspects of audio device configuration such as volume levels, and for general natural language queries such as, "What is the weather forecast?" "What time is it?" and "What's the date?" Answers to these queries are spoken to the user via a speech synthesizer.
Users and developers can create their own custom macros that can be based on text transcription and substitution, program execution (with support for command-line arguments), keyboard shortcuts, emulation of existing voice commands, or a combination of these items. XML, JScript and VBScript are supported. Macros can be limited to individual applications if desired, and rules for macros can be defined programmatically. In order for a macro to be loaded, it must be stored within a Speech Macros folder within the current user's Documents directory. By default, all macros are digitally signed if a user certificate is available in order to ensure that created commands are not loaded or tampered with by third-parties; if one is not available, an administrator can create a certificate for use. The macros utility also includes security levels to prohibit unsigned macros from being loaded, to prompt users to sign macros, and to load unsigned macros if a user desires for this to occur.
As of 2017[update] Windows Speech Recognition uses Microsoft Speech Recognizer 8.0, which has not been changed since Windows Vista. Speech recognition for dictation, rather than giving defined voice commands, was found to be 93.6% accurate by Mark Hachman, Senior Editor of PCWorld, dictating an article, not as good as other voice recognition software. This was without training the application; Microsoft employees have said that, properly trained, accuracy was 99%, which is good. Hachman commented that speech recognition was a feature Microsoft didn't like to talk about, with few users knowing that documents could be dictated within Windows.
- List of speech recognition software
- Microsoft Cortana
- Microsoft Narrator
- Microsoft Voice Command
- Technical features new to Windows Vista
- Microsoft. "Windows Speech Recognition". Microsoft Accessibility. Retrieved June 26, 2015.
- Brown, Robert. "Exploring New Speech Recognition And Synthesis APIs In Windows Vista". MSDN Magazine. Microsoft. Archived from the original on March 7, 2008. Retrieved June 26, 2015.
- Microsoft. "What can I do with Speech Recognition?". Windows How-to. Retrieved June 26, 2015.
- Microsoft. "How to use Speech Recognition". Windows How-to. Retrieved June 26, 2015.
- Microsoft. "Use Voice Recognition in Windows 10". Windows How-to. Retrieved August 24, 2015.
- Microsoft. "How To Use Speech Recognition in Windows XP". Support. Retrieved June 26, 2015.
- Microsoft. "Description of the speech recognition and handwriting recognition methods in Word 2002". Support. Retrieved June 30, 2015.
- Thurrott, Paul (June 25, 2002). "Windows XP Tablet PC Edition Review". SuperSite for Windows. Penton. Retrieved June 26, 2015.
- Dresevic, Bodin (2005). "Natural Input On Mobile PC Systems" (PPT). Microsoft. Retrieved June 26, 2015.
- Thurrott, Paul (October 6, 2010). "Plus! for Windows XP Review". SuperSite for Windows. Penton. Retrieved June 30, 2015.
- Stam, Nick (April 16, 2002). "WinHEC: The Pregame Show". PC Magazine. Ziff Davis Media. Retrieved June 26, 2015.
- Microsoft (October 27, 2003). "Bill Gates' Web Site - Speech Transcript, Microsoft Professional Developers Conference 2003". Archived from the original on February 3, 2004. Retrieved June 26, 2015.
- Thurrott, Paul; Furman, Keith (October 26, 2003). "Live from PDC 2003: Day 1, Monday". Windows IT Pro. Penton. Retrieved June 26, 2015.
- Rooney, Paula (October 27, 2003). "Gates Unveils Windows Longhorn SDK, WinFX Programming Model". CRN. The Channel Company. Retrieved June 26, 2015.
- Ricciuti, Mike (March 24, 2004). "Ruling could be key to Microsoft's future". CNET. CBS Interactive. Retrieved June 26, 2015.
- Spanbauer, Scott (December 4, 2003). "Your Next OS: Windows 2006?". TechHive. IDG. Retrieved June 25, 2015.
- Stoyanov, Stanimir. "'Hoolie' Demo". Retrieved June 26, 2015.
- Microsoft (2003). "Interacting with the Computer using Speech Input and Speech Output". MSDN. Archived from the original on January 4, 2004. Retrieved June 28, 2015.
- Suokko, Matti (2004). "Windows For Mobile PCs And Tablet PCs - CY05 And Beyond". Microsoft. Archived from the original (PPT) on December 14, 2005. Retrieved July 15, 2015.
- Fish, Darrin (2004). "Windows For Mobile PCs and Tablet PCs - CY04". Microsoft. Archived from the original (PPT) on December 14, 2005. Retrieved July 15, 2015.
- Chambers, Rob (August 1, 2005). "Commanding and Dictation - One mode or two in Windows Vista?". MSDN Blogs. Microsoft. Retrieved June 30, 2015.
- Dresevic, Bodin (2005). "Natural Input on Mobile PC Systems" (PPT). Microsoft. Retrieved June 28, 2015.
- Tashev, Ivan; Strande, Hakon. "Microphone Array Support in Windows Longhorn" (PPT). Microsoft. Retrieved June 28, 2015.
- Thurrott, Paul (October 6, 2010). "Windows Vista Beta 1 Review (Part 3)". SuperSite for Windows. Penton. Retrieved June 26, 2015.
- Levy, Brian (2006). "Microsoft Speech Recognition poster". Archived from the original on October 11, 2006. Retrieved March 17, 2016.
- Auchard, Eric (July 28, 2006). "UPDATED-When good demos go (very, very) bad". Thomson Reuters. Retrieved June 26, 2015.
- MSNBC (August 2, 2006). "Software glitch foils Microsoft demo". Associated Press. Archived from the original on August 12, 2006. Retrieved June 30, 2015.
- Montalbano, Elizabeth (July 31, 2006). "Vista voice-recognition feature needs work". InfoWorld. IDG. Archived from the original on August 5, 2006. Retrieved June 26, 2015.
- Montalbano, Elizabeth (July 31, 2006). "Vista's Voice Recognition Stammers". TechHive. IDG. Retrieved July 1, 2015.
- Chambers, Rob (July 29, 2006). "FAM: Vista SR Demo failure -- And now you know the rest of the story ..." MSDN Blogs. Microsoft. Retrieved June 26, 2015.
- "Vista has speech recognition hole". BBC News. BBC. February 1, 2007. Retrieved June 28, 2015.
- Miller, Paul (February 1, 2007). "Remote 'exploit' of Vista Speech reveals fatal flaw". Engadget. AOL. Retrieved June 28, 2015.
- Roberts, Paul (February 1, 2007). "Honeymoon's Over: First Windows Vista Flaw". PCWorld. IDG. Archived from the original on February 4, 2007. Retrieved June 28, 2015.
- Microsoft (January 31, 2007). "Issue regarding Windows Vista Speech Recognition". TechNet Blogs. Retrieved June 28, 2015.
- Phillips, Todd (2007). "Windows Vista Speech Recognition Step-by-Step Guide". MSDN. Microsoft. Retrieved June 30, 2015.
- Microsoft. "Common commands in Speech Recognition". Windows How-to. Retrieved June 30, 2015.
- Microsoft (2006). "Windows Vista Privacy Statement" (RTF). Retrieved July 1, 2015.
- Microsoft. "Setting speech options". Windows How-to. Retrieved July 1, 2015.
- Brown, Eric (January 29, 2009). "What's new in Windows Speech Recognition?". MSDN Blogs. Microsoft. Retrieved June 15, 2015.
- Chambers, Rob (February 15, 2007). "Transferring Windows Speech Recognition profiles from one machine to another". MSDN Blogs. Microsoft. Retrieved June 28, 2015.
- Microsoft. "Windows Speech Recognition Profile Tool". Download Center. Retrieved June 28, 2015.
- Shintaku, Kurt (April 29, 2008). "BETA: 'Windows Speech Recognition Macros' Technology Preview". Retrieved March 17, 2016.
- Pash, Adam (May 20, 2008). "Control Your PC with Your Voice". Lifehacker. Gawker Media. Retrieved March 17, 2016.
- Chambers, Rob (November 19, 2007). "Speech Macros, Typing Mode and Spelling Mode in Windows Speech Recognition". MSDN Blogs. Microsoft. Retrieved August 25, 2015.
- Chambers, Rob (May 7, 2007). "Windows Speech Recognition - ExactMatchOverPartialMatch". MSDN Blogs. Microsoft. Retrieved August 24, 2015.
- Chambers, Rob (March 12, 2007). "Windows Speech Recognition: General commands". MSDN Blogs. Microsoft. Retrieved May 1, 2017.
- US patent 7742923, Bickel, Ryan; Murillo, Oscar & Mowatt, David et al., "Graphic user interface schemes for supporting speech recognition input systems", assigned to Microsoft Corporation
- Microsoft. "Turn on dictation for all programs". Retrieved June 30, 2015.
- Microsoft. "Add or edit words in the Speech Dictionary". Windows How-to. Retrieved July 1, 2015.
- Chambers, Rob (September 20, 2005). "Customized speech vocabularies in Windows Vista". MSDN Blogs. Microsoft. Retrieved July 1, 2015.
- Microsoft. "Windows Speech Recognition Macros". Download Center. Retrieved June 29, 2015.
- Protalinski, Emil (April 30, 2008). "WSR Macros extend Windows Vista's speech recognition feature". ArsTechnica. Condé Nast. Retrieved June 29, 2015.
- Chambers, Rob (June 9, 2008). "Macro of the Day: Send Email to [OutlookContact]". MSDN Blogs. Microsoft. Retrieved June 26, 2015.
- Chambers, Rob (August 2, 2008). "Speech Macro of the Day: Speech Dictionary". MSDN Blogs. Microsoft. Retrieved September 3, 2015.
- Chambers, Rob (July 1, 2008). "Macro of the Day: Windows Media Player". MSDN Blogs. Microsoft. Retrieved June 26, 2015.
- Chambers, Rob (June 3, 2008). "Macro of the day: Next Slide". MSDN Blogs. Microsoft. Retrieved September 3, 2015.
- Chambers, Rob (May 28, 2008). "Macro of the Day: Read that". MSDN Blogs. Microsoft. Retrieved June 26, 2015.
- Chambers, Rob (November 7, 2008). "Macro of the Day: Microphone Control". MSDN Blogs. Microsoft. Retrieved June 30, 2015.
- Chambers, Rob (August 18, 2008). "Macro of the Day: Mute the speakers!". MSDN Blogs. Microsoft. Retrieved September 3, 2015.
- Chambers, Rob (June 2, 2008). "Macro of the Day: Tell me the weather forecast for Redmond". MSDN Blogs. Microsoft. Retrieved June 26, 2015.
- Chambers, Rob (November 19, 2007). "Speech Macros, Typing Mode and Spelling Mode in Windows Speech Recognition". MSDN Blogs. Microsoft. Retrieved June 26, 2015.
- Chambers, Rob (June 30, 2008). "Making a Speech macro Application Specific". MSDN Blogs. Microsoft. Retrieved September 3, 2015.
- Microsoft (2009). "Windows Speech Recognition Macros Release Notes" (DOCX). Retrieved June 28, 2015.
- Mark Hachman (10 May 2017). "The Windows weakness no one mentions: Speech recognition". Pcworld.com. Retrieved 2 August 2017.