UI + UXCross-Device And Physical Interactionstandard
Voice command
Design voice command as a visible, bounded command lifecycle with explicit invocation, permission handling, listening state, transcript review, command matching, confidence and alternatives, disambiguation, confirmation for risky actions, status feedback, cancellation, undo or retry, and equivalent non-voice paths.
Users need hands-free control, accessibility speech input, or rapid spoken command activation.
The command set is bounded enough to show phrases, targets, confidence, alternatives, and confirmation rules.
The product can provide fallback controls and safe recovery for recognition failure.
Avoid when
The task is high-risk and cannot tolerate recognition uncertainty without strong review or approval.
Users are likely to be in public, noisy, multilingual, privacy-sensitive, or microphone-restricted environments with no equivalent path.
The product cannot show what was heard, what will run, or how to cancel.
The spoken input is open-ended AI composition rather than a bounded command.
A visible button, command palette, shortcut, or text field would be simpler and more reliable.
Problem it prevents
Voice commands can make products accessible and hands-free, but speech recognition is uncertain, socially constrained, permission-gated, language-dependent, and easy to confuse with dictation, so invisible listening or unconfirmed side effects can execute the wrong command or block users entirely.
Pattern anatomy
What a strong implementation has to make clear
User need
Users may be hands-busy, driving an assistive setup, using switch or speech recognition software, dictating text, controlling a smart device, or using a microphone because touch and keyboard input are difficult.
Pattern promise
Design voice command as a visible, bounded command lifecycle with explicit invocation, permission handling, listening state, transcript review, command matching, confidence and alternatives, disambiguation, confirmation for risky actions, status feedback, cancellation, undo or retry, and equivalent non-voice paths.
Required state
Voice unavailable or unsupported state with non-voice alternatives.
Recovery path
The microphone starts recording before consent or without active-state feedback.
Access contract
Keep visible labels and accessible names aligned for speech-input activation.
Quality bar
The difference between expert and weak execution
Strong implementation
Specific, visible, recoverable
A mobile reporting app shows Press and say a command, records the transcript 'send report', matches it to Send report, reads back Incident 482, and requires Confirm send before uploading.
A dashboard voice overlay displays command chips such as Open alerts, Filter critical, and Read summary, shows confidence and alternatives, and offers Tap instead when the microphone is unavailable.
A user says 'filter critical alerts', sees the recognized phrase and target list, corrects an alternative before execution, and can undo the applied filter.
A user in a kitchen denies microphone permission and completes the same timer setup with large touch controls.
Weak implementation
Vague, hidden, hard to recover from
A microphone icon starts listening silently and executes the first matched word without a visible transcript.
The visible button says Export data but the voice command requires 'download CSV', so speech-input users cannot speak the label they see.
A user says 'send it later' while dictating a note and the app immediately sends a message.
A user with a speech difference gets repeated No match errors with no alternative phrase, tap path, or command list.
UI guidance
Render voice command as an explicit listening surface with microphone permission state, wake or push-to-talk trigger, listening indicator, timeout, recognized phrase, confidence, matched command, target object, alternatives, and cancel path.
Show the exact spoken phrases users can say, keep visible labels aligned with accessible names, and separate dictation text from executable commands so people can speak what they see without triggering hidden side effects.
UX guidance
Use voice command when users benefit from hands-free spoken control, accessibility voice input, or quick command activation and the product can manage recognition uncertainty, privacy, confirmation, recovery, and fallback paths.
Design the command lifecycle from permission request through listening, partial recognition, final transcript, disambiguation, confirmation, execution, feedback, undo, retry, and non-voice alternatives for noisy, private, unsupported, or denied states.
Implementation contract
What the implementation must handle
States
Voice unavailable or unsupported state with non-voice alternatives.
Microphone permission not requested, denied, granted, and revoked states.
Wake phrase, push-to-talk, or explicit start-listening state.
Listening state with visible microphone activity, privacy boundary, and timeout.
Interaction
Voice capture starts only after explicit user invocation, platform voice-access command, or documented wake condition.
The interface shows when it is listening, what it heard, which command it matched, and what target will be affected before any consequential execution.
Visible labels, accessible names, and primary spoken command phrases stay aligned so users can speak what they see.
Dictated text remains text unless the user is in command mode or speaks a documented text-editing command in a focused editing context.
Accessibility
Keep visible labels and accessible names aligned for speech-input activation.
Provide non-voice equivalents for all required commands.
Show and announce listening, recognition, no-match, disambiguation, confirmation, execution, and failure states.
Do not require continuous speech, exact accent, fast timing, or a single phrase when alternatives can be offered.
Review
How does the user know the product is listening, what it heard, and which command it matched?
What happens when microphone permission is denied, recognition is unsupported, the room is noisy, or the language is unavailable?
Can users speak the visible label of a control, and does that label appear in the accessible name?
Which phrases are commands, which are dictation, and which require disambiguation or confirmation?
Supports voice control of gestures, screen elements, dictation, and editing.
Full agent/debug reference
Problem Context
Users may be hands-busy, driving an assistive setup, using switch or speech recognition software, dictating text, controlling a smart device, or using a microphone because touch and keyboard input are difficult.
The product may need to recognize fixed commands, visible control names, open-ended dictation, text-editing instructions, navigation requests, app actions, or physical-world device controls.
Microphone permission, offline state, acoustic noise, privacy, accent, language, speech difference, latency, wake-word failure, and confidence thresholds affect whether the command can be recognized safely.
Voice command surfaces often sit near text input, prompt boxes, command palettes, keyboard shortcuts, touch gestures, screen-reader commands, and platform voice access tools.
Selection Rules
Choose voice command when the interaction problem is spoken activation or control, not typed search, command browsing, key-chord acceleration, or ordinary dictation alone.
Use text input when speech is only an optional input method for entering a value and no command execution, matching, or confirmation is owned by the product.
Use prompt box when the spoken content becomes an AI request that users review and send as natural language rather than a bounded command with a known target.
Use command palette when users need searchable command discovery, ranking, and selection before execution instead of remembering a spoken phrase.
Use keyboard shortcut when expert users need a key chord accelerator with scope and conflict rules, not microphone permission and recognition feedback.
Use touch gesture when movement, pressure, or pointer contact is the primary input; voice command can be an equivalent path but should not hide the gesture's visual affordance.
Prefer visible command phrases that match visible labels and accessible names, then support aliases only as documented alternatives.
Require read-back, confirmation, undo, or approval before executing destructive, paid, public, permission-changing, data-export, account, or physical-world actions.
Provide tap, keyboard, switch, text, or command-palette alternatives for every required task because voice can fail or be inappropriate in many environments.
Separate dictation mode from command mode so spoken prose does not unexpectedly trigger global commands.
Required States
Voice unavailable or unsupported state with non-voice alternatives.
Microphone permission not requested, denied, granted, and revoked states.
Wake phrase, push-to-talk, or explicit start-listening state.
Listening state with visible microphone activity, privacy boundary, and timeout.
Partial transcript and final recognized phrase state.
Low-confidence, no-match, and alternative command state.
Disambiguation state when a phrase maps to multiple commands or targets.
Command matched state with visible command name, target, scope, and consequence.
Confirmation state for risky or irreversible commands.
Cancel, stop listening, retry, edit transcript, and choose fallback states.
Executed command state with status feedback, focus or context preservation, and undo when available.
Dictation mode, text-editing command mode, and global command mode boundaries.
Noisy environment, offline, unsupported language, and screen-reader or platform voice-access coexistence states.
Interaction Contract
Voice capture starts only after explicit user invocation, platform voice-access command, or documented wake condition.
The interface shows when it is listening, what it heard, which command it matched, and what target will be affected before any consequential execution.
Visible labels, accessible names, and primary spoken command phrases stay aligned so users can speak what they see.
Dictated text remains text unless the user is in command mode or speaks a documented text-editing command in a focused editing context.
Low-confidence or ambiguous recognition never executes silently; it asks users to choose, retry, type, tap, or cancel.
Microphone denial, unsupported browser, offline recognition, language mismatch, or timeout keeps the task available through another input path.
Risky commands require confirmation that repeats the command and target; cancellation and undo use the same command semantics as visible controls.
Recognition results, transcript snippets, and command logs respect privacy expectations and do not persist sensitive spoken content unless the user is told.
Define voice states for permission, unsupported platform, start listening, partial transcript, final result, alternatives, no match, timeout, confirmation, execution, failure, retry, and stop listening.
Keep voice command state separate from dictation text, AI prompt drafts, search queries, keyboard shortcut handlers, and platform voice access events.
Expose a visible command list or help surface with exact phrases, examples, aliases, language support, microphone status, and fallback controls.
Use confidence thresholds and command grammars or bounded command matching where available; route ambiguous or low-confidence results to disambiguation.
Align visible labels and accessible names for controls that users may activate by speech recognition.
Confirm or require undo for destructive, paid, public, permission-changing, physical-world, or multi-object actions.
Test quiet, noisy, private, offline, permission-denied, unsupported-language, screen-reader, platform Voice Access or Voice Control, speech-difference, mobile, desktop, and keyboard fallback scenarios.
Common Generated-UI Mistakes
Listening without a visible active state or clear stop control.
Executing commands from partial recognition or low-confidence results.
Making voice the only route to a required workflow.
Using hidden command phrases that do not match visible labels or accessible names.
Mixing dictation and command mode so prose triggers app actions.
Saving transcripts of sensitive speech without notice or retention controls.
Treating a voice command like an AI prompt and letting open-ended language run deterministic side effects.
Skipping confirmation for destructive or physical-world commands.
Critique Questions
How does the user know the product is listening, what it heard, and which command it matched?
What happens when microphone permission is denied, recognition is unsupported, the room is noisy, or the language is unavailable?
Can users speak the visible label of a control, and does that label appear in the accessible name?
Which phrases are commands, which are dictation, and which require disambiguation or confirmation?
What prevents low-confidence or partial recognition from executing the wrong target?
What non-voice path completes the same task, and is it visible before voice fails?
What transcript or command history is stored, for how long, and can users avoid sensitive capture?
Accessibility
Keep visible labels and accessible names aligned for speech-input activation.
Provide non-voice equivalents for all required commands.
Show and announce listening, recognition, no-match, disambiguation, confirmation, execution, and failure states.
Do not require continuous speech, exact accent, fast timing, or a single phrase when alternatives can be offered.
Support users who rely on platform voice access, screen readers, switch devices, keyboard, touch, and assistive touch without conflicting command handlers.
Avoid audio-only feedback; provide text transcript, command match, visual status, and accessible live updates.
Protect privacy by making microphone capture explicit and avoiding unnecessary transcript retention.
Test with speech recognition users, keyboard users, screen readers, mobile voice access, and fallback-only scenarios.
Keyboard Behavior
Tab reaches the voice trigger, command list, transcript review, alternatives, confirm, cancel, retry, and fallback controls.
Space or Enter starts and stops push-to-talk or activates a visible voice trigger without requiring pointer input.
Escape stops listening, closes command help, or cancels a pending confirmation without executing the command.
Arrow keys or Tab can choose among recognition alternatives and disambiguation targets.
A visible text fallback accepts the same bounded command or task input when microphone use is unavailable.
Keyboard shortcuts and text fields suppress global voice-command handlers where dictated text or editing owns the input state.
After execution, focus remains on the changed region, confirmation result, or original trigger according to the command's visible equivalent.
Screen-reader and platform voice-access commands do not conflict with custom voice command shortcuts.