Building a Flutter AI Voice Assistant
A deep dive into building an AI-powered voice assistant with Flutter, speech recognition, and natural language processing
// table of contents (14 sections)
Voice assistants are everywhere — Siri, Google Assistant, Alexa. But what if you want to build your own, tailored to a specific use case, running on mobile with a beautiful cross-platform UI? That is exactly what I did — an AI-powered voice assistant built entirely in Flutter.
This post covers the full journey: architecture decisions, speech processing, NLP integration, and the UX patterns that make a voice assistant feel natural on mobile.
Why Flutter for a Voice Assistant
Building a voice assistant means dealing with real-time audio streams, animations that respond to voice state changes, and a UI that needs to feel fluid across iOS and Android. Flutter excels at all of these:
- Single codebase for both platforms with native performance
- Stream-based architecture that maps naturally to audio processing pipelines
- Rich animation framework for visual feedback during listening and processing states
- Platform channels for accessing native speech APIs when needed
The alternative would have been building separate native apps, which would have doubled the development time without meaningful performance gains for this type of application.
Architecture Overview
The assistant follows a straightforward pipeline architecture:
┌─────────────┐ ┌──────────────┐ ┌──────────────────┐ ┌─────────────┐
│ Speech-to- │───▶│ NLP Engine │───▶│ Response │───▶│ Text-to- │
│ Text │ │ (Intent + │ │ Generation │ │ Speech │
│ │ │ Entities) │ │ │ │ │
└─────────────┘ └──────────────┘ └──────────────────┘ └─────────────┘
Each stage runs asynchronously, and the UI reacts to state changes at each transition point using Flutter’s StreamBuilder and state management.
Speech-to-Text Integration
Flutter does not have a built-in speech recognition engine, but the speech_to_text package provides a solid wrapper around platform-native APIs — Apple’s Speech framework on iOS and Google’s Speech Recognizer on Android.
import 'package:speech_to_text/speech_to_text.dart';
class SpeechService {
final SpeechToText _speech = SpeechToText();
bool _isAvailable = false;
Future<void> initialize() async {
_isAvailable = await _speech.initialize(
onStatus: _onStatusChange,
onError: _onError,
);
}
Stream<String> listen() async* {
if (!_isAvailable) {
throw StateError('Speech recognition not available');
}
final resultController = StreamController<String>();
await _speech.listen(
onResult: (result) {
resultController.add(result.recognizedWords);
if (result.finalResult) {
resultController.close();
}
},
listenOptions: SpeechListenOptions(
listenMode: ListenMode.dictation,
cancelOnError: true,
partialResults: true,
),
);
yield* resultController.stream;
}
void _onStatusChange(String status) {
// Notify UI about listening state changes
print('Speech status: $status');
}
void _onError(SpeechRecognitionError error) {
print('Speech error: ${error.errorMsg}');
}
void stop() {
_speech.stop();
}
}
The key design decision here was exposing the transcription as a Stream<String>. This allows the UI to display partial results in real-time while the user is still speaking, creating that familiar voice assistant experience where text appears word by word.
Handling Audio Permissions
Both iOS and Android require explicit microphone permissions. On iOS, you also need to add a usage description to Info.plist:
<key>NSSpeechRecognitionUsageDescription</key>
<string>This app needs speech recognition to understand your voice commands</string>
<key>NSMicrophoneUsageDescription</key>
<string>This app needs microphone access to hear your voice</string>
On Android, the permissions go in AndroidManifest.xml:
<uses-permission android:name="android.permission.RECORD_AUDIO" />
<uses-permission android:name="android.permission.INTERNET" />
I wrapped permission handling into a reusable utility that checks and requests permissions before any speech operation, with graceful fallback messaging when permissions are denied.
NLP Challenges
Natural language understanding is where the real complexity lives. The assistant needs to figure out what the user wants (intent detection) and extract relevant details (entity extraction) from free-form spoken text.
Intent Detection
I started with a simple keyword-matching approach, which worked for basic commands but failed on natural language variations. “What is the weather like” and “Is it going to rain today” both express the same intent.
The production solution uses a hybrid approach: a lightweight on-device classifier for common intents and a cloud-based NLP API for more complex queries.
enum UserIntent {
weather,
setReminder,
setTimer,
searchWeb,
generalQuestion,
unknown,
}
class IntentDetector {
final List<IntentRule> _rules = [
IntentRule(
intent: UserIntent.weather,
keywords: ['weather', 'temperature', 'rain', 'sunny', 'forecast'],
patterns: [r'what.*(weather|temp)', r'is.*(rain|cold|hot)'],
),
IntentRule(
intent: UserIntent.setTimer,
keywords: ['timer', 'alarm', 'countdown'],
patterns: [r'set.*(timer|alarm)', r'(start|begin).*countdown'],
),
IntentRule(
intent: UserIntent.setReminder,
keywords: ['remind', 'remember', 'don\'t forget'],
patterns: [r'remind.*to', r'remember.*to'],
),
];
IntentResult detectIntent(String text) {
final normalized = text.toLowerCase().trim();
for (final rule in _rules) {
// Check keyword match first (fast path)
if (rule.keywords.any((kw) => normalized.contains(kw))) {
return IntentResult(
intent: rule.intent,
confidence: 0.8,
rawText: text,
);
}
// Check regex patterns (slower, more accurate)
for (final pattern in rule.patterns) {
if (RegExp(pattern).hasMatch(normalized)) {
return IntentResult(
intent: rule.intent,
confidence: 0.9,
rawText: text,
);
}
}
}
// Fallback to cloud NLP for unrecognized intents
return IntentResult(
intent: UserIntent.generalQuestion,
confidence: 0.5,
rawText: text,
needsCloudProcessing: true,
);
}
}
Entity Extraction
Once the intent is known, the assistant extracts relevant entities. For timers, that means duration values. For reminders, it means time references and action descriptions.
class EntityExtractor {
Map<String, dynamic> extract(UserIntent intent, String text) {
switch (intent) {
case UserIntent.setTimer:
return _extractTimerEntities(text);
case UserIntent.setReminder:
return _extractReminderEntities(text);
case UserIntent.weather:
return _extractLocationEntities(text);
default:
return {};
}
}
Map<String, dynamic> _extractTimerEntities(String text) {
final durationPattern = RegExp(
r'(?:(\d+)\s*(?:hours?|hrs?|h))?'
r'\s*(?:(\d+)\s*(?:minutes?|mins?|m))?'
r'\s*(?:(\d+)\s*(?:seconds?|secs?|s))?',
);
final match = durationPattern.firstMatch(text.toLowerCase());
if (match == null) {
return {'valid': false};
}
final hours = int.tryParse(match.group(1) ?? '0') ?? 0;
final minutes = int.tryParse(match.group(2) ?? '0') ?? 0;
final seconds = int.tryParse(match.group(3) ?? '0') ?? 0;
final totalSeconds = hours * 3600 + minutes * 60 + seconds;
return {
'valid': totalSeconds > 0,
'hours': hours,
'minutes': minutes,
'seconds': seconds,
'totalSeconds': totalSeconds,
};
}
}
On-Device vs Cloud
This is a critical trade-off. On-device processing is faster and works offline, but has limited accuracy. Cloud processing is more accurate but adds latency and requires internet connectivity.
My approach: use on-device for the top 10 most common intents (weather, timers, reminders, alarms, search, navigation, calculations, translations, note-taking, and general questions). Fall back to cloud for anything the on-device classifier cannot handle with sufficient confidence.
Response Generation
For simple intents, the app uses template-based responses. For general questions, it connects to an LLM API.
class ResponseGenerator {
final OpenAIService _llmService;
Future<String> generate(IntentResult intent, Map<String, dynamic> entities) async {
if (intent.needsCloudProcessing) {
return await _llmService.complete(intent.rawText);
}
return _templateResponse(intent.intent, entities);
}
String _templateResponse(UserIntent intent, Map<String, dynamic> entities) {
switch (intent) {
case UserIntent.setTimer:
if (entities['valid'] == true) {
final h = entities['hours'] ?? 0;
final m = entities['minutes'] ?? 0;
final s = entities['seconds'] ?? 0;
final parts = <String>[];
if (h > 0) parts.add('$h hour${h > 1 ? "s" : ""}');
if (m > 0) parts.add('$m minute${m > 1 ? "s" : ""}');
if (s > 0) parts.add('$s second${s > 1 ? "s" : ""}');
return 'Setting a timer for ${parts.join(", ")}.';
}
return 'I could not understand the duration. How long should the timer be?';
case UserIntent.weather:
final location = entities['location'] ?? 'your area';
return 'Let me check the weather for $location.';
default:
return 'Let me think about that...';
}
}
}
UX Considerations
Building the voice interaction layer taught me that UX is just as important as the NLP accuracy. A voice assistant that feels sluggish or unresponsive will frustrate users regardless of how accurate its understanding is.
Visual Feedback
The UI provides continuous visual feedback through three distinct states:
enum AssistantState { idle, listening, processing, speaking }
class VoiceAssistantWidget extends StatefulWidget {
@override
_VoiceAssistantWidgetState createState() => _VoiceAssistantWidgetState();
}
class _VoiceAssistantWidgetState extends State<VoiceAssistantWidget>
with SingleTickerProviderStateMixin {
late AnimationController _pulseController;
AssistantState _state = AssistantState.idle;
@override
void initState() {
super.initState();
_pulseController = AnimationController(
vsync: this,
duration: const Duration(milliseconds: 1200),
);
}
@override
Widget build(BuildContext context) {
return AnimatedBuilder(
animation: _pulseController,
builder: (context, child) {
return Container(
width: 120,
height: 120,
decoration: BoxDecoration(
shape: BoxShape.circle,
color: _stateColor.withValues(alpha: 0.3),
boxShadow: _state == AssistantState.listening
? [
BoxShadow(
color: _stateColor.withValues(alpha: 0.4),
blurRadius: 20 + _pulseController.value * 15,
spreadRadius: _pulseController.value * 8,
),
]
: [],
),
child: Icon(
_stateIcon,
size: 48,
color: _stateColor,
),
);
},
);
}
Color get _stateColor {
switch (_state) {
case AssistantState.idle:
return Colors.grey;
case AssistantState.listening:
return Colors.blue;
case AssistantState.processing:
return Colors.orange;
case AssistantState.speaking:
return Colors.green;
}
}
IconData get _stateIcon {
switch (_state) {
case AssistantState.idle:
return Icons.mic;
case AssistantState.listening:
return Icons.mic;
case AssistantState.processing:
return Icons.psychology;
case AssistantState.speaking:
return Icons.volume_up;
}
}
}
Conversation Flow and Error Handling
Not every voice input will be understood. The assistant needs to handle ambiguity gracefully:
- Low confidence: Ask for clarification instead of guessing wrong
- No speech detected: Prompt the user to try again with a helpful message
- Network error: Explain that cloud features are unavailable and offer offline alternatives
- Permission denied: Guide the user to settings with a clear explanation of why access is needed
The key principle: always tell the user what is happening. Silence is the enemy of trust in a voice interface.
Lessons Learned
Building this voice assistant reinforced several important lessons:
-
Start with the platform-native speech APIs. They are surprisingly good and avoid the complexity of integrating custom models for basic speech recognition.
-
Design for partial results. Showing text as it is being transcribed dramatically improves the perceived responsiveness of the app.
-
Keep the on-device/cloud split intentional. Having core features work offline is a significant advantage. Do not make everything dependent on a network connection.
-
Invest in error states. Users will encounter permissions issues, network problems, and unclear speech. How you handle these cases defines the quality of the experience.
-
State management matters. The voice assistant has multiple concurrent state streams (listening state, transcription text, processing status, TTS playback). Using
StreamBuilderwith clearly separated state objects kept the code manageable. -
Test with real accents and environments. Speech recognition accuracy in a quiet room with clear pronunciation is very different from a noisy street with a strong accent. Real-world testing is essential.
Conclusion
Flutter proved to be an excellent platform for building a voice assistant. The stream-based architecture, cross-platform UI capabilities, and easy access to native platform features through plugins and platform channels made it possible to build a polished experience without writing platform-specific code. The biggest challenge was not the speech technology itself but designing the interaction patterns — visual feedback, error handling, and conversation flow — that make a voice assistant feel natural and trustworthy. If you are considering building a voice-powered app, start with the native speech APIs, invest in your UX feedback loop, and keep the architecture modular enough to swap out components as better models and APIs become available.
