Speech-to-Text in Flutter Using Free & Open Source Tools
Introduction
Why Go Open Source for Speech-to-Text?
The Flutter Speech-to-Text Landscape
Performance and Model Size Considerations
Speech_to_text Flutter Plugin (Platform-Native, Free)
Vosk — Offline, On-Device ASR
OpenAI Whisper (Self-Hosted or Via whisper.cpp)
Mozilla DeepSpeech / Coqui STT
Handling Permissions Cleanly
Tips for Better Accuracy
Conclusion
Reference
Introduction
Voice interfaces are no longer a luxury reserved for Siri or Google Assistant. Today, developers can embed powerful, accurate speech recognition directly into Flutter apps — without paying per API call, without handing audio data to a third-party cloud, and without locking themselves into a proprietary vendor. Thanks to a growing ecosystem of free and open source tools, speech-to-text in Flutter has never been more accessible.
In this guide, we’ll explore the best open source options available, walk through practical implementations, compare trade-offs, and help you choose the right approach for your use case — whether you’re building a note-taking app, a voice-controlled UI, or an offline assistant for low-connectivity regions.
Why Go Open Source for Speech-to-Text?
Before diving into tools and code, it’s worth asking: why not just use Google’s Speech-to-Text API, AWS Transcribe, or Azure Speech Services?
There are compelling reasons to look elsewhere:
Cost at scale. Cloud ASR (Automatic Speech Recognition) APIs typically charge per 15-second audio chunk. At small volumes this is manageable, but any app that processes significant audio traffic — think a transcription tool, a language-learning app, or a voice-enabled productivity suite — can rack up large bills quickly.
Privacy and data sovereignty. When you stream audio to a cloud API, you’re sending potentially sensitive user data to a third-party server. For enterprise apps, healthcare tools, or any product with strict data regulations, on-device or self-hosted recognition is often a hard requirement.
Offline functionality. Cloud APIs require internet connectivity. Many use cases — field workers in remote areas, low-bandwidth markets, emergency scenarios — demand that voice recognition work when the network doesn’t.
Vendor independence. Basing your product on a single vendor’s API creates fragility. Open source solutions give you control over your stack, your model versions, and your roadmap.
The Flutter Speech-to-Text Landscape
Flutter doesn’t have a built-in speech recognition API, so all approaches rely on platform plugins, native integrations, or embedded models. Here are the main categories:
Platform-native wrappers — Use the device’s built-in ASR (Android’s SpeechRecognizer, iOS's SFSpeechRecognizer) via Flutter plugins. Free, but requires internet on most devices and isn't truly "open source" under the hood.
On-device open source models — Embed a model like Vosk or Whisper directly in your app. Truly offline, privacy-preserving, and fully open source.
Self-hosted server ASR — Run an open source model (like Whisper via a local server) and call it from your Flutter app over a local network or private cloud.
Each approach has its own Flutter integration strategy. Let’s cover the most practical options in depth.
Option 1: speech_to_text Flutter Plugin (Platform-Native, Free)
The speech_to_text package on pub.dev is the most popular Flutter plugin for voice recognition. It wraps the native speech recognition APIs on Android and iOS, making it easy to get started with just a few lines of Dart code.
What It Uses Under the Hood
Android: Google’s SpeechRecognizer API (requires Google Play Services and usually internet)
iOS: Apple’s SFSpeechRecognizer (works offline on newer iOS versions for some languages)
Web: The browser’s SpeechRecognition API (Chrome and Edge)
Installation
Add to your pubspec.yaml:
dependencies: speech_to_text: ^6.6.2
Android Setup
In android/app/src/main/AndroidManifest.xml, add:
<uses-permission android:name="android.permission.RECORD_AUDIO"/><uses-permission android:name="android.permission.INTERNET"/><queries> <intent> <action android:name="android.speech.RecognitionService" /> </intent></queries>
iOS Setup
In ios/Runner/Info.plist, add:
<key>NSSpeechRecognitionUsageDescription</key><string>This app uses speech recognition to convert your voice to text.</string><key>NSMicrophoneUsageDescription</key><string>This app needs access to the microphone for speech recognition.</string>
Basic Implementation
import 'package:speech_to_text/speech_to_text.dart';class SpeechController { final SpeechToText _speech = SpeechToText(); bool _isAvailable = false; String _recognizedText = ''; Future<void> initialize() async { _isAvailable = await _speech.initialize( onError: (error) => print('Error: $error'), onStatus: (status) => print('Status: $status'), ); } void startListening() { if (_isAvailable) { _speech.listen( onResult: (result) { _recognizedText = result.recognizedWords; print('Recognized: $_recognizedText'); }, listenFor: const Duration(seconds: 30), pauseFor: const Duration(seconds: 3), partialResults: true, localeId: 'en_US', ); } } void stopListening() => _speech.stop();}
Limitations
The speech_to_text plugin is easy to use and works well for general-purpose apps, but it is not truly open source speech recognition — it delegates to platform services. On Android, it typically requires an internet connection and routes audio through Google's servers. If you need genuine open source, offline, or privacy-first recognition, read on.
Option 2: Vosk — Offline, On-Device ASR
Vosk is a fully offline, open source speech recognition toolkit. It supports over 20 languages, runs on Android, iOS, Linux, Windows, and macOS, and is lightweight enough for mobile deployment. Models range from around 40 MB (small, fast) to 1.8 GB (large, highly accurate).
Vosk uses Kaldi-based acoustic models and is licensed under Apache 2.0, making it suitable for both personal and commercial projects.
Flutter Integration via vosk_flutter
The vosk_flutter plugin provides a Dart/Flutter interface to the Vosk library.
dependencies: vosk_flutter: ^0.2.0
Download a Model
Download a Vosk model from alphacephei.com/vosk/models and place it in your assets/ folder. For example, vosk-model-small-en-us-0.15 is a good starting point (~40 MB).
In pubspec.yaml:
flutter: assets: – assets/vosk-model-small-en-us-0.15/
Full Implementation Example
import 'package:vosk_flutter/vosk_flutter.dart';import 'dart:convert';class VoskSpeechRecognizer { late VoskFlutterPlugin _vosk; late Model _model; late Recognizer _recognizer; SpeechService? _speechService; Future<void> initialize() async { _vosk = VoskFlutterPlugin.instance(); // Load model from assets final modelPath = await ModelLoader().loadFromAssets( 'assets/vosk-model-small-en-us-0.15.zip', ); _model = await _vosk.createModel(modelPath); _recognizer = await _vosk.createRecognizer( model: _model, sampleRate: 16000, ); } Future<void> startListening({required Function(String) onResult}) async { _speechService = await _vosk.initSpeechService(_recognizer); _speechService!.onResult().listen((result) { final decoded = jsonDecode(result); final text = decoded['text'] as String; if (text.isNotEmpty) { onResult(text); } }); await _speechService!.start(); } Future<void> stopListening() async { await _speechService?.stop(); } void dispose() { _speechService?.dispose(); _recognizer.dispose(); _model.dispose(); }}
Key Advantages of Vosk
True offline operation. Audio never leaves the device. Vosk processes everything locally using the bundled model.
Multilingual support. Models are available for English, Hindi, Chinese, German, French, Spanish, Russian, Portuguese, and many more. There are even small models optimized for Indian English.
Low latency. Vosk streams results in real-time as the user speaks, which makes it suitable for interactive applications.
Customizable vocabulary. You can provide Vosk with a grammar or custom word list to improve accuracy for domain-specific terms (medical, legal, technical jargon).
Trade-offs
The small Vosk models sacrifice some accuracy for size and speed. For general conversational speech, expect accuracy in the 85–92% range depending on the speaker and noise conditions — good for most apps, but not quite at the level of cloud APIs.
Option 3: OpenAI Whisper (Self-Hosted or Via whisper.cpp)
Whisper was released by OpenAI as an open source model in 2022. It offers near-human-level transcription accuracy across dozens of languages and is available under the MIT license. While the original Python implementation is too heavy for mobile, whisper.cpp — a C/C++ port — can run on mobile devices.
Approach A: Self-Hosted Whisper Server + Flutter HTTP Client
The simplest production approach is to run Whisper on a server (even a local machine or a cheap VPS) and call it from Flutter via HTTP.
Run a simple Whisper API server using faster-whisper and FastAPI:
# server.pyfrom fastapi import FastAPI, UploadFilefrom faster_whisper import WhisperModelapp = FastAPI()model = WhisperModel("base", device="cpu")@app.post("/transcribe")async def transcribe(file: UploadFile): audio_bytes = await file.read() with open("/tmp/audio.wav", "wb") as f: f.write(audio_bytes) segments, _ = model.transcribe("/tmp/audio.wav") text = " ".join([seg.text for seg in segments]) return {"text": text}
On the Flutter side, record audio and POST it:
import 'package:http/http.dart' as http;import 'package:record/record.dart';class WhisperClient { final _recorder = AudioRecorder(); Future<void> startRecording() async { if (await _recorder.hasPermission()) { await _recorder.start( const RecordConfig(encoder: AudioEncoder.wav), path: '/tmp/audio.wav', ); } } Future<String?> stopAndTranscribe() async { final path = await _recorder.stop(); if (path == null) return null; final file = File(path); final request = http.MultipartRequest( 'POST', Uri.parse('http://your-server/transcribe'), ); request.files.add( await http.MultipartFile.fromPath('file', file.path), ); final response = await request.send(); final body = await response.stream.bytesToString(); return jsonDecode(body)['text']; }}
This approach delivers Whisper’s excellent accuracy while keeping the model off the mobile device.
Approach B: whisper.cpp Directly on Device
For fully on-device use, there is work underway to integrate whisper.cpp into Flutter via FFI. Projects like flutter_whisper (in active development) expose whisper.cpp bindings for Dart. The tiny Whisper model (~75 MB) runs adequately on mid-range devices; the base model (~150 MB) gives significantly better accuracy.
This space is evolving rapidly — check the pub.dev listings for the latest stable integrations.
Option 4: Mozilla DeepSpeech / Coqui STT
Coqui STT (the successor to Mozilla DeepSpeech) is another strong open source option. It uses a deep neural network based on Baidu’s DeepSpeech research architecture and supports streaming recognition. While active development on Coqui STT has slowed, existing models and integrations remain functional and production-worthy.
Coqui STT models are available in a TensorFlow Lite format suitable for mobile deployment. Integration with Flutter follows a similar pattern to Vosk — load the model from assets, initialize a recognizer, and stream audio through it.
Choosing the Right Tool for Your Use Case
Use Case Recommended Tool Quick prototyping, general app speech_to_text plugin Offline, privacy-first, mobile Vosk (vosk_flutter) Highest accuracy, server available Whisper (self-hosted) Multilingual, low-resource languages Vosk or Whisper Real-time streaming recognition Vosk Post-recording transcription Whisper Corporate/regulated environments Vosk or Whisper (self-hosted)
Recording Audio in Flutter
Regardless of which STT engine you choose, you need good audio input. The record package is the most versatile Flutter audio recorder:
dependencies: record: ^5.1.1 permission_handler: ^11.3.1import 'package:record/record.dart';import 'package:permission_handler/permission_handler.dart';class AudioCaptureService { final AudioRecorder _recorder = AudioRecorder(); Future<bool> requestPermissions() async { final status = await Permission.microphone.request(); return status.isGranted; } Future<void> startRecording(String outputPath) async { if (!await requestPermissions()) return; await _recorder.start( RecordConfig( encoder: AudioEncoder.wav, // PCM WAV — most compatible with STT engines sampleRate: 16000, // 16 kHz is the standard for most ASR models numChannels: 1, // Mono audio bitRate: 128000, ), path: outputPath, ); } Future<String?> stopRecording() async { return await _recorder.stop(); } Future<void> dispose() async { await _recorder.dispose(); }}
Important: Most open source ASR models expect 16 kHz, 16-bit, mono PCM audio. Mismatched sample rates are a common source of poor accuracy. Always configure your recorder to match your model’s requirements.
Handling Permissions Cleanly
Both Android and iOS require explicit user permission for microphone access. Use permission_handler to manage this gracefully:
Future<bool> ensureMicrophoneAccess(BuildContext context) async { var status = await Permission.microphone.status; if (status.isGranted) return true; if (status.isPermanentlyDenied) { showDialog( context: context, builder: (_) => AlertDialog( title: const Text('Microphone Access Required'), content: const Text( 'Please enable microphone access in your device settings to use voice features.', ), actions: [ TextButton( onPressed: () => openAppSettings(), child: const Text('Open Settings'), ), ], ), ); return false; } status = await Permission.microphone.request(); return status.isGranted;}
Building a Complete Voice-to-Text Widget
Here’s a minimal but production-ready Flutter widget that ties everything together:
import 'package:flutter/material.dart';import 'package:speech_to_text/speech_to_text.dart';class VoiceInputWidget extends StatefulWidget { final Function(String) onTextCaptured; const VoiceInputWidget({super.key, required this.onTextCaptured}); @override State<VoiceInputWidget> createState() => _VoiceInputWidgetState();}class _VoiceInputWidgetState extends State<VoiceInputWidget> { final SpeechToText _speech = SpeechToText(); bool _isListening = false; bool _isAvailable = false; String _liveText = ''; @override void initState() { super.initState(); _initSpeech(); } Future<void> _initSpeech() async { final available = await _speech.initialize(); if (mounted) setState(() => _isAvailable = available); } void _toggleListening() { if (_isListening) { _speech.stop(); setState(() => _isListening = false); widget.onTextCaptured(_liveText); } else { _speech.listen( onResult: (result) { setState(() => _liveText = result.recognizedWords); }, partialResults: true, ); setState(() { _isListening = true; _liveText = ''; }); } } @override Widget build(BuildContext context) { return Column( children: [ if (_liveText.isNotEmpty) Padding( padding: const EdgeInsets.all(16), child: Text( _liveText, style: Theme.of(context).textTheme.bodyLarge, ), ), GestureDetector( onTap: _isAvailable ? _toggleListening : null, child: AnimatedContainer( duration: const Duration(milliseconds: 200), width: _isListening ? 72 : 64, height: _isListening ? 72 : 64, decoration: BoxDecoration( color: _isListening ? Colors.red : Colors.blue, shape: BoxShape.circle, boxShadow: _isListening ? [BoxShadow( color: Colors.red.withOpacity(0.4), blurRadius: 20, spreadRadius: 5, )] : [], ), child: Icon( _isListening ? Icons.stop : Icons.mic, color: Colors.white, size: 32, ), ), ), const SizedBox(height: 8), Text( _isListening ? 'Tap to stop' : 'Tap to speak', style: Theme.of(context).textTheme.bodySmall, ), ], ); }}
Tips for Better Accuracy
Getting good transcription quality in real-world conditions requires attention beyond just choosing the right library:
Pre-process audio before sending to the model. Apply noise reduction if you’re operating in noisy environments. The flutter_sound package offers some built-in filters, or you can pass audio through a pre-processing step using native code.
Use the right language model. Don’t use an English model for Hindi speech. Vosk and Whisper both have language-specific models — always match the model to the user’s locale.
Detect silence properly. Most STT engines benefit from clear speech boundaries. Implement a voice activity detection (VAD) step to trim leading and trailing silence before passing audio to the recognizer.
Handle partial vs. final results differently. Show partial results in the UI as the user speaks (for live feedback) but only act on final results for downstream logic like form filling or commands.
Test on real devices. Emulators often have poor microphone simulation. Test early and often on physical hardware, especially for timing-sensitive streaming recognition.
Performance and Model Size Considerations
On-device speech recognition creates a real tension between accuracy, model size, and device performance. Here’s a rough guide:
Vosk small models (~40–80 MB): Fast inference, low RAM usage, suitable for older mid-range devices. Word Error Rate (WER) is higher, but acceptable for command-and-control use cases.
Vosk large models (~1–2 GB): High accuracy, close to cloud quality. Only suitable for apps where users expect to download a large language pack, or for server deployment.
Whisper tiny/base models (~75–150 MB): Excellent accuracy even at small sizes. Slower than Vosk for real-time streaming, but outstanding for post-recording transcription. Runs on most modern Android/iOS devices.
Whisper medium/large models (300 MB–3 GB): Best-in-class accuracy. Reserved for server deployment or desktop apps.
For most mobile apps, the Vosk small model or Whisper base model hits the right balance. Consider letting users choose their quality level, offering a “fast mode” (small model) and a “precise mode” (larger model they download on demand).
Conclusion
Building speech-to-text in Flutter is no longer a choice between convenience and freedom. With Vosk for offline on-device recognition, Whisper for high-accuracy transcription, and the speech_to_text plugin for quick platform-native integration, you have robust, production-ready tools at every point on the spectrum.
Open source ASR has matured significantly. The combination of Vosk’s streaming speed and Whisper’s transcription accuracy covers virtually every mobile use case — and both can be integrated without sending a single audio byte to a third-party cloud service.
Start with the speech_to_text plugin to validate your concept quickly, then graduate to Vosk or a self-hosted Whisper backend when you're ready for privacy, offline support, or scale. Your users' voices — and their data — stay where they belong.
References:
Converting Speech to Text in Flutter Applications – Deepgram Blog ⚡️In this tutorial, learn how to use Deepgram's speech recognition API with Flutter and Dart to convert speech to text on…deepgram.com
Adding speech-to-text and text-to-speech support in a Flutter app – LogRocket BlogA speech-to-text feature turns your voice into text, and a text-to-speech feature reads the text out loud for an…blog.logrocket.com
https://picovoice.ai/blog/streaming-speech-to-text-in-flutter/
FlutterDevs team of Flutter developers to build high-quality and functionally-rich apps. Hire a Flutter developer for your cross-platform Flutter mobile app project hourly or full-time as per your requirement! For any flutter-related queries, you can connect with us on Facebook, GitHub, Twitter, and LinkedIn.
We welcome feedback and hope that you share what you’re working on using #FlutterDevs. We truly enjoy seeing how you use Flutter to build beautiful, interactive web experiences.
Need help building production-grade Flutter apps? FlutterDevs helps teams ship faster with solid architecture, better UX, and practical AI features. Reach us at support@flutterdevs.com.


