- You likely engage the use of voice to text APIs on a regular basis – you just don’t know it! That’s because users don’t interact with the back end function directly.
- Audio-to-Text APIs recognize user’s voices and improve their understanding over time, within the limits of the API itself, such as what languages it supports.
- There are many options on the market, but we’ll cover the top 10 Speech-To-Text APIs available right now.
Application Programming Interfaces – more commonly known as APIs – are a method of connecting computers and/or computer programs together to provide a service to an end user. APIs are all around us. They link smartphones with ecommerce platforms, send weather reports, connect websites with external data sources, order taxis and facilitate mobile banking.
One of the most functional forms of API is what’s known as ‘Audio to Text’, ‘Voice to Text’ or speech recognition APIs – a development tool that converts spoken audio to written text on a computer or mobile device – via an application.
Audio to Text APIs are characterized by their ability to improve their understanding of a user’s voice over time through the use of advanced AI algorithms and huge databases of 100+ languages.
How Do Voice to Text APIs Work?
It’s important to note that the users themselves don’t generally interact with the API directly. It’s a back-end function. Speech recognition applications are different from speech recognition APIs. Applications (Cortana, Alexa etc.) are what the end user sees and hears. APIs are the engine behind it.
Audio to text may seem like a straightforward process to the end user, but there’s quite a lot going on in the background to produce an instant, accurate version of what’s being said.
Let’s break it down step-by-step:
- The user talks into a microphone, which sends the audio to the application.
- The application breaks up audio into small snippets of data called ‘phonemes’ – distinctly different units of sound, unique to each language (the English language has 44 of them).
- The software makes in informed decision of what the user is saying, based on how the phonemes are organized.
- The API then consults a language database to make an informed decision on what the user was likely to have said.
- The software transmits displays the written text within an application.
The Top 10 Voice to Text APIs
Speech recognition software is highly popular and universally easy to use. Given that the technology is at the forefront of modern neural networking and AI research, products and methodologies are in a constant state of development.
Now that you understand how the software works, let’s look at the best Voice-to-Text APIs available in 2021 (in no particular order). Read on to help you decide on what API is best for your organization or application.
1. IBM Watson STT API
IBM Watson STT is a well-supported, highly customizable API that draws on IBM’s experience as a leading provider of enterprise IT services. Its selling point is the number of resources that are made available to you once you start using it, from software development kits to best practice documentation.
Number of languages supported: 7
Pricing:
- Free up to 100 minutes per month
- $0.02 per minute (up to 250k minutes)
- $0.015 per minute (250k – 500k minutes)
- Built-in API development
- Commercial transcription (call centres and SEO functionality)
- Vast knowledge base
- Limited language support
2. Rev.AI
An increasingly popular speech-to-text platform that’s modelled on 50,000+ hours of transcribed speech and driven by DevOps/Agile KPIs such as ‘time to market’ and scalable CI/CD.
Number of languages supported: 31
Pricing:
- Pay as you go – $0.035 per minute
- Enterprise – $1.20 per hour for high volume work with dedicated support
- Multiple speaker recognition
- Support for asynchronous and streamed audio
- Cataloguing functionality for searchable transcript repositories
- Can be slower than average over short form transcriptions
3. Google Speech API
Google’s very own speech API remains one of the most popular audio to text platforms in the world, and benefits from some of the best minds in the field of AI research to develop its voice recognition features.
Number of languages supported: 120
Pricing (minus data logging):
- 0-60 minutes – Free
- 60+ minutes – $0.006 per 15 seconds
- Data logging options available
- Google Workspace functionality
- Automatic language detection
- Total monthly capacity is limited to 1 million minutes
4. Siri API
Not to be confused with Apple’s famous virtual assistant, Siri API is a cheap and cheerful third party voice to text platform provided by a company called Voice Actions.
Number of languages supported: English only
Pricing:
- Up to 30 minutes per day – Free
- 49,000 minutes – $0.01
- 1 million minutes – $0.009
- Free for small volume users
- Built for smartphone STT development (including menu navigation)
- Limited developer support
- English only
5. Microsoft Cognitive Services
Building on their status as a leading global provider of B2B/B2C tech services, Microsoft Cognitive Services offers enterprise-level speech recognition within its Azure framework. While there is a lack of specialized functionality, Microsoft have pledged to continue developing their machine learning division to broaden its operational scope in the coming years.
Number of languages supported: 103
Pricing:
- 5 hours per month – Free
- 5 hours+ per month – $1 per hour
- Industry-leading security via the Microsoft Trust agreement
- Full integration with existing Microsoft IaaS/PaaS products
- Highly active development community
- Large volume specialized work can be pricey
- Lack of specialized API tools
6. Speechmatics API
Speechmatics is a cloud-based API that relies on intuitive front-end functionality and ultra-fast transcription speeds for high volume workloads. It is an enterprise-level application, so you’ll need to contact them directly for demonstrations and pricing.
Number of languages supported: 31
Pricing:
- Volume-based pricing
- One of the best transcription engines available
- Broad range of integration features
- Lack of a free option
- Limited language support
Screenshot: Speechmatics
7. ReadSpeaker API
ReadSpeaker’s ‘SpeechCloud’ API is a straightforward cloud-based API for desktop and mobile applications, alongside PBX’s and interactive voice response systems. ReadSpeaker doesn’t support on-premise services but is fully compatible with popular open source communication platforms, such as Asterisk.
Number of languages supported: 50
Pricing:
- Volume-based pricing on demand
- Free trial account
- Highly customizable transcription features
- Sample code available for a variety of platforms
- No publicly available pricing
- Lack of a hybrid option
8. Amazon Transcribe
Amazon Web Services (AWS) has taken the world of commercial SaaS services by storm, since its introduction in introduction in the early-2000s. The platform’s speech to text offering, Amazon Transcribe, offers pay-as-you-go pricing models for both streaming and batch workloads.
Number of languages supported: 31
Pricing:
- Differs on a region-by-region basis across three product tiers
- Most US regions start off at $0.02 per minute for the first 250k minutes
- Incredibly accurate
- Highly progressive AI
- Support for video file speech
- Limited language support
- Custom vocabulary features are difficult to use
Shutterstock
9. Vonage Voice API
Vonage is a cloud communications company who provide a bespoke API for capturing voice communication via the telephone, and converting transcripts into marketing data for future analysis.
Number of languages supported: 120
Pricing:
- $0.019 per 15 seconds
- Specialized marketing features for direct sales calls
- Metadata tracking and ‘call event’ data capturing
- Vast number of languages supported
- Lack of high volume pricing plans
- Zero third party application functionality
10. AssemblyAI
AssemblyAI specializes in data analysis and transcription functionality. Their core product offering comes pre-packaged with a wide array of development tools, from word confidence scores that self-analyse the accuracy of the transcript, to multi-speaker recording and labelling.
Number of languages supported: English only
Pricing:
- $0.00025 per second
- Support for hybrid deployments
- Vast array of out-of-the-box features
- Much cheaper than enterprise-level APIs
- Lack of billing options
- English-only service
Conclusion
The Speech to Text marketplace is awash with innumerable pricing plans, features, language support options and hosting scenarios. The reality is that when you’re consulting on which API is best for your organization, nothing beats some good old fashioned market research. Each company’s workload and front-end requirements are different from one another. Use this information to narrow down a few prospective partners, establish your requirements and make some enquiries.
That being said, here are two standout providers.
11. SMEs
For micro-business, start-ups and SMEs looking to implement third party development tools, cost is often a major consideration. Given it’s broad range of pre-packaged features, big name clients and low base cost, it’s hard to look past AssemblyAI for small enterprises looking for an API that provides the most bang for its buck.
12. Large Organizations
When it comes to large to enterprise-level organizations who are on a mission to implement the very best machine learning algorithms the market has to offer, since its introduction in 2018. Google’s dedication to neural programming and cross-compatibility functions are undoubtedly leading the pack. The company represent the cutting edge of deep learning research, and their associated speech APIs are undoubtedly on an upwards trajectory.