13 Startups Transcribing Voice to Text Using AI

voice to text: hands typing on computer

Transcribing voice or video recordings to text doesn’t require too much “subject matter expertise” as they say. Anyone with a decent grasp of grammar and a mediocre understanding of the source material can do transcribing. Transcribing or transcription simply means changing the “format” of information from one medium to another – like “voice to text”. It may be a very boring and low-value-add task, but transcribing is essential in fields like writing, journalism, legal, and healthcare. Sounds like we need some artificial intelligence (AI) to the rescue.

Turns out there are lots of bored people who need rescuing. In 2016, there were 57,400 medical transcribers and 19,600 court reporters employed in the U.S., with median hourly wages for these jobs between $17 and $30 per hour. Back-of-the-napkin math tells us that this opportunity represents about $3.7 billion a year. Since robots and AI are already stealing the jobs of file clerks and database administrators around the globe, why is transcribing so difficult?

In our recent article on 7 Machine Language Translator Startups, we talked about how difficult it is to translate text from one language to another. The ability to correctly identify what someone said, using the language and context in which it was said, is very difficult, and it’s the main reason why transcription jobs are so hard to automate. It’s something we noticed while reviewing the World’s Best Voice Recognition Software a few years ago. Only now have Siri and Alexa reached the level where they can understand the average consumer enough to be useful.

Accurately transcribing voice to text depends on accurately identifying context. For example, we pronounce “blue” and “blew” exactly the same way but the algorithms need to differentiate these words based on their usage within a sentence. Microsoft’s machine learning algorithms reached the level of human accuracy (equal to a 5.1% error rate) less than a year ago, even though their voice recognition research has been going on since the ‘90s. That’s why there is plenty of room for advances to be made in voice to text transcription, something that these 13 startups are actively working on today.


Click for company websiteFounded in 2016, Silicon Valley startup AISense has raised $13 million in funding to develop their “Otter Voice Notes” app, a solution for transcribing long conversations between multiple people. Otter separates and identifies speakers, and allows users to store, search, analyze and share voice conversations. AISense provides the service through a cloud platform that includes storage as well, running their algorithms using Nvidia graphical processors.

Credit: AISense

Otter is available for consumers through the App Store and Google Play with a free plan that contains up to 600 minutes of transcription a month, or ten times that for $10 a month. Enterprise use cases include call centers, online meetings, and pre-production media content – all priced on a case-by-case basis.

Behavioral Signals

Click for company websiteFounded in 2016, Los Angeles startup Behavioral Signals has raised $1.5 million to develop a conversation analytics suite complete with automated transcription and behavioral analytics. Their “callER Analytics Engine” transcribes and analyzes calls while looking at the speakers’ emotional state to come up with a final success score.

Voice to text using Behavioral Signals
Credit: Behavioral Signals

Measuring factors like tone, positivity, politeness, or arousal, the engine is well equipped to help sales teams increase revenue by as much as 10% and even reduce agent attrition, the company claims. Other use cases include call centers, finance, and of course HR – who will inevitably use such a tool to make sure nobody flirts with anyone else over the phone. Ever.


Click for company websiteFounded in 2017, Netherlands startup SpeakSee has raised an undisclosed amount of funding to develop a small handheld microphone for real-time transcriptions for people with hearing problems. The company is currently running an Indiegogo campaign which has already exceeded the $50,000 target by 63%. These handheld microphones connect to a smartphone using Wi-Fi and listen in the direction they are pointed at, so background noise is effectively cancelled out.

Credit: SpeakSee

Data is relayed to their base station, then transmitted to the SpeakSee app. Mics are compatible with conference call systems and televisions as well, and the platform supports more than 120 languages or dialects. (Really?) One mic+dock combo costs $250 and a dock with three mics costs $350 at current early bird rates on Indiegogo. Regular readers already know our feelings about crowdfunding.


San Francisco startup Tetra received their first round of funding in August 2017, and since then the team has raised $1.5 million to develop a smartphone VoIP communications app that records and transcribes both incoming and outgoing calls. Transcripts are easily searchable as well. Currently, Tetra displays that the conversation is being recorded to the other participant by default, but subscribers can disable the announcement as long as they “stay compliant with local laws or get recording consent”. The app is available through the App Store with the Android version still in the pipeline. Tetra currently only understands U.S. English, so it won’t understand you when you tell your mate to stop taking the piss because you were on the lash last night and snogged some 21-stone slapper behind the pub.


Click for company websiteFounded in 2010, San Francisco startup VoiceBase has raised $23 million to develop speech-to-text and complementary analytical application programming interfaces (APIs) – specific services described by programming protocols that can be embedded into larger applications. In other words, it’s “voice to text as a service”.

VoiceBase offers voice transcription, information extraction, speech analytics, and predictive analytics as part of their API suite, and all of these can be embedded into client systems seamlessly. The company is working with high profile names like Amazon Web Services, Oracle, and Nasdaq. Ten languages are currently supported with five more in the pipeline including Indian, Russian, and Japanese.


Click for company websiteAssemblyAI, another San Francisco startup, received seed funding of $120,000 in August of last year and is backed by Y Combinator. Similarly to VoiceBase, the startup is developing speech-to-text APIs that can be integrated into any voice-based application. Developers can access the tool and transcribe up to five hours a month for free. Paid subscriptions cost $0.018 per minute of audio with discounts above 10,000 hours per month, way below the cost of human transcribers.


Click for company websiteFounded in 2009, Cambridge, UK startup Speechmatics has raised an undisclosed amount of funding to develop their version of speech-to-text software that employs AI algorithms. The company doesn’t offer speech analytics or predictive models like some of its competitors, but offers its services in 75 languages, both in real-time and using batch recordings, which can be used over the cloud or hosted on site.

Speechmatics has created a self-learning platform that only needs the speech corpus to learn new languages, and has expanded the languages covered extensively over the past year. According to client studies, Speechmatics services are 30-40% more accurate than rival solutions, offering a vocabulary of 250,000 words. To put this in context, foreign speakers of a language are considered fluent above 10,000 words. The Speechmatics cloud service is priced at $0.08 per minute of audio with discounts offered on larger volumes.

Update 10/18/2019: Speechmatics has raised $8.25 million in Series A funding for product development and geographical expansion. This brings the company’s total funding to $8.25 million to date.  


Click for company websiteWe first came across Chorus.ai in our article “8 Artificial Intelligence Startups Improving CRM” which was published in early 2017. Founded in 2015, the San Francisco startup has raised $22.3 million over two funding rounds to develop a tool to transcribe and analyze sales calls and meetings. The application also creates a short summary of each call saving valuable time the agents can spend with clients.

Chorus.ai has expanded their connectivity to other CRM and conference systems since 2017 and can be integrated into Salesforce, Google Suite, and Slack, among others. Customers include names like Adobe and The Muse job portal.


Click for company websiteFounded in 2015, Gong.io is another San Francisco startup geared towards salespeople. Also covered in our earlier article on CRM AI, the company has increased their total funding to $28 million with Cisco coming on board as an investor. Gong’s algorithms merge all email, phone and conference conversations into a central repository and look for common traits of success among all the big data. Gong acquired ONDiGO, a sales automation platform syncing all agent activities with Salesforce, and has also released Android and Apple smartphone apps to allow connectivity on the go.

Update 08/12/2020: Gong.io has raised $200 million in Series D funding at a $2.2 billion valuation to grow their product and possibly acquire other companies. This brings the company’s total funding to $333 million to date.  


Click for company websiteFounded in 2014, London, UK startup Trint has raised $5 million to develop their AI-based transcription service. Trint transcribes recorded audio and video files and is targeting journalists with their offering. All major file formats like .MP3, .MP4, and .WMA are supported by the web-based platform and a dedicated app is available for recordings on iPhones. Trint markets their solution as the one doing the heavy lifting instead of the client, and works best with clear audio files recorded over a professional microphone. While overlapping speech and ambient sounds decrease accuracy, the app has still received positive feedback from journalists. Pay-as-you-go accounts cost $0.32 per minute of recording, as expensive as the lower end of human transcribers, innit.

Simon Says

Click for company websiteFounded in 2016, San Francisco startup Simon Says has raised an undisclosed amount of funding to develop a transcription service for the media industry. Similarly to Trint, it works with audio and video files uploaded to a web app and also supports 90 languages including English, French, Spanish, Arabic, Chinese.

Customers can export the results into Word or Excel. The pay-as-you-go service costs $0.17 a minute, and Simon Says offers 50%+ discounts on monthly or annual subscriptions. The service is used by leading media outlets like CNN, the BBC, or Vice.


Click for company websiteFounded in 2017, San Francisco startup Sonix has raised undisclosed funding to develop a web service to quickly transcribe recorded audio. Sonix claims a 30-minute file can be transcribed in 3-4 minutes, and customers can edit and highlight results on the go in the browser window.

The platform has useful features like search & replace, timestamps, and word confidence levels, and can arrange multiple user access with different permissions. Sonix is targeting journalists, podcasters, researchers, and students with the tool starting from $15 a month subscription fee and $0.083 per minute of audio.


Click for company websiteFounded in 2015, Seattle startup SayKara has raised $2.5 million to develop an AI-powered scribe for physicians. The founding team hails from places like Nuance and Amazon, and their collective mission is to reduce physicians’ time spent on administration and increase their time spent on patient interaction. According to SayKara, half of doctors’ time is spent on non-patient facing activities which are very costly at $100 an hour. The company emerged from stealth mode last September, and has been testing its application with private practices and larger hospitals since then.

Update 09/20/2018: SayKara raised an additional $5 million from a funding round led by SpringRock Ventures. This brings the company’s total funding to $7.5 million so far. 


The ability for humans to communicate seamlessly with artificial intelligence using voice-to-voice will be the tipping point for call centers with no humans, courtrooms with no typewriters, and a future where pretty much everything being said in public will be recorded and analyzed by AI algorithms. Exciting, innit?


Leave a Reply

Your email address will not be published.