Exploring the Future of Voice AI: DeepGram's Contextual AI Approach

Welcome to Deepgram, a pioneering AI company revolutionizing speech recognition and conversational experiences with our end-to-end deep learning solutions.

  • * The speaker is from Deepgram, a company that has been working with audio AI for 9 years.
  • * Deepgram specializes in end-to-end deep learning for speech recognition.
  • * They have recently released a TTS (text-to-speech) product.
  • * Their API can be used on-premise or in the cloud and can adapt models to specific acoustic environments.
  • * Deepgram is a research-led company, with fundamental research informing their product development.
  • * The speaker discusses the evolution of voice AI, distinguishing between "Voice AI 1.0" (like Siri) and the current "Voice AI 2.0", which is more open-ended and real-time.
  • * Voice AI 2.0 involves transforming speech to text, then text to text using a language model, and finally text to speech.
  • * The speaker believes that we are at the beginning of a significant explosion in voice AI and text-based AI in the next decade.
  • * They compare the current stage of voice AI to the early days of the automobile industry.
  • * Deepgram's systems include speech detection, language/text detection, and text-to-speech, each of which functions independently.
  • * The accuracy and speed of these systems have improved dramatically in recent years.
  • * Deepgram can now perform the entire voice AI roundtrip conversation in less than 500 milliseconds.
  • * However, current models lack context throughout the conversation, which is a critical area for improvement.
  • * By adding context to speech-to-text models, they can become more accurate and better understand the nuances of conversations.
  • * This context can include not just text prompts but also other audio, images, documents, or even the previous turn of the conversation.
  • * Adding context will make AI systems feel more human-like, as they can respond appropriately to emotions, background noise, and conversation flow.
  • * The speaker mentions the possibility of multimodal or speech-to-speech models but notes that compartmentalized models offer better control for businesses.
  • * Deepgram aims to be a platform for all audio needs, including low-latency real-time speech-to-text and expressive text-to-speech.
  • * Their upcoming product is a full-stack voice AI agent that puts everything together and reduces latency for faster turn-taking in conversations.
  • * Deepgram offers $250 in credit for anyone to try out their products.
  • * The speaker encourages the audience to think about how AI will be everywhere during the upcoming intelligence revolution, which they believe will take 25-30 years to fully develop.

Source: AI Engineer via YouTube

❓ What do you think? What are the most significant cultural and societal implications of AI-driven voice assistants becoming increasingly sophisticated, contextual, and human-like in their interactions with humans? Feel free to share your thoughts in the comments!