Developing Storyteller: AI-Powered Audio Stories in 2 Mins with TypeScript & Model Fusion
Unlocking the power of AI-generated audio stories for preschool kids with Storyteller, a cutting-edge application that leverages typescript, model Fusion, and open-source libraries to craft engaging tales from voice inputs.
- 1. Storyteller is an application that generates short audio stories for preschool kids.
- 2. It is implemented using TypeScript and Model Fusion, an AI orchestration library.
- 3. The application generates audio stories that are about two minutes long.
- 4. All it needs is a voice input to start generating a story.
- 5. Storyteller is a client-server application with the client written in React.
- 6. The server is a custom Fastify implementation.
- 7. The main challenges were responsiveness, quality, and consistency.
- 8. When the user starts recording their voice input, it gets sent to the server as a buffer after being transcribed using Open AI Whisper.
- 9. Client-server communication is done through an event stream server.
- 10. An event with the videoion goes back to the client and updates the screen.
- 11. In parallel, a story outline is generated using GPT3 Turbo Instruct.
- 12. The title, image generation, and audio story are all generated in parallel.
- 13. The title is generated using Open AI GPT3 Turbo Instruct and sent to the client as an event.
- 14. An image is generated from a prompt extracted from the story and passed into Stability AI's stable diffusion model.
- 15. The generated image is stored in the server and the path is sent to the client.
- 16. Generating the full audio story takes the most time, around 1.5 minutes, using GPT4 with a low temperature.
- 17. Streaming the structure of the story being generated is used instead of streaming characters token by token.
- 18. The application uses Model Fusion to parse the structure and determine which parts are finished so they can be narrated.
- 19. Each story part is narrated as it's getting finished using predefined or retrieved voices for speakers.
- 20. A GPD 35 prompt is used to generate a voice description for new speakers, and a voice is selected based on the description.
- 21. The audio synthesis element uses one of eleven supported providers based on the chosen voices.
- 22. An audio file is generated and stored virtually in the server, and the path is sent to the client.
- 23. The first audio is played while more parts are being generated in the background.
- 24. Responsiveness is addressed by using a loading state that updates as results become available and streaming/parallel processing in the back end.
Source: AI Engineer via YouTube
❓ What do you think? What are your thoughts on the ideas shared in this video? Feel free to share your thoughts in the comments!