Developing Storyteller: AI-Powered Audio Stories in 2 Mins with TypeScript & Model Fusion

Unlocking the power of AI-generated audio stories for preschool kids with Storyteller, a cutting-edge application that leverages typescript, model Fusion, and open-source libraries to craft engaging tales from voice inputs.

1. Storyteller is an application that generates short audio stories for preschool kids.
2. It is implemented using TypeScript and Model Fusion, an AI orchestration library.
3. The application generates audio stories that are about two minutes long.
4. All it needs is a voice input to start generating a story.
5. Storyteller is a client-server application with the client written in React.
6. The server is a custom Fastify implementation.
7. The main challenges were responsiveness, quality, and consistency.
8. When the user starts recording their voice input, it gets sent to the server as a buffer after being transcribed using Open AI Whisper.
9. Client-server communication is done through an event stream server.
10. An event with the videoion goes back to the client and updates the screen.
11. In parallel, a story outline is generated using GPT3 Turbo Instruct.
12. The title, image generation, and audio story are all generated in parallel.
13. The title is generated using Open AI GPT3 Turbo Instruct and sent to the client as an event.
14. An image is generated from a prompt extracted from the story and passed into Stability AI's stable diffusion model.
15. The generated image is stored in the server and the path is sent to the client.
16. Generating the full audio story takes the most time, around 1.5 minutes, using GPT4 with a low temperature.
17. Streaming the structure of the story being generated is used instead of streaming characters token by token.
18. The application uses Model Fusion to parse the structure and determine which parts are finished so they can be narrated.
19. Each story part is narrated as it's getting finished using predefined or retrieved voices for speakers.
20. A GPD 35 prompt is used to generate a voice description for new speakers, and a voice is selected based on the description.
21. The audio synthesis element uses one of eleven supported providers based on the chosen voices.
22. An audio file is generated and stored virtually in the server, and the path is sent to the client.
23. The first audio is played while more parts are being generated in the background.
24. Responsiveness is addressed by using a loading state that updates as results become available and streaming/parallel processing in the back end.

Source: AI Engineer via YouTube

❓ What do you think? What are your thoughts on the ideas shared in this video? Feel free to share your thoughts in the comments!