Navigating AI App Infrastructure Challenges: Building for Resilience in Long-running Workflows

Join Evan Bole, founder and CEO of GenSx, as he shares his expertise on how agents broke infrastructure and what it takes to build reliable applications in the era of long-running workflows.

* The speaker, Evan Bole, is the founder and CEO of Genisx and has experience in cloud and developer tools.
* In AI applications, P1 (i.e., target response time) is a couple of seconds, unlike web 2.0 services where it was typically tens of milliseconds.
* The infrastructure for AI applications needs to be designed differently than traditional web services to handle the longer response times and high levels of traffic.
* Current serverless providers have limitations such as short timeouts (mostly 5 minutes) and lack of native streaming support, which makes them less suitable for long-running workflows.
* To build reliable AI applications, it's essential to consider infrastructure that can handle bursty traffic patterns, rate limits, and provide resumability in case of user interruptions or errors.
* Developers working on AI applications often transition from being AI engineers to data engineers as they focus on getting the right context into prompts, which may involve crawling a user's inbox, i
* The increasing complexity of workflows and infrastructure requirements can lead to issues such as outages, rate limits, and difficulties in experimentation.
* To address these challenges, developers can build custom solutions using tools like SQS, Airflow, or Temporal, but there is a need for more user-friendly, serverless options that support long-runnin
* The speaker's company, Genisx, aims to provide an open-source library and infrastructure solution tailored for building agentic workflows, with features such as resumable streams, retries, and error
* Genisx's framework is designed to be unopinionated and focused on providing building blocks rather than abstracting away infrastructure details.
* The framework includes a simple component model that supports reusable, independently testable steps and workflows composed of these steps.
* Components can use wrapped versions of existing SDKs (e.g., OpenAI) with additional tooling for retries, tracing, etc.
* Workflows can be turned into REST APIs automatically, supporting synchronous and asynchronous invocation.
* Genisx's infrastructure includes a custom-built serverless platform designed for long-running workflows and agentic UIs, with separate API and compute layers that communicate via Reddis streams.
* This architecture allows for independent scaling of the API layer, pluggable compute layers, resumability, and transparent UI handling of errors and navigation away from the page.
* When building infrastructure for agentic workflows, it's crucial to:
1. Start simple and plan for future long-running needs
2. Keep the compute and API plane separate
3. Leverage Reddis streams for resumability
4. Make it easy for users to navigate away from the page without losing progress and handle errors transparently
* The speaker encourages using Genisx on GitHub as a starting point for building agentic workflows instead of building everything from scratch.

Source: AI Engineer via YouTube

❓ What do you think? What are your thoughts on the ideas shared in this video? Feel free to share your thoughts in the comments!