ChatGPT is poorly designed. So I fixed it
A critique of ChatGPT's confusing voice UX and a proposed redesign that integrates voice and text simultaneously—inspired by the familiar pattern of FaceTime + iMessage, built with GPT-4o Realtime API and intelligent model routing.
with hundreds of millions of people using this every day, has no one stopped to ask why is this app so confusing?Watch (00:00:15)
Separate voice modes
Built by different teams
Shipping the org chart
The Problem: "Shipping the Org Chart"
ChatGPT has two separate voice interaction modes—voice-to-text and voice-to-voice—that feel like they were built by different companies. This is a classic UX anti-pattern called "shipping the org chart", where products reflect internal team structure rather than user needs.
"we actually see two buttons to interact with voice. This is the voicetoext option and this is voice to voice... it can only respond through voice."
Demonstrating the two separate voice modes in ChatGPT that don't integrate
Watch (00:00:22)"It really feels like these two apps were built by two different companies."
The disconnect between voice modes feels like separate products
Watch (00:01:20)"Scott Hansselman called this shipping the org chart... it was three Android tablets chained together and I could suddenly see the organizational chart of this large international auto company and OpenAI is guilty of this as well."
Using the EV dashboard analogy to explain 'shipping the org chart' anti-pattern
Watch (00:01:24)The Org Chart Anti-Pattern
When products reflect internal team structure instead of user workflows, users experience disjointed, confusing interfaces. The EV analogy—with three Android tablets chained together displaying different interfaces—is a perfect metaphor for organizational silos manifesting in product design. Users don't care which team built which feature; they expect a cohesive experience.
The Solution: FaceTime + iMessage
The solution is elegantly simple: enable simultaneous voice and text interaction within the same conversation. Voice becomes the primary mode (like a phone call), with a slide-up text panel for detailed collaboration (like texting during FaceTime).
"There's two things I want to change about chat GPT today. allowing voice and text at the same time and smartly choosing the right model depending on your ask."
The two-part solution: multimodal + intelligent model routing
Watch (00:02:14)"This pulls up a panel that looks like iMessage. It kind of feels like texting your friend while you're on a FaceTime call with your call controls at the top."
Describing the FaceTime + iMessage inspired multimodal interface
Watch (00:02:54)"GPT-4o Realtime gives you a live audio chat and tool calls can handle the rest."
Technical implementation using GPT-4o Realtime API and tool calls
Watch (00:02:22)Voice First, Text as Support
Voice is the default, always-available interaction mode. The text panel slides up when needed for editing, sharing links, or detailed content.
Context Persistence
Conversation context flows seamlessly between voice and text. You can see what you asked in voice mode and continue the conversation in either modality.
Why FaceTime + iMessage Works
This pattern leverages existing mental models—users already understand how to talk on the phone while texting. By mapping voice AI to a phone call and text interactions to messaging, the redesigned interface requires zero learning. The multimodal approach aligns with natural human communication: voice for quick interactions, text for detailed or shareable content.
Technical Implementation: Off-the-Shelf Tools
The solution is achievable with current APIs—no proprietary technology required. It combines GPT-4o Realtime API for voice conversations, tool calls for handling different interaction modes, and intelligent model routing based on query complexity.
"like refactor this entire codebase to use Flutter instead. it detects that it's complex and decides to write a plan with the reasoning model to make sure the code actually works."
Example of automatic complexity detection and model routing
Watch (00:03:35)"If you asked for details and pros and cons, for example, we could hand off to reasoning, tell you how long it's thinking, and hand back a more detailed response."
Heuristic-based model routing for deeper analysis
Watch (00:03:48)"Didn't even need a system prompt. Just added this description and it was smartly able to send the right stuff with text."
Modern models can accomplish much with simple tool descriptions
Watch (00:04:46)Tool Calls
Handles text input, research queries, and model handoffs through function calling
View Docs →Simple Heuristics Enable Smart Routing
Complex model selection doesn't require AI—simple pattern matching works. Detecting keywords like "details," "pros and cons," "deep dive," or "refactor" can trigger handoff to reasoning models. Modern LLMs are surprisingly good at understanding context from minimal tool descriptions, often eliminating the need for complex system prompts. The entire demo was built with very simple prompts.
Key Takeaways
1. Avoid 'Shipping the Org Chart'
Design Philosophy
- •Organize products around user workflows, not internal teams
- •Cross-functional integration is essential for cohesive UX
- •Product structure should be invisible to users
2. Multimodal > Mode Switching
UX Pattern
- •Enable simultaneous voice and text, not separate modes
- •Leverage existing mental models (FaceTime + iMessage)
- •Voice first, with text as supplementary tool
3. Smart Model Routing
AI Engineering
- •Use heuristics for automatic model selection
- •Fast models for simple tasks, reasoning models for complexity
- •Tell users when longer processing is needed
4. Off-the-Shelf Tools Enable Innovation
Technical
- •GPT-4o Realtime + Tool Calls = Multimodal AI
- •No proprietary tech required for great UX
- •Modern models work well with simple prompts
5. Context Persistence Matters
UX Principle
- •Maintain conversation context across modalities
- •Users should seamlessly switch between voice and text
- •Eliminate jarring context loss between modes
6. Design for Human Communication Patterns
Product Strategy
- •Map AI interactions to familiar real-world patterns
- •Voice is default, text for specific purposes
- •Reduce cognitive load through existing mental models
Source Video
ChatGPT is poorly designed. So I fixed it
AI Engineer World's Fair
Research Note: All quotes in this analysis are timestamped and link to exact moments in the video. This report critiques ChatGPT's voice UX problems and documents a proposed solution using GPT-4o Realtime API with simultaneous voice + text interaction.
Key Concepts: Shipping the org chart, multimodal AI, GPT-4o Realtime API, tool calls, model routing heuristics, FaceTime + iMessage pattern, context persistence, user workflow design