UX Case Study

ChatGPT is poorly designed. So I fixed it

A critique of ChatGPT's confusing voice UX and a proposed redesign that integrates voice and text simultaneously—inspired by the familiar pattern of FaceTime + iMessage, built with GPT-4o Realtime API and intelligent model routing.

with hundreds of millions of people using this every day, has no one stopped to ask why is this app so confusing?
Watch (00:00:15)
2 Modes

Separate voice modes

Different

Built by different teams

Anti-Pattern

Shipping the org chart

The Problem: "Shipping the Org Chart"

ChatGPT has two separate voice interaction modes—voice-to-text and voice-to-voice—that feel like they were built by different companies. This is a classic UX anti-pattern called "shipping the org chart", where products reflect internal team structure rather than user needs.

"we actually see two buttons to interact with voice. This is the voicetoext option and this is voice to voice... it can only respond through voice."

Demonstrating the two separate voice modes in ChatGPT that don't integrate

Watch (00:00:22)
"It really feels like these two apps were built by two different companies."

The disconnect between voice modes feels like separate products

Watch (00:01:20)
"Scott Hansselman called this shipping the org chart... it was three Android tablets chained together and I could suddenly see the organizational chart of this large international auto company and OpenAI is guilty of this as well."

Using the EV dashboard analogy to explain 'shipping the org chart' anti-pattern

Watch (00:01:24)

The Org Chart Anti-Pattern

When products reflect internal team structure instead of user workflows, users experience disjointed, confusing interfaces. The EV analogy—with three Android tablets chained together displaying different interfaces—is a perfect metaphor for organizational silos manifesting in product design. Users don't care which team built which feature; they expect a cohesive experience.

The Solution: FaceTime + iMessage

The solution is elegantly simple: enable simultaneous voice and text interaction within the same conversation. Voice becomes the primary mode (like a phone call), with a slide-up text panel for detailed collaboration (like texting during FaceTime).

"There's two things I want to change about chat GPT today. allowing voice and text at the same time and smartly choosing the right model depending on your ask."

The two-part solution: multimodal + intelligent model routing

Watch (00:02:14)
"This pulls up a panel that looks like iMessage. It kind of feels like texting your friend while you're on a FaceTime call with your call controls at the top."

Describing the FaceTime + iMessage inspired multimodal interface

Watch (00:02:54)
"GPT-4o Realtime gives you a live audio chat and tool calls can handle the rest."

Technical implementation using GPT-4o Realtime API and tool calls

Watch (00:02:22)

Voice First, Text as Support

Voice is the default, always-available interaction mode. The text panel slides up when needed for editing, sharing links, or detailed content.

Context Persistence

Conversation context flows seamlessly between voice and text. You can see what you asked in voice mode and continue the conversation in either modality.

Why FaceTime + iMessage Works

This pattern leverages existing mental models—users already understand how to talk on the phone while texting. By mapping voice AI to a phone call and text interactions to messaging, the redesigned interface requires zero learning. The multimodal approach aligns with natural human communication: voice for quick interactions, text for detailed or shareable content.

Technical Implementation: Off-the-Shelf Tools

The solution is achievable with current APIs—no proprietary technology required. It combines GPT-4o Realtime API for voice conversations, tool calls for handling different interaction modes, and intelligent model routing based on query complexity.

"like refactor this entire codebase to use Flutter instead. it detects that it's complex and decides to write a plan with the reasoning model to make sure the code actually works."

Example of automatic complexity detection and model routing

Watch (00:03:35)
"If you asked for details and pros and cons, for example, we could hand off to reasoning, tell you how long it's thinking, and hand back a more detailed response."

Heuristic-based model routing for deeper analysis

Watch (00:03:48)
"Didn't even need a system prompt. Just added this description and it was smartly able to send the right stuff with text."

Modern models can accomplish much with simple tool descriptions

Watch (00:04:46)

GPT-4o Realtime API

Provides low-latency, bidirectional voice conversations with the AI

View Docs →

Tool Calls

Handles text input, research queries, and model handoffs through function calling

View Docs →

Reasoning Models

Automatically routes complex queries to o1-preview for deeper analysis

View Docs →

Simple Heuristics Enable Smart Routing

Complex model selection doesn't require AI—simple pattern matching works. Detecting keywords like "details," "pros and cons," "deep dive," or "refactor" can trigger handoff to reasoning models. Modern LLMs are surprisingly good at understanding context from minimal tool descriptions, often eliminating the need for complex system prompts. The entire demo was built with very simple prompts.

Key Takeaways

1. Avoid 'Shipping the Org Chart'

Design Philosophy

  • Organize products around user workflows, not internal teams
  • Cross-functional integration is essential for cohesive UX
  • Product structure should be invisible to users

2. Multimodal > Mode Switching

UX Pattern

  • Enable simultaneous voice and text, not separate modes
  • Leverage existing mental models (FaceTime + iMessage)
  • Voice first, with text as supplementary tool

3. Smart Model Routing

AI Engineering

  • Use heuristics for automatic model selection
  • Fast models for simple tasks, reasoning models for complexity
  • Tell users when longer processing is needed

4. Off-the-Shelf Tools Enable Innovation

Technical

  • GPT-4o Realtime + Tool Calls = Multimodal AI
  • No proprietary tech required for great UX
  • Modern models work well with simple prompts

5. Context Persistence Matters

UX Principle

  • Maintain conversation context across modalities
  • Users should seamlessly switch between voice and text
  • Eliminate jarring context loss between modes

6. Design for Human Communication Patterns

Product Strategy

  • Map AI interactions to familiar real-world patterns
  • Voice is default, text for specific purposes
  • Reduce cognitive load through existing mental models

Source Video

ChatGPT is poorly designed. So I fixed it

AI Engineer World's Fair

Video ID: y6L5RkEqQ8gDuration: ~5 minutes
UX Design
GPT-4o Realtime
Multimodal AI
Product Critique
Watch on YouTube

Research Note: All quotes in this analysis are timestamped and link to exact moments in the video. This report critiques ChatGPT's voice UX problems and documents a proposed solution using GPT-4o Realtime API with simultaneous voice + text interaction.

Key Concepts: Shipping the org chart, multimodal AI, GPT-4o Realtime API, tool calls, model routing heuristics, FaceTime + iMessage pattern, context persistence, user workflow design

Research sourced from AI Engineer World's Fair transcript. Analysis critiques ChatGPT's voice UX and proposes a multimodal solution with real quotes and technical implementation details. All quotes verified against original VTT transcript with exact timestamps.