System Design: Designing YouTube: Video Upload and Streaming Platform

I remember the meeting like it was yesterday. The Head of Product at a fast-growing B2B SaaS company I was with had a brilliant idea: "Let's add video testimonials to the platform." Simple. Elegant. High-impact. The engineering team, eager to please, spun up a proof-of-concept in a week. They added a file upload field to a Rails model, used a popular library to handle the multipart form data, and stored the video on the same server's file system. The path was just another column in the PostgreSQL database. In the staging environment, it worked flawlessly.

Then came the launch. The first dozen uploads went fine. Then a customer tried to upload a 500 MB file from a slow connection. The web server process held the connection open for fifteen minutes, consuming memory and a precious worker thread. A few more of these uploads happened concurrently, and the entire application ground to a halt. The site was down. The post-mortem was painful, not because the problem was complex, but because it was so predictable.

They had fallen for the most common trap in system design: they mistook a file for a simple piece of data. They treated a video upload like updating a user's profile picture.

Here is the thesis I've come to after years of building and scaling media systems: A video file is not data to be stored; it is a raw material that must enter a manufacturing pipeline. The naive "upload-and-serve" model is not just a less-scalable version of a real video platform; it is a fundamentally incorrect architecture. It’s like trying to build a global car company by having customers mail you raw steel and waiting for you to hammer it into a car in your garage. To succeed, you must think like a factory foreman, not a file clerk.

Unpacking the Hidden Complexity

The initial failure wasn't just about resource exhaustion. It was a symptom of a deeper misunderstanding. The team's quick fix was to "just put it on S3." They refactored the code to upload directly to an S3 bucket. This solved the immediate problem of their web servers crashing, but it was like fixing a leaky pipe by pointing it out the window. The flood was now happening somewhere else, and a host of new, more insidious problems were just around the corner.

Why does this seemingly logical step fail? Because it ignores the three core domains of a video platform: ingestion, processing, and delivery.

1. The Ingestion Problem: The Fallacy of the Single File

When a user gives you a video, they are not giving you a finished product. They are giving you a package of unknown quality, size, and format. A 4K .MOV file from a new iPhone is a completely different beast from a 480p .AVI file from a ten-year-old digital camera. Your system must be prepared for anything.

Simply dumping this raw material into a bucket creates several second-order problems:

No Validation: What if the file is corrupted? Or not a video at all? You only find out when a downstream process fails.
No Metadata: How do you know the video's resolution, duration, or codec without processing it? You need this information to make intelligent decisions.
Inefficient Delivery: Serving a 1 GB original file to a user on a 3G connection is a recipe for a terrible user experience. The user doesn't need 4K quality; they need a video that starts playing now.

This is where the factory analogy becomes critical. A car factory doesn't just receive a lump of steel. It receives a manifest, inspects the material, and routes it to the correct assembly line. Our video ingestion pipeline must do the same. This involves a decoupled upload process where the client gets a temporary, secure credential to upload the file directly to object storage. This isolates your application servers from the slow, unpredictable nature of file uploads.

This diagram illustrates the Resilient Upload Pipeline. The user's app first requests permission to upload from your API server. The server creates a record in the metadata database (e.g., status: uploading) and returns a special, short-lived URL (a pre-signed URL). The client then uses this URL to upload the file directly to the cloud storage bucket, completely bypassing your servers for the heavy lifting. Once the upload is complete, the storage service automatically sends an event to a message queue, kicking off the next stage of the process.

2. The Processing Problem: The Transcoding Assembly Line

Once the raw material is in our warehouse (the S3 bucket), the real manufacturing begins. You cannot just serve the original file. To provide a smooth streaming experience to a global audience on countless devices and network conditions, you need Adaptive Bitrate Streaming (ABS).

ABS works by re-encoding the original video into multiple versions (called renditions) at different resolutions and bitrates. For example:

2160p (4K) at 15-20 Mbps
1080p at 5-8 Mbps
720p at 2-4 Mbps
480p at 1-2 Mbps
360p at 0.5-1 Mbps

The video is also chopped into small segments, typically 2-10 seconds long. A manifest file is created that lists all the available renditions and the location of their segments. The user's video player (like video.js or the native YouTube player) downloads this manifest, detects the user's current network speed, and requests the appropriate segments. If the network speed changes, the player can seamlessly switch to a higher or lower quality stream between segments.

This transcoding process is computationally expensive and asynchronous. It is the core of your video factory. Trying to do this synchronously upon upload would lead to impossibly long request times. This is a job for a fleet of specialized worker services that consume tasks from the message queue.

The lifecycle of a video is a state machine. It is never just "there." It is always in a specific state: uploading, pending_transcoding, transcoding, published, or failed.

This state diagram shows the Video Processing Lifecycle. Each state transition is a well-defined event in your system. An Upload Complete event moves the video to TranscodingQueued. A transcoding worker pulls the message, changes the state to Transcoding, and begins its work. If it succeeds, it updates the state to Published and writes the manifest file location to the database. If it fails, it enters a Failed state, perhaps triggering an alert or a retry mechanism. Managing these states explicitly is fundamental to building a reliable system.

When choosing a streaming protocol, the two dominant industry standards are HLS and DASH.

Feature	HLS (HTTP Live Streaming)	MPEG-DASH
Originator	Apple	ISO/MPEG (Industry Consortium)
Compatibility	Native on all Apple devices, broad support elsewhere with JS players. The de-facto standard for mobile.	Standard-based, requires a JavaScript player (e.g., Shaka Player, video.js) on most platforms.
Container Format	MPEG-2 Transport Stream (`.ts`) or Fragmented MP4 (`.fmp4`).	Primarily Fragmented MP4 (`.fmp4`).
Latency	Traditionally higher (6-30s), but Low-Latency HLS (LL-HLS) is closing the gap significantly.	Generally lower and more flexible, designed for low latency from the start.
DRM Support	Primarily uses Apple's FairPlay, but can be containerized to support Widevine and PlayReady.	Uses Common Encryption (CENC), which is designed to support multiple DRM systems (Widevine, PlayReady, etc.) natively.

For most use cases today, starting with HLS is the pragmatic choice due to its unparalleled native support on iOS and macOS, while still being universally playable everywhere else with standard JavaScript players.

3. The Delivery Problem: The Global Logistics Network

You've successfully ingested and processed the video. It's now a set of manifest files and video segments sitting in your S3 bucket. The final, and perhaps most critical, piece of the puzzle is delivery. If a user in Tokyo has to download video segments from a server in Virginia, the latency will kill the experience. The time to first frame (the time from hitting play to the video starting) will be abysmal.

This is where a Content Delivery Network (CDN) is not a luxury; it is a core, non-negotiable component of the architecture. A CDN is a globally distributed network of cache servers. When a user requests your video, the request is routed to the nearest CDN "edge" server.

If the edge server has the video segment cached, it serves it directly with extremely low latency.
If it doesn't (a "cache miss"), it requests the file from a regional mid-tier cache or, ultimately, from your "origin" server (the S3 bucket), and then caches it for future requests in that region.

This means the vast majority of your traffic is served by the CDN, dramatically reducing latency for users and egress costs from your cloud provider. Your application's job is simply to tell the user's browser the correct CDN URL for the video manifest.

This diagram shows the Global Streaming Delivery Flow. The key interaction is that your backend application is only involved in the initial page load to provide the metadata, including the CDN URL for the video manifest. All subsequent requests for video segments are handled entirely by the CDN, which intelligently pulls from the origin storage only when necessary. This architecture scales horizontally and globally.

Traps the Hype Cycle Sets for You

As with any complex domain, the landscape is littered with buzzwords and tempting over-optimizations. Here are the most common traps I see teams fall into.

Trap 1: "We'll build our own transcoding service." Unless your company's core business is video encoding, do not do this. The complexity of managing codecs, resolutions, bitrates, audio tracks, captioning, and hardware acceleration is staggering. The open-source tool FFmpeg is the heart of this world, but wrapping it in a reliable, scalable, and observable service is a massive undertaking. The Pragmatic Path: Use a managed service like AWS Elemental MediaConvert, Azure Media Services, or Google Transcoder API. These services are built by teams of experts, are infinitely scalable, and you only pay for what you use. They are the "boring" choice that lets you focus on your actual product.

Trap 2: "We must use the AV1 codec for maximum efficiency." AV1 is a fantastic, royalty-free codec that offers superior compression to its predecessors (like H.264 and HEVC). However, its adoption is still growing, and it requires significantly more computational power to encode. H.264, on the other hand, is the cockroach of video codecs: it's everywhere, it's supported by every device made in the last decade, and its hardware encoding/decoding is ubiquitous and cheap. The Pragmatic Path: Start with H.264 as your baseline for universal compatibility. Add newer codecs like VP9 or AV1 as a second option for supported clients (like Chrome and Android). This progressive enhancement strategy ensures everyone can watch your video, while power users get a better experience.

Trap 3: "The video file is the source of truth." The video file is just a dumb asset. The intelligence of your system lives in its metadata. The database record for a video should be your source of truth, containing its title, description, owner, privacy settings, and most importantly, its current status in the processing pipeline and the final manifest_url. Your application logic should always consult the database, not the file system. A video with a status of transcoding should not be playable, even if some of its files already exist in the S3 bucket.

Architecting for the Future

We have moved from a simple file upload to a distributed, event-driven, asynchronous manufacturing pipeline. This is the only way to handle video at scale. The core principle is decoupling. The ingestion pipeline is decoupled from the processing pipeline via a message queue. The delivery architecture is decoupled from your application via a CDN. Each component can be scaled, updated, or replaced independently.

Your First Move on Monday Morning: If you are tasked with adding a video feature, do not start by looking for an upload library. Your first action should be to open a whiteboard and draw the state machine for a video's lifecycle. Define the states: Created, Uploading, Processing, Published, Failed. Define the events that trigger transitions between them. This state machine is the logical core of your entire system. Everything else is an implementation detail that serves this lifecycle.

This principles-first approach saves you from the catastrophic failure of the naive file server model and sets you on a path to building a system that is resilient, scalable, and ready for the future. It forces you to think in terms of pipelines and asynchronous workflows, which is the language of modern distributed systems.

So, let me ask you this: As AI continues its relentless march, and video generation becomes as common as video uploading, how will your content manufacturing pipeline need to evolve? When the raw material is no longer a file but a text prompt, will your architecture bend, or will it break?

TL;DR: Key Architectural Takeaways

Don't Treat Video Like a File: A video is a raw material for a complex processing pipeline, not a simple blob to be stored.
Decouple Ingestion: Use pre-signed URLs to allow clients to upload directly to object storage (S3, GCS). This protects your application servers from being overwhelmed by slow, large uploads.
Embrace Asynchronous Processing: Video transcoding is slow and resource-intensive. Use a message queue (SQS, Kafka, RabbitMQ) to decouple the upload process from the transcoding process. This allows for scalability, retries, and resilience.
Think in State Machines: Explicitly model the lifecycle of a video (uploading, processing, published, failed). This makes your system's behavior predictable and easier to manage.
Use Adaptive Bitrate Streaming (ABS): Transcode every video into multiple resolutions and bitrates (using HLS or DASH) to ensure a smooth playback experience for all users on any network.
A CDN is Non-Negotiable: Use a Content Delivery Network (CDN) to serve video segments. This provides low-latency global delivery and dramatically reduces the load and cost on your origin storage.
Buy, Don't Build (for Transcoding): Leverage managed services like AWS Elemental MediaConvert for transcoding. The complexity of building and maintaining a transcoding service is immense and rarely a core business competency.
Metadata is King: Your database, not your file storage, is the source of truth for a video's status and location. All application logic should be driven by this metadata.

Designing YouTube: Video Upload and Streaming Platform

Unpacking the Hidden Complexity

Traps the Hype Cycle Sets for You

Architecting for the Future

TL;DR: Key Architectural Takeaways

Comments

System Design

More from this blog

Domain-Driven Design in Microservices

Blue-Green vs Canary Deployment Strategies

Global Load Balancing and DNS-based Routing

Bulkhead Pattern for System Isolation

Auto-scaling and Load-based Scaling

Command Palette

Unpacking the Hidden Complexity

Traps the Hype Cycle Sets for You

Architecting for the Future

TL;DR: Key Architectural Takeaways

Comments

System Design

More from this blog