Design a Video Upload and Processing System
Video systems look easy when you only think about “upload a file and play it.” In reality, a good video platform is a pipeline: uploads must be reliable, processing must be scalable, playback must be fast worldwide, and costs must not explode when users start uploading large files.
Imagine you’re building this for DevsCall. Creators upload course videos. Students watch them on mobile and desktop. You need smooth playback, multiple qualities (so videos don’t buffer), thumbnails, and basic analytics. You also need to protect the origin servers, avoid storing duplicate uploads, and handle failures without losing user files.
This lesson walks through a production-friendly design that stays simple but covers the real engineering decisions.
Start with requirements that guide the pipeline
The first step is deciding what your system must support. At minimum: users upload videos, the platform processes them into streaming formats, and viewers can play them with low buffering. You likely want multiple renditions (360p/720p/1080p), thumbnails, and maybe captions later. You also want secure uploads, because letting clients upload directly to your backend is a recipe for timeouts and huge bandwidth bills.
Your main non-functional goals are reliability (uploads should not fail halfway silently), scalability (processing should handle spikes), latency (playback must start fast), and cost control (storage and transcoding can become expensive quickly).
The upload pipeline
A classic mistake is uploading large video files through the application server. It works at small scale, but it burns CPU, RAM, and bandwidth on your API fleet, and it increases failure rates as file sizes grow.
The standard production approach is direct-to-object-storage uploads using signed URLs.
Here’s the typical flow. The client asks your API for an upload session. Your API authenticates the user, validates basic constraints (max file size, allowed formats), and returns a signed upload URL for object storage (like S3-compatible storage). The client uploads directly to storage using that URL. When upload completes, the client notifies your API, or storage emits an event that the upload finished.
This design keeps your API servers lightweight and makes uploads much more reliable because object storage is built for large file transfer.
You also usually upload to a temporary “incoming” bucket/prefix first, not the final public location, because the file is not yet processed or safe to serve.
Metadata and status
The system needs to track the lifecycle of each video: uploaded, processing, ready, failed. This is usually stored in a database.
A simple model includes a videos table with fields like video_id, owner_id, original_storage_key, upload_status, processing_status, duration, created_at, and maybe visibility settings (private/unlisted/public). You also keep a renditions table or structured field that records output formats and their storage keys.
This state tracking matters because processing can take minutes, and failures are common. Users need to see progress and retry safely.
Transcoding
Once a video is uploaded, you need to transcode it. Transcoding is CPU/GPU heavy and should never happen inside the user-facing request path.
The correct pattern is asynchronous jobs. When an upload is finalized, the system publishes a job to a queue that contains the video_id and source location. Dedicated transcoding workers consume jobs, download the source from storage, run transcoding, upload outputs, and update the database status.
This queue-based design gives you spike handling and retries. If 1,000 videos are uploaded after a marketing push, your queue grows, and workers process at capacity. You can scale workers horizontally based on queue depth.
You also want idempotency here. Jobs should be safe to retry without generating duplicate outputs or corrupting state. A common approach is to store a processing version or job_id and ensure workers only finalize outputs once.
Generate streaming-friendly video, not one big MP4
For playback at scale, you want adaptive streaming, not one giant file. The common approach is to output HLS or DASH. These break the video into small segments and a manifest file. Players can switch quality based on network speed, which reduces buffering.
Your transcoding job typically produces multiple renditions (like 360p, 720p, 1080p), each segmented. This also helps cost and performance because users on slow networks don’t download large high-bitrate streams unnecessarily.
Thumbnails
Thumbnails are part of the processing pipeline. Most systems generate thumbnails during transcoding by extracting frames at specific timestamps (for example near the start and midpoints) and storing them separately.
You may generate a default thumbnail automatically and optionally allow the creator to choose or upload a custom one later. Thumbnails are usually served via CDN and cached aggressively since they are read-heavy.
CDN delivery
Serving video segments from your origin storage directly to the world can be expensive and slower for global users. You place a CDN in front of your video storage so edge locations cache segments close to users.
The typical setup is: output segments stored in object storage; CDN configured with that storage as origin; playback URLs point to the CDN domain. For private videos, you use signed URLs or token-based access at the CDN layer so only authorized viewers can fetch segments.
This is critical: you don’t want someone to copy a direct URL and share it publicly if the content is paid.
Access control
For paid content, your system must enforce access consistently. A common approach is short-lived signed URLs generated when a user starts playback, or a signed cookie/token that the CDN validates on every segment request.
The key is to avoid putting authorization checks in front of every segment at your application servers. That would overload your backend. You want the CDN to enforce access so your API stays out of the hot path.
Cost planning
Video systems have three major cost centers: storage, transcoding compute, and CDN egress (bandwidth).
Storage costs grow with original uploads and all renditions. Transcoding costs scale with minutes of video processed. Bandwidth costs scale with views and quality levels streamed.
A practical cost control approach includes limiting maximum upload size, restricting allowed bitrates, generating only the renditions you actually need, and using lifecycle policies. For example, you may keep originals only for a limited time if not needed, or archive them to cheaper storage.
Caching also matters. CDN hit rate reduces origin fetches and often improves egress efficiency.
You also plan capacity around peak usage. Upload spikes affect queue depth and worker capacity. View spikes affect CDN usage and manifest/segment cache behavior. A good design scales these independently: upload/processing and playback are separate planes.
Failure modes
Uploads fail due to network interruptions. Your client should support resumable uploads (multipart upload) for large files. The server should track partial uploads and allow retry without starting over.
Transcoding jobs fail due to corrupt files, unsupported formats, or worker crashes. Your queue should retry with backoff, but after a threshold, mark the video as failed and surface a clear error to the user. Keep logs and a dead-letter queue for debugging.
CDN issues happen. If CDN has problems, playback degrades. You may fall back to origin temporarily, but you must protect origin capacity. In practice, you monitor CDN error rates and use multi-CDN only when scale demands it.
Most importantly, “processing down” should not break uploads, and “analytics down” should not break playback. Keep your pipeline decoupled.
A clean, interview-ready architecture
A strong video system design usually looks like this: the client uploads directly to object storage using signed URLs; the backend stores metadata and publishes a processing job; transcoding workers consume jobs, generate adaptive streaming outputs and thumbnails, and write results back; playback is served through a CDN with token-based access; analytics and notifications run asynchronously; cost controls and lifecycle rules keep storage and compute predictable.
If you can explain this pipeline end-to-end, including why uploads are direct-to-storage, why processing is queued, and why CDN access is secured, you’re answering this question at a production level.
Frequently Asked Questions
Large video files consume too much bandwidth and memory. Direct uploads to object storage using signed URLs are more reliable and scalable.
Signed URLs allow clients to upload or access videos securely without exposing storage credentials or overloading backend servers.
Transcoding is compute-intensive and slow. Using queues and background workers prevents uploads from blocking and allows safe retries.
Multiple qualities (like 360p, 720p, 1080p) enable adaptive streaming so users get smooth playback based on their network speed.
A CDN caches video segments near users, reducing latency, protecting origin storage, and lowering bandwidth costs.
Systems use signed URLs or token-based access at the CDN layer to ensure only authorized users can stream the content.
Still have questions?Contact our support team