A visual, beginner-friendly guide to media workflows β from raw video to your screen. Step through each concept at your own pace.
Before diving into AWS services, let's understand what video actually is at a technical level.
Video is just a series of images (frames) shown rapidly. 30fps = 30 images per second. Your eyes perceive this as motion.
Common frame rates: 24fps (cinema), 30fps (TV/web), 60fps (sports/gaming). Higher = smoother but more data.
The number of pixels in each frame. More pixels = sharper image but bigger file.
SD (480p) = 720Γ480 pixels. HD (1080p) = 1920Γ1080. 4K = 3840Γ2160 β that's 4x the pixels of 1080p!
How much data flows per second. Higher bitrate = better quality but needs more bandwidth. Measured in Mbps (megabits per second).
Think of it like a water pipe β wider pipe (higher bitrate) carries more detail. A 1080p stream typically needs 4-8 Mbps. 4K needs 15-25 Mbps.
A container (.mp4, .ts, .mkv) is the box that holds everything together. A codec (H.264, H.265) is how the video and audio inside are compressed.
One container can hold multiple streams: video, audio (multiple languages), subtitles, and metadata β all in one file.
Raw video is massive β a single minute of uncompressed 1080p is about 10 GB. Codecs compress it by 100-1000x so it can travel over the internet.
I-frame (Keyframe): A complete picture. The "reset point." Larger but self-contained.
P-frame: Stores only changes from the previous frame. Much smaller.
B-frame: References both past AND future frames. Smallest but most complex.
A GOP is the sequence from one I-frame to the next. Typical GOP = 2-4 seconds of video.
Shorter GOP = more I-frames = bigger file but easier to seek/cut. Longer GOP = better compression but harder to edit.
In live streaming, GOP size directly affects latency β the player must wait for the next I-frame to start displaying.
The workhorse. Supported on literally everything β phones, browsers, TVs, toasters. Good compression but not the best by today's standards.
50% better compression than H.264 at the same quality. The go-to for 4K content. Licensing is complicated (patent pools).
Open-source, royalty-free (backed by Google, Netflix, etc). Even better compression. Growing support, very high CPU cost to encode.
You can't just send one giant file. Modern streaming breaks video into chunks and adapts quality in real-time.
The same video is encoded at multiple quality levels β each one is called a rendition. The full set is your ABR ladder (Adaptive Bitrate ladder).
A viewer on fast wifi gets 1080p; on a congested cellular connection they get 480p. The player switches dynamically.
The player continuously measures your available bandwidth and buffer level, then switches between renditions seamlessly. This is why Netflix quality fluctuates when your connection dips.
How switching works: Each rendition is split into aligned segments (same duration, same keyframe boundaries). The player can jump between renditions at any segment boundary without glitches.
Apple's protocol, the most widely used. Splits video into small .ts or .m4s segments (2-6 sec each) with a .m3u8 playlist that tells the player what's available.
Uses a two-level manifest structure: master playlist (lists renditions) β child playlists (list segments for each rendition).
Open standard (MPEG). Uses .mpd manifests (XML) and .m4s segments. Functionally similar to HLS but more flexible and used heavily on Android/smart TVs.
Single manifest file describes all AdaptationSets (video, audio) and Representations (renditions) in XML.
A unified format that works with BOTH HLS and DASH. Uses fragmented MP4 (.m4s) segments and can serve both protocols from the same encoded files.
Why it matters: Without CMAF, you'd encode everything twice β once for HLS (.ts) and once for DASH (.m4s). CMAF = encode once, serve both.
Encryption that prevents unauthorized copying. The video segments are encrypted; only authorized players with a license key can decrypt them.
Widevine (Google/Android/Chrome), FairPlay (Apple), PlayReady (Microsoft) β most services use all three to cover every device.
Manifests are the "table of contents" that tell the video player everything it needs β what qualities exist, where the segments are, and when ads should play.
The top-level file the player fetches first. It lists all available renditions with their bandwidth and resolution, so the player can pick the right one.
#EXTM3U
#EXT-X-STREAM-INF:BANDWIDTH=2000000,RESOLUTION=1280x720,CODECS="avc1.64001f,mp4a.40.2"
720p/playlist.m3u8
#EXT-X-STREAM-INF:BANDWIDTH=5000000,RESOLUTION=1920x1080,CODECS="avc1.640028,mp4a.40.2"
1080p/playlist.m3u8
#EXT-X-STREAM-INF:BANDWIDTH=800000,RESOLUTION=854x480,CODECS="avc1.4d401e,mp4a.40.2"
480p/playlist.m3u8
#EXTM3U File header β identifies this as an HLS playlist#EXT-X-STREAM-INF Describes a rendition: bandwidth, resolution, codecs usedBANDWIDTH Peak bitrate in bits/sec β player uses this to decide which quality to pickCODECS RFC 6381 codec string β tells player if it can decode this streamOne per rendition. Lists the actual video segments in order with their duration. For live streams, this file updates continuously as new segments become available.
#EXTM3U
#EXT-X-VERSION:3
#EXT-X-TARGETDURATION:6
#EXT-X-MEDIA-SEQUENCE:1
#EXTINF:6.006,
segment_001.ts
#EXTINF:6.006,
segment_002.ts
#EXTINF:5.839,
segment_003.ts
#EXT-X-ENDLIST
#EXT-X-TARGETDURATION Max segment duration in seconds β player uses this for buffering decisions#EXT-X-MEDIA-SEQUENCE Sequence number of the first segment β critical for live streams to know position#EXTINF Duration of the next segment in seconds#EXT-X-ENDLIST Marks VOD (complete). Absent in live streams (playlist keeps growing)SCTE-35 is the industry standard for signaling ad breaks, program boundaries, and other events in video streams. These markers tell downstream systems "an ad break should go here."
In HLS manifests, SCTE-35 appears as special tags that MediaTailor (or any SSAI system) reads to know where to splice in ads.
#EXTINF:6.006,
segment_044.ts
<!-- Ad break starts here -->
#EXT-X-CUE-OUT:30.0
#EXTINF:6.006,
segment_045.ts
#EXT-X-CUE-OUT-CONT:ElapsedTime=6.006,Duration=30.0
#EXTINF:6.006,
segment_046.ts
<!-- ...more ad segments... -->
#EXT-X-CUE-IN
#EXTINF:6.006,
segment_050.ts
#EXT-X-CUE-OUT:30 Start of ad break β "splice out" for 30 seconds#EXT-X-CUE-OUT-CONT Continuation β tells player we're still in the ad break, how much time has elapsed#EXT-X-CUE-IN End of ad break β "splice back in" to main content#EXT-X-KEY
DRM encryption info β method (AES-128, SAMPLE-AES), key URI, and IV. Tells the player how to decrypt segments.
#EXT-X-MAP
Points to the initialization segment (fMP4 header). Required for CMAF/fMP4 segments β contains codec config, not video data.
#EXT-X-DISCONTINUITY
Signals a break in encoding parameters (codec, resolution change). Common at ad boundaries where the ad is encoded differently than content.
#EXT-X-PROGRAM-DATE-TIME
Wall-clock timestamp for a segment. Used for DVR/time-shift β lets the player show "10 minutes ago" labels.
#EXT-X-BYTERANGE
Multiple segments packed in one file β this tag says "read bytes X to Y." Reduces HTTP requests.
#EXT-X-DATERANGE
Metadata tied to a time range. Used for SCTE-35 in newer HLS specs, timed metadata events, and interstitials.
AWS provides purpose-built services for each step of the video pipeline. Here's what each one does.
Live video encoding in the cloud
Takes a live video input (camera, RTMP/RTP/HLS/MediaConnect) and encodes it in real-time into multiple renditions with ABR. Supports dual-pipeline for redundancy.
File-based video transcoding
Serverless batch transcoding. Takes a source file from S3 and outputs multiple renditions in various formats. Pay per minute of video processed.
Just-in-time packaging & origination
Receives encoded video and packages it into HLS, DASH, CMAF on-the-fly. Adds DRM, creates manifests, enables DVR/catchup-TV. Acts as the origin for your CDN.
Reliable video transport
Broadcast-quality live video transport. Moves streams between AWS regions, to/from on-prem, with protocols like SRT, Zixi, RIST for error correction over the public internet.
Server-side ad insertion & channel assembly
Reads SCTE-35 markers in manifests, fetches ads from your ad server (VAST/VMAP), and stitches them into the stream server-side. Also builds virtual linear channels from VOD assets.
Now let's see how these services combine into real workflows, from simple to complex.
Upload a file, transcode it, deliver via CDN. The most basic streaming setup.
How it works: A video file lands in S3. An EventBridge rule triggers a MediaConvert job that encodes it into multiple renditions as HLS (master + child manifests + segments). Output goes back to S3, and CloudFront distributes globally. Player fetches the master manifest, picks a rendition, and streams segments.
Take a live camera feed and stream it to thousands of viewers in real-time.
How it works: A live feed arrives at MediaLive via RTMP or SRT. MediaLive encodes it in real-time into an ABR ladder and pushes to MediaPackage. MediaPackage generates live manifests (constantly updated with new segments), adds DRM if configured, and serves as the origin. CloudFront caches edge-close to viewers.
Broadcast-grade live streaming with reliable transport, redundancy, content protection, and monetization.
How it works: Live feed β MediaConnect provides reliable transport with FEC error correction over public internet. MediaLive encodes with dual pipelines (automatic failover) and passes through SCTE-35 ad markers. MediaPackage packages with Widevine/FairPlay DRM and enables DVR window. MediaTailor reads SCTE-35 cue-out/cue-in markers, fetches personalized ads from your VAST ad server, and stitches them in seamlessly per viewer.
Complete media platform with live events, VOD library, and linear FAST channels β all monetized.
How it works: This is a complete OTT platform. Live β MediaConnect β MediaLive. VOD β MediaConvert. Both paths converge at MediaPackage (DRM + unified manifests). MediaTailor inserts ads AND assembles FAST (Free Ad-Supported Television) channels by stitching VOD assets + live feeds into a 24/7 linear schedule. CloudFront delivers to every device type globally.
Every video stream has audio β and audio has its own codecs, channel layouts, and standards that matter just as much as the visual side.
AAC β The default. Great quality at low bitrates (128-256 kbps). Universal support. Used in HLS, MP4, DASH.
Dolby Digital (AC-3) β 5.1 surround sound. Standard for broadcast TV and Blu-ray. 384-640 kbps.
Dolby Digital Plus (EC-3) β Improved AC-3 with higher efficiency. Supports 7.1 and object-based audio (Atmos). Used on Netflix, Disney+.
Dolby Atmos β Object-based audio embedded in EC-3. Sound can be placed in 3D space. Requires compatible soundbar/headphones.
Opus β Open-source, royalty-free. Excellent at all bitrates. Used in WebRTC and some DASH streams.
Sample rate = how many times per second audio is measured. 48 kHz is standard for video (48,000 samples/sec). CD uses 44.1 kHz.
Bit depth = precision of each sample. 16-bit = CD quality (96 dB dynamic range). 24-bit = professional/studio (144 dB range).
More samples Γ more bits = bigger file. But after encoding with AAC/AC-3, the final bitrate matters more than the source format.
Mono (1.0) β Single channel. Used for voice/commentary tracks.
Stereo (2.0) β Left + Right. The baseline for streaming.
5.1 Surround β Front L/C/R + Rear L/R + Subwoofer. Standard for cinema/broadcast.
7.1 β Adds side speakers. Premium home theater.
Atmos (object-based) β Not channels but "objects" with 3D positions. Renderer maps to whatever speakers you have.
Without standards, ads would blast at max volume while content is quiet. Loudness normalization fixes this.
EBU R128 β European broadcast standard. Target: -23 LUFS (Loudness Units relative to Full Scale).
ATSC A/85 β US broadcast standard. Target: -24 LKFS (same unit, different name).
Why it matters: MediaLive has audio normalization filters. If your output fails loudness specs, broadcasters will reject it.
The delay between something happening live and the viewer seeing it. Shorter latency = harder engineering problem.
Total glass-to-glass latency is the sum of every step in the pipeline:
Typical total: Normal HLS = 20-40 sec. Low-Latency HLS = 3-5 sec. WebRTC = <1 sec.
Standard HLS/DASH with 6-second segments. The player buffers 3-5 segments before starting.
Pros: Rock-solid reliability, works everywhere, CDN-friendly, tolerates network jitter.
Use case: VOD, linear TV, non-interactive streams where delay doesn't matter.
LL-HLS and LL-DASH use partial segments (0.3-1 sec chunks) delivered before the full segment is complete. Player starts earlier with less buffer.
Key tech: Chunked Transfer Encoding, #EXT-X-PART tags, blocking playlist reload, preload hints.
Use case: Live sports, auctions, watch parties β where a few seconds delay is acceptable.
WebRTC β Peer-to-peer, sub-second latency. No segments, no manifests. Direct media delivery. Doesn't scale easily via CDN.
SRT/RIST β Used for contribution (camera β cloud) not distribution. UDP-based with error correction.
Use case: Video calls, interactive live (betting, gaming), remote production.
Different protocols for different stages. Contribution (getting video to the cloud) uses different tech than distribution (sending to viewers).
Contribution = getting the raw/mezzanine video from the source (camera, venue) to the encoder/cloud. Needs reliability, not scale. Point-to-point.
Distribution = delivering the final stream from origin/CDN to millions of viewers. Needs scale, not point-to-point reliability. HTTP-based.
Adobe's protocol from 2002. TCP-based. Still the most common way to push a live stream from an encoder (OBS, Wirecast) to a service.
Pros: Universal encoder support, simple push model.
Cons: TCP = head-of-line blocking under packet loss. Limited to H.264 + AAC. No built-in encryption. Being phased out for ingest but still dominant.
Open-source, UDP-based protocol designed by Haivision. Handles packet loss with ARQ (retransmission). AES-128/256 encryption built in.
Pros: Works over unpredictable internet, encrypted, low overhead, supports H.265.
Cons: Newer β not as universally supported as RTMP in legacy gear.
AWS: MediaConnect and MediaLive both accept SRT input. This is the preferred contribution protocol.
Industry standard (VSF/SMPTE) competing with SRT. Also UDP + ARQ retransmission. Interoperable between vendors by design.
Pros: Standards-body backed, multi-vendor interop, profile-based (Simple, Main, Advanced).
Cons: Less community adoption than SRT, more complex profiles.
RTP (Real-time Transport Protocol) β bare UDP packets with sequence numbers. Used in professional broadcast (SDI over IP / SMPTE 2110).
RTSP β Control protocol that manages RTP streams (play, pause, seek). Used by IP cameras.
MediaLive accepts RTP input for professional contribution feeds.
Browser-native real-time protocol. Sub-second latency. Uses SRTP (encrypted RTP) + ICE/STUN/TURN for NAT traversal.
Use case: Video conferencing, interactive live streaming. Not a contribution protocol for broadcast β it's end-to-end.
Limitation: Doesn't scale via CDN easily. Each viewer is a peer connection. Solutions like Amazon IVS use WebRTC for low-latency at scale.
Text tracks are not optional β they're required by law in many contexts (FCC, ADA, EAA). Here's how they work in streaming.
Subtitles = translation of dialogue for viewers who don't speak the language. Assumes you can hear.
Closed Captions (CC) = transcription of ALL audio: dialogue, sound effects, music cues ("[door slams]"). For deaf/hard-of-hearing viewers.
Open captions = burned into the video pixels permanently. Can't be turned off.
Closed captions = separate data track. Viewer toggles on/off. This is what streaming uses.
CEA-608/708 β US broadcast standard. Embedded in the video stream (in SEI NAL units for H.264). Carried through .ts segments. Legacy but required for US broadcast.
WebVTT β Web standard. Plain text file with timestamps. Used in HLS as sidecar files. Clean, simple, widely supported.
TTML / IMSC β XML-based. Used in DASH and for interchange. Supports rich styling, positioning, regions. IMSC is the profile for streaming.
SRT (SubRip) β Simple text format. Common for file exchange but not used directly in streaming protocols.
Embedded β Captions inside the video stream itself (CEA-608/708 in .ts segments). Player extracts and renders them. No extra HTTP requests.
Sidecar β Captions in separate files referenced by the manifest. Player fetches alongside video. More flexible (add languages without re-encoding).
In HLS, sidecar captions use #EXT-X-MEDIA:TYPE=SUBTITLES in the master manifest, pointing to a .m3u8 with .vtt segment files.
MediaLive β Passthrough embedded 608/708, convert between formats, or burn in for preview outputs.
MediaConvert β Extracts embedded captions, converts to WebVTT/TTML sidecar, or burns in. Supports SCC, SRT, STL input.
MediaPackage β Passes through captions. Sidecar WebVTT tracks appear in HLS manifests as separate renditions.
Choosing the right encoding settings is the difference between wasting bandwidth and delivering sharp video.
CBR (Constant Bitrate) β Same bitrate every second. Wastes bits on static scenes, starves complex scenes. Predictable file size. Used in broadcast.
VBR (Variable Bitrate) β More bits for complex scenes, fewer for simple ones. Better quality-per-bit but unpredictable size.
QVBR (Quality-Defined VBR) β AWS innovation. Set a quality level (1-10) + max bitrate ceiling. Encoder uses only what's needed. Best of both worlds.
VMAF β Netflix's perceptual quality metric. Score 0-100. 93+ is excellent. Industry standard.
PSNR β Mathematical pixel comparison. Higher = closer to original. Fast but doesn't always match human perception.
SSIM β Measures structural information loss. Better correlation with human eyes than PSNR. Score 0-1.
In practice: Use VMAF for final quality decisions. Target 93+ for premium, 85+ for mobile.
Animated shows need less bitrate than live sports at the same resolution. Smart encoding adapts the ladder per content.
Static ladder: Same bitrates for everything. Simple but wasteful.
Per-title: Analyze complexity first, then pick optimal bitrate per rendition. A cartoon might need 3 Mbps at 1080p while sports needs 10 Mbps.
AWS QVBR is content-adaptive β it adjusts bitrate based on scene complexity within your quality target.
β’ Each rendition should be perceptually different from adjacent rungs.
β’ Lowest rung = watchable on a phone (360p @ 0.5 Mbps).
β’ Highest rung = match source quality β never upscale.
β’ Spacing = ~1.5-2x bitrate between rungs for smooth switching.
β’ Include an audio-only fallback for extremely poor connections.
A CDN caches your video at edge locations worldwide so viewers get content from nearby servers, not your origin.
CloudFront has 400+ edge locations. Viewers get segments from the nearest edge. On cache miss, it fetches from origin once and caches for subsequent requests.
Key cache behaviors for video:
β’ Segments (.ts, .m4s) β Cache aggressively (high TTL). Same segment serves millions of viewers.
β’ Manifests (.m3u8, .mpd) β Brief cache for live (1-3s TTL), long for VOD. Live manifests update every segment duration.
β’ Personalized manifests (SSAI) β Never cache. Each viewer gets different ad URLs stitched in.
Extra caching layer between edge locations and origin. All edges in a region check the shield first.
Without: 50 edges each ask origin = 50 requests per cache miss.
With: 50 edges ask shield, shield asks origin once = 1 request. Massive origin load reduction.
VOD segments: 1 year. They never change.
Live segments: = segment duration (6s). Immutable once created.
Live manifests: 1 second or half segment duration. Must refresh frequently.
VOD manifests: 1 day+. Content is static.
Large platforms use multiple CDN providers simultaneously:
β’ Active-active: Route to fastest CDN via DNS or client logic.
β’ Failover: Shift traffic if one CDN degrades.
β’ Cost optimization: Route by pricing tier per region.
Players like hls.js support mid-stream CDN switching on segment failures.
How ads get into video streams β the protocols, the players, and the trade-offs between client-side and server-side insertion.
XML response that describes a single ad: what creative to play, its duration, tracking pixels, click-through URL.
MediaTailor calls the ad server, gets VAST XML back, extracts the video URL, and stitches that video into the stream at the ad break point.
Contains: MediaFile URL, duration, impression trackers, click URL, companion ads.
Wraps around VAST. Defines WHEN ad breaks should happen for VOD content that has no embedded SCTE-35 markers.
timeOffset="start" = pre-roll. timeOffset="00:05:00" = mid-roll at 5 min. timeOffset="end" = post-roll.
Each break points to a VAST URL for the actual ad creative. MediaTailor supports both VAST and VMAP.
The player itself fetches ads from an ad server and plays them locally. Traditional web/mobile approach.
Pros: Mature ecosystem, interactive overlays, companion ads, viewability measurement.
Cons: Ad blockers defeat it entirely. Quality/resolution mismatch between ad and content. Buffering at ad transitions. Detectable by the client.
Ads are stitched into the video stream on the server. The player sees one continuous stream β it can't distinguish ads from content.
Pros: Ad-blocker proof, broadcast-quality transitions, works on all devices (smart TVs, Roku), consistent quality.
Cons: No client-side interactivity, harder viewability measurement, server cost. This is what MediaTailor does.
Ad pod = multiple ads played in sequence during a single break (like a TV commercial break).
Competitive separation β Don't show two competing brands (Coke then Pepsi) in the same pod.
Frequency capping β Limit how many times one viewer sees the same ad per session/day.
Fill rate β What percentage of ad breaks actually get filled with ads (vs showing slate). Target: 90%+.
Free Ad-Supported Streaming Television β Virtual linear channels assembled from VOD content, monetized with ads. Like traditional TV but over streaming.
How it works: MediaTailor Channel Assembly takes VOD assets, arranges them into a 24/7 schedule, inserts SCTE-35 markers at break points, and serves a live HLS manifest. Viewers tune in and see a "live" channel.
Examples: Pluto TV, Tubi, Samsung TV+, Amazon Freevee channels.
The player is the most complex piece of the puzzle on the client side. It handles manifest parsing, ABR decisions, buffering, DRM, and rendering.
Every stream playback follows this loop:
This loop repeats continuously. Every segment download, the player re-evaluates bandwidth and may switch renditions.
Why buffering happens: The player downloads segments faster than playback consumes them (building a buffer). If download speed drops below playback speed, the buffer drains and playback stalls.
Forward buffer = seconds of video downloaded but not yet played. Typical: 30s for VOD, 10s for live.
ABR logic β If buffer is low, switch DOWN to a lower rendition (downloads faster). If buffer is full, switch UP for better quality.
Rebuffering ratio = time spent buffering / total playback time. Target: <0.5%.
hls.js β JavaScript HLS player. Runs in any browser via MSE. The most popular choice for web HLS playback. Handles ABR, subtitles, DRM (via EME).
dash.js β Reference DASH player by the DASH Industry Forum. Full DASH spec support including low-latency modes.
Shaka Player β Google's player. Supports BOTH HLS and DASH, plus offline/download. Used by YouTube under the hood.
Video.js β Player UI framework. Wraps hls.js/dash.js with a consistent UI, plugin system, and analytics hooks.
Players use EME (Encrypted Media Extensions) β a browser API that talks to the device's DRM module (CDM).
Flow: Player sees EXT-X-KEY or ContentProtection β requests a license from the license server β CDM decrypts segments in a secure sandbox β decoded frames go to the screen.
The player code never sees decrypted content β the CDM handles it in hardware/trusted execution environment. This is why DRM works even on "open" platforms.