An H.26x video stream is:
A time‑ordered series of compressed pictures, plus instructions describing how to decode them, packaged in a way that works for files, networks, and live streaming.
When debugging, you are usually asking one of these questions:
- Why won’t this stream decode? → parameter sets, NAL units, reference frames
- Why does playback stutter or reorder frames? → PTS vs DTS, B‑frames
- Why can’t I start decoding mid‑stream? → missing SPS / IDR
- Why is video timing wrong? → time_base / time_scale confusion
Keep those questions in mind as we walk through the pieces.
They are containers that let encoded video data travel over:
- files (MP4, MKV)
- networks (RTP, RTSP)
- transport streams (MPEG‑TS)
A NAL unit is like an envelope.
Inside the envelope is either:
- part of a picture, or
- decoder instructions (metadata)
Every H.264/H.265 stream is fundamentally:
NAL | NAL | NAL | NAL | ...
| NAL Type | Meaning |
|---|---|
| Slice NALs | Actual compressed picture data |
| SPS | Sequence Parameter Set (global config) |
| PPS | Picture Parameter Set (per‑picture config) |
| SEI | Supplemental info (timing, HDR, captions, etc.) |
| IDR slice | Special “reset” picture |
In Annex B streams (raw .h264, MPEG‑TS):
00 00 00 01 [NAL header][payload]
In MP4/MKV:
[length][NAL payload]
✅ Debugging tip:
If your decoder says “missing SPS”, it literally hasn’t seen the right NAL unit yet.
This trips up almost everyone at first.
An access unit is “everything needed to decode ONE displayed picture”.
An access unit may contain:
- multiple slice NALs (one frame split into chunks)
- optional SPS/PPS
- SEI
- exactly one picture’s worth of image data
So:
[ Access Unit ] [ Access Unit ] [ Access Unit ]
Not:
[ Frame ][ Frame ][ Frame ]
✅ Debugging tip:
If you’re counting NALs and expecting “1 NAL = 1 frame”, timestamps will never make sense.
Instead of encoding every frame fully, H.26x encodes differences.
| Type | Depends on | Explanation |
|---|---|---|
| I (Intra) | nothing | A self‑contained image |
| P (Predicted) | past frame(s) | “What changed since before?” |
| B (Bi‑predicted) | past and future frames | “What changed between before and after?” |
B‑frames:
- compress very well
- require future frames to decode
That single fact explains:
- decode order vs display order
- DTS vs PTS
- reordering bugs
✅ Debugging tip:
If you remove B‑frames, PTS == DTS and life gets much simpler.
Frames are not always decoded in the order they are shown.
Example (display order):
I B B P
To decode the B‑frames, the decoder must first decode the P‑frame:
Decode order: I → P → B → B
Display order: I → B → B → P
Because B‑frames reference the future.
✅ Debugging tip:
If frames appear out of order unless you respect PTS, this is why.
This is the most important concept for playback issues.
- DTS (Decode Time Stamp)
“When must this be decoded?” - PTS (Presentation Time Stamp)
“When must this be shown?”
DTS is for the decoder
PTS is for the viewer
With no B‑frames:
PTS == DTS
With B‑frames:
DTS: I P B B
PTS: I B B P
- Missing DTS → decoder stalls
- Wrong PTS → jerky playback
- Constant DTS but reordered PTS → frames jump
✅ Debugging tip:
If video is smooth but frames are wrong, inspect PTS.
If video freezes or never decodes, inspect DTS.
Random access and error recovery.
An IDR frame is a hard reset point for the decoder.
At an IDR:
- all reference history is discarded
- decoding can start cleanly
- no older frames are needed
- I‑frame: intra‑coded
- IDR frame: intra‑coded and resets reference state
✅ Debugging tip:
If you join a live stream mid‑way and see garbage until the next IDR — that’s expected.
The global configuration for decoding.
Contains things like:
- resolution
- profile / level
- reference frame limits
- timing info (frame rate hints)
SPS is the decoder’s instruction manual.
Without it:
- decoder does not know how to interpret the bitstream
- even perfect frame data is useless
- Sent at stream start
- Often resent before IDRs
- Must be available before decoding frames
✅ Debugging tip:
“No SPS found” = decoder literally does not know the video’s shape.
This is subtle and often mis‑implemented.
Timestamps are integers, but they represent time using a scale.
Example:
time_scale = 90000
PTS = 180000
→ 2 seconds
- MP4:
timescale - RTP: clock rate (often 90 kHz)
- FFmpeg:
time_base
Timestamps are “ticks” — time_scale tells you how long one tick is.
- Mixing 1/1000 vs 1/90000
- Assuming milliseconds
- Rescaling incorrectly when remuxing
✅ Debugging tip:
If audio and video slowly drift apart, your time base math is wrong.
H.265 (HEVC) keeps the same concepts but adds:
- VPS (Video Parameter Set) above SPS
- More frame types internally
- Higher compression → more reordering
But:
PTS/DTS, NAL units, IDR, SPS all work the same way.
Think of an H.26x stream like this:
- SPS/PPS: “Here’s how to decode what follows”
- IDR frame: “You can start decoding now”
- P/B frames: “Here are changes relative to others”
- DTS: “Decode in this order”
- PTS: “Show in this order”
- time_scale: “Here’s how fast time moves”
If any of those are missing or inconsistent, playback breaks.
Since you’re new but clearly technical, the fastest way to solidify this:
- Use
ffprobe -show_frames -show_packets - Look at:
pict_typepts,dtskey_frame
- Compare decode vs display order
- Try the same stream with and without B‑frames
If you want, next we can:
- walk through a real
ffprobedump line‑by‑line - explain Annex B vs MP4 in detail
- map RTP timestamps → PTS/DTS
- or explain how encoders decide GOP structure
Just tell me where you want to go deeper.