The response above is generally correct as long as:
You differentiate the first video packet and the first audio packet. This is important because audio packets generally have the same PTS and DTS (and never use as a start dts a 0 value).
Let me give you a concrete example where this could fail (real example from OBS):
- Packet 1 (Video): PTS: 33, DTS: 0 (start_pts=33, start_dts=0) => PTS: 0, DTS: 0 (Here, there is already an error because you are overlapping the decoding time with the presentation time)
- Packet 2 (Video): PTS: 100, DTS: 17 => PTS: 67, DTS: 17
- Packet 3 (Video): PTS: 66, DTS: 33 => PTS: 33, DTS: 33 (another overlap)
- Packet 4 (Audio): PTS: 33, DTS: 33 => PTS: 0, DTS: 33, you are asking the decoder to decode it in the future and present it now.
The error that appears in this case is pts (0) < dts (2970) in stream 1