RTP can be reliably auto-detected without SIP (or SAP/SDP) protocol. It takes a heuristic algorithm and very careful payload parsing, in some cases going down the wrong path for a bit to recognize when it's wrong, and then re-adjusting. In this C++ source code the function detect_codec_type_and_bitrate() does this. Note that it looks for video RTP streams first as they forbid start code sequences (which of course will show up in just about any other random stream), then checks for various voice codec types (which is harder). Audio and video command line examples are here. The idea is to recognize on the fly any new session in UDP or pcap flow, create a session for it, detect the codec type, and then go from there (extract bitstreams, decode, merge streams, ASR, etc). We have tested this on over 150+ pcaps collected over 5 years from users, plus we synthetically generate another several hundred. It's deployed by F500s and it works
disclaimer - I work for the company that maintains this