You can also get word-level timestamps by using speech-to-text APIs that include timing data in their response. I've used AssemblyAI for this - their API returns timestamps for each word during transcription, so you don't need to run forced alignment as a separate step. This can be simpler if you're starting with just audio and need both the text and timing information.