I don't believe that is possible using the XSI streaming APIs.
You may need to look into CUCM JTAPI or TAPI, which have the ability to create conference calls (i.e. with an application controlled CTI Port phone device to play audio into the call). The Agent Greeting feature is a simplified conference scenario that uses the phone's built-in-bridge DSP mixing capability, and may be a bit easier to implement.
This repo has some JTAPI samples demonstrating some of these pieces (e.g. conference and CTI Port): CiscoDevNet/jtapi-samples