Regarding the Ray integration question, I would think Ray Serve can be something suitable for the use case to serve online requests in parallel and with some computation. The library is a general framework to set up multiple replicas for logic to handle incoming requests and can be scaled up to run across a Ray cluster.
In addition, Ray Serve supports Resource Allocation. With that, you should be able to specify necessary GPU device for each replica.