Reports

Regarding the Ray integration question, I would think Ray Serve can be something suitable for the use case to serve online requests in parallel and with some computation. The library is a general framework to set up multiple replicas for logic to handle incoming requests and can be scaled up to run across a Ray cluster.

In addition, Ray Serve supports Resource Allocation. With that, you should be able to specify necessary GPU device for each replica.

79252590