Reports

So you're running a FastAPI ML service with CPU-intensive SVD computations on the main thread, and you're seeing occasional throttling. You already scaled up Gunicorn workers, but you are wondering if you should move this to a separate thread or any best practices, right?

Ah, CPU-bound work in FastAPI can be tricky! Threads might not help much because of Python’s GIL, but let me show you a few approaches we’ve seen work well.

So the first step is ProcessPoolExecutor (Quick Fix)

For lighter workloads, offload to a separate process:

from concurrent.futures import ProcessPoolExecutor  
import asyncio  

def _compute_svd(matrix: np.ndarray):  
    return np.linalg.svd(matrix, full_matrices=False)  

async def svd_with_fallback(self, matrix):  
    with ProcessPoolExecutor() as executor:  
        return await asyncio.get_event_loop().run_in_executor(executor, _compute_svd, matrix)

Pros: Simple, uses multiple cores.
Cons: Overhead for large matrices (serialization costs).

If you’re already hitting CPU limits, then a job queue(like Celery) might be better for scaling.

So the next step is Celery + Redis (Production-Grade)

@celery.task  
def async_svd(matrix_serialized: bytes) -> bytes:  
    matrix = pickle.loads(matrix_serialized)  
    U, S, V = np.linalg.svd(matrix)  
    return pickle.dumps((U, S, V))

Pros: Decouples compute from API, scales horizontally.
Cons: Adds Redis/RabbitMQ as a dependency.

If you are thinking about optimizing Gunicorn further, so I mean, if you’re sticking with more workers:

gunicorn -w $(nproc) -k uvicorn.workers.UvicornWorker ...

Match workers to CPU cores.

Check for throttling in htop—if it’s kernel-level, taskset might help.

Maybe you are considering any low-hanging performance optimizations, right?

For SVD specifically, you could try Numba(if numerical stability allows):

from numba import njit  
@njit  
def svd_fast(matrix):  
    return np.linalg.svd(matrix)  # JIT-compiled

But test thoroughly! Some NumPy-SVD optimizations are already BLAS-backed.
I’d start with ProcessPoolExecutor—it’s the least invasive. If you’re still throttled, Celery’s the way to go. Want me to dive deeper into any of these?

79653509