I tested tensorRT with voltaML, compile works well, speed goes bit faster but quality of generated images goes down.
at 3090, for one image generation, voltaml takes 7.5 seconds, and pytorch fp16 takes 8.8 seconds. But after model loads and warmup, generating images continuously is much faster at voltaml.
Compiling takes about 25 minutes.
Now I think compiling with AITemplate, TensorRT, etc… will cause quality change with original pt model inference. (GitHub - stochasticai/x-stable-diffusion: Real-time inference for Stable Diffusion - 0.88s latency. Covers AITemplate, nvFuser, TensorRT, FlashAttention.)
So for image quality, I think the best I can do for production level is just loading with safetensors and apply xformers, (probably with jit. jit was faster than didn’t and there was no quality downgrade, maybe reason why it was slow was I didn’t warmup) and multiprocessing with single gpu.
As I know, you are also planning to server mulitple models. What’s your plan if there’s 100 requests per seconds? Loading, inference optimization is limited. With affordable gpus like 3090, first one inference will take at least 7 seconds. Than If 100 requests comes with 100 different models each, the last guy will get response 700 seconds later.
Maybe different models at another docker image will solve this problem? Don’t know it will run async correctly. Will test about it.