Safetensors & our own Optimization (faster model init)

I tested tensorRT with voltaML, compile works well, speed goes bit faster but quality of generated images goes down.

at 3090, for one image generation, voltaml takes 7.5 seconds, and pytorch fp16 takes 8.8 seconds. But after model loads and warmup, generating images continuously is much faster at voltaml.
Compiling takes about 25 minutes.

Now I think compiling with AITemplate, TensorRT, etc… will cause quality change with original pt model inference. (GitHub - stochasticai/x-stable-diffusion: Real-time inference for Stable Diffusion - 0.88s latency. Covers AITemplate, nvFuser, TensorRT, FlashAttention.)

So for image quality, I think the best I can do for production level is just loading with safetensors and apply xformers, (probably with jit. jit was faster than didn’t and there was no quality downgrade, maybe reason why it was slow was I didn’t warmup) and multiprocessing with single gpu.

As I know, you are also planning to server mulitple models. What’s your plan if there’s 100 requests per seconds? Loading, inference optimization is limited. With affordable gpus like 3090, first one inference will take at least 7 seconds. Than If 100 requests comes with 100 different models each, the last guy will get response 700 seconds later.

Maybe different models at another docker image will solve this problem? Don’t know it will run async correctly. Will test about it.

1 Like

As I research, multiprocessing is not recommended using deep learning models, whole one gpu is needed for one model inference. So I think the best way to increase concurrency with different models is using amazon inf instance and making cluster with kubernetes, and managing models with docker.
The key is if stable diffusion can be compiled using neuron compiler, which is needed for inf instances.
Since inf1.xlarge instance costs 0.228usd/hour, if using 5 instances as node, 10000 usd will be cost. And then, if generating images need 7 seconds for all things (model load, warmup, inference), 5 requests can be done with 7 seconds for different models

With more research
I think making kubernetes cluster using, runpod instances will be better. There is too much uncertainty about neuron compiling.

1 Like

Thanks for continuing to share your findings!

As I research, multiprocessing is not recommended using deep learning models, whole one gpu is needed for one model inference.

Yeah, I also decided to rather to use multiple 3090s rather than splitting an A100. I did a quick test now and for DPMSolverMultistepScheduler with 512x512 x20 steps, I get 1.8s inference (warm) on both the 3090 and Banana’s split A100 - hard to know if the other “half” is in use at same time, but the timings can often vary quite a bit and I’m sure it’s just from splitting the GPU. (inference = inference only, excluding any request latency).

That’s unfortunate to hear and something we’ll need to solve before we can adopt any of these solutions. Also waiting to see results with int8 quantization.

Unfortunate but not a showstopper… presumably we can compile and upload to S3 just like we’re doing currently with safetensors.

I don’t think anything beats serverless for quick scaling but it has its downside too (price, cold starts). My long term plan is a hybrid approach:

  • Long-running containers.
  • Requests are queued. If queue goes above a certain size for a certain amount of time, spin up another container.
  • Containers will have “model affinity” (in my closed-source autoscaler), so requests will always be routed to a container that already has the requested model loaded (rather than needing to keep switching models between requests).
  • Hybrid: could route requests to serverless at certain thresholds or while waiting for new (long-running) containers to finish booting

Key point in my mind is that switching models, while slow, takes less time than booting a new serverless instance.

That’s a great takeaway! Neuron looks very interesting but I guess we’ll have to wait for that uncertainty to improve, otherwise those prices look great an AWS know how to scale.

For now I’m enjoying (and comparably runpod pods)… it wouldn’t be kubernetes though, as we don’t have raw access to the system (you could do it on Lambda and others though). I’m currently using a custom autoscaler (which currently just handles queuing but will eventually start new instances as needed too) - but as you point out, difficult to react quickly to a sudden spike in requests, and I think we still need serverless for that (provided whoever we choose for serverless don’t run into capacity issues themselves).

Thanks again for all this! Excited to see where we’ll take it :rocket: :raised_hands:

I will also note though that there’s a balance between getting these optimizations working “early” (and to maintain that work) vs waiting for official implementations in diffusers (xformers as a great example, and the person who did the xformers support expressed interest in AITemplate too!).

Happy New Year, @hihihi! :tada: