Safetensors & our own Optimization (faster model init)

Thanks for continuing to share your findings!

As I research, multiprocessing is not recommended using deep learning models, whole one gpu is needed for one model inference.

Yeah, I also decided to rather to use multiple 3090s rather than splitting an A100. I did a quick test now and for DPMSolverMultistepScheduler with 512x512 x20 steps, I get 1.8s inference (warm) on both the 3090 and Banana’s split A100 - hard to know if the other “half” is in use at same time, but the timings can often vary quite a bit and I’m sure it’s just from splitting the GPU. (inference = inference only, excluding any request latency).

That’s unfortunate to hear and something we’ll need to solve before we can adopt any of these solutions. Also waiting to see results with int8 quantization.

Unfortunate but not a showstopper… presumably we can compile and upload to S3 just like we’re doing currently with safetensors.

I don’t think anything beats serverless for quick scaling but it has its downside too (price, cold starts). My long term plan is a hybrid approach:

  • Long-running containers.
  • Requests are queued. If queue goes above a certain size for a certain amount of time, spin up another container.
  • Containers will have “model affinity” (in my closed-source autoscaler), so requests will always be routed to a container that already has the requested model loaded (rather than needing to keep switching models between requests).
  • Hybrid: could route requests to serverless at certain thresholds or while waiting for new (long-running) containers to finish booting

Key point in my mind is that switching models, while slow, takes less time than booting a new serverless instance.

That’s a great takeaway! Neuron looks very interesting but I guess we’ll have to wait for that uncertainty to improve, otherwise those prices look great an AWS know how to scale.

For now I’m enjoying vast.ai (and comparably runpod pods)… it wouldn’t be kubernetes though, as we don’t have raw access to the system (you could do it on Lambda and others though). I’m currently using a custom autoscaler (which currently just handles queuing but will eventually start new instances as needed too) - but as you point out, difficult to react quickly to a sudden spike in requests, and I think we still need serverless for that (provided whoever we choose for serverless don’t run into capacity issues themselves).

Thanks again for all this! Excited to see where we’ll take it :rocket: :raised_hands:

I will also note though that there’s a balance between getting these optimizations working “early” (and to maintain that work) vs waiting for official implementations in diffusers (xformers as a great example, and the person who did the xformers support expressed interest in AITemplate too!).

Happy New Year, @hihihi! :tada: