Definitely a bit of doc on the endpoint could make it easier, but that normal it’s started!
i’m not sure how to handle PRECISION in this query, if you can ligth me.
Mmm can you try add the callInput MODEL_PRECISION: "" ? I hope it works Otherwise you might need to use the docker env variable PRECISION="" when loading the container Let me know!
Haha yes I will indeed do a proper doc… its funny how the project developed. Initially the frontend was the big open source project, but the container was much more popular (and wasn’t open source in the beginning). Probably the best “docs” so far is the test.py file (just scroll down to the middle or so where there are a bunch of examples). But well done, you’ve really done great with what’s available!! And will clean this up.
I must admit, until now, most people have been using separate containers for train vs inference, on serverless… but there’s no reason we shouldn’t be able to use the same container for both… it’s just untested
Umm, setting PRECISION docker var just sets the default I think, and should be overridable on every call with MODEL_PRECISION callInput. Let me know what works and doesn’t and I’ll fix this up… but it only be after I finish some other stuff. Just remind me where / how you’re deploying? Provider and docker img vs build from source. There’s a lot of cleanup coming early next year!
Do you use only safetensor for optimization?
As I experience, for 2GB fp 16 model, loading with original .bin file took 5.3 seconds, unet, vae safetensor + SAFETENSORS_FAST_GPU took 4.5 seconds. Wonder how you optimized init time to around 2.2s
And now do you use runpod for kiri.art?
I used runpod serverless, most first response ends within 14seconds, and after cold boot, just 4 seconds took. But sometiems it took 50seconds to get first completed response. Maybe because it is because of container on, but can’t figure out why sometimes container doesn’t need to boot, and sometimes need to. (all based on 2GB fp16 model)
Becuase banana is still unstable, I assume you are using your own optimization and runpod to run kiri.art
I digged in to pipeline.ai, but I found async request was slow at there, and because they are not based at docker, it isn’t available to set own environment. Have to fit into their server’s env.
So if there’s good way to optimize cold boot, I think runpod serverless will be best way.
Hope to know how did you do that!
(FYI, I tried tracing to optimize, but it wasn’t effective)
improvements in diffusers (particularly the “fast load” as part of their low memory code - enabled by default)
improvements in docker-diffusers-api (i spent a lot of time measuring how long each part of init takes and improving where I could)
Thanks including the timings from your own experience. Is that with docker-diffusers-api too? On Banana? It may also depend on banana’s load during test time, I had more data for the banana’s optimized containers and there was greater variance in those times… so I guess I’ll have to do more testing over a longer period with the “our optimization” on banana.
Ah they’re still unstable? That’s pretty upsetting to hear
RunPod was a good assumption but no, we’re not using it for kiri.art I’m actually just renting an RTX 3090 off vast.ai, and running docker-diffusers-api there. So it’s not serverless, it’s always-on (with no cold boots)… but paying about $220/mo (vs banana’s new pricing, shared A100 would cost about $1600/mo!). There’s also a custom (closed-source) load-balancer / queue / future autoscaler I put in front of it.
It’s hard to want to go back to serverless after enjoying such great response times, but serverless does have scaling advantages. I think probably moving forward I’ll have a hybrid solution, using a mix of long-running servers and scale-to-zero serverless, but it’s a bit of work, as you can imagine. I’ll almost definitely use RunPod for this (banana’s new pricing unfortunately make them irrelevant for me).
Please keep sharing your experiences, I’ve only had limited experience with RunPod and their serverless offering is still in beta. I think I noticed that for the very first call, it still has to download/extract the docker image. I think that can include the first call that goes to a new worker, but not sure. I bet as a community we’ll make a lot of fun discoveries and improvements here. I’ll spend more time on this in a few more weeks after I’ve finished some more important things (in docker-diffusers-api).
Good to know, thanks for sharing this!
And even bigger thanks for sharing this! Tracing was on my list of things to try, so I’m happy you did it for me Without needing to worry about breaking banana’s optimization anymore, there are a loooot of new possibilities opened up. Top of my list (after I complete the other things I mentioned) is PyTorch 2 and Meta’s AITemplate. But I think it will help on inference speeds more than anything else (but we won’t know until we measure it :)).
I’m pretty happy with coldboots at the moment and not sure how much more can be improved on our side at least. What I would really love to see, on both banana and runpod, is on-prem S3-compatible storage. We’d have to measure, but I’d prefer to have have much smaller docker images. Switching models at runtime is “bad” for serverless, but, I still think it will be faster than a coldboot, is much more maintainable (to have a single image for all models), and more likely to re-use a running instance. Anyways, a lot of innovation happening in this space so let’s see what develops moving forwards
My current roadmap hopefully by mid Jan: continuous integration testing with automatic publishing of images with semantic versioning. Then back to work on everything else.
Thank you for your details!
I measured time at local with just using diffusers.
Serverless seems slow option to serve model fast to users. Thanks for telling me that you are using own server and load balancers. From now on, I should better establish own server.
I heard that big companies who have lots of users use amazon inferentia as server and nvidia triton inference server as server framework. To increase throughput, batching is needed for requests come within certain timeframe, and I heard triton already made it with load balancing.
I will see your docker so that how you optimized cold boot! Thank you.
And will also see PyTorch 2 and Meta’s AITemplate.
Thanks, @hihihi! This is great!! I don’t always manage to keep up to date with everything, and hadn’t heard of OneFlow before. Looks like they plan to integrate this code into diffusers too, that will make it super easy.
Now that we don’t have to worry about breaking banana’s optimization all the time, we’re free to try all this stuff. In general, I’m working on automated testing and release cycle at the moment, but after that, I hope we’ll have quite a fast release cycle, as it will get easier and easier to make changes, know they don’t break anything, and release with confidence
Thats cool. I tested oneflow, it works well. They made super easy like this
import oneflow as torch
from diffusers import OneflowStableDiffusionPipeline as StableDiffusionPipeline
with T4, I confirmed that 5it/s with no oneflow went 10it/s with oneflow.
But the problem is, loading model took so long.
So at same condition with model load and 1 inference, 45 seconds vs 13 seconds (with oneflow / without oneflow) is the result.
Maybe it’s because loading their C++ compiler took so long, so will test with better cpu condition.
So I think AITemplate will be the best option for production at stablediffusion if they use already compiled model so that model loading is fast.
Will continue to share!
Thanks.
If using oneflow, and load model in memory with no release, then I think every inference requests can be done within 2 seconds. If serving few not changed models, it will be best option for throughput I think.
Thanks again for sharing all this, @hihihi! Really helpful (and motivating ). I’m very confident that together we’ll get inference to be super fast! I’ll be in touch when I actually start working on all this stuff but in the meantime please do continue to share your findings
the most promising one can be adoptable at production looks like this.
But as I research, knowledge distillation may ongoing on stable diffusion, and then all approaches will be useless because that will be the most fast one. (within 1 seconds)
I tested tensorRT with voltaML, compile works well, speed goes bit faster but quality of generated images goes down.
at 3090, for one image generation, voltaml takes 7.5 seconds, and pytorch fp16 takes 8.8 seconds. But after model loads and warmup, generating images continuously is much faster at voltaml.
Compiling takes about 25 minutes.
So for image quality, I think the best I can do for production level is just loading with safetensors and apply xformers, (probably with jit. jit was faster than didn’t and there was no quality downgrade, maybe reason why it was slow was I didn’t warmup) and multiprocessing with single gpu.
As I know, you are also planning to server mulitple models. What’s your plan if there’s 100 requests per seconds? Loading, inference optimization is limited. With affordable gpus like 3090, first one inference will take at least 7 seconds. Than If 100 requests comes with 100 different models each, the last guy will get response 700 seconds later.
Maybe different models at another docker image will solve this problem? Don’t know it will run async correctly. Will test about it.
As I research, multiprocessing is not recommended using deep learning models, whole one gpu is needed for one model inference. So I think the best way to increase concurrency with different models is using amazon inf instance and making cluster with kubernetes, and managing models with docker.
The key is if stable diffusion can be compiled using neuron compiler, which is needed for inf instances.
Since inf1.xlarge instance costs 0.228usd/hour, if using 5 instances as node, 10000 usd will be cost. And then, if generating images need 7 seconds for all things (model load, warmup, inference), 5 requests can be done with 7 seconds for different models
With more research
I think making kubernetes cluster using vast.ai, runpod instances will be better. There is too much uncertainty about neuron compiling.
As I research, multiprocessing is not recommended using deep learning models, whole one gpu is needed for one model inference.
Yeah, I also decided to rather to use multiple 3090s rather than splitting an A100. I did a quick test now and for DPMSolverMultistepScheduler with 512x512 x20 steps, I get 1.8s inference (warm) on both the 3090 and Banana’s split A100 - hard to know if the other “half” is in use at same time, but the timings can often vary quite a bit and I’m sure it’s just from splitting the GPU. (inference = inference only, excluding any request latency).
That’s unfortunate to hear and something we’ll need to solve before we can adopt any of these solutions. Also waiting to see results with int8 quantization.
Unfortunate but not a showstopper… presumably we can compile and upload to S3 just like we’re doing currently with safetensors.
I don’t think anything beats serverless for quick scaling but it has its downside too (price, cold starts). My long term plan is a hybrid approach:
Long-running containers.
Requests are queued. If queue goes above a certain size for a certain amount of time, spin up another container.
Containers will have “model affinity” (in my closed-source autoscaler), so requests will always be routed to a container that already has the requested model loaded (rather than needing to keep switching models between requests).
Hybrid: could route requests to serverless at certain thresholds or while waiting for new (long-running) containers to finish booting
Key point in my mind is that switching models, while slow, takes less time than booting a new serverless instance.
That’s a great takeaway! Neuron looks very interesting but I guess we’ll have to wait for that uncertainty to improve, otherwise those prices look great an AWS know how to scale.
For now I’m enjoying vast.ai (and comparably runpod pods)… it wouldn’t be kubernetes though, as we don’t have raw access to the system (you could do it on Lambda and others though). I’m currently using a custom autoscaler (which currently just handles queuing but will eventually start new instances as needed too) - but as you point out, difficult to react quickly to a sudden spike in requests, and I think we still need serverless for that (provided whoever we choose for serverless don’t run into capacity issues themselves).
Thanks again for all this! Excited to see where we’ll take it
I will also note though that there’s a balance between getting these optimizations working “early” (and to maintain that work) vs waiting for official implementations in diffusers (xformers as a great example, and the person who did the xformers support expressed interest in AITemplate too!).