Safetensors & our own Optimization (faster model init)

What is It?

Safetensors is a newer model format sporting super fast model loading. Coupled with other advancements in both in diffusers and docker-diffusers-api, model init time has gotten super fast, and outperforms Banana’s current optimization methods (see a comparison below).

We’ve provided links to the safetensors versions of common models below. The latest versions of docker-diffusers-api can also convert any model (diffusers or ckpt) to safetensors format for you and store on your own S3 compatible storage (and was used to create the models below). These converted models can be downloaded and built into your image, for those that want it (e.g. on Banana).

How to use it

Consuming a previously converted model

Download at Build Time (e.g. for Banana)

Use the “build-dowload” repo:

(or the runpod variant).

See the README there and set the build-args appropriately, particularly MODEL_URL. Use one of the prebuilt models below, or, if you have your own S3-compatible storage setup, give either the full S3 URL (according to Storage docs), or set to s3:// to download from default_bucket/models--REPO_ID--MODEL_ID--PRECISION.tar.zst.

NB: current known issue where build args don’t override values I set previousy in the build, working on this, but please set any vars using ENV line only for now.

Runtime

TODO

Producing your own converted models

Requires S3-compatible storage

Set up the default, regular repo (i.e., not the “build-dowload” variant). The default build allows for runtime downloads of models. Call it like this:

$ python test.py txt2img \
  --call-arg MODEL_ID="stabilityai/stable-diffusion-2-1-base"
  --call-arg MODEL_URL="s3://"
  --call-arg MODEL_PRECISION="fp16"
  --call-arg MODEL_REVISION="fp16"

Docker-diffusers-api will download the model, convert to safetensors format, and upload the archive to your S3-compatible bucket. CHECKPOINT_URL is supported too to convert .ckpt files. It will also return an image when the process completes.

Prebuilt models

Note, these downloads are rate-limited and shouldn’t be considered the full speed of Cloudflare R2.

Comparison to Banana’s Optimization

A quick look through https://kiri.art/logs reveals:

  • Banana optimization: 2.3s - 6.4s init time (usually around 3.0s)
  • Our optimization: 2.0s - 2.5s init time (usually about 2.2s)*

*Our optimization was tested in a limited period, it could vary more.

Why we’re stopping to support to Banana’s Optimization

  • It’s incredibly fickle… small changes can break it.
  • It’s incredibly slow and is the main reason why banana builds are so slow.
  • There’s no failure explanation. If it fails, we can just have to try one thing after another, wait 1hr+ to see if it worked, and try again.
  • There’s no way to test locally, as above.
  • There has been a loooot of downtime recently, and when its down, we still get the regular failure message, with no idea that it’s system wide and not in our own code.
  • It’s super limiting… I’ve had to avoid many improvements previously because they would break optimization.
  • Our own optimization is faster anyways.

I really hope Banana will give us a way to opt out of optimization in future, as it would really speed up deployments!

Questions, issues, etc

  • (Banana) “Warning: Optimization Failed”. You can safely ignore this warning, we don’t use banana’s optimization anymore. We already optimized the model in the previous step. You can be sure it’s working based on the fast load time (around 2.2s init time vs 30s init time).
1 Like

Thanks a lot for your work man,
i’m trying to use the model * models–stabilityai–stable-diffusion-2-1-base–fp16.tar.zst
to train a model and i get the famous issue : Attempting to unscale FP16 gradients
Did i should create a f32 model ?

1 Like

Hey! Thanks for the kind words :slight_smile:

Ok, so, this could be clearer from diffusers and there’s an open issue for this. The gist of it is:

  • Need to fine-tune a FP32 model (since even if we got past this error, fine-tuning a fp16 model can lead to poor results)

  • The new / saved / fine-tuned model (after training) will still be saved as fp16! And loaded as fp16 for future inference.

In short, on your container for dreambooth training, you should set PRECISION="" and use a full precision (without the --fp16) model. If you get stuck creating that, let me know.

Otherwise, congrats on getting so far! This is super early days for the above so I’m really happy its working so far… let me know if anything is unclear or not working, and I’ll be happy to improve the docs and / or fix anything as needed.

for now, that the payload i use:

    const payloadTaining = {
        id: crypto.randomUUID(),
        created: Date.now(),
        apiKey: "***",
        modelKey: "***",
        startOnly: true,
        modelInputs: {
            name: "dreambooth",
            modelInputs: {
                instance_prompt: "a photo of sks",
                instance_images: ["IMAGE_BAS64"],
                num_inference_steps: 50,
                guidance_scale: 9,
                height: 512,
                width: 512,
                seed: 3242,
            },
            callInputs: {
                MODEL_ID: "stable-diffusion-2-1-base--fp16",
                MODEL_URL: 'https://pub-bdad4fdd97ac4830945e90ed16298864.r2.dev/diffusers/models--stabilityai--stable-diffusion-2-1-base--fp16.tar.zst',
                train: "dreambooth",
                dest_url: `https://***.r2.cloudflarestorage.com/test/1.ckpt`
            },
        },
    }

Definitely a bit of doc on the endpoint could make it easier, but that normal it’s started!
i’m not sure how to handle PRECISION in this query, if you can ligth me.

Mmm can you try add the callInput MODEL_PRECISION: "" ? I hope it works :sweat_smile: Otherwise you might need to use the docker env variable PRECISION="" when loading the container :sweat_smile: Let me know!

Haha yes I will indeed do a proper doc… its funny how the project developed. Initially the frontend was the big open source project, but the container was much more popular (and wasn’t open source in the beginning). Probably the best “docs” so far is the test.py file (just scroll down to the middle or so where there are a bunch of examples). But well done, you’ve really done great with what’s available!! And will clean this up.

Will track this here:

Same :confused:
i’m scared to set the PRECISION in docker var, that seems during the inference on same deploy i will have the issue as well no ?

I must admit, until now, most people have been using separate containers for train vs inference, on serverless… but there’s no reason we shouldn’t be able to use the same container for both… it’s just untested :sweat_smile:

Umm, setting PRECISION docker var just sets the default I think, and should be overridable on every call with MODEL_PRECISION callInput. Let me know what works and doesn’t and I’ll fix this up… but it only be after I finish some other stuff. Just remind me where / how you’re deploying? Provider and docker img vs build from source. There’s a lot of cleanup coming early next year!

Ok, i will do 2 forks then ! :slight_smile:
here is mine GitHub - riderx/docker-diffusers-api: Diffusers / Stable Diffusion in docker with a REST API, supporting various models, pipelines & schedulers.
i use banana with this agrs:
RUNTIME_DOWNLOADS=1
USE_DREAMBOOTH=1

still not better with build arg PRECISION=“”

ok i think i found the issue i didn’t saw you said remove the --f16

1 Like

Do you use only safetensor for optimization?
As I experience, for 2GB fp 16 model, loading with original .bin file took 5.3 seconds, unet, vae safetensor + SAFETENSORS_FAST_GPU took 4.5 seconds. Wonder how you optimized init time to around 2.2s

And now do you use runpod for kiri.art?
I used runpod serverless, most first response ends within 14seconds, and after cold boot, just 4 seconds took. But sometiems it took 50seconds to get first completed response. Maybe because it is because of container on, but can’t figure out why sometimes container doesn’t need to boot, and sometimes need to. (all based on 2GB fp16 model)

Becuase banana is still unstable, I assume you are using your own optimization and runpod to run kiri.art

I digged in to pipeline.ai, but I found async request was slow at there, and because they are not based at docker, it isn’t available to set own environment. Have to fit into their server’s env.

So if there’s good way to optimize cold boot, I think runpod serverless will be best way.
Hope to know how did you do that!

(FYI, I tried tracing to optimize, but it wasn’t effective)

1 Like

Hey @hihihi :slight_smile:

There are 3 things that make the cold boots fast:

  1. safetensors
  2. improvements in diffusers (particularly the “fast load” as part of their low memory code - enabled by default)
  3. improvements in docker-diffusers-api (i spent a lot of time measuring how long each part of init takes and improving where I could)

Thanks including the timings from your own experience. Is that with docker-diffusers-api too? On Banana? It may also depend on banana’s load during test time, I had more data for the banana’s optimized containers and there was greater variance in those times… so I guess I’ll have to do more testing over a longer period with the “our optimization” on banana.

Ah they’re still unstable? That’s pretty upsetting to hear :confused:

RunPod was a good assumption but no, we’re not using it for kiri.art :sweat_smile: I’m actually just renting an RTX 3090 off vast.ai, and running docker-diffusers-api there. So it’s not serverless, it’s always-on (with no cold boots)… but paying about $220/mo (vs banana’s new pricing, shared A100 would cost about $1600/mo!). There’s also a custom (closed-source) load-balancer / queue / future autoscaler I put in front of it.

It’s hard to want to go back to serverless after enjoying such great response times, but serverless does have scaling advantages. I think probably moving forward I’ll have a hybrid solution, using a mix of long-running servers and scale-to-zero serverless, but it’s a bit of work, as you can imagine. I’ll almost definitely use RunPod for this (banana’s new pricing unfortunately make them irrelevant for me).

Please keep sharing your experiences, I’ve only had limited experience with RunPod and their serverless offering is still in beta. I think I noticed that for the very first call, it still has to download/extract the docker image. I think that can include the first call that goes to a new worker, but not sure. I bet as a community we’ll make a lot of fun discoveries and improvements here. I’ll spend more time on this in a few more weeks after I’ve finished some more important things (in docker-diffusers-api).

Good to know, thanks for sharing this!

And even bigger thanks for sharing this! Tracing was on my list of things to try, so I’m happy you did it for me :raised_hands: Without needing to worry about breaking banana’s optimization anymore, there are a loooot of new possibilities opened up. Top of my list (after I complete the other things I mentioned) is PyTorch 2 and Meta’s AITemplate. But I think it will help on inference speeds more than anything else (but we won’t know until we measure it :)).

I’m pretty happy with coldboots at the moment and not sure how much more can be improved on our side at least. What I would really love to see, on both banana and runpod, is on-prem S3-compatible storage. We’d have to measure, but I’d prefer to have have much smaller docker images. Switching models at runtime is “bad” for serverless, but, I still think it will be faster than a coldboot, is much more maintainable (to have a single image for all models), and more likely to re-use a running instance. Anyways, a lot of innovation happening in this space so let’s see what develops moving forwards :smiley:

My current roadmap hopefully by mid Jan: continuous integration testing with automatic publishing of images with semantic versioning. Then back to work on everything else.

Thank you for your details!
I measured time at local with just using diffusers.
Serverless seems slow option to serve model fast to users. Thanks for telling me that you are using own server and load balancers. From now on, I should better establish own server.
I heard that big companies who have lots of users use amazon inferentia as server and nvidia triton inference server as server framework. To increase throughput, batching is needed for requests come within certain timeframe, and I heard triton already made it with load balancing.
I will see your docker so that how you optimized cold boot! Thank you.
And will also see PyTorch 2 and Meta’s AITemplate.

1 Like

maybe it will help for your inference optimizations.
Seems tensorRT is the best

1 Like

this is interesting, too. They are better than tensorrt, aitemplate as they say.

1 Like

Thanks, @hihihi! This is great!! I don’t always manage to keep up to date with everything, and hadn’t heard of OneFlow before. Looks like they plan to integrate this code into diffusers too, that will make it super easy.

Now that we don’t have to worry about breaking banana’s optimization all the time, we’re free to try all this stuff. In general, I’m working on automated testing and release cycle at the moment, but after that, I hope we’ll have quite a fast release cycle, as it will get easier and easier to make changes, know they don’t break anything, and release with confidence :tada:

Thats cool. I tested oneflow, it works well. They made super easy like this
import oneflow as torch
from diffusers import OneflowStableDiffusionPipeline as StableDiffusionPipeline
with T4, I confirmed that 5it/s with no oneflow went 10it/s with oneflow.
But the problem is, loading model took so long.
So at same condition with model load and 1 inference, 45 seconds vs 13 seconds (with oneflow / without oneflow) is the result.
Maybe it’s because loading their C++ compiler took so long, so will test with better cpu condition.
So I think AITemplate will be the best option for production at stablediffusion if they use already compiled model so that model loading is fast.
Will continue to share!
Thanks.

1 Like

If using oneflow, and load model in memory with no release, then I think every inference requests can be done within 2 seconds. If serving few not changed models, it will be best option for throughput I think.

1 Like

Thanks again for sharing all this, @hihihi! Really helpful (and motivating :sweat_smile: ). I’m very confident that together we’ll get inference to be super fast! I’ll be in touch when I actually start working on all this stuff but in the meantime please do continue to share your findings :pray: :pray:

yeah I will keep testing!

the most promising one can be adoptable at production looks like this.
But as I research, knowledge distillation may ongoing on stable diffusion, and then all approaches will be useless because that will be the most fast one. (within 1 seconds)