Runtime model modifications in a serverless setup

Hello!

First of all, I would like to thank you for the great quality code!

I am trying to implement a diffuser-based model modification during inference:

  1. Based on request params select weights tensors from secondary model(s) and add them to the main model
  2. Compute forward pass
  3. Substract weights to get original model back in case it would get another request with different params

This works perfectly on a local (therefore on-demand), but I would like to adapt this to a serverless (banana/runpod) setup - and here I get overwhelmed because serverless is new for me. I would really appreciate your help on following topics. (multiple images won’t work - because amount of secondary models permutations is enormous)

  1. Do I understand correctly that both runpod and banana store models globally for a fast scaling? If so, I guess, modifying weights on one worker would impact all other active workers too?

I have some workaround ideas that might work (main model deep-copy before weights mod / custom model paired with custom pipeline that would do the modifications during forward pass without saving new weights) - but I don’t know whether these methods are possible in a serverless setup.

  1. What is the state of serverless providers - banana / runpod, which one should I chose? Both claim low cold boots, but I was told that banana is stable and runpod (which seems to be cheaper per second) had some issues with 50s boots - is it the case now?

Hey @George84

Thanks for your kinds words and welcome to the forums!

Indeed, serverless has it’s pros and cons and it can be a difficult balance to choose.

I’m afraid I’m out of date with both banana and runpod. The Banana guide and Runpod guide have some timing info but they’re a few months old. Best place to ask is probably on the respective Discords.

In general, for both cases:

  1. The (main) model is committed to the docker image, and scaling occurs via the regular docker / kubernetes methods.

  2. Banana have their secret black box optimisation of models, which helps a lot with load times (although we have our own optimisation which on last check outperformed it anyway, but requires some extra work). RunPod copy the image to 2-3 extra machines before its needed, for faster cold boots. Cold starts are slow for two reasons… copying the image over the network / cloud (let’s call this a “very cold boot”) and to load the model into memory which has to happen again every time the container is booted, even it’s image / model is already on local disk.

  3. I’ve asked about onsite S3-compatible storage for both which would be a dream, but it’s not a priority for them, and a big ask for whole other service outside their core business. But speeds to cloud providers are good (considering) - in fact, I thiiink Banana even use AWS Elastic Registry for the docker image storage.

You’re right, having lots of models (or permutations!) is not so practical on serverless atm, in light of the restrictions above. Also with so many models, you lose some advantages of serverless scaling if every request is a cold boot.

As for your specific question, well, first some questions:

  1. Is primary model a “big” model, of which there are only a few (e.g. SDv2.1, etc).
  2. Is it possible to do the secondary models as LoRAs? That means the models will be pretty small (iirc ~50mb for diffusers-generated, ~300mb for A1111-generated, with default options). These will be easy to download from cloud storage as needed, and to both add/remove to the base model.

There’s some info in LoRA fine-tuning (no training / fine-tuning yet but the topic covers inference too). Tbh it’s still a bit of a mess in the ecosystem with different formats out there. When I have a chance I hope to make it easier with diffusers & docker-diffusers-api, but I’m super swamped with work for the next few months :confused:

In short, if the secondary models can be LoRAs I think this can work great with serverless. If not, you’ll have to consider your options. Downloading entire (big) models from the cloud at runtime is possible too, just slower… you could balance how long the containers stay up for, vs user experience; or consider a servered option which is possibly a better fit for this (but only, i’d say, until we can get local S3-compatible storage, which would completely change the game. Try find my feature request for this on banana and upvote it, for RunPod I think it was just in Discord DMs).

Any hope that helps and let us know what you end up doing and what the experience is like :slight_smile: I’m sure this will interest a lot of others too.

Thanks for your reply!

Secondary models are indeed loras - .safetensors from a1111, which I used to load directly by manipulating weights.

The thing is - there are many of them and each request to txt2img can specify what to select. As far as I know, basic implementation allows only one lora.

First thing - I would like to avoid runtime loading (or is this negligible for loras?).

Second - modifying attention layers (I believe load_attn_procs does the same, what I am doing manually - adds matrices to the model) would modify global model and affect other instances - If I understand banana’s ‘blackbox magic’ correctly. Again, if I am mistaken, please, correct me - I am new to distributed cloud computing.

There is a solution - custom model with preloaded loras and custom pipeline which does not update model weights but adds lora’s to them on fly during forward pass (but I am saving this for later - it would require a lot of work)

And there is an off-topic question. I am trying to build via ‘build-download’ banana repo. I’ve packed a model from civitai (single safetensors file → diffusers-like bins → tar of safetensors with same structure as your sample model packed in tars) and put them to s3. But unlike your prebuilt models - it won’t build.

I’ve also run some tests on one of your prebuilt models - 17sec cold boot time. That’s sad. I guess I’ll have to try runpod or go for 24\7 vast.ai :frowning:

Hey again, @George84.

Sorry for my late reply, currently in the middle of some overseas travel.

Ok great! So we’re talking about LoRAs. This puts is an a better position but not a perfect position… I’ll address your points and a few other relevant matters:

  1. Correct, currently you can only load one LoRA at a time (with diffusers). Diffusers team do plan to add it but as far as I know hasn’t happened yet… refferring [Question] Can I load mutiple lora model? · Issue #2739 · huggingface/diffusers · GitHub.

  2. I haven’t actually measured it but I think runtime loading of LoRAs should indeed be negligible. One concern is if we can load the base model AND the LoRA to GPU RAM independently… and I’m pretty sure we can.

  3. Banana’s “blackbox”, at least from my educated guess, loads and initializes the full model, then saves that new state as a pickle which is built into the docker image (at build time, when you update your git repo) for faster future loads. But either way it shouldn’t have any implication if we keep the LoRAs separate, as opposed to merging it with the base model into a new model, which doesn’t serve your purposes anyway.

  4. LoRAs already work how you describe… or atleast, you have two choices… you can merge it into an existing model to create a new model which includes those weights, or you can simply apply it on top of a base model without merging into a new model first, which is where it really shines. This is indeed what happens with load_attn_procs(), and you can revert back to the original model by simply reloading the original attn processor with e.g. pipeline.unet.set_attn_processor(CrossAttnProcessor()) like we do here.

Having said all that, probably the biggest issue now is that you can’t directly load A111 generated LoRAs directly and have to first convert them to diffusers format. I think I mentioned I’d like to make this easier, but it will probably be a few months until I have the necessary time. However, it seems like you’re comfortable with a lot of this stuff, maybe you take a look at upstream diffusers’ existing code and see if this is something you could contribute :wink:

Ok, on to the off-topic question:

  • If you can send the build failures I’ll be happy to take a look.

  • Is there a reason you’re not just letting docker-diffusers-api do the conversion for you?

  • Ok one reason might be that it didn’t support safetensors format… I actually had this fixed locally but only pushed it to dev now (make sure in build-download that you reference the reference the :dev tag if you want to try it out).

  • I haven’t been on banana for a while but from what I recall it should be about 2.3s. You might have just caught them at a bad time, or they may have changed some their optimization process. I suggest to try again at a different time and let me know, in which case I’ll take a look if its still that long.

Anyways, keep me posted and please let me know if anything is not clear, happy to elaborate further. Good luck! :smiley:

Hello! Thanks a lot for your patiens and comprehensive answers - I really appreciate that. Hope you travel goes well!

I had quite a 3-day torture with banana until I booted it and found that it would be very expensive.

Migrated to Runpod - 7 sec cold start, <3sec inference on 512x768 images.

Wrote custom server and dockerfile in like 5 hours (there were some issues with building your repo with my custom models) or so :slight_smile: - and their pricing is really affordable to us. At least 4x times cheaper!

I also like their advanced scaling.

So for anyone who reads it past June - I would recommend using a Runpod’s server template (they made a lot of them for different SDs and other AI services).

That’s some really interesting feedback, thanks for taking the time to post it! No doubt it will help a lot of others. I know just from their mailing list that Banana have some big improvements coming, and maybe it would address some of this too. But great to have options and be able to pick the best fit (Banana’s pricing was an issue for me too).