Runtime model modifications in a serverless setup

gadicc · June 3, 2023, 3:43pm

Thanks for your kinds words and welcome to the forums!

Indeed, serverless has it’s pros and cons and it can be a difficult balance to choose.

I’m afraid I’m out of date with both banana and runpod. The Banana guide and Runpod guide have some timing info but they’re a few months old. Best place to ask is probably on the respective Discords.

In general, for both cases:

The (main) model is committed to the docker image, and scaling occurs via the regular docker / kubernetes methods.
Banana have their secret black box optimisation of models, which helps a lot with load times (although we have our own optimisation which on last check outperformed it anyway, but requires some extra work). RunPod copy the image to 2-3 extra machines before its needed, for faster cold boots. Cold starts are slow for two reasons… copying the image over the network / cloud (let’s call this a “very cold boot”) and to load the model into memory which has to happen again every time the container is booted, even it’s image / model is already on local disk.
I’ve asked about onsite S3-compatible storage for both which would be a dream, but it’s not a priority for them, and a big ask for whole other service outside their core business. But speeds to cloud providers are good (considering) - in fact, I thiiink Banana even use AWS Elastic Registry for the docker image storage.

You’re right, having lots of models (or permutations!) is not so practical on serverless atm, in light of the restrictions above. Also with so many models, you lose some advantages of serverless scaling if every request is a cold boot.

As for your specific question, well, first some questions:

Is primary model a “big” model, of which there are only a few (e.g. SDv2.1, etc).
Is it possible to do the secondary models as LoRAs? That means the models will be pretty small (iirc ~50mb for diffusers-generated, ~300mb for A1111-generated, with default options). These will be easy to download from cloud storage as needed, and to both add/remove to the base model.

There’s some info in LoRA fine-tuning (no training / fine-tuning yet but the topic covers inference too). Tbh it’s still a bit of a mess in the ecosystem with different formats out there. When I have a chance I hope to make it easier with diffusers & docker-diffusers-api, but I’m super swamped with work for the next few months

In short, if the secondary models can be LoRAs I think this can work great with serverless. If not, you’ll have to consider your options. Downloading entire (big) models from the cloud at runtime is possible too, just slower… you could balance how long the containers stay up for, vs user experience; or consider a servered option which is possibly a better fit for this (but only, i’d say, until we can get local S3-compatible storage, which would completely change the game. Try find my feature request for this on banana and upvote it, for RunPod I think it was just in Discord DMs).

Any hope that helps and let us know what you end up doing and what the experience is like I’m sure this will interest a lot of others too.