I was reading this interesting post about a new method of training Microsoft came up with. (Using LoRA for Efficient Stable Diffusion Fine-Tuning)
Full model fine-tuning of Stable Diffusion used to be slow and difficult, and that’s part of the reason why lighter-weight methods such as Dreambooth or Textual Inversion have become so popular. With LoRA, it is much easier to fine-tune a model on a custom dataset.
With LoRA, it is now possible to publish a single 3.29 MB file to allow others to use your fine-tuned model.
This would make at runtime downloads of models a lot easier too. The blog post includes an example and a link to a HuggingFace space UI.
Most assuredly! It became an official example in diffusers recently too. This is definitely on my list but it might be a while until I have a chance to work on it. Reluctant to give an ETA now but I’m dying to implement this for kiri.art too, so it’s high on my priority list
It is very high on my priority list too actually. I need it in order to move from private beta to public beta. Lora models are super small and I need the diffusers repository to download the Lora models at runtime with the MODEL_ID being a parameter sent to my Banana or Runpod endpoint. (Banana for now) This is much better than deploying hundreds or thousands of models.
I hope you won’t take too long implementing this, before the 1st of March that would be really great.
I’d very much like to have something workable by the end of the month too, but it’s definitely not something I can promise at this point… and even once the code is done, we’ll need a lot of testing before using in production or especially, as something that could be a turning point for your company.
But yes, agree with you on all the advantages over dreambooth, and I’m also very motivated to get this up and running sooner rather than later… so let’s see I’ll post any updates here.
There are two other things I require before I go public beta. I wasn’t sure if I should create a separate topic for them.
Training the text encoder
After talking to Kyle from Banana, they now assign 20GB of VRAM to their replicas rather than 16. This makes Banana more than capable to also train the text encoder. On Google Colab I am able to train the text encoder just fine on 16GB so I’m not sure why you said Banana can’t. 20GB should be more than enough though.
Training with image captions
Stable Diffusion has the option to train on images with captions. Captions enable you to much more specifically describe the dataset. Rather than using one single prompt that labels all the images, you have a different prompt for each image.
Dataset URL (secondary)
Another thing I was thinking about is that on runpod.io, in stead of an array of base64 encoded images you provide a URL to a zip file with images (or captions). On Banana this might be useful because they have set a max post size. This is not super important.
I will try to see what I can do implementing some of this stuff myself and do a pull request when I did. No promises either, but I understand this stuff a lot better now, eno ugh to think maybe I can do it.
Oh wow, that’s really interesting that you managed on a 16GB colab. The diffusers’ dreambooth readme says “Note: Training text encoder requires more memory, with this option the training won’t fit on 16GB GPU. It needs at least 24GB VRAM”. I guess things have improved since that was written, so thanks for reporting! I don’t recall if I ever tested it, but in theory, you may well be already able to pass modelInput { train_text_encoder: True } to the existing dreambooth training.
Training with image captions
This would be a good feature request for diffusers, which we wrap. If they add this feature it should be trivial for us to make use of it.
Dataset URL (secondary)
Very reasonable request and not too much work to implement. In line with my plans for the “Storage” class which is becoming more flexible and powerful.
Happy to hear you’re understanding this stuff a lot better… I can totally relate, pretty much all of this is new to me too So sure, PRs always welcome, but no pressure at all
For full training from scratch, sure - that’s how the whole model was trained after all. But we’re talking about dreambooth fine-tuning, no? If that’s possible with dreambooth too (in diffusers), please send me some reference to it for me to look over… I couldn’t find any mention of this in the dreambooth readme or on a quick scan of the dreambooth training code itself. But would love to allow for this if the code is already there.
Oh yeah no. In the case of using captions it would mean finetuning the whole model and not just adding a concept using Dreambooth. They actually did this using the pokemon blip dataset, which they even reproduced with Lora which is what this topic is about.
Ah interesting, in the colab he pulls his fork of diffusers from his updt branch… which has a modified train_dreambooth.py which both adds LoRA support (in official diffusers, LoRA is in a separate script) and things like --image_captions_filename, --external_captions, --captions_dir and related code.
I hope maybe he’ll submit these changes back to upstream diffusers but either way I’m sure I’d be able to add a similar feature based off his code. In short, very possible and we’ll definitely get it - eventually Probably a good thing to work on after I finish the initial LoRA support. Nice find, thanks!
Hey, sorry for the late reply… missed the notification for this. No, unfortunately I didn’t I’d like to but I’m completely swamped with other work Maybe someone else will be able to do it. Wish I had better news, sorry