Dreambooth training [first look]

Hey ok great:

You need 3 forwardslashes (“///”) for the S3 URL.

Don’t set an S3 endpoint unless you know you need it. The library will construct it for you automatically. But if you really want to set it, it looks like https://my-bucket.s3.us-west-2.amazonaws.com. More info at Methods for accessing a bucket - Amazon Simple Storage Service.

Generally people only set the endpoint if they’re NOT using AWS, i.e. using S3-compatible storage offered by some other provider, or running their own S3-compatible storage server like MinIO.

Just to explain the 3 slashes, quoting storage docs:

s3://endpoint/bucket/path/to/file
s3:///bucket/file (uses the default endpoint)

so using 3 slashes will use the default endpoint, which will be set automatically for you, or which you can override with AWS_S3_ENDPOINT_URL (if you really need to). Open to suggestions to how I can make this any clearer.

Have added the following to known issues:

  • Banana runtime logs have a length limit! If you see (in the runtime logs) that training gets “stuck” early on (at around the 100 iteration point?), fear not… your run() / check() will still complete in the end after your model has been uploaded. I’ll look into making the logs… shorter :wink:

Big thanks to @grf who helped confirmed this with some awesome detailed logs.

Hi!! finally, I was very close to making it work. I just got this log:

Request took 562.9s
{
    "$error": {
        "code": "MODEL_MISMATCH",
        "message": "Model \"runwayml/stable-diffusion-v1-5\" not available on this container which hosts \"stabilityai/stable-diffusion-2\"",
        "requested": "runwayml/stable-diffusion-v1-5",
        "available": "stabilityai/stable-diffusion-2"
    }
}

Probably it’s an easy solution :grin:

Oh haha, yeah… I bumped the default model to stabilityai/stable-diffusion-2 (from SD v1-5).

Just specify the correct model in your call:

{
  callInputs: {
    MODEL_ID: "runwayml/stable-diffusion-v1-5"   // <-- remove this line
    MODEL_ID: "stabilityai/stable-diffusion-2"   // <-- replace with this
  },
  modelInputs: {
    // ...
  }
}

Or alternatively, to train against SDv1.5 instead of SDv2, set the correct build-arg, e.g

ARG MODEL_ID="stabilityai/stable-diffusion-2"  # Remove this
ARG MODEL_ID="runwayml/stable-diffusion-v1-5"  # Replace with this

Really just depends which base model you want to fine-tune against.

Hey @gadicc
I really appreciate the tutorial. I have been following it since yesterday to understand the workflow and just now, pushed the code and it’s deploying.

I have some questions on this setup.
Is there any way we can use multiple models (Inference) within the same repo?

For instance, I want to build multiple models and save them in S3, and then generate images from multiple models (one model at a time), all through API without creating multiple GitHub repos.

I know one solution is to load the model in the “inference” function itself. but not sure if that’s a good solution.

I would really appreciate your input.

Thanks :blush:

1 Like

Hey @ayush. Firstly, welcome to the forums :smiley: And thanks for all your kind words.

Great that you’re up and running already! Please let us know if anything isn’t clear… I’ve been trying really hard to improve the docs and make things super clear lately (and one last time, I have to give a massive thanks to @Klaudioz for his patience and efforts here).

You’re not the first person to ask for this, so I think I will probably add it (when I have a chance)… as long as it’s clear that it will be much slower:

  • Downloading a particular model after every cold boot will take about a minute or so.
  • Switching models in GPU RAM can take a while too… I’m not sure how long as I’m quite used to banana’s optimizations, but also, I’ve done a lot of work to make init times much faster, so it’s maybe not as big a deal as it used to be.

So there’s quite a big delay every time you switch models before you can start inferring pictures (but yes, you’re right, the above all happens in the inference() function even though it’s not actually inference :sweat_smile: And those delays are the reason it’s not generally recommended).

I guess after the support is in we can do more timing tests to understand exactly. What would help a lot is S3-compatible storage on-premises at Banana… there’s a feature-request for that in these forums which you can upvote :slight_smile: It’s not a simple business decision though for banana to commit to maintaining high uptime object storage in addition to their core business offering. But let’s see :slight_smile:

1 Like

Hi Thank you for your great work!

I’m trying your codes at brev.dev, gpu instance.

Succesfully built image with USE_DREAMBOOTH=1, PRECISION=""

and run local server with docker container

python test.py dreambooth --model-arg max_train_steps=1

worked well with reponse

Request took 44.5s (init: 44.4s, inference: 44.4s, training: 7.2s, upload: 0ms)`
{
    "done": true,
    "$timings": {
        "init": 44404,
        "inference": 44415,
        "training": 7249,
        "upload": 0
    }
}

but when I try to upload model to huggingface with
python test.py breambooth --model-arg max_train_steps=1 --model-arg hub_model_id=“my_huggingface_id/my_model_name” --model-arg push_to_hub=True

huggingface model created but Running test: dreambooth doesn’t finish and nothing uploaded to huggingface model.

It seems training is done well, but problem exists while server upload it to huggingface.
Thank you for your great job again, and is there way to solve my problem?

Hey, @hihihi :slight_smile:

So firstly, welcome to the forums, and well done on your great start to using the dreambooth training :raised_hands:

I agree with your assessment, seems like training is working great but its getting stuck on upload. Unfortunately I’ve never used brev.dev before, so I’m unsure of the specifics, but is there any way you can provide the runtime logs? If you have direct docker access, you can probably just docker logs <containerId>. That will be really helpful to understand what’s going on.

One side note, it’s a little silly, but if you use a hub_model_id that you’ve used before, diffusers will first download the entire previously trained model, only to overwrite it and upload the new fine-trained model afterwards… which wastes a lot of time. I’ll try improve that in a future release, but that is how the original train_dreambooth code from diffusers is written.

The other thing I can think of in the meantime, is to make sure your huggingface auth token is setup on huggingface with both read AND write access, otherwise it will fail / break when trying to upload. This was one of my own early mistakes that took me a while to figure out :sweat_smile:

Anyway, if you can provide the logs, it will be super helpful. Thanks!

Thank you for fast and kind reply!
After I ate lunch, it did well with response
Running test: dreambooth
Request took 1697.9s (init: 44.4s, inference: 28.3m, training: 2.8s, upload: 27.5m)
{
“done”: true,
“$timings”: {
“init”: 44404,
“inference”: 1697784,
“training”: 2782,
“upload”: 1650076
}
}
Don’t know why it took so long. Is it usaully take this much?

Ah, great news. Nothing better than going to eat and finding out something we thought was broken was actually working, and less work for us! :grin: :raised_hands:

Ok yes, so we can see that what really took so long was the upload (almost half an hour!). Average training result is about 4.2GB, so that means you were getting about 2.5 MB/s to HuggingFace.

I’m not sure if brev.dev has limited bandwidth, or if the issue is HuggingFace being slow. However, I have recently realised that speeds to HuggingFace can vary quite a lot, and are much much slower than Amazon S3, so I guess I’m going to change my original recommendation to rather use S3 :confused:

1 Like

Hi all, based on my own experiments and feedback from other users, I’ve updated my recommendation (in the first post) to use HuggingFace to rather use S3, thus:

:white_check_mark: HuggingFace: HuggingFace allows for unlimited private models even on their free plan, and is super well integrated into the diffusers library (and you already have your token all set up and in the repo! – just make sure your token has read/write access).

However, HuggingFace is also much slower. On initial tests from a Banana instance, for a 4GB average model, looking at roughly 5 MB/s (40Mbps) or 15m to upload. So, although you’re not paying for storage, you’ll end up paying much more to Banana because you’re paying for those upload seconds with GPU seconds :confused:

:white_check_mark: S3 - see docs/storage.md . For an average 4GB model, from Banana to AWS us-west-1, 60 MB/s (480 Mbps), or 1m. This works out to 1/15th of the time/cost of GPU seconds to upload to HuggingFace.

2 Likes

I set AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_DEFAULT_REGION, AWS_S3_ENDPOINT_URL environment variable when run docker image and tested, but I got error

I typed
python test.py dreambooth --model-arg max_train_steps=1 --call-arg dest_url=“s3:///MY_BUCKET_NAME/model.tar.zst”

response is

Running test: dreambooth
Request took 125.3s
{
“description”: “Internal Server Error”,
“status”: 500,
“message”: “The server encountered an internal error and cannot complete your request.”
}

I tested with boto3 that aws access key, secret key and bucket name is right for upload and download. But don’t know why this error occur, and how to debug. Do you have idea?

You set a good pace! And great that you thought to check in boto3 too :ok_hand:

Hard to say without seeing the logs :confused: However, if I understood you correctly, you set all the stuff as environment variables when they should be set as build arguments to docker (I’ll make this clearer). Not so sure how it works on brev.dev but basically however you set USE_DREAMBOOTH=1 and PRECISION="" before, i.e., you’ll need to rebuild the image, but because of how it’s set up, it will use the cache and only rebuild the very last part of it, should be very quick to change (but again, I’m not sure how it will work on brev.dev specifically).

Otherwise if you were already doing that, see if there’s any info from brev.dev and how to get the runtime log / docker log / container log to see what’s actually happening inside.

And let us know how it goes :smiley:

Hey @gadicc unless I can make some changes to the docker-diffusers-api to download the model after the build step, I don’t see how things can scale, specifically the training of different pre-trained models generated by the training.

The training step can scale well as it can build only once and keep creating new models that will be saved for inference somewhere else, but then all the unique inferences that need to happen to each model does not get automated, requires manual creation of new models on banana dashboard if it were to scale horizontally, but since banana has no API for that, this cannot get automated in that way.

The only alternative that I see is to make a separate model from the training one, in which will only be done inference, and change the way this model downloads the pre trained model that was uploaded to S3.

Do you see a different easier approach to this? If not, how do you recommend doing so in a not so disruptive way to the existing repo? I don’t mind to deviate my own code from the way things will be carried on on this repo and isolate my code from fork updates as this is my sole purpose to use this repo, so I’m ok to do radical changes. I really just wonder if you have suggestions, or if you agree that this could be done on the repo directly as I suspect many others may have this need.

edit to fx typos

I rebuild images with

docker build --no-cache -t banana-sd-s3 --build-arg HF_AUTH_TOKEN=“MY_HF_AUTH_TOKEN” --build-arg AWS_ACCESS_KEY_ID=“MY_AWS_ACCESS_KYE_ID” --build-arg AWS_SECRET_ACCESS_KEY=“MY_AWS_SECRET_ACCESS_KEY” --build-arg AWS_DEFAULT_REGION=“MY_AWS_DEFAULT_REGION” --build-arg AWS_S3_ENDPOINT_URL=“s3///MY_BUCKET NAME”

and built well, but still same result of “Internal Server Error”
Maybe I should change some codes to see what error received by server. And as your advice, get runtime log / docker log / container log

After I do it will share it!

Thank you!

Hey there @hihihi , do you have more logs that you can provide as to why there was an error?

From your first message, you see your json had upload:0, meaning no upload was done
Your second message showed upload:1650079, meaning upload completed
For aws you should expect the same thing, success upload returns a number, failure returns 0 and sometimes no error either, just upload:0 could indicate failure

When I was trying a few times to setup AWS and had problems, I really don’t remember seeing the 500 status error when my credentials were wrong, there was a clear error message, for example when my bucket name was wrong I got a file not found response from aws, so I wonder if this error you’re having is actually related to some other change on your code.

I have the impression this 500 error is a kind of timeout, indicating the app never finished processing whatever it was doing, got frozen somehow, even typos in python code return a clear error message for example, so 500 generic error “cannot complete your request” to me generally meant the app got stuck somehow. I could be wrong but that’s why I believe it could be some other change, not just credentials.

All the times I was in this situation I was reverting small changes until things ran again, and only then started over to apply whatever changes I was doing to locate exactly where the problem was.

It took me quite a few tries to make AWS work correctly to upload and download, so what I did was to copy the storage folder from this repo into a test project separetly, create the same AWS credentials that were needed to run storage class to download and upload, so basically I “borrowed” this repo’s storage class, and made it work on a test project, and only then I figured out exactly how to setup everything.

Hope to have helped with some ideas

1 Like

Ah I can see your mistake! You’re setting AWS_S3_ENDPOINT_URL. You should only do this if you know you need to (usually only if you’re using a non-Amazon S3-compatible storage service), otherwise from this list Amazon Simple Storage Service endpoints and quotas but I recommend to rather let boto3 set it automatically for you.

You’re the second person doing this, so I feel like I must have implied you should somewhere… if I did please let me know where, so I can change it :sweat_smile: I’ll modify storage.md to have a better comment about this.

P.S. You also don’t need --no-cache, and changing certain build args will be much quicker without it (notably the AWS related build-args).

@gadicc you’re right, I also did have the impression that it was needed to setup endpoint_url

1 Like

@grf, regarding horizontal scaling and downloading models at runtime:

You’re not the first person to ask for this, and I’m going to do it when I have a chance. I also have a few ideas to speed things up beyond what would normally be possible without a build-optimized model, we’ll see if they pan out :slight_smile:

Have also started some very, very early work to automatically deploy built models :innocent: I’m not sure if it will work yet, but obviously it has the advantage of much faster cold starts.

So, I have two ways to deal with this issue… just a bit limited in time atm :confused: Have an exam and some travel coming back. But it’s next on my list when I have a chance, and I’m quite excited to see how it will work out :smiley:

1 Like

Thanks for the feedback, @grf and @hihihi.

Everyone, I’ve amended the build-args paragraph of the S3 section of docs/storage.md as follows:

Build Args

Set the following build-args, as appropriate (through the Banana dashboard,
by modifying the appropriate lines in the Dockerfile, or by specifying, e.g.
--build-arg AWS_ACCESS_KEY="XXX" etc.)

ARG AWS_ACCESS_KEY_ID="XXX"
ARG AWS_SECRET_ACCESS_KEY="XXX"
ARG AWS_DEFAULT_REGION="us-west-1" # best for banana
# Optional. ONLY SET THIS IF YOU KNOW YOU NEED TO.
# Usually only if you're using non-Amazon S3-compatible storage.
# If you need this, your provider will tell you exactly what
# to put here. Otherwise leave it blank to automatically use
# the correct Amazon S3 endpoint.
ARG AWS_S3_ENDPOINT_URL

and have made similar remarks in the top post too. Sorry this wasn’t clearer earlier and thanks for the feedback :grin: