Dreambooth training [first look]

gadicc · November 14, 2022, 2:33pm

Introduction

Warning: This is still under active development and only for the truly adventurous

You need the dev branch of docker-diffusers-api (since 2022-11-22) until this is released into main. It’s still very new and will undergo further development based on feedback of early users.

Dreambooth is a method to personalize text2image models like stable diffusion given just a few (3~5) images of a subject. Here we provide a REST API around diffuser’s dreambooth implementation (consider reading that first if you haven’t already).

There are two steps to using dreambooth. Each step happens in a separate container with their own build-args (but with the same docker-diffusers-api source).

Fine-tuning an existing model with your new pics (and then uploading the fine-tuned model somewhere after training)
Using your new model (deploying a new container that will download it at build time for nice and fast cold starts).

Let’s take a look at each step.

Training / Fine-tuning

Deploy a new repo with the following build-args (either via Banana’s dashboard, or by editing the Dockerfile in the appropriate places, committing and pushing).

USE_DREAMBOOTH=1
PRECISION="" (download fp32 weights needed for training, output still defaults to fp16)
For S3 only, set AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY and AWS_DEFAULT_REGION. Don’t set AWS_S3_ENDPOINT_URL unless your non-Amazon S3-compatible storage provider has told you exactly what to put here. More info in docs/storage.md.

Now either build and run locally, or deploy (e.g. to banana), and test with any of the following methods:

# For both options below, make sure BANANA_API_KEY and BANANA_MODEL_KEY
# are set.  Or, for local testing, simply remove the --banana paramater
# to have test.py connect to localhost:8000.

# Upload to HuggingFace.   Make sure your hf token has read/write access!
$ python test.py dreambooth --banana \
  --model-arg hub_model_id="huggingFaceUsername/modelName" \
  --model-arg push_to_hub=True

# Upload to S3 (note the triple forwardslash /// below)
$ python test.py dreambooth --banana \
  --call-arg dest_url="s3:///bucket/model.tar.zst"

# Prior loss preservation, add:
  --model-arg with_prior_preservation=true \
  --model-arg prior_loss_weight=1.0 \
  --model-arg class_prompt="a photo of dog"

# One iteration only (great to test your workflow), add:
  --model-arg max_train_steps=1 \
  --model-arg num_class_images=1

NB: this is running the dreambooth test from test.py which has a number of important JSON defaults that get sent along too. The default test trains with the dog pictures in tests/fixtures/dreambooth with prompt photo of sks dog. The --call-arg and --model-arg parameters allow you to override and add to these test defaults.

Alternatively, in your own code, you can send the full JSON yourself:

{
        "modelInputs": {
            "instance_prompt": "a photo of sks dog",
            "instance_images": [ b64encoded_image1, image2, etc ],
            // Option 1: upload to HuggingFace (see notes below)
            // Make sure your HF API token has read/write access.
            "hub_model_id": "huggingFaceUsername/targetModelName",
            "push_to_hub": True,
        },
        "callInputs": {
            "MODEL_ID": "runwayml/stable-diffusion-v1-5",
            "PIPELINE": "StableDiffusionPipeline",
            "SCHEDULER": "DDPMScheduler", // train_dreambooth default
            "train": "dreambooth",
            // Option 2: store on S3.  Note the **s3:///* (x3).  See notes below.
            "dest_url": "s3:///bucket/filename.tar.zst".
        },
    }

Other Options

modelInputs
- mixed_precision (default: "fp16"): this takes half the training time and produces smaller models. If you want to, instead, create a new, full precision fp32 model, pass the modelInput { "mixed_precision": "no" }.
- resolution (default: 512): set to 768 for Stable Diffusion 2 models (except for -base variants).

Using your fine-tuned model (Inference)

Now you need to deploy another docker-diffusers-api that’s built against it, with the following build-args (again, either through Banana dashboard, or via editing the Dockerfile):

If you uploaded to HuggingFace:
- Set MODEL_ID=<<hub_model_id>> from before
If you uploaded to S3:
- Set MODEL_ID to an arbitrary (unique) name.
- Set MODEL_URL="s3:///bucket/model.tar.zst" (filename from previous step)
- Note the three backslashes in the beginning (see the notes below).

That’s it! Docker-diffusers-api will download your model at build time and be ready for super fast inference.

As usual, you can override one of the default tests to specifically test your new model, e.g.:

$ python test.py txt2img --banana \
  --call-arg MODEL_ID="<your_MODEL_ID_used_above>" \
  --model-arg prompt="sks dog"

Roadmap

Status updates
~~S3 support~~ done, see example above and notes below.

With an eye to the future, I know both banana-team and I would love an API to deploy your model automatically on completion, I’ll speak to them about this after I finish the fundamentals first. ALSO, may consider allowing a container to download your model at run time, there are some use-cases for this, but don’t forget, it will be much slower.

Known Issues

No fp16 support pending #1247 Attempting to unscale FP16 gradients
No xformers / memory_efficient_attention - working!
~~No prior-preservation~~ - example above!
Not tested on banana yet. From what I recall, the timeout limit was raised, but I don’t remember if this has to be requested. I’ll check closer to official release time. Also, would be great to have the fp16 + xformers stuff fixed before, as that will greatly speed up training. But focused on the basics first.
Banana runtime logs have a length limit! If you see (in the runtime logs) that training gets “stuck” early on (at around the 100 iteration point?), fear not… your run() / check() will still complete in the end after your model has been uploaded. I’ll look into making the logs… shorter

Storage

This was explained above, but here’s a summary:

HuggingFace: HuggingFace allows for unlimited private models even on their free plan, and is super well integrated into the diffusers library (and you already have your token all set up and in the repo! – just make sure your token has read/write access).

However, HuggingFace is also much slower. On initial tests from a Banana instance, for a 4GB average model, looking at roughly 5 MB/s (40Mbps) or 15m to upload. So, although you’re not paying for storage, you’ll end up paying much more to Banana because you’re paying for those upload seconds with GPU seconds
- Upload: set { modelInputs: { hub_model_id: "hfUserName/modelName", push_to_hub: True }}
- Download on build: set build-arg MODEL_ID=<hub_model_id_above>, that’s it!
S3 - see docs/storage.md. For an average 4GB model, from Banana to AWS us-west-1, 60 MB/s (480 Mbps), or 1m. This works out to 1/15th of the time/cost of GPU seconds to upload to HuggingFace.

Make sure to set the following build-args: AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, AWS_DEFAULT_REGION. Do not set AWS_S3_ENDPOINT_URL unless your non-Amazon S3-compatible storage provider has told you you need it. More info on S3-related build-args in docs/storage.md.
- Uploads - set e.g. { callInputs: { dest_url: "s3:///bucket/model.tar.zst" } }
- Download on build - set e.g. MODEL_URL="s3:///bucket/model.tar.zst"
- Development - optionally run your own local S3 server with 1 command!
- NB: note that it’s S3:/// at the beginning (three forward slashes). More deails on this format in the storage doc above.

Acknowledgements

Massive thanks to all our dreambooth early adopters: Klaudioz, hihihi, grf, Martin_Rauscher and jochemstoel. There is no way we would have reached this point without them.
Special thanks to @Klaudioz, one of our earliest adopters, who patiently ploughed through our poor documentation (at the time), and helped us improve it for everyone! The detailed, clear docs you see above are largely a result of our back-and-forth in the thread below.

gadicc · November 14, 2022, 2:58pm

Reserved for future use.

gadicc · November 18, 2022, 3:45pm

End of Week Update

Things are still a bit rough around the edges but it does work. Here’s the main news since Monday’s first announcement.

S3 upload/download support (see edited notes above)
Extra timing metrics for compression & upload time… this can be improved.
fp16 still not working. This is an upstream issue, linked above. I spent a day and a half going through the diffusers code trying to track it down, adding my findings to the issue, let’s hope it will help the diffusers team.
If no progress in the next few days, I’ll release the dreambooth code with an earlier diffusers release, with working fp16, but unfortunately, no xformers or memory efficient cross attention, which would really help with the training speeds!

Don’t think I’ll have much time over the weekend but will carry on next week. More exciting things on the way!

Klaudioz · November 18, 2022, 4:02pm

Thanks for this work. It’s amazing!!

I’m trying to understand all the code and how it works.

I’m interested in testing it with S3 so I’m going to provide feedback soon.

gadicc · November 18, 2022, 4:58pm

Thanks, @Klaudioz! That will be great. I updated the post above with even more S3 info and examples, and even a link to run your own S3 server locally with 1 command, which has been awesome for development (in case it fits your use case).

I’m really happy how the S3 support turned out, even though it’s still super new… the idea is we now have a generic Storage class, so it will be easy to mix and match uploads/downloads with different services with no further change to the code, basically just by using URLs like s3:// in the examples above.

So thanks, some feedback will be great, and definitely don’t hesitate to get in touch with any questions, especially in this early stage. I’ll be very happy to make any of the instructions clearer based on questions that arise, or fix things that aren’t intuitive. In any event, happy diffusing!

Klaudioz · November 21, 2022, 2:40pm

I’m writing all that is happening on my mind, maybe it can be useful for documentation.
Steps:

Get the repo and use dreambooth branch.
Add all personal credentials to the ARGs on the Dockerfile.
Push the code to Banana to generate the Model.
- Is it necessary to optimize it if it fails every time?
Get the Model Key from the newly generated model.
Running test.py using env variables BANANA_API_KEY and BANANA_MODEL_KEY:
python test.py dreambooth --model-arg max_train_steps=400 --call-arg dest_url="s3://<MY_BUCKET>/test/model.tar.zst"

I’m getting:
requests.exceptions.ConnectionError: HTTPConnectionPool(host='localhost', port=8000): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x106a11150>: Failed to establish a new connection: [Errno 61] Connection refused'))

I can understand this process should be:

Input: images to train, classes images (optional)
Output: Generated images, model ckpt (where these files are going to be stored in the bucket?)

Maybe a full example, step by step can be very useful here. I consider myself an advanced user, but even that it’s very complicated to follow.

Another questions:

I’ve seen the last commit changes * USE_DREAMBOOTH=1 to * USE_DREAMBOOTH=0, why ?

I’m planning to use this entirely with S3, so the paragraph starting with: " Then, to use your model , deploy another repo with …" is it useful to me?. Honestly I don’t understand it.

I’ve seen some ARGs are written in the code and some are provided by the json or command line but both aren’t matching. There are a lot of more arguments on the JSON.

Thanks for everything you’ve done related to this.

gadicc · November 21, 2022, 3:17pm

Hey, thanks. This feedback is really helpful as what’s clear to me as the author isn’t always that clear to those using the library But thanks to early feedback like this we can improve it for everyone. So thanks.

I’ll make some more improvements tomorrow but in the meantime:

Thanks, I cleaned this up. Please let me know if it’s clearer now.

Good point. The JSON are all the full raw parameters to pass. However, test.py will use the parameters for that particular test (dreambooth test in this case, i.e. here) and these can be overriden (and added to) from the command line with --call-arg and --model--arg. I’ll make this clearer.

Optimization should not be failing. It wasn’t last time I deployed. I’ll look into this tomorrow.

Be careful, the s3 url should have three forwardslashes in the beginning. It’s actually:

s3://endpoint/bucket/path/to/file, but if you leave out the endpoint:
s3:///bucket/path/to/file, it will read it from the environment variable.

There’s some info about this the docs/storage link above, but it should definitely be clearer, so thanks again.

Re the connection error, I’ll have to look into that tomorrow, however, there’s a good chance it’s from the wrong the S3 URL, as it would be trying to connect to a HOST with your bucket name, which would definitely fail. So that would be a good first thing to try.

Yup, although, I haven’t tried with class images yet (I will!). In theory it should work because we literally pass through the entire model_inputs parameter straight into the train_dreambooth script, so all the same parameters should work. But some things I had to modify a bit to make work, so let’s see.

I haven’t tried yet but you should also be able to do this straight from the banana dashboard now, without modifying anything.

Actually, there are no generated images from training, just the fine-tuned model. We don’t convert to cpkt as we’re training in diffusers and using in diffusers, so we can entirely skip the 1) convert from diffusers to ckpt and 2) convert back ckpt back to diffusers that are a lot of others seem to be doing. The diffusers format is multiple directories and files so we tar and zstd that to a single file first before uploading, and extract after downloading during the build of the inference container.

Ah, just because the default should be 0 (no dreambooth), and should explicitly be enabled by the user. So you should leave yours as 1 (or set in the banana dashboard). If building locally, you can also pass --build-arg USE_DREAMBOOTH=1.

Thanks, this is a great idea and very helpful feedback. I’ve been in your shoes before, so really appreciate the time to play around when things aren’t clear and bash your head against the wall a bit because of bad docs. So, I apologise for that and thank you one last time for pointing out these deficiencies and suggested improvements, which will help everyone

Ok, I hope that’s everything! I keep jumping back and realise I missed something, so sorry it’s out of order and so long but wanted to answer everything as I probably won’t be back online again today. Thanks again for all your time and feedback.

Klaudioz · November 21, 2022, 3:37pm

Nothing to apologize for, everybody is learning right now and the speed of innovation is really fast so a lot of people (including me) are just trying to make it work and understand it well next … lol, so thank you for your detailed answer. Sometimes you don’t realize that you know a lot about one topic and it’s not as easy as it looks for you.

I just realized that I couldn’t test it locally because I don’t have a graphic card. Also, I didn’t get an optimization error this time.

I tried with s3:///<MY_BUCKET>/test/model.tar.zst and I got the same error. So … this file doesn’t exist in my bucket and probably I’m still not understanding what is done here.

I just want to accomplish the following:

I have 20 pictures of a person in my bucket, so I want to generate the ckpt (I didn’t know about that method of diffusers → ckpt → diffusers, I’ll need to read about it )
Generate the model and run personalized prompts to generate images and put them in an S3 bucket folder. It looks like this project can accomplish much more of it.

Also, I didn’t know that the banana dashboard is supporting ENV variables right now, thank you for telling me that.

gadicc · November 21, 2022, 3:50pm

It’s new I haven’t even tried it yet, but the Dockerfile had been ready for it for months

Currently it will send num_images_per_prompt back in the REST request as base64 encoded images (just like we’ve been doing until now). But yeah it wouldn’t be hard at all to instead upload them to S3, so we can add this (I guess after everything else is working).

Ah damn, was hoping it would be a quick fix. I’m about to log off but if you could paste the full build log, it would be great to see if there’s any pointers in there. I’ll have a chance to look tomorrow (or maaaybe a quick look later, let’s see :)).

Klaudioz · November 21, 2022, 3:52pm

Here is the log:

Running test: dreambooth
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/urllib3/connection.py", line 174, in _new_conn
    conn = connection.create_connection(
  File "/usr/local/lib/python3.10/site-packages/urllib3/util/connection.py", line 96, in create_connection
    raise err
  File "/usr/local/lib/python3.10/site-packages/urllib3/util/connection.py", line 86, in create_connection
    sock.connect(sa)
ConnectionRefusedError: [Errno 61] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 699, in urlopen
    httplib_response = self._make_request(
  File "/usr/local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 394, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "/usr/local/lib/python3.10/site-packages/urllib3/connection.py", line 239, in request
    super(HTTPConnection, self).request(method, url, body=body, headers=headers)
  File "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/http/client.py", line 1282, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/http/client.py", line 1328, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/http/client.py", line 1277, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/http/client.py", line 1037, in _send_output
    self.send(msg)
  File "/usr/local/Cellar/python@3.10/3.10.8/Frameworks/Python.framework/Versions/3.10/lib/python3.10/http/client.py", line 975, in send
    self.connect()
  File "/usr/local/lib/python3.10/site-packages/urllib3/connection.py", line 205, in connect
    conn = self._new_conn()
  File "/usr/local/lib/python3.10/site-packages/urllib3/connection.py", line 186, in _new_conn
    raise NewConnectionError(
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x10f841150>: Failed to establish a new connection: [Errno 61] Connection refused

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/requests/adapters.py", line 439, in send
    resp = conn.urlopen(
  File "/usr/local/lib/python3.10/site-packages/urllib3/connectionpool.py", line 755, in urlopen
    retries = retries.increment(
  File "/usr/local/lib/python3.10/site-packages/urllib3/util/retry.py", line 574, in increment
    raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='localhost', port=8000): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x10f841150>: Failed to establish a new connection: [Errno 61] Connection refused'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/canalescl/personal/SelfieAI2/banana/dreambooth/dreambooth/test.py", line 346, in <module>
    main(
  File "/Users/canalescl/personal/SelfieAI2/banana/dreambooth/dreambooth/test.py", line 284, in main
    runTest(test, banana, extraCallInputs, extraModelInputs)
  File "/Users/canalescl/personal/SelfieAI2/banana/dreambooth/dreambooth/test.py", line 109, in runTest
    response = requests.post("http://localhost:8000/", json=inputs)
  File "/usr/local/lib/python3.10/site-packages/requests/api.py", line 117, in post
    return request('post', url, data=data, json=json, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/requests/api.py", line 61, in request
    return session.request(method=method, url=url, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/requests/sessions.py", line 542, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/local/lib/python3.10/site-packages/requests/sessions.py", line 655, in send
    r = adapter.send(request, **kwargs)
  File "/usr/local/lib/python3.10/site-packages/requests/adapters.py", line 516, in send
    raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='localhost', port=8000): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x10f841150>: Failed to establish a new connection: [Errno 61] Connection refused'))

I’m sure that the bucket has open/write permissions.

gadicc · November 21, 2022, 5:14pm

Ah, thanks… I was actually looking for the build log from banana, but as it happens, what you sent me is more useful! I didn’t notice before but looking more carefully this time I see that you didn’t specify --banana to test.py. In this case it will try to connect to localhost:8000 (expecting a local docker container)… I clicked now after you mentioning that you can’t run it locally. I’ll emphasize this more in the original post so thanks again

Edit: and I see I don’t even mention the --banana above, so I guess that was quite an assumption on my part!

Klaudioz · November 21, 2022, 9:31pm

Thank you !!.

Now I’m getting:

Running test: dreambooth
Request took 16.6s
{
    "$error": {
        "code": "TRAIN_DREAMBOOTH_NOT_AVAILABLE",
        "message": "Called with callInput { train: \"dreambooth\" } but built with TRAIN_DREAMBOOTH=0"
    }
}

I tried with: export USE_DREAMBOOTH=1 and export TRAIN_DREAMBOOTH=1 but it didn’t work for me.

gadicc · November 22, 2022, 6:57am

Ok, firstly, you’re right, it should be USE_DREAMBOOTH not TRAIN_

However, where are you running that? USE_DREAMBOOTH is that same build-arg from before, you should either set it via banana dashboard or via the Dockerfile. As you mentioned, I changed these default back to 0 so it’s important to set manually.

Sorry re the time difference but I believe we’ll have you training in no time today

gadicc · November 22, 2022, 2:38pm

Hey all. A few small updates.

Promoted to dev. You should no longer follow the dreambooth branch, which will be deleted soon. Still under active development, but stable enough to belong in dev. This brings a lot of other features (such S3 support and testing improvements) which were built for dreambooth but relevant for everyone.
Major documentation updates (in the top post above). Please give a big kudos to @Klaudioz for all time and energy in being one of the first people to try out the dreambooth code despite very poor documentation (at the time!). Just scroll up a bit and you’ll see all the valuable back-and-forth we’ve had to make things clearer for everyone. Thank you!
Prior loss preservation - was working already, but now there’s an example.
Xformers - now working for much faster training
Misc small fixes and improvements

Thanks!

Klaudioz · November 25, 2022, 9:26pm

I’m trying again from zero after having some issues with banana.

I’m getting errors trying the first example.

If I try: python3 test.py dreambooth --banana --call-arg dest_url="s3:///<MY_BUCKET>/model.tar.zst" --model-arg max_train_steps=1 --model-arg num_class=images=1
I’m getting:

{
    "$error": {
        "code": "TRAIN_DREAMBOOTH_NOT_AVAILABLE",
        "message": "Called with callInput { train: \"dreambooth\" } but built with USE_DREAMBOOTH=0"
    }
}

Even when I set USE_DREAMBOOTH=1 on banana.

If I try this now: python3 test.py dreambooth --banana --call-arg dest_url="s3:///<MY_BUCKET/model.tar.zst" --model-arg max_train_steps=1 --model-arg num_class=images=1 --build-arg USE_DREAMBOOTH=1

I’m getting: No such tests: --build-arg, USE_DREAMBOOTH=1

I’m pushing to banana, I don’t have a GPU to test locally, and it looks like --build-arg is a docker command. Why is it mixed with a python command explanation?

gadicc · November 26, 2022, 7:34am

Hey!

Ok, so in your first screenshot, you’re setting USE_DREAMBOOTH=1 as a local environment variable on your own computer, but as you point out later, it needs to be set as a docker build arg, either through banana’s dashboard or by editing the Dockerfile.

You said you set it on banana, I guess through the dashboard… if that’s not working apparently (I haven’t actually spent too much time with the build args yet), you need to manually trigger a new deploy by e.g. pushing a dummy commit to get it to use the new saved vars. From what I read from a user report on discord at least!

I’m not sure I’m following… indeed, --build-arg is a docker command, so that’s why you’re getting that error… since test.py doesn’t recognize such a command, it’s looking for tests, one called "--build-arg" and one called "USE_DREAMBOOTH=1" which of course, don’t exist.

Please if the documentation somehow hints that this is a combination you should be doing, let me know exactly where and I’ll fix it. Because these are two different concepts. The test script is the “client” making the request, and the container is the “server” serving the request. It’s the container / server that has to be built with the build-arg USE_DREAMBOOTH=1, and that’s through Banana’s dashboard (+ a commit afterwards, apparently) or through editing the Dockerfile.

Very open to suggestions to how I can make this clearer in the docs.

Klaudioz · November 26, 2022, 4:35pm

Hi, thanks for your answer. I never thought I’d need a dummy commit to loading the ENV VARS, but it worked. Now I’m getting this weird error after about 30 minutes:

(Reading database ... 95%
(Reading database ... 100%
(Reading database ... 9375 files and directories currently installed.)
Preparing to unpack .../git-lfs_2.3.4-1_amd64.deb ...
Unpacking git-lfs (2.3.4-1) ...
Setting up git-lfs (2.3.4-1) ...
e[36mINFOe[0m[1307] Taking snapshot of full filesystem...        e[36mINFOe[0m[1318] Using files from context: [/zeet_workspace/train_dreambooth.py] e[36mINFOe[0m[1318] ADD train_dreambooth.py .                    e[36mINFOe[0m[1318] Taking snapshot of files...                  e[36mINFOe[0m[1318] Using files from context: [/zeet_workspace/send.py] e[36mINFOe[0m[1318] ADD send.py .                                e[36mINFOe[0m[1318] Taking snapshot of files...                  e[36mINFOe[0m[1318] Using files from context: [/zeet_workspace/app.py] e[36mINFOe[0m[1318] ADD app.py .                                 e[36mINFOe[0m[1318] Taking snapshot of files...                  e[36mINFOe[0m[1318] CMD python3 -u server.py                     e[36mINFOe[0m[1318] Pushing image to 484288001113.dkr.ecr.us-west-2.amazonaws.com/zeet/0b518925-ebfd-49cd-a524-82d402863b1a:8d2d0b98-c64f-4c3a-a823-0054746a42eb-f1430aed-037d-4d31-9e36-6850dc610d02 error pushing image: failed to push to destination 484288001113.dkr.ecr.us-west-2.amazonaws.com/zeet/0b518925-ebfd-49cd-a524-82d402863b1a:8d2d0b98-c64f-4c3a-a823-0054746a42eb-f1430aed-037d-4d31-9e36-6850dc610d02: PUT https://484288001113.dkr.ecr.us-west-2.amazonaws.com/v2/zeet/0b518925-ebfd-49cd-a524-82d402863b1a/manifests/8d2d0b98-c64f-4c3a-a823-0054746a42eb-f1430aed-037d-4d31-9e36-6850dc610d02: DENIED: The repository with name 'zeet/0b518925-ebfd-49cd-a524-82d402863b1a' in registry with id '484288001113' already has the maximum allowed number of images which is '10000'

I feel so dumb with all these errors and my inability to make it work. Because of that, I loved this explanation:

Because these are two different concepts. The test script is the “client” making the request, and the container is the “server” serving the request. It’s the container / server that has to be built with the build-arg USE_DREAMBOOTH=1, and that’s through Banana’s dashboard (+ a commit afterwards, apparently) or through editing the Dockerfile.

If I understand conceptually what I’m doing here, probably I’ll be able to troubleshoot in a better way.

gadicc · November 26, 2022, 4:49pm

Oh, for sure, that’s why I’m so appreciative of your time and efforts to get through all this and help make the process clearer for everyone! Glad the explanation helped. And I promise a little further down the line this stuff will be natural and obvious for you, but in the meantime, let’s make things as clear as possible for those with less background in these things.

As for the error, it’s a banana error, someone reported it already on discord, unfortunately it means you’ll have to wait until they fix it their side. Looking great on your side though, and that you’ll be all ready to train as soon as this is fixed

Klaudioz · November 26, 2022, 5:01pm

wow … probably their docker container registry is full

Klaudioz · November 27, 2022, 5:09am

After the banana issue is gone, another issue I’m having now. This time, after running:

python3 test.py dreambooth --banana --call-arg dest_url="s3://<BUCKET_NAME>/model.tar.zst" --model-arg max_train_steps=1 --model-arg num_class=images=1

I’m getting:

botocore.exceptions.EndpointConnectionError: Could not connect to the endpoint URL: "https://<BUCKET_NAME>/model.tar.zst/text-inversion-model.tar.zstd?uploads" [2022-11-27 05:03:05 +0000] - (sanic.access)[INFO][127.0.0.1:39978]: POST http://0.0.0.0:8000/ 500 139

with Banana dashboard ENV: AWS_S3_ENDPOINT_URL=s3://<BUCKET_NAME>/uploads/

So, the weird thing here is I’ve never used https with <BUCKET_NAME>, like the error.