Not sure if the model really ran or not, here are all logs and outputs

grf · November 28, 2022, 1:12pm

I’m deploying the dev branch of this repo directly to banana, used the test script as an example to look at how the call to the model should be made, and created a script based on that, where I have the same model and call inputs, grab the images from a folder in the same way the test script does.

Here is the log on banana dashboard: 2022-11-28T05:08:58.000Z 2022-11-28 05:09:04.808803 {'type': 'init', 'status': ' - Pastebin.com

Here are the outputs on my terminal:

RUN finished with result:
{'id': '56a8b867-d1e4-4451-aad1-edbe6aba2449', 'message': '', 'created': 1669571311, 'apiVersion': '28 July 2022', 'modelOutputs': [{'test': '{\n  "modelInputs": {\n    "instance_prompt": "a photo of sks dog",\n    "instance_images": [\n      "/9j/4A..."\n    ]\n  },\n  "callInputs": {\n    "MODEL_ID": "runwayml/stable-diffusion-v1-5",\n    "PIPELINE": "StableDiffusionPipeline",\n    "SCHEDULER": "DDPMScheduler",\n    "train": "dreambooth"\n  }\n}'}]}

*not sure if the last run call was the one above or the one below*

RUN finished with result:
{'id': '596b736c-bbf4-4284-bbb9-c99355ff606a', 'message': 'success', 'created': 1669609220, 'apiVersion': '28 July 2022', 'modelOutputs': [{'done': True, '$timings': {'init': 7393, 'inference': 845194, 'training': 832903, 'upload': 0}}]}

START finished with result:
call_5cd8910c-79ac-4daa-996f-b1a9635aff1c

check:
{'id': 'a7ddc116-587b-41da-bf09-c3015c19155a', 'message': 'running', 'created': 1669612220, 'apiVersion': '28 July 2022', 'modelOutputs': None}

check:
{'id': '25c7bb8f-5053-4a55-8251-6053a52fb5ef', 'message': 'running', 'created': 1669612861, 'apiVersion': '28 July 2022', 'modelOutputs': None}

check:
{'id': '2de6e7d0-a119-43ae-9a1a-c89271cdd289', 'message': 'success', 'created': 1669613655, 'apiVersion': '28 July 2022', 'modelOutputs': [{'done': True, '$timings': {'init': 6718, 'inference': 847500, 'training': 831806, 'upload': 0}}]}

I was really just experimenting, not worrying too much about the outcomes, just really trying to make the model run correctly for the first time and check the results

From all the above, I would say that by looking at the banana dashboard logs on the pastebin link, the model ran, since the step count started to increment, but apparently they got truncated (or perhaps not, I have no previous reference of running it)

After seeing the steps I was expecting to just wait some time and be able to cal the check api, which I did as you can see the plain logs above, and the status change from running to success, from that, what is the upload=0 telling me? Would that be the images that were generated that are about to be uploaded? Or is it the model that just go trained and supposed to be uploaded to S3?

I left it all as is and went to bed, this morning I checked the s3 bucket again and nothing was uploaded.
To my surprise after calling the check API with the same call id as before I got this error now, is it supposed to happen? :

check:
Exception: inference server error: taskID does not exist: task_1d298fc1-bb1e-4be9-a2ab-f1457f3678c5. This is a general inference pipeline error, and could be due to:

So, its not clear to me if things really finished running or not. Is anyone able to check one of those Ids and give me some info?

A few questions:

Where am I supposed to get the model generated images after the inference ran?
I was under the impression that just calling the run API does not really run everything, so I called the start API too, coincidentally or not it seems that only by calling start API things really started running, also calling start I finally got a call ID so I could call the check API.
assuming my credentials were all setup correctly and all that, should the trained model be uploaded to my S3 bucket? If there was an error there, where should it show logs? On dashboard?

Right now I’m confused as to if I should just really call the run API and expect it all to run from start to finish based only on that single call or if I should call run and start APIs.
Is there anything else I’m missing?
Again, what about the generated images? Where should they go?

Thanks!

gadicc · November 28, 2022, 2:21pm

Hey! So firstly, @grf, welcome to the forums, and especially… welcome to your awesome DOTT avatar. Great memories! (and how did everything all go so down hill since the 90s? :)).

Also, thanks for detailed report, logs, etc, which makes such a big difference!

Does indeed look like there’s an upper limit on banana runtime logs, as indeed, training finished as you pointed out with the { message: 'success', modelOutputs: { /* ... */ } }). I wonder if Banana will be willing to raise the limit, otherwise I guess I can just make it log less

It’s telling you that nothing was uploaded Now, you might ask, why would we possibly ever train a model without uploading it anywhere? Which is a very fair question It’s useful for making sure everything works and performing timing tests on training only. But I think it’s fair that I’ll add a big warning somewhere that this is what’s happening

In short, (and again, I’ll make this clearer), you didn’t actually specify anything to do with the model post training. For S3, you need a

{
  callInputs: {
    dest_url: "s3:///bucket/model-filename.tar.zst"
  }
}

(pay especial attention to the triple /// in the beginning). I know that dest_url doesn’t appear in test.py (because in testing we really are just checking that training works, not uploading) but in the example section you’ll see we call test.py dreambooth --call-arg dest_url="s3:///bucket/filename.tar.zst to add it. Hope that’s clear! I’ll add in some comments to test.py too for those learning from there.

Indeed, only the new fine-tuned model gets uploaded to S3. Then you can deploy that model to another instance (with optimized cold starts, etc) to do inference of new images with the new fine-tuned model. With docker-diffusers-api, just set the MODEL_URL build-arg with the same s3:/// url, and it will download it for you at build time.

Yes, exactly. I’ll make it clearer in the logs when training is called with no destination given. And if there was an error, it would indeed be in the logs, not on the dashboard.

Now on to the banana questions…

grf:

To my surprise after calling the check API with the same call id as before I got this error now, is it supposed to happen? :
check:
Exception: inference server error: taskID does not exist: task_1d298fc1-bb1e-4be9-a2ab-f1457f3678c5. This is a general inference pipeline error, and could be due to: 

Yeah, that is correct. The model results don’t stick around after you’ve consumed them. I’m not sure how long they stick around after inference (or in this case, training) finished without consumption.

So it’s basically:

start - start inference with the given options
check - check if inference is done and return the results
run = start + check

so calling run() absolutely should start everything for you. This is in banana’s SDK at least. I decided to use their REST API directly in test.py, and that’s definitely a bit more work. It can be useful for long running tasks though (in kiri.art, we call start in a serverless function with credentials, and then have the user’s web browser keep doing the checks for the results).

Also just wanted to double check that you saw the guide at

https://banana-forums.dev/t/dreambooth-training-first-look/36/2

Hope that’s all clear! Let me know how it goes. Don’t hesitate to ask for any further clarifications especially as we shore up the docs for everyone.

gadicc · November 28, 2022, 2:42pm

Thanks for all the feedback, @grf!

    chore(dreambooth): warn when neither push_to_hub or dest_url given
    chore(tests/dreambooth): show example destinations in comments
    chore(tests): stop the loop if message.contains(/error/) - from another issue

grf · November 28, 2022, 3:42pm

Wow @gadicc, I don’t know how to thank you enough for all the detailed explanation

DOTT is really a masterpiece

I’m a guy who likes to be thorough, specially in technical subjects that has so many details, glad it helps to get the message around

I did check that first look post 2 weeks ago, really helped understand things a bit more
Now with the rest of your explanation here things are making much more sense.

Now, the only thing that is not clear to me is where the images generated by the prompts on inference step will be located, or how am I going to be able to download them

gadicc · November 28, 2022, 3:57pm

Details make all the difference!

Ah that first post has been updated a lot since then, mostly due to the awesome back-and-forth I’ve had with @Klaudioz to make things clearer and clearer from the rather scarce info there in the beginning.

Ok sorry you did ask that and I answered as part of another question, let me try be clearer. Side note: it’s misleading because banana only has an “init” and “inference” stage, and we’re actually performing training (and not inference) in the second stage, but we still have to call it “inference”, and that’s confusing.

In short, when you’re doing training, no images are returned. Just the fine-tuned model is uploaded to your destination of choice. You should then deploy that fine-tuned model to a different container (that’s the using your fine-tuned model section of the “first look” post) and use exactly how you usually would with the regular SD model. Open to suggestions on how I can make this clearer.

In other news… did you enjoy Return to Monkey Island as much as I did?