Not sure if the model really ran or not, here are all logs and outputs

gadicc · November 28, 2022, 2:21pm

Hey! So firstly, @grf, welcome to the forums, and especially… welcome to your awesome DOTT avatar. Great memories! (and how did everything all go so down hill since the 90s? :)).

Also, thanks for detailed report, logs, etc, which makes such a big difference!

Does indeed look like there’s an upper limit on banana runtime logs, as indeed, training finished as you pointed out with the { message: 'success', modelOutputs: { /* ... */ } }). I wonder if Banana will be willing to raise the limit, otherwise I guess I can just make it log less

It’s telling you that nothing was uploaded Now, you might ask, why would we possibly ever train a model without uploading it anywhere? Which is a very fair question It’s useful for making sure everything works and performing timing tests on training only. But I think it’s fair that I’ll add a big warning somewhere that this is what’s happening

In short, (and again, I’ll make this clearer), you didn’t actually specify anything to do with the model post training. For S3, you need a

{
  callInputs: {
    dest_url: "s3:///bucket/model-filename.tar.zst"
  }
}

(pay especial attention to the triple /// in the beginning). I know that dest_url doesn’t appear in test.py (because in testing we really are just checking that training works, not uploading) but in the example section you’ll see we call test.py dreambooth --call-arg dest_url="s3:///bucket/filename.tar.zst to add it. Hope that’s clear! I’ll add in some comments to test.py too for those learning from there.

Indeed, only the new fine-tuned model gets uploaded to S3. Then you can deploy that model to another instance (with optimized cold starts, etc) to do inference of new images with the new fine-tuned model. With docker-diffusers-api, just set the MODEL_URL build-arg with the same s3:/// url, and it will download it for you at build time.

Yes, exactly. I’ll make it clearer in the logs when training is called with no destination given. And if there was an error, it would indeed be in the logs, not on the dashboard.

Now on to the banana questions…

grf:

To my surprise after calling the check API with the same call id as before I got this error now, is it supposed to happen? :
check:
Exception: inference server error: taskID does not exist: task_1d298fc1-bb1e-4be9-a2ab-f1457f3678c5. This is a general inference pipeline error, and could be due to: 

Yeah, that is correct. The model results don’t stick around after you’ve consumed them. I’m not sure how long they stick around after inference (or in this case, training) finished without consumption.

So it’s basically:

start - start inference with the given options
check - check if inference is done and return the results
run = start + check

so calling run() absolutely should start everything for you. This is in banana’s SDK at least. I decided to use their REST API directly in test.py, and that’s definitely a bit more work. It can be useful for long running tasks though (in kiri.art, we call start in a serverless function with credentials, and then have the user’s web browser keep doing the checks for the results).

Also just wanted to double check that you saw the guide at

https://banana-forums.dev/t/dreambooth-training-first-look/36/2

Hope that’s all clear! Let me know how it goes. Don’t hesitate to ask for any further clarifications especially as we shore up the docs for everyone.