Inference server error: taskID does not exist

I’m not very sure if this issue is related to docker-diffusers-api dreambooth but I’m going to describe the situation here. Also I tried with two days old version of the repo and after updating it to the last version of the dev branch I got the same.

Command: python3 test.py dreambooth --banana --call-arg dest_url="s3:///<S3_BUCKET>/model.tar.zst" --model-arg max_train_steps=1 --model-arg num_class=images=1

Logs:

Running test: dreambooth
2022-11-27 06:50:31.833318: t+0s
{
    "id": "54ad24a4-76b2-4e3a-8666-c5f0a5c58577",
    "message": "",
    "created": 1669557032,
    "apiVersion": "28 July 2022",
    "callID": "call_61bdf0cc-05cb-4ce8-a218-09b9310c7362",
    "finished": false,
    "modelOutputs": null
}
2022-11-27 06:50:56.673410: t+0s
{
    "id": "7aeb2573-3d55-4426-b1dc-2b874d709d73",
    "message": "inference server error: error sending payload to inference server",
    "created": 1669557056,
    "apiVersion": "28 July 2022",
    "modelOutputs": null
}
2022-11-27 06:50:56.874768: t+0s
{
    "id": "4789f77f-c7c9-4c97-91d0-6811cddbfd94",
    "message": "inference server error: taskID does not exist: task_a2c93abe-d1b0-4c82-9bb6-381726eb4071. This is a general inference pipeline error, and could be due to: \n-- An inference runtime error crashed the server before it could return 500 --> be sure to test for runtime crashes on your own GPU.\n-- The payload in or out was too large --> current limit is 50mb.\n-- The model was not yet fully deployed --> try again later once the dashboard confirms deployed.\n-- (Rare) Banana's GPUs are at capacity and the scaleup timed out --> try again later.\n\t\t\t The Banana infra team is working hard to resolve these.",
    "created": 1669557057,
    "apiVersion": "28 July 2022",
    "modelOutputs": null
}
2022-11-27 06:50:57.066066: t+0s
{
    "id": "81c948ab-e19e-416b-a874-6f1c3dfdb484",
    "message": "inference server error: taskID does not exist: task_a2c93abe-d1b0-4c82-9bb6-381726eb4071. This is a general inference pipeline error, and could be due to: \n-- An inference runtime error crashed the server before it could return 500 --> be sure to test for runtime crashes on your own GPU.\n-- The payload in or out was too large --> current limit is 50mb.\n-- The model was not yet fully deployed --> try again later once the dashboard confirms deployed.\n-- (Rare) Banana's GPUs are at capacity and the scaleup timed out --> try again later.\n\t\t\t The Banana infra team is working hard to resolve these.",
    "created": 1669557057,
    "apiVersion": "28 July 2022",
    "modelOutputs": null
}

Ah, thanks @Klaudioz, there’s some extra info in here that is very helpful, notably the middle one, inference server error: error sending payload to inference server. How many images are you sending for training? Is there any chance that all of them together, and after base 64 encoding, end up being > 50mb?

If that’s not it, I’d be grateful if you could please still send:

  • banana build log
  • banana runtime log for this call
  • list of any build-args you used

Was helpful to know that this broke after an update… the only thing I’ve changed recently is to use the new diffusers version. So could be somethere there. However, everything is still running perfectly locally even with the latest dev. I want to try on banana too but need your exact build-args for that, including which MODEL_ID you’re training against, to make sure I’m testing against an identical setup (since as you know, builds can take a while `:).

I think it’s sending the ones on the fixtures or dreambooth folder. The size of the images there is 6 MB.

Build logs:

SUCCESS: Model Registered

Your model was updated and is now deployed!
It is runnable with the same credentials:

Runtime Logs aren’t showing something special. I’ve to stop the error loop manually.

Uh ok yeah should be no problem with that. So, just to confirm, you’re building with:

  • USE_DREAMBOOTH=1
  • PRECISION=""

and no other changes to the default build args? I guess that would use SD v1.5 up until yesterday, and SDv2 since then (based on my commits to dev).

Would still be grateful for the runtime logs just to see everything that’s going on there. Just be careful, I currently log all environment variables, so please remove your AWS info before pasting here.

Re the repeating error, it’s my fault that it “repeats”, I’ll fix that. test.py should stop after a single error and stop retrying. But what we really need to figure out is why we get that error in the first place. And this will take me a little while as it seems I have to get it to happen on banana which is much slower than working on it locally, but don’t worry, it’s on my last for today!

Banana dashboard:

Runtime logs:


2022-11-27T04:18:57.000Z total 36208
-rw-r--r-- 1 root root 7951765 Nov 27 04:19 image0.png
-rw-r--r-- 1 root root 7772122 Nov 27 04:19 image1.png
-rw-r--r-- 1 root root 7688237 Nov 27 04:19 image2.png
-rw-r--r-- 1 root root 9121967 Nov 27 04:19 image3.png
-rw-r--r-- 1 root root 4526831 Nov 27 04:19 image4.png
2022-11-27 04:19:20.353072 {'type': 'training', 'status': 'start', 'container_id': '8d46f1de86e18b2a5c0bbe1d3aaef85aff9fa4f05a59aa98c8ac5a0986e8d9ed', 'time': 1669522760353, 't': 0, 'tsl': 9896, 'payload': {}, 'init': True}


Steps: 100%|██████████| 1/1 [00:12<00:00, 12.18s/it, loss=0.0057, lr=5e-6]
self.endpoint_url s3:///selfieai-photos/
model.tar.zst
model.tar.zst

-rw-r--r-- 1 root root 4561822747 Nov 27 04:19 model.tar.zst
[2022-11-27 04:19:53 +0000] [24] [ERROR] Exception occurred while handling uri: 'http://0.0.0.0:8000/'
Traceback (most recent call last):
  File "handle_request", line 81, in handle_request
    FutureStatic,
  File "/api/server.py", line 36, in inference
    output = user_src.inference(model_inputs)
  File "/api/app.py", line 277, in inference
    result = TrainDreamBooth(model_id, pipeline, model_inputs, call_inputs)
  File "/api/train_dreambooth.py", line 140, in TrainDreamBooth
    upload_result = storage.upload_file(filename, filename)
  File "/api/utils/storage/S3Storage.py", line 74, in upload_file
    result = self.bucket().upload_file(source, dest)
  File "/api/utils/storage/S3Storage.py", line 66, in bucket
    self._bucket = self.s3().Bucket(self.bucket_name)
  File "/api/utils/storage/S3Storage.py", line 55, in s3
    self._s3 = boto3.resource(
  File "/opt/conda/envs/xformers/lib/python3.10/site-packages/boto3/__init__.py", line 101, in resource
    return _get_default_session().resource(*args, **kwargs)
  File "/opt/conda/envs/xformers/lib/python3.10/site-packages/boto3/session.py", line 446, in resource
    client = self.client(
  File "/opt/conda/envs/xformers/lib/python3.10/site-packages/boto3/session.py", line 299, in client
    return self._session.create_client(
  File "/opt/conda/envs/xformers/lib/python3.10/site-packages/botocore/session.py", line 976, in create_client
    client = client_creator.create_client(
  File "/opt/conda/envs/xformers/lib/python3.10/site-packages/botocore/client.py", line 155, in create_client
    client_args = self._get_client_args(
  File "/opt/conda/envs/xformers/lib/python3.10/site-packages/botocore/client.py", line 485, in _get_client_args
    return args_creator.get_client_args(
  File "/opt/conda/envs/xformers/lib/python3.10/site-packages/botocore/args.py", line 129, in get_client_args
    endpoint = endpoint_creator.create_endpoint(
  File "/opt/conda/envs/xformers/lib/python3.10/site-packages/botocore/endpoint.py", line 402, in create_endpoint
    raise ValueError("Invalid endpoint: %s" % endpoint_url)
ValueError: Invalid endpoint: s3:///selfieai-photos/
[2022-11-27 04:19:53 +0000] - (sanic.access)[INFO][127.0.0.1:58144]: POST http://0.0.0.0:8000/  500 139

2022-11-27T04:59:33.000Z /opt/conda/envs/xformers/lib/python3.10/site-packages/torch/utils/tensorboard/__init__.py:4: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
  if not hasattr(tensorboard, "__version__") or LooseVersion(


Steps: 100%|██████████| 1/1 [00:15<00:00, 15.87s/it, loss=0.269, lr=5e-6]
self.endpoint_url s3:///selfieai-photos/
model.tar.zst
model.tar.zst

2022-11-27T05:01:43.000Z /api/diffusers/src/diffusers/utils/deprecation_utils.py:35: DeprecationWarning: It is deprecated to pass a pretrained model name or path to `from_config`.If you were trying to load a model, please use <class 'diffusers.models.unet_2d_condition.UNet2DConditionModel'>.load_config(...) followed by <class 'diffusers.models.unet_2d_condition.UNet2DConditionModel'>.from_config(...) instead. Otherwise, please make sure to pass a configuration dictionary instead. This functionality will be removed in v1.0.0.
  warnings.warn(warning + message, DeprecationWarning)

[2022-11-27 05:03:05 +0000] [25] [ERROR] Exception occurred while handling uri: 'http://0.0.0.0:8000/'
Traceback (most recent call last):
  File "/opt/conda/envs/xformers/lib/python3.10/site-packages/urllib3/connection.py", line 174, in _new_conn
    conn = connection.create_connection(
  File "/opt/conda/envs/xformers/lib/python3.10/site-packages/urllib3/util/connection.py", line 72, in create_connection
    for res in socket.getaddrinfo(host, port, family, socket.SOCK_STREAM):
  File "/opt/conda/envs/xformers/lib/python3.10/socket.py", line 955, in getaddrinfo
    for res in _socket.getaddrinfo(host, port, family, type, proto, flags):
socket.gaierror: [Errno -2] Name or service not known

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/envs/xformers/lib/python3.10/site-packages/botocore/httpsession.py", line 455, in send
    urllib_response = conn.urlopen(
  File "/opt/conda/envs/xformers/lib/python3.10/site-packages/urllib3/connectionpool.py", line 787, in urlopen
    retries = retries.increment(
  File "/opt/conda/envs/xformers/lib/python3.10/site-packages/urllib3/util/retry.py", line 525, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/opt/conda/envs/xformers/lib/python3.10/site-packages/urllib3/packages/six.py", line 770, in reraise
    raise value
  File "/opt/conda/envs/xformers/lib/python3.10/site-packages/urllib3/connectionpool.py", line 703, in urlopen
    httplib_response = self._make_request(
  File "/opt/conda/envs/xformers/lib/python3.10/site-packages/urllib3/connectionpool.py", line 386, in _make_request
    self._validate_conn(conn)
  File "/opt/conda/envs/xformers/lib/python3.10/site-packages/urllib3/connectionpool.py", line 1042, in _validate_conn
    conn.connect()
  File "/opt/conda/envs/xformers/lib/python3.10/site-packages/urllib3/connection.py", line 358, in connect
    self.sock = conn = self._new_conn()
  File "/opt/conda/envs/xformers/lib/python3.10/site-packages/urllib3/connection.py", line 186, in _new_conn
    raise NewConnectionError(
urllib3.exceptions.NewConnectionError: <botocore.awsrequest.AWSHTTPSConnection object at 0x7f2af15ac430>: Failed to establish a new connection: [Errno -2] Name or service not known

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "handle_request", line 81, in handle_request
    FutureStatic,
  File "/api/server.py", line 36, in inference
    output = user_src.inference(model_inputs)
  File "/api/app.py", line 277, in inference
    result = TrainDreamBooth(model_id, pipeline, model_inputs, call_inputs)
  File "/api/train_dreambooth.py", line 140, in TrainDreamBooth
    upload_result = storage.upload_file(filename, filename)
  File "/api/utils/storage/S3Storage.py", line 74, in upload_file
    result = self.bucket().upload_file(source, dest)
  File "/opt/conda/envs/xformers/lib/python3.10/site-packages/boto3/s3/inject.py", line 233, in bucket_upload_file
    return self.meta.client.upload_file(
  File "/opt/conda/envs/xformers/lib/python3.10/site-packages/boto3/s3/inject.py", line 143, in upload_file
    return transfer.upload_file(
  File "/opt/conda/envs/xformers/lib/python3.10/site-packages/boto3/s3/transfer.py", line 288, in upload_file
    future.result()
  File "/opt/conda/envs/xformers/lib/python3.10/site-packages/s3transfer/futures.py", line 103, in result
    return self._coordinator.result()
  File "/opt/conda/envs/xformers/lib/python3.10/site-packages/s3transfer/futures.py", line 266, in result
    raise self._exception
  File "/opt/conda/envs/xformers/lib/python3.10/site-packages/s3transfer/tasks.py", line 139, in __call__
    return self._execute_main(kwargs)
  File "/opt/conda/envs/xformers/lib/python3.10/site-packages/s3transfer/tasks.py", line 162, in _execute_main
    return_value = self._main(**kwargs)
  File "/opt/conda/envs/xformers/lib/python3.10/site-packages/s3transfer/tasks.py", line 348, in _main
    response = client.create_multipart_upload(
  File "/opt/conda/envs/xformers/lib/python3.10/site-packages/botocore/client.py", line 530, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/opt/conda/envs/xformers/lib/python3.10/site-packages/botocore/client.py", line 943, in _make_api_call
    http, parsed_response = self._make_request(
  File "/opt/conda/envs/xformers/lib/python3.10/site-packages/botocore/client.py", line 966, in _make_request
    return self._endpoint.make_request(operation_model, request_dict)
  File "/opt/conda/envs/xformers/lib/python3.10/site-packages/botocore/endpoint.py", line 119, in make_request
    return self._send_request(request_dict, operation_model)
  File "/opt/conda/envs/xformers/lib/python3.10/site-packages/botocore/endpoint.py", line 202, in _send_request
    while self._needs_retry(
  File "/opt/conda/envs/xformers/lib/python3.10/site-packages/botocore/endpoint.py", line 354, in _needs_retry
    responses = self._event_emitter.emit(
  File "/opt/conda/envs/xformers/lib/python3.10/site-packages/botocore/hooks.py", line 412, in emit
    return self._emitter.emit(aliased_event_name, **kwargs)
  File "/opt/conda/envs/xformers/lib/python3.10/site-packages/botocore/hooks.py", line 256, in emit
    return self._emit(event_name, kwargs)
  File "/opt/conda/envs/xformers/lib/python3.10/site-packages/botocore/hooks.py", line 239, in _emit
    response = handler(**kwargs)
  File "/opt/conda/envs/xformers/lib/python3.10/site-packages/botocore/retryhandler.py", line 207, in __call__
    if self._checker(**checker_kwargs):
  File "/opt/conda/envs/xformers/lib/python3.10/site-packages/botocore/retryhandler.py", line 284, in __call__
    should_retry = self._should_retry(
  File "/opt/conda/envs/xformers/lib/python3.10/site-packages/botocore/retryhandler.py", line 320, in _should_retry
    return self._checker(attempt_number, response, caught_exception)
  File "/opt/conda/envs/xformers/lib/python3.10/site-packages/botocore/retryhandler.py", line 363, in __call__
    checker_response = checker(
  File "/opt/conda/envs/xformers/lib/python3.10/site-packages/botocore/retryhandler.py", line 247, in __call__
    return self._check_caught_exception(
  File "/opt/conda/envs/xformers/lib/python3.10/site-packages/botocore/retryhandler.py", line 416, in _check_caught_exception
    raise caught_exception
  File "/opt/conda/envs/xformers/lib/python3.10/site-packages/botocore/endpoint.py", line 281, in _do_get_response
    http_response = self._send(request)
  File "/opt/conda/envs/xformers/lib/python3.10/site-packages/botocore/endpoint.py", line 377, in _send
    return self.http_session.send(request)
  File "/opt/conda/envs/xformers/lib/python3.10/site-packages/botocore/httpsession.py", line 484, in send
    raise EndpointConnectionError(endpoint_url=request.url, error=e)
botocore.exceptions.EndpointConnectionError: Could not connect to the endpoint URL: "https://selfieai-photos/model.tar.zst/text-inversion-model.tar.zstd?uploads"
[2022-11-27 05:03:05 +0000] - (sanic.access)[INFO][127.0.0.1:39978]: POST http://0.0.0.0:8000/  500 139

2022-11-27T12:44:34.000Z /opt/conda/envs/xformers/lib/python3.10/site-packages/torch/utils/tensorboard/__init__.py:4: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
  if not hasattr(tensorboard, "__version__") or LooseVersion(
/opt/conda/envs/xformers/lib/python3.10/site-packages/transformers/image_utils.py:239: DeprecationWarning: BILINEAR is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.BILINEAR instead.
  def resize(self, image, size, resample=PIL.Image.BILINEAR, default_to_square=True, max_size=None):
/opt/conda/envs/xformers/lib/python3.10/site-packages/transformers/image_utils.py:396: DeprecationWarning: NEAREST is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.NEAREST or Dither.NONE instead.
  def rotate(self, image, angle, resample=PIL.Image.NEAREST, expand=0, center=None, translate=None, fillcolor=None):
/opt/conda/envs/xformers/lib/python3.10/site-packages/transformers/models/clip/feature_extraction_clip.py:67: DeprecationWarning: BICUBIC is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.BICUBIC instead.
  resample=Image.BICUBIC,

environ({'CONDA_SHLVL': '2', 'LD_LIBRARY_PATH': '/usr/local/nvidia/lib:/usr/local/nvidia/lib64', 'REQUESTS_CA_BUNDLE': '', 'CONDA_EXE': '/opt/conda/bin/conda', '_': '/opt/conda/envs/xformers/bin/python3', 'MODEL_URL': '', 'HOSTNAME': 'selfieaidreamboothapi00484d77580e70a458dae520b2d06a10a46-f55x8k', 'PRECISION': '', 'HF_AUTH_TOKEN': 'XXXXXXX', 'AWS_SECRET_ACCESS_KEY': 'XXXXXXX', 'PIPELINE': 'ALL', 'ALIYUN_COM_GPU_MEM_CONTAINER': '16', 'USE_DREAMBOOTH': '1', 'CONDA_PREFIX': '/opt/conda/envs/xformers', 'ALIYUN_COM_GPU_MEM_POD': '16', 'AWS_S3_ENDPOINT_URL': 's3://selfieai-photos/uploads/', 'NVIDIA_VISIBLE_DEVICES': 'GPU-d0cbe3aa-ff8f-fb3a-e5ad-dabefddf9002', 'AWS_DEFAULT_REGION': 'us-east-1', '_CE_M': '', 'KUBERNETES_PORT_443_TCP_PROTO': 'tcp', 'KUBERNETES_PORT_443_TCP_ADDR': '10.96.0.1', 'CONDA_PREFIX_1': '/opt/conda', 'ALIYUN_COM_GPU_MEM_DEV': '40', 'KUBERNETES_PORT': 'tcp://10.96.0.1:443', 'PWD': '/api', 'HOME': '/root', 'CONDA_PYTHON_EXE': '/opt/conda/bin/python', 'LC_CTYPE': 'C.UTF-8', 'CHECKPOINT_CONFIG_URL': '', 'PYTORCH_VERSION': 'v1.12.1-rc5', '_CONVERT_SPECIAL': '', 'https_proxy': '', 'KUBERNETES_SERVICE_PORT_HTTPS': '443', 'DEBIAN_FRONTEND': 'noninteractive', 'KUBERNETES_PORT_443_TCP_PORT': '443', 'http_proxy': '', '_CE_CONDA': '', 'MODEL_ID': 'runwayml/stable-diffusion-v1-5', 'KUBERNETES_PORT_443_TCP': 'tcp://10.96.0.1:443', 'CONDA_PROMPT_MODIFIER': '(xformers) ', 'ALIYUN_COM_GPU_MEM_IDX': '6', 'NVIDIA_DRIVER_CAPABILITIES': 'compute,utility', 'CONDA_ROOT': '/opt/conda', 'AWS_ACCESS_KEY_ID': 'XXXXXXXXX', 'SHLVL': '2', 'KUBERNETES_SERVICE_PORT': '443', 'CHECKPOINT_URL': '', 'PATH': '/opt/conda/envs/xformers/bin:/opt/conda/condabin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin', 'CONDA_DEFAULT_ENV': 'xformers', 'KUBERNETES_SERVICE_HOST': '10.96.0.1'})

2022-11-27 12:44:41.761231 {'type': 'init', 'status': 'start', 'container_id': 'ba87e86e46d0b8d845d87b44b59576b7e73bc3d203fbf42309e7fc45ac940c83', 'time': 1669553081761, 't': 0, 'tsl': 407, 'payload': {'device': 'NVIDIA A100-SXM4-40GB', 'hostname': 'selfieaidreamboothapi00484d77580e70a458dae520b2d06a10a46-f55x8k', 'model_id': 'runwayml/stable-diffusion-v1-5', 'diffusers': '0.8.0'}, 'init': True}
Loading model: runwayml/stable-diffusion-v1-5
Initializing LMSDiscreteScheduler for runwayml/stable-diffusion-v1-5...
/api/diffusers/src/diffusers/utils/deprecation_utils.py:35: DeprecationWarning: It is deprecated to pass a pretrained model name or path to `from_config`.If you were trying to load a scheduler, please use <class 'diffusers.schedulers.scheduling_lms_discrete.LMSDiscreteScheduler'>.from_pretrained(...) instead. Otherwise, please make sure to pass a configuration dictionary instead. This functionality will be removed in v1.0.0.
  warnings.warn(warning + message, DeprecationWarning)
Initialized LMSDiscreteScheduler for runwayml/stable-diffusion-v1-5 in 29ms
<frozen importlib._bootstrap>:283: DeprecationWarning: the load_module() method is deprecated and slated for removal in Python 3.12; use exec_module() instead

/api/diffusers/src/diffusers/utils/deprecation_utils.py:35: DeprecationWarning: It is deprecated to pass a pretrained model name or path to `from_config`.
  warnings.warn(warning + message, DeprecationWarning)
/api/diffusers/src/diffusers/utils/deprecation_utils.py:35: DeprecationWarning: It is deprecated to pass a pretrained model name or path to `from_config`.If you were trying to load a model, please use <class 'diffusers.models.unet_2d_condition.UNet2DConditionModel'>.load_config(...) followed by <class 'diffusers.models.unet_2d_condition.UNet2DConditionModel'>.from_config(...) instead. Otherwise, please make sure to pass a configuration dictionary instead. This functionality will be removed in v1.0.0.
  warnings.warn(warning + message, DeprecationWarning)

{
  "modelInputs": {
    "instance_prompt": "a photo of sks dog",
    "instance_images": [
      "/9j/4A...",
      "/9j/4A...",
      "/9j/4A...",
      "/9j/4A...",
      "/9j/4A..."
    ],
    "max_train_steps": 1,
    "num_class": "images=1"
  },
  "callInputs": {
    "MODEL_ID": "runwayml/stable-diffusion-v1-5",
    "PIPELINE": "StableDiffusionPipeline",
    "SCHEDULER": "DDPMScheduler",
    "train": "dreambooth",
    "dest_url": "s3:///selfieai-photos/model.tar.zst"
  }
}
Initializing DDPMScheduler for runwayml/stable-diffusion-v1-5...
/api/diffusers/src/diffusers/utils/deprecation_utils.py:35: DeprecationWarning: It is deprecated to pass a pretrained model name or path to `from_config`.If you were trying to load a scheduler, please use <class 'diffusers.schedulers.scheduling_ddpm.DDPMScheduler'>.from_pretrained(...) instead. Otherwise, please make sure to pass a configuration dictionary instead. This functionality will be removed in v1.0.0.
  warnings.warn(warning + message, DeprecationWarning)
Initialized DDPMScheduler for runwayml/stable-diffusion-v1-5 in 1ms
Decoded image "instance_image": JPEG 2732x2736
Decoded image "instance_image": JPEG 2476x2612
Decoded image "instance_image": JPEG 2469x2558
Decoded image "instance_image": JPEG 2796x2656
Decoded image "instance_image": JPEG 1815x1967
2022-11-27 12:44:51.769836 {'type': 'inference', 'status': 'start', 'container_id': 'ba87e86e46d0b8d845d87b44b59576b7e73bc3d203fbf42309e7fc45ac940c83', 'time': 1669553091770, 't': 0, 'tsl': 1614, 'payload': {'startRequestId': None}, 'init': True}
pipeline.enable_xformers_memory_efficient_attention()
{'instance_prompt': 'a photo of sks dog', 'max_train_steps': 1, 'num_class': 'images=1'}
Namespace(pretrained_model_name_or_path='runwayml/stable-diffusion-v1-5', revision=None, tokenizer_name=None, instance_data_dir='instance_data_dir', class_data_dir='class_data_dir', class_prompt=None, with_prior_preservation=False, prior_loss_weight=1.0, num_class_images=100, output_dir='text-inversion-model', seed=None, resolution=512, center_crop=None, train_text_encoder=None, train_batch_size=1, sample_batch_size=1, num_train_epochs=1, max_train_steps=1, gradient_accumulation_steps=1, gradient_checkpointing=True, learning_rate=5e-06, scale_lr=False, lr_scheduler='constant', lr_warmup_steps=0, use_8bit_adam=True, adam_beta1=0.9, adam_beta2=0.999, adam_weight_decay=1e-06, adam_epsilon=1e-08, max_grad_norm=1.0, push_to_hub=None, hub_token='XXXXXXX', hub_model_id=None, logging_dir='logs', mixed_precision=None, local_rank=-1, instance_prompt='a photo of sks dog', num_class='images=1')

total 36208
-rw-r--r-- 1 root root 7951765 Nov 27 12:44 image0.png
-rw-r--r-- 1 root root 7772122 Nov 27 12:44 image1.png
-rw-r--r-- 1 root root 7688237 Nov 27 12:44 image2.png
-rw-r--r-- 1 root root 9121967 Nov 27 12:45 image3.png
-rw-r--r-- 1 root root 4526831 Nov 27 12:45 image4.png
2022-11-27 12:45:03.071653 {'type': 'training', 'status': 'start', 'container_id': 'ba87e86e46d0b8d845d87b44b59576b7e73bc3d203fbf42309e7fc45ac940c83', 'time': 1669553103072, 't': 0, 'tsl': 11302, 'payload': {}, 'init': True}

2022-11-27 12:45:09.186304 {'type': 'training', 'status': 'done', 'container_id': 'ba87e86e46d0b8d845d87b44b59576b7e73bc3d203fbf42309e7fc45ac940c83', 'time': 1669553109186, 't': 6114, 'tsl': 6114, 'payload': {}}

  0%|          | 0/1 [00:00<?, ?it/s]
Steps:   0%|          | 0/1 [00:00<?, ?it/s]
Steps: 100%|██████████| 1/1 [00:05<00:00,  5.90s/it]
Steps: 100%|██████████| 1/1 [00:05<00:00,  5.90s/it, loss=0.0224, lr=5e-6]/api/diffusers/src/diffusers/utils/deprecation_utils.py:35: DeprecationWarning: It is deprecated to pass a pretrained model name or path to `from_config`.
  warnings.warn(warning + message, DeprecationWarning)

-rw-r--r-- 1 root root 4561838429 Nov 27 12:45 model.tar.zst
[2022-11-27 12:45:42 +0000] [25] [ERROR] Exception occurred while handling uri: 'http://0.0.0.0:8000/'
Traceback (most recent call last):
  File "handle_request", line 81, in handle_request
    FutureStatic,
  File "/api/server.py", line 36, in inference
    output = user_src.inference(model_inputs)
  File "/api/app.py", line 277, in inference
    result = TrainDreamBooth(model_id, pipeline, model_inputs, call_inputs)
  File "/api/train_dreambooth.py", line 140, in TrainDreamBooth
    upload_result = storage.upload_file(filename, filename)
  File "/api/utils/storage/S3Storage.py", line 74, in upload_file
    result = self.bucket().upload_file(source, dest)
  File "/opt/conda/envs/xformers/lib/python3.10/site-packages/boto3/s3/inject.py", line 233, in bucket_upload_file
    return self.meta.client.upload_file(
  File "/opt/conda/envs/xformers/lib/python3.10/site-packages/boto3/s3/inject.py", line 143, in upload_file
    return transfer.upload_file(
  File "/opt/conda/envs/xformers/lib/python3.10/site-packages/boto3/s3/transfer.py", line 288, in upload_file
    future.result()
  File "/opt/conda/envs/xformers/lib/python3.10/site-packages/s3transfer/futures.py", line 103, in result
    return self._coordinator.result()
  File "/opt/conda/envs/xformers/lib/python3.10/site-packages/s3transfer/futures.py", line 266, in result
    raise self._exception
  File "/opt/conda/envs/xformers/lib/python3.10/site-packages/s3transfer/tasks.py", line 139, in __call__
    return self._execute_main(kwargs)
  File "/opt/conda/envs/xformers/lib/python3.10/site-packages/s3transfer/tasks.py", line 162, in _execute_main
    return_value = self._main(**kwargs)
  File "/opt/conda/envs/xformers/lib/python3.10/site-packages/s3transfer/tasks.py", line 348, in _main
    response = client.create_multipart_upload(
  File "/opt/conda/envs/xformers/lib/python3.10/site-packages/botocore/client.py", line 530, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/opt/conda/envs/xformers/lib/python3.10/site-packages/botocore/client.py", line 916, in _make_api_call
    endpoint_url, additional_headers = self._resolve_endpoint_ruleset(
  File "/opt/conda/envs/xformers/lib/python3.10/site-packages/botocore/client.py", line 1059, in _resolve_endpoint_ruleset
    endpoint_info = self._ruleset_resolver.construct_endpoint(
  File "/opt/conda/envs/xformers/lib/python3.10/site-packages/botocore/regions.py", line 502, in construct_endpoint
    provider_result = self._provider.resolve_endpoint(
  File "/opt/conda/envs/xformers/lib/python3.10/site-packages/botocore/endpoint_provider.py", line 715, in resolve_endpoint
    endpoint = self.ruleset.evaluate(input_parameters)
  File "/opt/conda/envs/xformers/lib/python3.10/site-packages/botocore/endpoint_provider.py", line 695, in evaluate
    evaluation = rule.evaluate(input_parameters.copy(), self.rule_lib)
  File "/opt/conda/envs/xformers/lib/python3.10/site-packages/botocore/endpoint_provider.py", line 546, in evaluate
    rule_result = rule.evaluate(scope_vars.copy(), rule_lib)
  File "/opt/conda/envs/xformers/lib/python3.10/site-packages/botocore/endpoint_provider.py", line 546, in evaluate
    rule_result = rule.evaluate(scope_vars.copy(), rule_lib)
  File "/opt/conda/envs/xformers/lib/python3.10/site-packages/botocore/endpoint_provider.py", line 546, in evaluate
    rule_result = rule.evaluate(scope_vars.copy(), rule_lib)
  [Previous line repeated 1 more time]
  File "/opt/conda/envs/xformers/lib/python3.10/site-packages/botocore/endpoint_provider.py", line 522, in evaluate
    raise EndpointResolutionError(msg=error)
botocore.exceptions.EndpointResolutionError: Custom endpoint `s3://selfieai-photos/uploads/` was not a valid URI
[2022-11-27 12:45:42 +0000] - (sanic.access)[INFO][127.0.0.1:44328]: POST http://0.0.0.0:8000/  500 139

2022-11-27T12:46:43.000Z /opt/conda/envs/xformers/lib/python3.10/site-packages/transformers/image_utils.py:239: DeprecationWarning: BILINEAR is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.BILINEAR instead.
  def resize(self, image, size, resample=PIL.Image.BILINEAR, default_to_square=True, max_size=None):
/opt/conda/envs/xformers/lib/python3.10/site-packages/transformers/image_utils.py:396: DeprecationWarning: NEAREST is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.NEAREST or Dither.NONE instead.
  def rotate(self, image, angle, resample=PIL.Image.NEAREST, expand=0, center=None, translate=None, fillcolor=None):
/opt/conda/envs/xformers/lib/python3.10/site-packages/transformers/models/clip/feature_extraction_clip.py:67: DeprecationWarning: BICUBIC is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.BICUBIC instead.
  resample=Image.BICUBIC,

environ({'CONDA_SHLVL': '2', 'LD_LIBRARY_PATH': '/usr/local/nvidia/lib:/usr/local/nvidia/lib64', 'REQUESTS_CA_BUNDLE': '', 'CONDA_EXE': '/opt/conda/bin/conda', '_': '/opt/conda/envs/xformers/bin/python3', 'MODEL_URL': '', 'HOSTNAME': 'selfieaidreamboothapi00484d77580e70a458dae520b2d06a10a46-fmg2gq', 'PRECISION': '', 'HF_AUTH_TOKEN': 'XXXXXXXX', 'AWS_SECRET_ACCESS_KEY': 'XXXXXXXXX', 'PIPELINE': 'ALL', 'ALIYUN_COM_GPU_MEM_CONTAINER': '16', 'USE_DREAMBOOTH': '1', 'CONDA_PREFIX': '/opt/conda/envs/xformers', 'ALIYUN_COM_GPU_MEM_POD': '16', 'AWS_S3_ENDPOINT_URL': 's3://selfieai-photos/uploads/', 'NVIDIA_VISIBLE_DEVICES': 'GPU-a097ad1c-88f3-4877-2e51-1f8d371fb2b4', 'AWS_DEFAULT_REGION': 'us-east-1', '_CE_M': '', 'KUBERNETES_PORT_443_TCP_PROTO': 'tcp', 'KUBERNETES_PORT_443_TCP_ADDR': '10.96.0.1', 'CONDA_PREFIX_1': '/opt/conda', 'ALIYUN_COM_GPU_MEM_DEV': '40', 'KUBERNETES_PORT': 'tcp://10.96.0.1:443', 'PWD': '/api', 'HOME': '/root', 'CONDA_PYTHON_EXE': '/opt/conda/bin/python', 'LC_CTYPE': 'C.UTF-8', 'CHECKPOINT_CONFIG_URL': '', 'PYTORCH_VERSION': 'v1.12.1-rc5', '_CONVERT_SPECIAL': '', 'https_proxy': '', 'KUBERNETES_SERVICE_PORT_HTTPS': '443', 'DEBIAN_FRONTEND': 'noninteractive', 'KUBERNETES_PORT_443_TCP_PORT': '443', 'http_proxy': '', '_CE_CONDA': '', 'MODEL_ID': 'runwayml/stable-diffusion-v1-5', 'KUBERNETES_PORT_443_TCP': 'tcp://10.96.0.1:443', 'CONDA_PROMPT_MODIFIER': '(xformers) ', 'ALIYUN_COM_GPU_MEM_IDX': '5', 'NVIDIA_DRIVER_CAPABILITIES': 'compute,utility', 'CONDA_ROOT': '/opt/conda', 'AWS_ACCESS_KEY_ID': 'XXXXXXXX', 'SHLVL': '2', 'KUBERNETES_SERVICE_PORT': '443', 'CHECKPOINT_URL': '', 'PATH': '/opt/conda/envs/xformers/bin:/opt/conda/condabin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin', 'CONDA_DEFAULT_ENV': 'xformers', 'KUBERNETES_SERVICE_HOST': '10.96.0.1'})

2022-11-27 12:48:57.721665 {'type': 'init', 'status': 'start', 'container_id': 'f0d6409d28af0244aed2e3293f5a241dbe79aaf7f049e1496e86743010c78bb1', 'time': 1669553337722, 't': 0, 'tsl': 266, 'payload': {'device': 'NVIDIA A100-SXM4-40GB', 'hostname': 'selfieaidreamboothapi00484d77580e70a458dae520b2d06a10a46-fmg2gq', 'model_id': 'runwayml/stable-diffusion-v1-5', 'diffusers': '0.8.0'}, 'init': True}
Loading model: runwayml/stable-diffusion-v1-5
Initializing LMSDiscreteScheduler for runwayml/stable-diffusion-v1-5...
/api/diffusers/src/diffusers/utils/deprecation_utils.py:35: DeprecationWarning: It is deprecated to pass a pretrained model name or path to `from_config`.If you were trying to load a scheduler, please use <class 'diffusers.schedulers.scheduling_lms_discrete.LMSDiscreteScheduler'>.from_pretrained(...) instead. Otherwise, please make sure to pass a configuration dictionary instead. This functionality will be removed in v1.0.0.
  warnings.warn(warning + message, DeprecationWarning)
Initialized LMSDiscreteScheduler for runwayml/stable-diffusion-v1-5 in 3ms
<frozen importlib._bootstrap>:283: DeprecationWarning: the load_module() method is deprecated and slated for removal in Python 3.12; use exec_module() instead
/api/diffusers/src/diffusers/utils/deprecation_utils.py:35: DeprecationWarning: It is deprecated to pass a pretrained model name or path to `from_config`.If you were trying to load a model, please use <class 'diffusers.models.unet_2d_condition.UNet2DConditionModel'>.load_config(...) followed by <class 'diffusers.models.unet_2d_condition.UNet2DConditionModel'>.from_config(...) instead. Otherwise, please make sure to pass a configuration dictionary instead. This functionality will be removed in v1.0.0.
  warnings.warn(warning + message, DeprecationWarning)

/api/diffusers/src/diffusers/utils/deprecation_utils.py:35: DeprecationWarning: It is deprecated to pass a pretrained model name or path to `from_config`.
  warnings.warn(warning + message, DeprecationWarning)

{
  "modelInputs": {
    "instance_prompt": "a photo of sks dog",
    "instance_images": [
      "/9j/4A...",
      "/9j/4A...",
      "/9j/4A...",
      "/9j/4A...",
      "/9j/4A..."
    ],
    "max_train_steps": 1,
    "num_class": "images=1"
  },
  "callInputs": {
    "MODEL_ID": "runwayml/stable-diffusion-v1-5",
    "PIPELINE": "StableDiffusionPipeline",
    "SCHEDULER": "DDPMScheduler",
    "train": "dreambooth",
    "dest_url": "s3:///selfieai-photos/model.tar.zst"
  }
}
Initializing DDPMScheduler for runwayml/stable-diffusion-v1-5...
/api/diffusers/src/diffusers/utils/deprecation_utils.py:35: DeprecationWarning: It is deprecated to pass a pretrained model name or path to `from_config`.If you were trying to load a scheduler, please use <class 'diffusers.schedulers.scheduling_ddpm.DDPMScheduler'>.from_pretrained(...) instead. Otherwise, please make sure to pass a configuration dictionary instead. This functionality will be removed in v1.0.0.
  warnings.warn(warning + message, DeprecationWarning)
Initialized DDPMScheduler for runwayml/stable-diffusion-v1-5 in 2ms
Decoded image "instance_image": JPEG 2732x2736
Decoded image "instance_image": JPEG 2476x2612
Decoded image "instance_image": JPEG 2469x2558
Decoded image "instance_image": JPEG 2796x2656
Decoded image "instance_image": JPEG 1815x1967
2022-11-27 12:49:05.880331 {'type': 'inference', 'status': 'start', 'container_id': 'f0d6409d28af0244aed2e3293f5a241dbe79aaf7f049e1496e86743010c78bb1', 'time': 1669553345880, 't': 0, 'tsl': 1573, 'payload': {'startRequestId': None}, 'init': True}
pipeline.enable_xformers_memory_efficient_attention()
{'instance_prompt': 'a photo of sks dog', 'max_train_steps': 1, 'num_class': 'images=1'}
Namespace(pretrained_model_name_or_path='runwayml/stable-diffusion-v1-5', revision=None, tokenizer_name=None, instance_data_dir='instance_data_dir', class_data_dir='class_data_dir', class_prompt=None, with_prior_preservation=False, prior_loss_weight=1.0, num_class_images=100, output_dir='text-inversion-model', seed=None, resolution=512, center_crop=None, train_text_encoder=None, train_batch_size=1, sample_batch_size=1, num_train_epochs=1, max_train_steps=1, gradient_accumulation_steps=1, gradient_checkpointing=True, learning_rate=5e-06, scale_lr=False, lr_scheduler='constant', lr_warmup_steps=0, use_8bit_adam=True, adam_beta1=0.9, adam_beta2=0.999, adam_weight_decay=1e-06, adam_epsilon=1e-08, max_grad_norm=1.0, push_to_hub=None, hub_token='XXXXXXX', hub_model_id=None, logging_dir='logs', mixed_precision=None, local_rank=-1, instance_prompt='a photo of sks dog', num_class='images=1')

(continuation)

2022-11-27T13:49:42.000Z /opt/conda/envs/xformers/lib/python3.10/site-packages/torch/utils/tensorboard/__init__.py:4: DeprecationWarning: distutils Version classes are deprecated. Use packaging.version instead.
  if not hasattr(tensorboard, "__version__") or LooseVersion(
/opt/conda/envs/xformers/lib/python3.10/site-packages/transformers/image_utils.py:239: DeprecationWarning: BILINEAR is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.BILINEAR instead.
  def resize(self, image, size, resample=PIL.Image.BILINEAR, default_to_square=True, max_size=None):
/opt/conda/envs/xformers/lib/python3.10/site-packages/transformers/image_utils.py:396: DeprecationWarning: NEAREST is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.NEAREST or Dither.NONE instead.
  def rotate(self, image, angle, resample=PIL.Image.NEAREST, expand=0, center=None, translate=None, fillcolor=None):
/opt/conda/envs/xformers/lib/python3.10/site-packages/transformers/models/clip/feature_extraction_clip.py:67: DeprecationWarning: BICUBIC is deprecated and will be removed in Pillow 10 (2023-07-01). Use Resampling.BICUBIC instead.
  resample=Image.BICUBIC,

environ({'CONDA_SHLVL': '2', 'LD_LIBRARY_PATH': '/usr/local/nvidia/lib:/usr/local/nvidia/lib64', 'REQUESTS_CA_BUNDLE': '', 'CONDA_EXE': '/opt/conda/bin/conda', '_': '/opt/conda/envs/xformers/bin/python3', 'MODEL_URL': '', 'HOSTNAME': 'selfieaidreamboothapi00484d77580e70a458dae520b2d06a10a46-fftmwc', 'PRECISION': '', 'HF_AUTH_TOKEN': 'XXXXXXXX', 'AWS_SECRET_ACCESS_KEY': 'XXXXXXXXXXX', 'PIPELINE': 'ALL', 'ALIYUN_COM_GPU_MEM_CONTAINER': '16', 'USE_DREAMBOOTH': '1', 'CONDA_PREFIX': '/opt/conda/envs/xformers', 'ALIYUN_COM_GPU_MEM_POD': '16', 'AWS_S3_ENDPOINT_URL': 's3://selfieai-photos/uploads/', 'NVIDIA_VISIBLE_DEVICES': 'GPU-68060eb8-0868-cf14-99ce-dfb73fe8a69f', 'AWS_DEFAULT_REGION': 'us-east-1', '_CE_M': '', 'KUBERNETES_PORT_443_TCP_PROTO': 'tcp', 'KUBERNETES_PORT_443_TCP_ADDR': '10.96.0.1', 'CONDA_PREFIX_1': '/opt/conda', 'ALIYUN_COM_GPU_MEM_DEV': '40', 'KUBERNETES_PORT': 'tcp://10.96.0.1:443', 'PWD': '/api', 'HOME': '/root', 'CONDA_PYTHON_EXE': '/opt/conda/bin/python', 'LC_CTYPE': 'C.UTF-8', 'CHECKPOINT_CONFIG_URL': '', 'PYTORCH_VERSION': 'v1.12.1-rc5', '_CONVERT_SPECIAL': '', 'https_proxy': '', 'KUBERNETES_SERVICE_PORT_HTTPS': '443', 'DEBIAN_FRONTEND': 'noninteractive', 'KUBERNETES_PORT_443_TCP_PORT': '443', 'http_proxy': '', '_CE_CONDA': '', 'MODEL_ID': 'runwayml/stable-diffusion-v1-5', 'KUBERNETES_PORT_443_TCP': 'tcp://10.96.0.1:443', 'CONDA_PROMPT_MODIFIER': '(xformers) ', 'ALIYUN_COM_GPU_MEM_IDX': '3', 'NVIDIA_DRIVER_CAPABILITIES': 'compute,utility', 'CONDA_ROOT': '/opt/conda', 'AWS_ACCESS_KEY_ID': 'XXXXXXXX', 'SHLVL': '2', 'KUBERNETES_SERVICE_PORT': '443', 'CHECKPOINT_URL': '', 'PATH': '/opt/conda/envs/xformers/bin:/opt/conda/condabin:/opt/conda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin', 'CONDA_DEFAULT_ENV': 'xformers', 'KUBERNETES_SERVICE_HOST': '10.96.0.1'})

2022-11-27 13:49:48.903735 {'type': 'init', 'status': 'start', 'container_id': 'e0858d37787251406d51d0ef86f997751778ee63548bf4bb035a717150f9d4ed', 'time': 1669556988904, 't': 0, 'tsl': 268, 'payload': {'device': 'NVIDIA A100-SXM4-40GB', 'hostname': 'selfieaidreamboothapi00484d77580e70a458dae520b2d06a10a46-fftmwc', 'model_id': 'runwayml/stable-diffusion-v1-5', 'diffusers': '0.8.0'}, 'init': True}
Loading model: runwayml/stable-diffusion-v1-5
Initializing LMSDiscreteScheduler for runwayml/stable-diffusion-v1-5...
/api/diffusers/src/diffusers/utils/deprecation_utils.py:35: DeprecationWarning: It is deprecated to pass a pretrained model name or path to `from_config`.If you were trying to load a scheduler, please use <class 'diffusers.schedulers.scheduling_lms_discrete.LMSDiscreteScheduler'>.from_pretrained(...) instead. Otherwise, please make sure to pass a configuration dictionary instead. This functionality will be removed in v1.0.0.
  warnings.warn(warning + message, DeprecationWarning)
Initialized LMSDiscreteScheduler for runwayml/stable-diffusion-v1-5 in 3ms
<frozen importlib._bootstrap>:283: DeprecationWarning: the load_module() method is deprecated and slated for removal in Python 3.12; use exec_module() instead
/api/diffusers/src/diffusers/utils/deprecation_utils.py:35: DeprecationWarning: It is deprecated to pass a pretrained model name or path to `from_config`.
  warnings.warn(warning + message, DeprecationWarning)

Thank you!!

Ah, this is super helpful!! I can see in the environment that:

is still set. As per other thread, this should be left blank (or set to a valid value). It sounds like you’ve already fixed that on the dashboard, but for whatever reason, it’s still there in the container. So, I would double check that it’s not set directly in your Dockerfile and then just push a dummy commit to trigger a new build to make sure it’s using the new dashboard build vars. Does that make sense?

Good finding !!. Yes, this variable is correct, but I don’t know how it was applied. Maybe it was a banana env variable, and after deleting it, it was still applied … I don’t know.

I just pushed a dummy commit, and I will check the logs. Because I don’t know how it was applied now, I don’t know how to change it to empty.

Thanks again!!

1 Like

Btw, it’s not so obvious but here’s the line to look for in the logs (it’s currently all on one line, I’ll improve this and also auto delete credentials when I have a chance):

Good questions re banana, please report back and let us all know. If deleting it didn’t work, maybe try setting it again to no value, like you’re managing to do with PRECISION. And yes unfortunately for now you have to push a dummy commit every time to trigger a rebuild with the new vars.