The code is working but I do NOT think it is using the optimzation since my response time is mostly like this Request took 12.1s (inference: 3.0s, init: 1.8s)
Firstly, welcome to the forums, and thanks for providing all these details, which was really helpful.
To be honest, 12s total response time sounds really good for a cold start!
The model init time is where you really see the difference in optimization. 1.8s is great. I unfortunately don’t recall runpod’s init time for unoptimized models, but on banana at least, unoptimized ~= 90-120s (!) vs optimized ~= 2.5s.
The difference in request time vs init+inference (i.e. 7.3s in your example above) is the container boot time (+ any network latency). You should be able to see exactly what’s happening in the banana logs.
Subsequent requests should be super fast, just the inference time. But it depends on your “idle timeout” setting (after this time, the container shuts down again, to save you money… you can make the timeout longer (default is 5s) or set a number of “min workers” to keep things past, but you’ll pay more for keeping the containers alive when no one is using them.
The only other thing I notice is that usually we need a triple-slash (///) for S3 URLs, but I think if it didn’t find the model, you would have gotten an error. Anything else in the logs on the runpod side? Also, if you’re using default bucket and model name - which it seems you are - you should be able to just put MODEL_URL="s3://" and it will figure out the full URL for you.
Hope this helps and let me know if anything wasn’t clear or you have any other questions. And thanks for posting your experience on the forums.