Can I train / fine-tune on Banana?

gadicc · November 22, 2022, 8:00am

Introduction

Banana is awesome for inference, which is a great fit for server-less architecture. But what about training? Banana offers incredible cost-value in terms of GPU seconds, the price is similar to lambdalabs.com who offer some of the cheapest A100 VMs in the industry (but without any serverless infrastructure, of course). This makes it good for certain kinds of training too (with these limitations):

16 GB VRAM: Banana currently offers only 16GB VRAM per instance (although there are plans to offer more at a higher cost per second in the future).
No distributed training: Although in theory parallelisation is possible, you’d be doing a lot of fighting against the serverless architecture for this. As such, training on banana works fantastic for training that can be done on a single machine (e.g. dreambooth and other fine tuning of existing models) but I wouldn’t suggest it to train a huge model from scratch.
Time limit: Banana jobs have an upper run limit which make them better suited for short(er) running tasks. At one stage this was 10m but was recently expanded to much more (TODO reference? I think 999m now?). In short, if your training will take longer than a few hours, it might fit better elsewhere.
Storage requirements: After training, where will you upload / store the trained model, sice containers are ephemeral (i.e. their runtime storage is lost whenever it shuts down). S3 works really well (Banana HQ is in San Francisco so e.g. AWS us-west-1 works great).

Rule of thumb, if your training happens in response to some user action on your site, banana is probably a great fit, provided you don’t require multi-GPU training for days with huge amounts of RAM.

Disclaimer: I don’t work for Banana and these are my personal opinions.

Architecture

The regular architecture design principles from banana inference apply. So, use one of the various starter templates with the following functions in your app.py:

In init():
- Load your model (if doing fine-tuning) - according to optimization rules
In inference(): (even though you’ll be doing training in it)
- Run training, and then
- Upload your trained model somewhere (e.g. S3)
- Return some relevant response as required

Existing (Public) Projects / Examples

Dreambooth

and

(see Dreambooth training [first look] - #13 by gadicc)