Introducing VESSL Run: a unified YAML interface for running any AI models

VESSL Run makes fine-tuning and scaling the latest open-source models easier than ever

Published in

VESSL AI

5 min readJun 19, 2023

Today, we are releasing VESSL Run, the easiest way to train, fine-tune, and scale open-source AI/ML models. VESSL Run simplifies the complex compute backends and system details required to run off-the-shelf models into a unified YAML interface. This means developers can start training without being bogged down in ML-specific peripherals like cloud infrastructures, CUDA configurations, and Python dependencies.

Together with VESSL Run, we are also releasing several custom Docker images for the latest Generative AI models and highlight papers from CVPR 2023 such as DreamBooth Stable Diffusion. Explore these models at bit.ly/cvpr2023 and run them right from the terminal.

pip install --upgrade vessl
vessl run hello

The problem we are tackling

We’ve seen an explosion of open-source models that are on par with, if not better than, the latest closed-source counterparts — Stable Diffusion for DALL-E 2 and LLaMa for GPT-3. Every day, we see ML enthusiasts and professionals create their own versions and applications of these models and showcase them on GitHub. These models with hundreds of forks and stars, however, have one major problem.

Most of them don’t work.

If you tried actually running these models — whether following the guides on model cards or running Colab notebooks — you probably spent hours just configuring PyTorch and CUDA. Even if you do get through this step, it’s a whole other story to fine-tune and scale the model on your datasets and cloud. The value of these models is either lost between CUDA errors or they remain as toy projects without getting to the production level. It’s a common story that most of us in AI face today.

Our approach with VESSL Run

Our approach to solving this problem is to provide a simple, unified YAML interface that abstracts the peripherals surrounding the models. These include everything from CUDA configurations and Python dependencies for your first run; custom data loaders and cloud infrastructures for fine-tuning and scaling; and finally, endpoints and automatic scaling for serving and deployment. With VESSL Run, developers can experiment with the latest open-source models on their dataset and GPUs, without having to go through manual setup processes.

How it works

The following YAML snippet, for example, is all you need to run Dreambooth on Stable Diffusion with A100 GPUs.

Mount a public GitHub repo and a dataset from an S3 bucket.
Set up a training environment with our custom Docker Image.
Run a training task on an on-premise DGX cluster using A100 GPUs.

name: dreamboothstablediffusion
image: nvcr.io/nvidia/pytorch:22.10-py3
resources:
  accelerators: A100:1
volumes:
  /root/examples: git://github.com/vessl-ai/examples
  /output:
    artifact: true
run:
  - workdir: /root/examples/Dreambooth-Stable-Diffusion
    command: |
      conda env create -f environment.yaml 
      source activate ldm
      pip install Omegaconf
      pip install pytorch-lightning 
      mkdir data/
      wget https://github.com/prawnpdf/prawn/raw/master/data/fonts/DejaVuSans.ttf -P data/
      wget https://huggingface.co/CompVis/stable-diffusion-v-1-4-original/resolve/main/sd-v1-4-full-ema.ckpt
      python main.py --base configs/stable-diffusion/v1-finetune_unfrozen.yaml -t --actual_resume ./sd-v1-4-full-ema.ckpt  -n "generate_pikachu" --no-test --gpus "0," --data_root ./dataset --reg_data_root ./reg --class_word "{$class_word}"
      rm -rf ./logs/*.ipynb_checkpoints
      python scripts/stable_txt2img.py --ddim_eta 0.0 --n_samples 2 --n_iter 4 --scale 10.0 --ddim_steps 100  --ckpt ./logs/*/checkpoints/last.ckpt --prompt "{$prompt}"
      cp -r ./outputs /output

env:
  class_word: "pikachu"
  prompt: "A photo of sks pikachu playing soccer."

You can see it in action by running the following command and referring to the YAML file.

vessl run -f dreambooth.yaml

In essence, with every vessl run, you are launching a Kubernetes pod that’s configured specifically for machine learning. Our custom Docker Images are dockerized versions of each machine learning GitHub repo — /Dreambooth-Stable-Diffusion, /nanoGPT, /LangChain and more — with the right CUDA and application dependencies.

This means that you can launch not only individual training jobs but also create persistent workspaces for GPU-enabled inference tasks with the same YAML definition — and use tools like Streamlit to create a Lensa-like app, for example — all without worrying about the peripherals.

interactive:
  runtime: 24h
  ports:
    - 8501

How to get started

We prepared a few run-proof models for the latest Generative AI models and highlight papers from CVPR 2023 on our VESSL Hub gallery. These all come with our custom Docker Images and can be launched with the same vessl run command.

DreamBooth
LangChain
VisProg
ImageBind
Segment Anything
MobileNerf

We also made a few resources to help you get started:

Going forward

Our latest development on VESSL Run extends our efforts to bring the easiest way to train and deploy production-ready ML models at scale, along with our ML task launcher and workflow manager.

The onset of Stable Diffusion and LLaMa showed that lots of people want to build AI-enabled tools, whether that be for a side project or a full-scaled AI product. By creating a simple and unified interface that abstracts away the minute, yet time-consuming peripherals, we hope to make the latest-open models accessible to all builders and enable more enthusiasts to rapidly experiment with the latest developments in machine learning.

—
Yong Hee, Growth Manager
Floyd Ryoo, Product Manager
David Oh, ML Engineer Intern