Run CVPR 2023 highlights with VESSL Run

Run CVPR 2023 highlight models and papers with a single YAML file

Published in

VESSL AI

8 min readJun 21, 2023

If you tried to run GitHub or Colab codes from the top AI/ML conferences like NeurIPS, CVPR, ICML, and ICCV, you soon realize that most of them don’t work. You have to spend hours just to configure CUDA, Python dependencies.

We created VESSL Run to help ML researchers and data scientists explore the latest models effortlessly with a unified YAML interface. With the release of VESSL Run, we are sharing several YAML files for highlight papers & models from CVPR 2023. These YAML files make the models like Dreambooth by Google Research, ImageBind by Meta AI, and VisProg by Allen AI all run-proof on your laptop and any clouds.

You can run these models simply using our vessl run command and referencing the YAML file. Explore more models from CVPR 2023 at our model gallery, https://vessl.ai/hub.

pip install --upgrade vessl
vessl run -f dreambooth.yaml

DreamBooth by Google Research

DreamBooth presents a novel method to personalize text-to-image diffusion models by fine-tuning them with a small set of subject images. By incorporating a unique identifier and leveraging semantic prior, the models can generate highly realistic images of the subject in different contexts, surpassing previous limitations in tasks like subject recontextualization, text-guided view synthesis, appearance modification, and artistic rendering while preserving important features.

name: dreamboothstablediffusion
image: nvcr.io/nvidia/pytorch:22.10-py3
resources:
  accelerators: A100:1
volumes:
  /root/examples: git://github.com/vessl-ai/examples
  /output:
    artifact: true
run:
  - workdir: /root/examples/Dreambooth-Stable-Diffusion
    command: |
      conda env create -f environment.yaml 
      source activate ldm
      pip install Omegaconf
      pip install pytorch-lightning 
      mkdir data/
      wget https://github.com/prawnpdf/prawn/raw/master/data/fonts/DejaVuSans.ttf -P data/
      wget https://huggingface.co/CompVis/stable-diffusion-v-1-4-original/resolve/main/sd-v1-4-full-ema.ckpt
      python main.py --base configs/stable-diffusion/v1-finetune_unfrozen.yaml -t --actual_resume ./sd-v1-4-full-ema.ckpt -n "generate_pikachu" --no-test --gpus "0," --data_root ./dataset --reg_data_root ./reg --class_word "{$class_word}"
      rm -rf ./logs/*.ipynb_checkpoints
      python scripts/stable_txt2img.py --ddim_eta 0.0 --n_samples 2 --n_iter 4 --scale 10.0 --ddim_steps 100  --ckpt ./logs/*/checkpoints/last.ckpt --prompt "{$prompt}"
      cp -r ./outputs /output

env:
  class_word: "pikachu"
  prompt: "A photo of sks pikachu playing soccer."

The YAML snippet uses the Docker image “nvcr.io/nvidia/pytorch:22.10-py3” to configure the runtime and allocates one NVIDIA A100 GPU. It specifies volumes for the GitHub repository and artifact creation. The project runs a sequence of commands, including environment setup, package installation, data download, model training, and output generation.

Under env, you can enter an example class word and a prompt.

class_word: Customize an identifier. For this example, here we are using “pikachu” for the new class word.
prompt: You can generate a regularization image by entering a prompt. In this example, we’re using “A photo of pikachu playing soccer.” as our example.

Segment Anything by Meta AI

Segment Anything (SA) introduces a task, model, and dataset for image segmentation, including over 1 billion masks on 11 million images. Their promptable model demonstrates impressive zero-shot performance, rivaling or surpassing prior fully supervised methods. Meta AI released the Segment Anything Model along with the dataset (SA-1B) to foster research in computer vision.

The YAML uses the Docker image “nvcr.io/nvidia/pytorch:21.05-py3” allocates one NVIDIA V100 on AWS. It runs a setup script located in the “/root/segment-anything/” directory. The GitHub repository “git://github.com/vessl-ai/segment-anything” is mounted as a volume. For interactive usage, the project has a runtime of 24 hours and exposes the port 8501.

name : segment-anything
resources:
  accelerators: V100:1
image: nvcr.io/nvidia/pytorch:21.05-py3
run:
  - workdir: /root/segment-anything/
    command: |
      bash ./setup.sh
volumes:
  /root/segment-anything: git://github.com/vessl-ai/segment-anything
interactive:
  runtime: 24h
  ports:
    - 8501

Thin-Plate Spline Motion Model for Image Animation

The paper introduces a new end-to-end unsupervised motion transfer framework to address the challenge of large pose gaps between source and driving images in image animation. The framework utilizes thin-plate spline motion estimation for flexible optical flow, incorporates multi-resolution occlusion masks to realistically restore missing regions, and employs additional auxiliary loss functions to ensure high-quality image generation. Experimental results demonstrate the superiority of this method over existing approaches, showing significant improvements in pose-related metrics across various objects such as talking faces, human bodies, and pixel animations.

The YAML uses the “nvcr.io/nvidia/pytorch:21.05-py3” image with a V100 accelerator. It runs a script and mounts a code and dataset from a GitHub repo and S3 bucket, respectively.

name: Thin-Plate-Spline-Motion-Model
image: nvcr.io/nvidia/pytorch:21.05-py3
resources:
  accelerators: V100:1
run:
  - workdir: /root/thin-plate-spline-motion-model
    command: |
      pip install -r requirements.txt && python run.py --config config/vox-256.yaml --device_ids 0
volumes:
  /root/thin-plate-spline-motion-model: git://github.com/saeyoon17/Thin-Plate-Spline-Motion-Model
  /root/vox: s3://vessl-public-apne2/vessl_run_datasets/vox/

MobileNeRF by Google Research

Neural Radiance Fields (NeRFs) have impressive image synthesis capabilities for 3D scenes. This paper introduces a new NeRF representation using textured polygons that can efficiently synthesize images using standard rendering pipelines. By incorporating a z-buffer, which assigns features to each pixel, and utilizing a view-dependent MLP in a fragment shader, the final pixel colors are produced. This approach enables NeRFs to be rendered with the traditional polygon rasterization pipeline, achieving interactive frame rates on various compute platforms.

The YAML involves tasks like unzipping a dataset, cloning a GitHub repository, installing dependencies, and executing a Python script. The dataset is sourced from an S3 bucket.

name: mobilenerf
resources:
  accelerators: V100:1
image: quay.io/vessl-ai/ngc-pytorch-kernel:22.12-py3-202301160809
run:
  - command: |
       unzip /root/datasets/nerf_synthetic.zip -d /datasets/
       git clone https://github.com/treasuraid/jax3d.git
  - command: |
       pip3 install -r requirements.txt 
       pip install jaxlib==0.1.69+cuda111 -f https://storage.googleapis.com/jax-releases/jax_cuda_releases.html
       python stage1.py
    workdir: jax3d/jax3d/projects/mobilenerf
volumes:
  /root/datasets/: s3://vessl-public-apne2/vessl_run_datasets/cvpr_candidates/nerf_synthetic.zip

ImageBind by Meta AI

ImageBind the learning of a joint embedding across diverse modalities such as images, text, audio, depth, thermal, and IMU data. By utilizing image-paired data, ImageBind effectively binds these modalities together and expands the zero-shot capabilities of large-scale vision-language models. It enables various applications, including cross-modal retrieval, arithmetic composition, detection, and generation, achieving state-of-the-art performance in emergent zero-shot recognition and few-shot recognition tasks, while also serving as a valuable evaluation framework for vision models across visual and non-visual domains.

The YAML utilizes the “nvcr.io/nvidia/pytorch:22.10-py3” image with an A100 accelerator. It involves creating an environment, installing dependencies, and running a Streamlit demo. The code and resources are sourced from the “treasuraid/ImageBind” repository, and the project is set to run interactively for 24 hours on port 8501.

To run this YAML, you need a A100 GPU. You can bring your own GPU clusters using our vessl cluster create command. Refer to our documentation to get started.

name: ImageBind
image: nvcr.io/nvidia/pytorch:22.10-py3
resources:
  accelerators: A100:1
run:
  - command: |
      cd ImageBind
      conda create --name imagebind python=3.8 -y
      source activate imagebind
      pip install numpy
      pip install vtk==9.0.1
      pip install mayavi
      pip install -r requirements.txt
      conda install -c conda-forge cartopy -y
      streamlit run streamlit_demo.py

volumes: 
  /root/ImageBind: git://github.com/treasuraid/ImageBind

interactive: 
  runtime: 24h
  ports:
    -8501

Visual Programming by Allen AI

VisProg is an innovative neuro-symbolic approach that utilizes natural language instructions to tackle complex visual tasks. By generating modular programs and employing computer vision models and image processing routines, VisProg offers flexible solutions for tasks like visual question answering and language-guided image editing. This approach broadens the capabilities of AI systems, allowing them to cater to diverse user needs and effectively handle a wide range of complex tasks.

For this YAML, you need to enter your Open AI API key

name: visprog
image: nvcr.io/nvidia/pytorch:22.10-py3
resources:
  accelerators: V100:1
run:
  - workdir: /root
    command: |
      echo $OPENAI_API_KEY
      git clone https://github.com/treasuraid/visprog.git
      cd visprog
      conda env create -f environment.yaml
      source activate visprog
      pip install vessl opencv-python-headless
      cd script
      python image_editing.py
env:
  OPENAI_API_KEY: "your openai api key"

Input query: “Replace man in black henley (person) with brick wall” (top: original, bottom: after the query)

Top-Down Visual Attention from Analysis by Synthesis

Current attention algorithms, such as self-attention, highlight all salient objects in an image without considering the specific task. In contrast, humans use task-guided top-down attention to focus on task-related objects. This paper introduces AbSViT, a top-down modulated ViT model that approximates AbS and enables controllable top-down attention. AbSViT improves performance on Vision-Language tasks and serves as a versatile backbone for classification, semantic segmentation, and model robustness.

The YAML utilizes the “nvcr.io/nvidia/pytorch:22.10-py3” image and runs with a V100 accelerator. It involves installing requirements and a library dependency. The project’s code and resources are fetched from the “bfshi/AbSViT” repository. During runtime, it operates interactively for 24 hours on port 8501.

name: AbSViT
image: nvcr.io/nvidia/pytorch:22.10-py3
resources:
  accelerators: V100:1
run:
  - workdir: /root/AbSVit
    command: |
      pip install -r requirements.txt
      apt-get install libmagickwand-dev
      
volumes:
  /root/AbSvit: git://github.com/bfshi/AbSViT.git
interactive:
  runtime: 24h
  ports:
    - 8501

VESSL AI @ CVPR 2023

VESSL AI will be at CVPR 2023 all week to host the official student social event, share our latest product updates, and showcase demos! Our team will be also in booth 📍1527 so stop by our booth to see more of our latest work!

—
SungHyun Moon, ML Lead
David Oh, ML Engineer Intern
Yong Hee Lee, Growth Manager