10 Highlights from CVPR 2023
0. Vancouver Played Host to CVPR 2023
In the scenic embrace of Vancouver, the highly-anticipated CVPR 2023, Computer Vision and Pattern Recognition, is held from June 18th to 22nd with many luminaries and attendees. The event kicked off with two days of engaging workshops and tutorials on the 18th and 19th. Following this, the core conference dominated the stage from the 20th to the 22nd, presenting insightful oral presentations and informative posters, all against Vancouver’s stunning vistas and bustling urban charm.
1. CVPR’s Growing Impact: More Participants, More Papers, More Events
CVPR 2023 has seen significant growth in all dimensions, a testament to its increasing global influence and relevance in the field. The conference brought together over 8,000 attendees from 79 different countries, both online and on-site in Vancouver, setting the stage for a dynamic and insightful exchange.
Moreover, research vigor was evident in the increased volume of paper submissions. The conference received 9,155 papers, a remarkable 29% increase from the previous year. Among these submissions, 2,359 papers passed the stringent selection process, showing an astounding rise of 42% from 2022.
CVPR 2023 marked the inauguration of official social events by adding a new dimension to the event. A series of six engaging social events were held, fostering more significant interaction and networking among attendees. Of these, the standout event was “AMA with Senior Faculty and Industry Leaders,” organized by VESSL AI. This event was attended by more than 300 students and mentors, creating an invaluable platform for emerging professionals to interact with established leaders in the field.
2. Showcasing Innovation Amidst Challenges: Booths from Tech Giants to Emerging Challengers
Despite the shadow of the recent economic crisis appearing to loom over all tech companies, CVPR 2023 served as a vibrant arena for a formidable array of booths, where the tech giants stood shoulder to shoulder with emerging challengers. Displaying a wide array of tech stacks, the range of exhibits was vast, spanning from data acquisition and preprocessing to making inferences on pre-trained models, refining model compositions, tuning models, optimizing architecture, model training, CI/CD, hyper-parameter tuning, and even managing load balancing with serving.
A total of 119 exhibitors graced the event, each contributing their unique technological offerings and insights. Scale AI, for instance, captured the attendees’ attention with impressive demos such as the Scale Data Engine and the Spellbook, Forge, and Enterprise Generative AI Platforms. Microsoft, on the other hand, delved into the intricacies of vision, language, and multi-modal encoding through 41 thought-provoking publications. Google stood out with its wide-ranging contribution of 90 papers, while Meta and Amazon further enriched the knowledge pool with 65 and 22 publications, respectively.
3. Interesting trend #1: Downstream Task from Large Model (LM)
One fascinating trend that emerged from CVPR 2023 was the focus on downstream tasks derived from Large Models (LM). This approach involves seeking solutions for niche areas by harnessing abstract or general knowledge. While not precisely defined under a single term, this trend conceptually embraces downstream tasks and fine-tuning techniques.
Meta’s research on Language-augmented Video Language Pretrainig, or LaViLa, exemplifies this trend. The LaViLa project improves video-language representations by augmenting them with auto-generated narrations. These narrations provide comprehensive coverage, excellent synchronization, and diverse textual input. As a result, the learned embeddings outperform previous methods in video tasks, and LaViLa shows promising scalability and potential for future developments.
Another research that resonates with this trend is a study focusing on weakly-supervised few-shot image classification and segmentation. This research utilizes a pretrained Vision Transformer (ViT) with self-supervision, allowing the model to learn classification and segmentation effectively using only image-level labels. The attention maps generated from ViT’s tokens are employed as pixel-level pseudo-labels, leading to significant performance improvements, especially in scenarios with limited or no pixel-level labels.
Lastly, a study focusing on enhancing visual grounding in Vision-Language Pretraining (VLP) models using a Position-guided Text Prompt (PTP) also falls within this trend. PTP has been shown to improve performance across various benchmarks, matching object-detector-based methods with faster inference. The code and pre-trained weights are promised to be released, making it a valuable contribution to this emerging trend.
4. Interesting trend #2: Generative AI
Google’s DreamBooth displayed a testament to the immense potential of Generative AI in Computer Vision. The prowess of this technology can be amplified even further when it’s applied to prompts and for editing specific sections of an image or generating personalized images and videos.
In particular, the generation of video content holds great promise for industries like advertising. Short-form video platforms such as YouTube Shorts or TikTok could greatly benefit from such technology, and the applications could even extend to the broader entertainment industry.
The industry’s growing emphasis on generative AI is readily apparent when you consider the volume of Generative AI papers published by tech giants such as Google, NVIDIA, and Qualcomm. This emphasis underscores the potential and the ever-growing interest in this area, indicating that the field of Generative AI is burgeoning with opportunities for further research and exploration.
5. Interesting trend #3: Connectivity with Real World
As mentioned, a series of insightful workshops were conducted at CVPR 2023, centered around real-world datasets. A workshop titled “Visual Perception via Learning in an Open World” encapsulated this focus perfectly. The workshop shed light on the unique features of open-world data, such as its open-vocabulary, ontology/taxonomy of object classes, and evolving class ontology.
For instance, the datasets used in this workshop spanned a wide range of categories, such as mixed reality, autonomous driving, safety, and security in real life, agriculture, data captured from Unmanned Aerial Vehicles (UAVs) or satellites, animal behavior, and open dataset object categories. The datasets bore characteristics similar to raw data, with features like a long-tailed distribution, open-set, unknowns, streaming data, biased data, unlabeled data, anomalies, and multi-modality features.
The workshops and their focus on real-world datasets underscore the concerted effort within computer vision to strengthen its connectivity with real-world applications and scenarios. The implications of this focus are far-reaching, indicating a shift towards more practical, application-oriented research in computer vision.
6. Problems to tackle #1: Cost Management
The recent shortage of GPUs has had a profound impact on the industry and conferences like CVPR, where a diverse range of researchers, professors, and students from academia and industry gather to explore advancements in computer vision. This scarcity has permeated the atmosphere of CVPR, reflecting the challenges the entire industry faces. Insufficient GPU availability poses a significant obstacle to the training and evaluating of complex deep learning models, which are fundamental to numerous computer vision tasks.
Researchers often encounter extended waiting times for GPU access or need help to acquire adequate computing power for effective model training and experimentation within model development. This limitation hampers research progress and restricts the ability to achieve state-of-the-art performance.
In light of these circumstances, efficient resource management becomes paramount in mitigating the effects of GPU scarcity. It is imperative to allocate and utilize the available GPU resources judiciously, enabling researchers to maximize their utilization despite the limited availability. Prioritizing tasks, optimizing workflows, and promoting responsible resource sharing are critical considerations in overcoming the challenges posed by GPU shortages.
7. Problems to tackle #2: LLM & Multi-modality
The rise of large language models like GPT-4 has revolutionized the field of computer vision, marking a significant shift from solely analyzing visual data to incorporating linguistic understanding into a visual interpretation. This evolution has enabled models to generate detailed descriptions for images, create text based on visual content, and produce images responding to textual prompts. As such, these advancements in large language models have unraveled new opportunities for deciphering and manipulating visual data, leveraging their expansive linguistic proficiency and contextual comprehension.
Meanwhile, multi-modality, which fuses computer vision with other areas like natural language processing, sound, and speech, has gained considerable momentum. With this approach, models can assimilate and interpret data from diverse sources, thus achieving a holistic understanding and analysis. Multi-modal models excel in tasks such as image captioning and visual question answering (VQA), where the model processes visual and textual information to generate accurate responses. This cross-modal proficiency stimulates progress in image recognition, visual reasoning, and image synthesis. Furthermore, it paves the way for multidisciplinary applications that integrate computer vision with NLP, speech recognition, or audio analysis to solve complex problems requiring the fusion of various modalities.
8. Problems to tackle #3: The Forgotten Region, Complement of the Set Named ML Algorithm
Once the model weights have been fine-tuned for tasks like detection, segmentation, and video analysis, the next exciting step, particularly for academic researchers, is to seamlessly integrate these models into real-life pipelines that interact with the complexities of the world around us. The transformation from theoretical advancements to practical applications brings these models’ true impact to life. By bridging the gap between academia and real-world implementation, researchers can witness their innovations making a tangible difference in various domains and industries.
MLOps, short for Machine Learning Operations, is indispensable for the future of computer vision, NLP, sound, speech, and various industries. By providing a framework for efficient deployment, continuous monitoring, scalability, collaboration, governance, cost optimization, and agility, MLOps empowers researchers to leverage machine learning effectively, drive innovation, and unlock the full potential of these technologies across diverse domains.
9. The Future of Computer Vision: ἰδέα rather than εἶδος
In a discourse on a plenary session, Professor Rodney Brooks of MIT delved into archaic concepts from the past. These ideas transcend the realm of cutting-edge technology and delve into the essence and origins of human endeavors. That behavior can be contemplated and regarded as ἰδέα, as once expounded by the ancient Greek philosopher Plátōn. It is a formidable task to stand resolute amidst the rapid changes of the present and to anticipate the future, encompassing not only computer vision technology but also other frontiers of human achievement. In such a context, we may discover elusive paths towards approaching the ἰδέα. Translating Professor Rodney Brooks’ ideas and philosophy proves challenging in this short article, leaving us only with the visual imagery of his presentation slides and the questions he posed:
- What existed since the human inception?
- What lies ahead for the progress of computer vision?
- Does deep learning triumph over computer vision?
- So, what awaits us in the future?
In an exhilarating convergence, VESSL AI collided with a dynamic cohort of trailblazing ML researchers and influential industry luminaries across communications, finance, and healthcare realms at the momentous CVPR 2023. Fuelled by this transformative rendezvous, our passionate team is poised to embark on many forthcoming conferences, driven by an unwavering desire to intimately listen to user voices and forge ahead in shaping world-class MLOps services with a global impact. 🚀🌍✨
SungHyun Moon, ML Lead
Intae Ryoo, Product Manager