Introduction

Test-time scaling, which has demonstrated great success in improving reasoning for LLMs (e.g., OpenAI o1/o3, DeepSeek-R1), can hold significant promise for computer vision models as well. By allocating more computational resources during inference, vision models could achieve greater accuracy, robustness, and interpretability in complex tasks ranging from perception and understanding to reasoning and decision-making. This approach could enhance performance in high-stakes domains such as medical imaging, autonomous driving, and security surveillance, where precision and interpretability are crucial. Additionally, extending test-time scaling to multimodal models and generative architectures could foster more sophisticated cross-modal reasoning and higher-quality content generation. However, applying test-time scaling to vision models presents unique challenges. Vision tasks typically involve high-dimensional inputs, making computational scaling at test time more resource-intensive. Efficient algorithms will be necessary to ensure that the increased computation does not lead to impractical processing times or energy consumption. Moreover, ensuring robustness and safety when models are subjected to increased inference computation—particularly in dynamic or adversarial environments—will be crucial.

The Workshop on Test-time Scaling for Computer Vision (ViSCALE) at CVPR2025 aims to explore the frontiers of scaling test-time computation in vision models, addressing both theoretical advancements and practical implementations. We will discuss the suitability of test-time scaling for traditional vision tasks like perception and the extensions to multimodal and generative models, towards enhancing performance in critical domains. It will also cover solutions for efficient algorithms, considerations of robustness and safety, and novel problems in computer vision posed by test-time scaling. By bringing together experts, the workshop seeks to foster collaboration and innovation in applying this paradigm to push the limits of computer vision.

Speakers

Trevor Darrell U.C., Berkeley

Saining Xie New York University

Yue Zhao U.T., Austin

Cihang Xie U.C., Santa Cruz

Ludwig Schmidt Stanford University

Chen Qiu Bosch Center for AI

Schedule

Select Time Zone:

Session	From	To
Opening Remarks	09:00 AM	09:05 AM
Keynote Talk by Trevor Darrell	09:05 AM	09:30 AM
Keynote Talk by Saining Xie	09:30 AM	09:55 AM
Keynote Talk by Yue Zhao	09:55 AM	10:20 AM
Coffee Break	10:20 AM	10:30 AM
Keynote Talk by Cihang Xie	10:30 AM	10:55 AM
Keynote Talk by Ludwig Schmidt	10:55 AM	11:20 AM
Keynote Talk by Chen Qiu	11:20 AM	11:45 AM
Lightening Talks	11:45 AM	12:25 PM
Closing Remarks	12:25 PM	12:30 AM
Poster Session at ExHall D	9:00 AM	12:30 PM

Important Dates

March 15, 2025 (AoE)

Submission Deadline

March 29, 2025 (AoE)

Notification of Acceptance

April 14, 2025 (AoE)

Camera-Ready Submission

June 12 AM, 2025

Workshop Date

Accepted Papers

We hereby list the papers accepted to our workshop.

On the Suitability of Reinforcement Fine-Tuning to Visual Tasks
TTGen: Incorporating Test-time scaling to diffusion models
Can Neural Networks Decide Their Own Depth?
Centaur: Robust End-to-End Autonomous Driving with Test-Time Training
Get a GRIP on Test Time Adaptation! Group Robust Inference-Time Policy Optimization for Vision Models
EasyARC: Evaluating Vision Language Models on True Visual Reasoning
TimeZero: Temporal Video Grounding with Reasoning-Guided LVLM