Computer Vision

Fast Encoder-Based 3D from Casual Videos via Point Track Processing

This paper addresses the long-standing challenge of reconstructing 3D structures from videos with dynamic content. Current approaches to this problem were not designed to operate on casual videos recorded by standard cameras or require a long …

Domain-Agnostic Tuning-Encoder for Fast Personalization of Text-To-Image Models

Abstract Text-to-image (T2I) personalization allows users to guide the creative image generation process by combining their own visual concepts in natural language prompts. Recently, encoder-based techniques have emerged as a new effective approach for T2I personalization, reducing the need for multiple images and long training times.

Learning to Initiate and Reason in Event-Driven Cascading Processes

We describe “Cascade”, a new counterfactual reasoning setup. An agent is provided a semantic instruction and the results of a played out dynamical system. Its goal is to intervene in the dynamic environment, triggering a cascade of events that will lead to a different and counterfactual outcome.

Point-Cloud Completion with Pretrained Text-to-image Diffusion Models

Abstract Point-cloud data collected in real-world applications are often incomplete, because objects are being observed from specific viewpoints, which only capture one perspective. Data can also be incomplete due to occlusion and low-resolution sampling.

Key-Locked Rank One Editing for Text-to-Image Personalization

Summary: We present Perfusion, a new text-to-image personalization method. With only a 100KB model size, trained for roughly 4 minutes, Perfusion can creatively portray personalized objects. It allows significant changes in their appearance, while maintaining their identity, using a novel mechanism we call “Key-Locking”.

Encoder-based Domain Tuning for Fast Personalization of Text-to-Image Models

Summary: We use an encoder to personalize a text-to-image model to new concepts with a single image and 5-15 tuning steps. Abstract: Text-to-image personalization aims to teach a pre-trained diffusion model to reason about novel, user provided concepts, embedding them into new scenes guided by natural language prompts.

An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

Abstract Text-to-image models offer unprecedented freedom to guide creation through natural language. Yet, it is unclear how such freedom can be exercised to generate images of specific unique concepts, modify their appearance, or compose them in new roles and novel scenes.

"This is my unicorn, Fluffy": Personalizing frozen vision-language representations

Abstract Large Vision & Language models pretrained on web-scale data provide representations that are invaluable for numerous V&L problems. However, it is unclear how they can be used for reasoning about user-specific visual concepts in unstructured language.

Perception and Reasoning

Understanding of a complex scene goes way beyond top-down perception. When people operate in a natural scene, they can detect and recognize objects and relations using context, they can predict how objects and people will move next, and even reason why they behave as they do.

StyleGAN-NADA: CLIP-Guided Domain Adaptation of Image Generators

Can a generative model be trained to produce images from a specific domain, guided by a text prompt only, without seeing any image? In other words: can an image generator be trained blindly?