Computer Vision

Domain-Agnostic Tuning-Encoder for Fast Personalization of Text-To-Image Models

Abstract Text-to-image (T2I) personalization allows users to guide the creative image generation process by combining their own visual concepts in natural language prompts. Recently, encoder-based techniques have emerged as a new effective approach for T2I personalization, reducing the need for multiple images and long training times.

Learning to Initiate and Reason in Event-Driven Cascading Processes

We describe “Cascade”, a new counterfactual reasoning setup. An agent is provided a semantic instruction and the results of a played out dynamical system. Its goal is to intervene in the dynamic environment, triggering a cascade of events that will lead to a different and counterfactual outcome.

Point-Cloud Completion with Pretrained Text-to-image Diffusion Models

Abstract Point-cloud data collected in real-world applications are often incomplete, because objects are being observed from specific viewpoints, which only capture one perspective. Data can also be incomplete due to occlusion and low-resolution sampling.

Key-Locked Rank One Editing for Text-to-Image Personalization

Summary: We present Perfusion, a new text-to-image personalization method. With only a 100KB model size, trained for roughly 4 minutes, Perfusion can creatively portray personalized objects. It allows significant changes in their appearance, while maintaining their identity, using a novel mechanism we call “Key-Locking”.

Encoder-based Domain Tuning for Fast Personalization of Text-to-Image Models

Summary: We use an encoder to personalize a text-to-image model to new concepts with a single image and 5-15 tuning steps. Abstract: Text-to-image personalization aims to teach a pre-trained diffusion model to reason about novel, user provided concepts, embedding them into new scenes guided by natural language prompts.

An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

Abstract Text-to-image models offer unprecedented freedom to guide creation through natural language. Yet, it is unclear how such freedom can be exercised to generate images of specific unique concepts, modify their appearance, or compose them in new roles and novel scenes.

"This is my unicorn, Fluffy": Personalizing frozen vision-language representations

Abstract Large Vision & Language models pretrained on web-scale data provide representations that are invaluable for numerous V&L problems. However, it is unclear how they can be used for reasoning about user-specific visual concepts in unstructured language.

Perception and Reasoning

Understanding of a complex scene goes way beyond top-down perception. When people operate in a natural scene, they can detect and recognize objects and relations using context, they can predict how objects and people will move next, and even reason why they behave as they do.

StyleGAN-NADA: CLIP-Guided Domain Adaptation of Image Generators

Can a generative model be trained to produce images from a specific domain, guided by a text prompt only, without seeing any image? In other words: can an image generator be trained blindly?

Compositional Video Synthesis with Action Graphs

Video Abstract Videos of actions are complex signals, containing rich compositional structure. Current video generation models are limited in their ability to generate such videos. To address this challenge, we introduce a generative model (AG2Vid) that can be conditioned on an Action Graph, a structure that naturally represents the dynamics of actions and interactions between objects.