This paper addresses the long-standing challenge of reconstructing 3D structures from videos with dynamic content. Current approaches to this problem were not designed to operate on casual videos recorded by standard cameras or require a long …
Abstract Text-to-image (T2I) personalization allows users to guide the creative image generation process by combining their own visual concepts in natural language prompts. Recently, encoder-based techniques have emerged as a new effective approach for T2I personalization, reducing the need for multiple images and long training times.
We describe “Cascade”, a new counterfactual reasoning setup. An agent is provided a semantic instruction and the results of a played out dynamical system. Its goal is to intervene in the dynamic environment, triggering a cascade of events that will lead to a different and counterfactual outcome.
Abstract Point-cloud data collected in real-world applications are often incomplete, because objects are being observed from specific viewpoints, which only capture one perspective. Data can also be incomplete due to occlusion and low-resolution sampling.
Summary: We present Perfusion, a new text-to-image personalization method. With only a 100KB model size, trained for roughly 4 minutes, Perfusion can creatively portray personalized objects. It allows significant changes in their appearance, while maintaining their identity, using a novel mechanism we call “Key-Locking”.
Summary: We use an encoder to personalize a text-to-image model to new concepts with a single image and 5-15 tuning steps.
Abstract: Text-to-image personalization aims to teach a pre-trained diffusion model to reason about novel, user provided concepts, embedding them into new scenes guided by natural language prompts.
Abstract Text-to-image models offer unprecedented freedom to guide creation through natural language. Yet, it is unclear how such freedom can be exercised to generate images of specific unique concepts, modify their appearance, or compose them in new roles and novel scenes.
Abstract Large Vision & Language models pretrained on web-scale data provide representations that are invaluable for numerous V&L problems. However, it is unclear how they can be used for reasoning about user-specific visual concepts in unstructured language.
Understanding of a complex scene goes way beyond top-down perception. When people operate in a natural scene, they can detect and recognize objects and relations using context, they can predict how objects and people will move next, and even reason why they behave as they do.
Can a generative model be trained to produce images from a specific domain, guided by a text prompt only, without seeing any image? In other words: can an image generator be trained blindly?