"This is my unicorn, Fluffy": Personalizing frozen vision-language representations

Large Vision & Language models pretrained on web-scale data provide representations that are invaluable for numerous V&L problems. However, it is unclear how they can be used for reasoning about user-specific visual concepts in unstructured language. This problem arises in multiple domains, from personalized image retrieval to personalized interaction with smart devices. We introduce a new learning setup called Personalized Vision & Language (PerVL) with two new benchmark datasets for retrieving and segmenting user-specific "personalized" concepts "in the wild".

Optimizing tensor network contraction using reinforcement learning

Quantum Computing (QC) stands to revolutionize computing, but is currently still limited. To develop and test quantum algorithms today, quantum circuits are often simulated on classical computers. Simulating a complex quantum circuit requires computing the contraction of a large network of tensors. The order (path) of contraction can have a drastic effect on the computing cost, but finding an efficient order is a challenging combinatorial optimization problem.

Point-Cloud Completion with Pretrained Text-to-image Diffusion Models

Point-cloud data collected in real-world applications are often incomplete, be-

cause objects are being observed from specific viewpoints, which only capture

one perspective. Data can also be incomplete due to occlusion and low-resolution

sampling. Existing approaches to completion rely on training models with datasets

of predefined objects to guide the completion of point clouds. Unfortunately, these

approaches fail to generalize when tested on objects or real-world setups that

SceneScape: Text-Driven Consistent Scene Generation

We present a method for text-driven perpetual view generation – synthesizing

long-term videos of various scenes solely from an input text prompt describing

the scene and camera poses. We introduce a novel framework that generates

such videos in an online fashion by combining the generative power of a pre-

trained text-to-image model with the geometric priors learned by a pre-trained

monocular depth prediction model. To tackle the pivotal challenge of achieving 3D

Magic3D: High-Resolution Text-to-3D Content Creation

DreamFusion has recently demonstrated the utility of a pre-trained text-to-image diffusion model to optimize Neural Radiance Fields (NeRF), achieving remarkable text-to-3D synthesis results. However, the method has two inherent limitations: (a) extremely slow optimization of NeRF and (b) low-resolution image space supervision on NeRF, leading to low-quality 3D models with a long processing time. In this paper, we address these limitations by utilizing a two-stage optimization framework.

ChipNeMo: Domain-Adapted LLMs for Chip Design

ChipNeMo aims to explore the applications of large language models (LLMs) for industrial chip design. Instead of directly deploying off-the-shelf commercial or open-source LLMs, we instead adopt the following domain adaptation techniques: custom tokenizers, domain-adaptive continued pretraining, supervised fine-tuning (SFT) with domain-specific instructions, and domain-adapted retrieval models. We evaluate these methods on three selected LLM applications for chip design: an engineering assistant chatbot, EDA script generation, and bug summarization and analysis.

Norm-guided latent space exploration for text-to-image generation

Text-to-image diffusion models show great potential in synthesizing a large variety of concepts in new compositions and scenarios. However, their latent seed space is still not well understood and has been shown to have an impact in generating new and rare concepts. Specifically, simple operations like interpolation and centroid finding work poorly with the standard Euclidean and spherical metrics in the latent space. This paper makes the observation that current training procedures make diffusion models biased toward inputs with a narrow range of norm values.

Syntactic Binding in Diffusion Models: Enhancing Attribute Correspondence through Attention Map Alignment

Text-conditioned image generation models often generate incorrect associations between entities and their visual attributes. This reflects an impaired mapping between linguistic binding of entities and modifiers in the prompt and visual binding of the corresponding elements in the generated image. As one notable example, a query like ``a yellow tomato and a red lemon'' may incorrectly produce an image of a yellow lemon and a red tomato.

Online Overexposed Pixels Hallucination in Videos with Adaptive Reference Frame Selection

Low dynamic range (LDR) cameras cannot deal with wide dynamic range inputs, frequently leading to local overexposure issues. We present a learning-based system to reduce these artifacts without resorting to complex acquisition mechanisms like alternating exposures or costly processing that are typical of high dynamic range (HDR) imaging. We propose a transformer-based deep neural network (DNN) to infer the missing HDR details. In an ablation study, we show the importance of using a multiscale DNN and train it with the proper cost function to achieve state-of-the-art quality.

Rethinking Display Requirements for Esports and High Interactivity Applications

Media technology is continuing its transition from passive streaming to participatory interactive experiences, including well-known applications such as web browsing, video conferencing and gaming, as well as emerging and more demanding uses like AR/MR/VR and esports. How should display traits such as latency, refresh rate and size change to meet this trend? We review recent studies from NVIDIA Research and others on requirements for esports as the cutting edge of this trend toward interactivity, and discuss the studies’ implications for other interactive applications.