Adapting to the Unknown: Training-Free Audio-Visual Event Perception with Dynamic Thresholds

Abstract: In the domain of audio-visual event perception, which focuses on the temporal localization and classification of events across distinct modalities (audio and visual), existing approaches are constrained by the vocabulary available in their training data. This limitation significantly impedes their capacity to generalize to novel, unseen event categories. Furthermore, the annotation process for this task is labor-intensive, requiring extensive manual labeling across modalities and temporal segments, limiting the scalability of current methods.

Make It Count: Text-to-Image Generation with an Accurate Number of Objects

Despite the unprecedented success of text-to-image diffusion models, controlling the number of depicted objects using text is surprisingly hard. This is important for various applications from technical documents, to children's books to illustrating cooking recipes. Generating object-correct counts is fundamentally challenging because the generative model needs to keep a sense of separate identity for every instance of the object, even if several objects look identical or overlap, and then carry out a global computation implicitly during generation.

TriTex: Learning Texture from a Single Mesh via Triplane Semantic Features

As 3D content creation continues to grow, transferring semantic textures between 3D meshes remains a significant challenge in computer graphics. While recent methods leverage text-to-image diffusion models for texturing, they often struggle to preserve the appearance of the source texture during texture transfer. We present TRITEX, a novel approach that learns a volumetric texture field from a single textured mesh by mapping semantic features to surface colors. Using an efficient triplane-based architecture, our method enables semantic-aware texture transfer to a novel target mesh.

SoftTreeMax: Policy Gradient via tree expansion

Policy gradient methods are notorious for having a large variance and high sample complexity. To mitigate this, we introduce SoftTreeMax -- a generalization of softmax that employs planning. In SoftTreeMax, we extend the traditional logits with the multi-step discounted cumulative reward, topped with the logits of future states. We analyze SoftTreeMax and explain how tree expansion helps to reduce its gradient variance. We prove that the variance depends on the chosen tree-expansion policy.

Detection of artifacts in clean and corrupted video pairs is influenced by artifact type and presentation modality

Modern computer-generated videos display a variety of artifacts. While image-computable metrics exist to quantify the visibility of artifacts in images and videos, designers often rely in part on human observers to find artifacts and assess video quality. Furthermore, human labeling of artifacts is often an essential component of building image and video quality metrics. Yet, relatively little research has studied the impact of different video comparison interfaces on an observer’s strategies and ability to detect different artifact types.

A Generative AI Game Jam Case Study from October 2024

Generative Artificial Intelligence (GenAI) promises to democratize many creative endeavors, from art, to music, to writing. However, video games are an underexplored field for GenAI given the highly multi-modal and interactive nature. In this work, we present a case study game-jam-style game development process (performed over only a few days!) making heavy use of available GenAI tools (as of October 2024) to create a game called Plunderwater: Sunken Treasure, a title selected from among GenAI suggestions.

Fly, Fail, Fix: Iterative Game Repair with Reinforcement Learning and Large Multimodal Models

Game design hinges on understanding how static rules and content translate into dynamic player behavior---something modern generative systems that inspect only a game's code or assets struggle to capture. We present an automated design iteration framework that closes this gap by pairing a reinforcement learning (RL) agent, which playtests the game, with a large multimodal model (LMM), which revises the game based on what the agent does. In each loop the RL player completes several episodes, producing
(i)~numerical play metrics and/or