Toward Understanding Display Size for FPS Esports Aiming

Gamers use a variety of different display sizes, though for PC gaming, monitors in the 24 to 27 inch size range have become most popular. Particularly popular among many PC gamers, first person shooter (FPS) games represent a genre where hand-eye coordination is particularly central to the player's performance in game. In a carefully designed set of experiments on FPS aiming, we compare player performance across a range of display sizes.

NVIDIA Isaac GR00T N1: An Open Foundation Model for Humanoid Robots

At NVIDIA, we are developing AI solutions to enable general-purpose humanoid robots to understand the human world, follow language instructions, and perform diverse tasks. A robust Vision-Language-Action (VLA) model is crucial for such advanced capabilities. To this end, we developed GR00T N1, a generalist robot model trained on a diverse dataset that includes egocentric human videos, real and simulated robot trajectories, and synthetic data. 

Cosmos-Reason 1: From Physical AI Common Sense to Embodied Decisions

Physical AI systems need to perceive, understand, and perform complex actions in the physical world. In this paper, we present the Cosmos-Reason1 models that can understand the physical world and generate appropriate embodied decisions (e.g., next step action) in natural language through long chain-of-thought reasoning processes. We begin by defining key capabilities for Physical AI reasoning, with a focus on physical common sense and embodied reasoning.

Real-Time Anomaly Detection and Reactive Planning with Large Language Models

Foundation models, e.g., large language models (LLMs), trained on internet-scale data possess zero-shot generalization capabilities that make them a promising technology towards detecting and mitigating out-of-distribution failure modes of robotic systems. Fully realizing this promise, however, poses two challenges: (i) mitigating the considerable computational expense of these models such that they may be applied online, and (ii) incorporating their judgement regarding potential anomalies into a safe control framework.

MTP: Multi-Hypothesis Tracking and Prediction for Reduced Error Propagation

Recently, there has been tremendous progress in developing each individual module of the standard perception-planning robot autonomy pipeline, including detection, tracking, prediction of other agents' trajectories, and ego-agent trajectory planning. Nevertheless, there has been less attention given to the principled integration of these components, particularly in terms of the characterization and mitigation of cascading errors. This paper addresses the problem of cascading errors by focusing on the coupling between the tracking and prediction modules.

Large Language Model Based Generative Error Correction: A Challenge and Baselines for Speech Recognition, Speaker Tagging, and Emotion Recognition

Given recent advances in generative AI technology, a key question is how large language models (LLMs) can enhance acoustic modeling tasks using text decoding results from a frozen, pretrained automatic speech recognition (ASR) model. To explore new capabilities in language modeling for speech processing, we introduce the generative speech transcription error correction (GenSEC) challenge. This challenge comprises three post-ASR language modeling tasks: (i) post-ASR transcription correction, (ii) speaker tagging, and (iii) emotion recognition.

Towards Neural Scaling Laws for Time Series Foundation Models

Scaling laws offer valuable insights into the design of time series foundation models (TSFMs). However, previous research has largely focused on the scaling laws
of TSFMs for in-distribution (ID) data, leaving their out-of-distribution (OOD)
scaling behavior and the influence of model architectures less explored. In this
work, we examine two common TSFM architectures—encoder-only and decoderonly Transformers—and investigate their scaling behavior on both ID and OOD
data. These models are trained and evaluated across varying parameter counts,

Audio Large Language Models Can Be Descriptive Speech Quality Evaluators

An ideal multimodal agent should be aware of the quality of its input modalities.
Recent advances have enabled large language models (LLMs) to incorporate auditory systems for handling various speech-related tasks. However, most audio
LLMs remain unaware of the quality of the speech they process. This limitation arises because speech quality evaluation is typically excluded from multi-task
training due to the lack of suitable datasets. To address this, we introduce the first

UniWav: Towards Unified Pre-training for Speech Representation Learning and Generation

Pre-training and representation learning have been playing an increasingly important role in modern speech processing. Nevertheless, different applications have been relying on different foundation models, since predominant pre-training techniques are either designed for discriminative tasks or generative tasks. In this work, we make the first attempt at building a unified pre-training framework for both types of tasks in speech.