Understanding of a complex scene goes way beyond top-down perception. When people operate in a natural scene, they can detect and recognize objects and relations using context, they can predict how objects and people will move next, and even reason why they behave as they do. We develop algorithms that allow smart agents to learn how to reason about their environment. Below are several recent studied that we published about this topic, including zero-shot learning from reasoning-by-elimination, and recognizing new combinations of known components.
Reasoning by elimination (RBE) is a fundamental mode of reasoning in classical logic, and a major form of concept learning in children. It would be a key ingredient for deployment of autonomous agents that interact in language, such as NVIDIA’s Riva. This type of reasoning with unfamiliar objects is particularly important for open-world scenarios where agents experience a mix of familiar and unfamiliar objects. We show how agents can be trained using reinforcement learning to make RBE inference, use it to learn new concepts, and apply their novel knowledge to reason about other new concepts.
Machine learning models are primarily trained to perform inductive reasoning: generalizing rules from training examples. In this work we describe the first approach to train an agent to reason-by-elimination. The agent receives textual instructions, for example, “pick the cyan popnap and the brown wambim” (Fig 1), that contain both familiar concepts and unfamiliar ones. The agent combines a perception module with a reasoning module to construct a reasoning policy that, by considering all available items, can make a correct inference even for never-seen objects or concepts. Furthermore, it then uses one-shot learning to add the new concept to its set of known concepts, so it can recognize even more new concepts.
Appeared in 37th Conference on Uncertainty in Artificial Intelligence (UAI 2021). Oral.
As a simple example, people can recognize a purple cauliflower even if they have never seen one, based on their familiarity with cauliflowers and with other purple objects. Unfortunately, current deep models struggle to generalize to new compositions of labels, although feature compositionality is a key design consideration of deep networks. Here we address zero-shot compositional recognition, which is the problem of learning to recognize new combinations of known attributes and objects.
Compositional Reasoning is considered a hallmark of human intelligence, and a currently a fundamental limitation of AI systems. For example, the space of combinations of visual scenes that a car can encounter grows exponentially with the number of objects and their attributes, so there is no hope to cover the full set of class combinations for recognizing the long-long tail of scene distribution. Compositional generalization is also encountered in a long list of problems including for example text, speech, and control.
Models trained from data tend to fail with compositional generalization for two fundamental reasons: distribution-shift and entanglement. First, recognizing new combinations is an extreme case of distribution-shift, where we want to recognize label combinations never observed in training (zero-shot learning). As a result, models learn correlations during training that hurt inference at test time. The second challenge is that the training samples themselves are often labeled in a compositional way and disentangling their “elementary” components from examples is often an ill-defined problem. We address both challenges using a causal framework, and propose to view zero-shot inference as finding “which intervention caused the observed image?”
Appeared in Advances in Neural Information Processing Systems (NeurIPS), 2020. Spotlight.