# FAT dataset overview

The "Falling Things" (FAT) dataset is a collection of synthetic images with ground truth annotations for research in object detection and 3D pose estimation.  The dataset combines object models with complex backgrounds of high graphical quality to yield photorealistic images with accurate 3D pose annotations for all objects in all images.  The dataset contains 61,500 unique annotated frames of 21 household objects from the [YCB dataset](http://www.ycbbenchmarks.com/).  Each frame consists of a stereo pair of RGBD images (i.e., RGB stereo images with ground truth depth for both cameras) and 3D poses, per-pixel semantic segmentation, and 2D/3D bounding box coordinates for all object instances.  The images show the objects falling onto different surfaces in three different scenes (living room, sun temple, and kite demo), captured by a [custom plug-in](https://github.com/NVIDIA/Dataset_Synthesizer) for Unreal Engine 4.  The dataset can be used for research in pose estimation, depth estimation from a single or stereo pair of cameras, semantic segmentation, and other applications within computer vision and robotics. 

## Paper

The paper describing the dataset can be found [here](https://arxiv.org/abs/1804.06534).

If you use this dataset, please cite as follows:

```
@INPROCEEDINGS{tremblay2018arx:fat,
  AUTHOR = "Jonathan Tremblay and Thang To and Stan Birchfield",
  TITLE = "Falling Things: {A} Synthetic Dataset for {3D} Object Detection and Pose Estimation",
  BOOKTITLE = "CVPR Workshop on Real World Challenges and New Benchmarks for Deep Learning in Robotic Vision",
  MONTH = jun,
  YEAR = 2018}
```

## License

The license can be found [here](http://research.nvidia.com/publication/2018-06_Falling-Things).

## Downloading

The dataset can be downloaded from [here](http://research.nvidia.com/publication/2018-06_Falling-Things).

## Visualizing

The YCB object models can be downloaded and visualized using our [NVDU tool](https://github.com/NVIDIA/Dataset_Utilities).  Note that the [publicly available YCB object models](http://www.ycbbenchmarks.com/) are not necessarily centered or aligned with respect to their coordinate system.  The NDDS tool downloads these models and transforms them so that the origin of the coordinate system is at the centroid of the point cloud, and the object is (approximately) aligned with the coordinate axes.  (The alignment is not perfect, since the objects are not pure geometric shapes but rather noisy scans.)  The NDDS tool can then be used to visualize the overlay of these models on the FAT images according to ground truth.

## Folder structure

When the dataset is extracted from the `.zip` file, the folder tree structure is as follows:
```
data
+-- single
|   +-- 002_master_chef_can_16k
|   |   +-- kitchen_X    
|   |   |   +-- _object_settings.json   
|   |   |   +-- _camera_settings.json
|   |   |   +-- XXXXXX.left.depth.png
|   |   |   +-- XXXXXX.left.json
|   |   |   +-- XXXXXX.left.seg.png
|   |   |   +-- XXXXXX.left.jpg
|   |   |   +-- XXXXXX.right.depth.png
|   |   |   +-- XXXXXX.right.json
|   |   |   +-- XXXXXX.right.seg.png
|   |   |   +-- XXXXXX.right.jpg
|   |   |   +-- ...
|   |   +-- ...
|   |   +-- kitedemo_X
|   |   +-- ...
|   |   +-- temple_X
|   |   +-- ...
|   +-- ...
+-- mixed
|   +-- kitchen_X
|   |   +-- _object_settings.json   
|   |   +-- _camera_settings.json
|   |   +-- XXXXXX.left.depth.png
|   |   +-- XXXXXX.left.json
|   |   +-- XXXXXX.left.seg.png
|   |   +-- XXXXXX.left.jpg
|   |   +-- XXXXXX.right.depth.png
|   |   +-- XXXXXX.right.json
|   |   +-- XXXXXX.right.seg.png
|   |   +-- XXXXXX.right.jpg
|   +-- ...
|   +-- kitedemo_X
|   +-- ...
|   +-- temple_X
|   +-- ...    
```

At the root level, there are two folders representing the two types of scenes: 
- `single` (single falling object), and 
- `mixed` (2 to 10 falling objects).

### Single 

For `single` each of the 21 object types has its own folder: 
```sh
002_master_chef_can_16k  008_pudding_box_16k      024_bowl_16k          051_large_clamp_16k
003_cracker_box_16k      009_gelatin_box_16k      025_mug_16k           052_extra_large_clamp_16k
004_sugar_box_16k        010_potted_meat_can_16k  035_power_drill_16k   061_foam_brick_16k
005_tomato_soup_can_16k  011_banana_16k           036_wood_block_16k
006_mustard_bottle_16k   019_pitcher_base_16k     037_scissors_16k
007_tuna_fish_can_16k    021_bleach_cleanser_16k  040_large_marker_16k

```
Within each folder there are 3 different scenes (`kitchen`, `kitedemo`, and `temple`) and 5 independent locations (0 through 4) within each scene: 
```
kitchen_0  kitchen_2  kitchen_4   kitedemo_1  kitedemo_3  temple_0  temple_2  temple_4
kitchen_1  kitchen_3  kitedemo_0  kitedemo_2  kitedemo_4  temple_1  temple_3
```
Each of these subfolders contains a dataset of 100 images of a particular object within a particular scene location, thus leading to `21 x 3 x 5 x 100 = 31500` image frames for `single`.

### Mixed

For `mixed` the images are organized by the 3 scenes and 5 locations within each scene, similar to above.
```
kitchen_0  kitchen_2  kitchen_4   kitedemo_1  kitedemo_3  temple_0  temple_2  temple_4
kitchen_1  kitchen_3  kitedemo_0  kitedemo_2  kitedemo_4  temple_1  temple_3
```
Each of these subfolders contains a dataset of 2000 images of objects within a particular scene location, thus leading to `3 x 5 x 2000 = 30000` image frames for `mixed`.

## File details

The details of the files are as follows.

### Setting files

In each data folder containing frames (*e.g.,* `data/single/002_master_chef_can_16k/kitchen_0/`), there are two files describing the exported scene:
* `_object_settings.json` includes information about the objects exported.  This includes 
  - the names of the exported object classes (`exported_object_classes`) 
  - details about the exported object classes (`exported_objects`), including
    - name of the class (`class`)
    - numerical class ID for semantic segmentation (`segmentation_class_id`).  For `mixed`, this number uniquely identifies the object class, but for `single`, this number is always 255, since there is just one object.
    - 4x4 Euclidean transformation (`fixed_model_transform`).  This transformation is applied to the original publicly-available YCB object in order to center and align it (translation values are in centimeters) with the coordinate system (see the discussion above on the NDDS tool).  Note that this is actually the transpose of the matrix.
    - dimensions of the 3D bounding cuboid along the XYZ axes (`cuboid_dimensions`)
* `_camera_settings.json` includes the intrinsics of both cameras (`camera_settings`).

The baseline between cameras is 6.0 cm. 

### Captured frame files

Each frame export contains 
- left / right RGB images (`XXXXXX.left.jpg`, `XXXXXX.right.jpg`)
- left / right depth images (`XXXXXX.left.depth.png`, `XXXXXX.right.depth.png`)
- left / right segmentation images (`XXXXXX.left.seg.png`, `XXXXXX.right.seg.png`)
- left / right annotation files (`XXXXXX.left.json`, `XXXXXX.right.json`), 

#### Image files

The image files are
- RGB images: JPEG-compressed images from the virtual cameras
- depth images:  Depth along the optical axis (in 0.1 mm increments)
- segmentation images:  Each pixel indicates the numerical ID of the object whose surface is visible at that pixel

#### Annotation files

Each annotation file includes
- XYZ position and orientation of the camera in the world coordinate frame (`camera_data`)
- for each object,
  - class name (`class`)
  - visibility, defined as the percentage of the object that is not occluded (`visibility`).  (0 means fully occluded whereas 1 means fully visible)
  - XYZ position (in centimeters) and orientation (`location` and `quaternion_xyzw`)
  - 4x4 transformation (redundant, can be computed from previous) (`pose_transform_permuted`)
  - 3D position of the centroid of the bounding cuboid (in centimeters) (`cuboid_centroid`)
  - 2D projection of the previous onto the image (in pixels) (`projected_cuboid_centroid`)
  - 2D bounding box of the object in the image (in pixels) (`bounding_box`)
  - 3D coordinates of the vertices of the 3D bounding cuboid (in centimeters) (`cuboid`)
  - 2D coordinates of the projection of the above (in pixels (`projected_cuboid`) 

*Note:*  Like the `fixed_model_transform`, the `pose_transform_permuted` is actually the transpose of the matrix.  Moreover, after transposing, the columns are permuted, and there is a sign flip (due to UE4's use of a lefthand coordinate system).  Specifically, if `A` is the matrix given by `pose_transform_permuted`, then actual transform is given by `A^T * P`, where `^T` denotes transpose, `*` denotes matrix multiplication, and the permutation matrix `P` is given by
```
    [ 0  0  1]
P = [ 1  0  0]
    [ 0 -1  0]
```

#### Coordinate frames

The indexes of the 3D bounding cuboid are in the following order:  
- `FrontTopRight` [0]
- `FrontTopLeft` [1]
- `FrontBottomLeft` [2]
- `FrontBottomRight` [3]
- `RearTopRight` [4]
- `RearTopLeft` [5]
- `RearBottomLeft` [6]
- `RearBottomRight` [7]

The XYZ coordinate frames are attached to each object as if the object were a camera facing the world through the front.  In other words, from the point of view of viewing the front from inside the object, the X axis points to the right, the Y axis points down, and the Z axis points forward toward the world.  Alternatively, from the point of view of viewing the front of the object from the outside (shown below), the X axis points left, the Y axis points down, and the Z axis points out of the page toward the viewer (right-hand coordinate system).

```
      4 +-----------------+ 5
       /     TOP         /|
      /                 / |
   0 +-----------------+ 1|
     |      FRONT      |  |
     |                 |  |
     |  x <--+         |  |
     |       |         |  |
     |       v         |  + 6
     |        y        | /
     |                 |/
   3 +-----------------+ 2
```

## Uncompressed RGB images

In the official FAT dataset above, the RGB images are lossy-compressed `.jpg` images.  If you would prefer to work with uncompressed `.png` images (that is, not lossy-compressed), you may download the alternative version [here](https://drive.google.com/open?id=16fJNufhOHay-SU-JcpQy9JWME47zFDzg) (137 GB).