TG Telegram Group & Channel
Gradient Dude | United States America (US)
Create: Update:

​​Transferring Dense Pose to Proximal Animal Classes
Artsiom Sanakoyeu, Vasil Khalidov, Maureen S. McCarthy, Andrea Vedaldi, Natalia Neverova (Facebook AI Research)
In CVPR 2020.

🌐https://asanakoy.github.io/densepose-evolution/
▶️youtu.be/OU3Ayg_l4QM
📝https://arxiv.org/pdf/2003.00080.pdf


What?
DensePose approach predicts the pose of humans densely and accurately given a large dataset of poses annotated in detail.
We want to extend the same approach to animals but without annotations. Because it's super expensive to collect DensePose annotations for all different classes of animals. So we show that, at least for proximal animal classes such as chimpanzees, it is possible to transfer the knowledge existing in DensePose for humans. We propose to utilize the existing annotations of humans and do self-training on unlabeled images of animals.

In a nutshell, we first pretrain the DensePose on the existing human annotations. Then we predict DensePose on unlabeled images, select the most confident predictions and throw them in the augmented training set for retraining the model. To be able to select point-wise the most confident DensePose predictions we introduce a novel Auto-Calibrated version of DensePose-RCNN which can estimate the uncertainty of its predictions for every pixel.
We tested several techniques for sampling pseudo-labels and concluded that sampling based on confidence estimates from fine-grained tasks (24-Body-part estimation and DensePose UV-maps) results in the best performance.
We introduced a novel DensePose-Chimps dataset with Dense Pose ground truth annotations for chimps and tested our models on it, obtaining significant performance improvement over the baseline.
In this paper, we conducted thorough experiments only for chimps, but the method can be extended to other animals like cats and dogs as well.

✏️ More details:
1. To transfer DensePose from humans to animals we need a reference 3D model of an animal. Let's suppose we got an artist-created 3D model of the desired animal. The next step is to establish a dense mapping between the 3D model of animal and 3D model of a human. This is necessary to unify the evaluation protocols between humans and animals and allows to transfer of knowledge and annotations between different species. The matching between 3D models is done by matching semantic descriptors of the vertices on the meshes.

2. Our goal is to develop a DensePose predictor for a new class. Such a predictor must detect the object via a bounding box, segment it from the background, and obtain the Dense-Pose chart and UV-map coordinates for each foreground pixel. To do this we introduce a multi-head R-CNN architecture that combines multiple recognition tasks within a single model.
The first head refines the coordinates of the bounding box. The second head computes a foreground-background segmentation mask in the same way as MaskR-CNN. The third and the final head computes a part segmentation mask I, assigning each pixel to one of the 24-body parts charts, and the UV-map values for each foreground pixel.

3. We have a few existing instance-segmentation and detection annotations for some animals in the COCO dataset. Let's use them! Given a target animal class, let's say chimps. We want to find an optimal support domain: We find such classes from the COCO dataset pretraining on which gives the best detection (or segmentation) performance on the holdout set of chimps.

4. We jointly train DensePose prediction for people and detection, segmentation for other classes in the support domain. The goal is always to only build a model for the final target class — we found that merging classes is an effective way of integrating information. So all support domain categories are merged in one and the training is done in a class-agnostic manner.

5. Now we have our baseline network which knows a lot about humans and a bit about the detection and segmentation of animals. We run this model over ~5Tb of videos from camera traps in the wild and select around 100k video frames with good detections. N

​​Transferring Dense Pose to Proximal Animal Classes
Artsiom Sanakoyeu, Vasil Khalidov, Maureen S. McCarthy, Andrea Vedaldi, Natalia Neverova (Facebook AI Research)
In CVPR 2020.

🌐https://asanakoy.github.io/densepose-evolution/
▶️youtu.be/OU3Ayg_l4QM
📝https://arxiv.org/pdf/2003.00080.pdf


What?
DensePose approach predicts the pose of humans densely and accurately given a large dataset of poses annotated in detail.
We want to extend the same approach to animals but without annotations. Because it's super expensive to collect DensePose annotations for all different classes of animals. So we show that, at least for proximal animal classes such as chimpanzees, it is possible to transfer the knowledge existing in DensePose for humans. We propose to utilize the existing annotations of humans and do self-training on unlabeled images of animals.

In a nutshell, we first pretrain the DensePose on the existing human annotations. Then we predict DensePose on unlabeled images, select the most confident predictions and throw them in the augmented training set for retraining the model. To be able to select point-wise the most confident DensePose predictions we introduce a novel Auto-Calibrated version of DensePose-RCNN which can estimate the uncertainty of its predictions for every pixel.
We tested several techniques for sampling pseudo-labels and concluded that sampling based on confidence estimates from fine-grained tasks (24-Body-part estimation and DensePose UV-maps) results in the best performance.
We introduced a novel DensePose-Chimps dataset with Dense Pose ground truth annotations for chimps and tested our models on it, obtaining significant performance improvement over the baseline.
In this paper, we conducted thorough experiments only for chimps, but the method can be extended to other animals like cats and dogs as well.

✏️ More details:
1. To transfer DensePose from humans to animals we need a reference 3D model of an animal. Let's suppose we got an artist-created 3D model of the desired animal. The next step is to establish a dense mapping between the 3D model of animal and 3D model of a human. This is necessary to unify the evaluation protocols between humans and animals and allows to transfer of knowledge and annotations between different species. The matching between 3D models is done by matching semantic descriptors of the vertices on the meshes.

2. Our goal is to develop a DensePose predictor for a new class. Such a predictor must detect the object via a bounding box, segment it from the background, and obtain the Dense-Pose chart and UV-map coordinates for each foreground pixel. To do this we introduce a multi-head R-CNN architecture that combines multiple recognition tasks within a single model.
The first head refines the coordinates of the bounding box. The second head computes a foreground-background segmentation mask in the same way as MaskR-CNN. The third and the final head computes a part segmentation mask I, assigning each pixel to one of the 24-body parts charts, and the UV-map values for each foreground pixel.

3. We have a few existing instance-segmentation and detection annotations for some animals in the COCO dataset. Let's use them! Given a target animal class, let's say chimps. We want to find an optimal support domain: We find such classes from the COCO dataset pretraining on which gives the best detection (or segmentation) performance on the holdout set of chimps.

4. We jointly train DensePose prediction for people and detection, segmentation for other classes in the support domain. The goal is always to only build a model for the final target class — we found that merging classes is an effective way of integrating information. So all support domain categories are merged in one and the training is done in a class-agnostic manner.

5. Now we have our baseline network which knows a lot about humans and a bit about the detection and segmentation of animals. We run this model over ~5Tb of videos from camera traps in the wild and select around 100k video frames with good detections. N


>>Click here to continue<<

Gradient Dude




Share with your best friend
VIEW MORE

United States America Popular Telegram Group (US)