Existing unsupervised keypoint detection methods apply artificial deformations to images such as masking a significant portion of images and using reconstruction of original image as a learning objective to detect keypoints. However, this approach lacks depth information in the image and often detects keypoints on the background. To address this, we propose Distill-DKP, a novel cross-modal knowledge distillation framework that leverages depth maps and RGB images for keypoint detection in a self-supervised setting. During training, Distill-DKP extracts embedding-level knowledge from a depth-based teacher model to guide an image-based student model with inference restricted to the student model. Experiments show that Distill-DKP significantly outperforms previous unsupervised methods by reducing mean L2 error by 47.15% on Human3.6M, mean average error by 5.67% on Taichi, and improving keypoints accuracy by 1.3% on DeepFashion dataset. Detailed ablation studies demonstrate the sensitivity of knowledge distillation across different layers of the network.
Distill-DKP operates by distilling knowledge from a depth-based teacher model to an Image-based student model. The teacher model is pre-trained to detect keypoints on depth maps, which allows it to capture topological information. This knowledge is transferred to the student model, which operates on RGB images. During inference, only the Image student model is used. With the help of depth teachers, students effectively learn to distinguish the foreground and background components and understand the topological structure of objects in the images. This enables the student to detect keypoints accurately, even when the backgrounds have structures.
Our model shows superior performance over all other unsupervised baselines. The number of keypoints for Human3.6M and DeepFashion datasets is K = 16, and K = 10 for Taichi. Bold and underlined numbers represent the best and the second-best results, respectively. The † sign represents the results we reproduce by the official code of AutoLink, and the * sign means the thickness-tuned variant of AutoLink. (W/O B) and (WB) mean without background and with background, respectively.
We conduct detailed ablation tests to understand the sensitivity of KD and the influence of knowledge on different layers of the detector. We vary the loss coefficient (γ) from 0.1 to 1 for Human3.6M and Taichi datasets and 0.01 to 0.1 for DeepFashion. We select lower γ value due to the simpler background, avoiding model degeneration at higher value of γ.