X-Pose : Detecting Any Keypoints

1International Digital Economy Academy
2Shenzhen Research Institute of Big Data, The Chinese University of Hong Kong, Shenzhen
*corresponding author

Highlight

We show the remarkable generalization capabilities of X-Pose for unseen object and keypoint detection, which exhibits a notable 42.8% improvement in PCK performance when compared to the state-of-the-art CAPE method.

X-Pose outperforms the state-of-the-art end-to-end model (e.g., ED-Pose) across 12 diverse datasets. Its performance is also comparable with state-of-the-art expert models for object detection (e.g., GroundingDINO) and keypoint detection (e.g., ViTPose++).

X-Pose exhibits impressive text-to-image similarity at both instance and keypoint levels, notably surpassing CLIP by 204% when distinguishing between different animal categories and by 166% when discerning various image styles.

Overview: A Generalist Keypoint Detector

X-Pose is the first end-to-end prompt-based keypoint detection framework.

X-Pose could support visual or textual prompts for any articulated, rigid, and soft objects.

X-Pose has strong fine-grained localization and generalization abilities across image styles, categories, and poses.

Visual Prompt as Input

Textual Prompt as Input

Test on arbitrary in-the-wild images

Test on existing datasets

Abstract

This work aims to address an advanced keypoint detection problem: how to accurately detect any keypoints in complex real-world scenarios, which involves massive, messy, and open-ended objects as well as their associated keypoints definitions. Current high-performance keypoint detectors often fail to tackle this problem due to their twostage schemes, under-explored prompt designs, and limited training data. To bridge the gap, we propose X-Pose, a novel end-to-end framework with multi-modal (i.e., visual, textual, or their combinations) prompts to detect multi-object keypoints for any articulated (e.g., human and animal), rigid, and soft objects within a given image. Moreover, we introduce a large-scale dataset called UniKPT, which unifies 13 keypoint detection datasets with 338 keypoints across 1, 237 categories over 400K instances. Training with UniKPT, X-Pose effectively aligns textto-keypoint and image-to-keypoint due to the mutual enhancement of multi-modal prompts based on cross-modality contrastive learning. Our experimental results demonstrate that X-Pose achieves notable improvements of 27.7 AP, 6.44 PCK, and 7.0 AP compared to state-of-the-art non-promptable, visual prompt-based, and textual prompt-based methods in each respective fair setting. More importantly, the in-the-wild test demonstrates X-Pose’s strong fine-grained keypoint localization and generalization abilities across image styles, object categories, and poses, paving a new path to multi-object keypoint detection in real applications. Our code and dataset are available at https://github.com/IDEAResearch/X-Pose.

Framework of X-Pose

Highlight: X-Pose can effectively align text-to-keypoint and image-to-keypoint due to the mutual enhancement of textual and visual prompts based on the cross-modality contrastive learning optimization objectives.

Unifying 13 datasets into the UniKPT dataset for effective training

Cite Us!

@article{xpose,
            title={X-Pose: Detection Any Keypoints},
            author={Yang, Jie and Zeng, Ailing and Zhang, Ruimao and Zhang, Lei},
            journal={ECCV24},
            year={2023}
          }
      
@inproceedings{yang2022explicit,
        title={Explicit Box Detection Unifies End-to-End Multi-Person Pose Estimation},
        author={Yang, Jie and Zeng, Ailing and Liu, Shilong and Li, Feng and Zhang, Ruimao and Zhang, Lei},
        booktitle={The Eleventh International Conference on Learning Representations},
        year={2022}
      }
      
@inproceedings{yang2023neural,
        title={Neural Interactive Keypoint Detection},
        author={Yang, Jie and Zeng, Ailing and Li, Feng and Liu, Shilong and Zhang, Ruimao and Zhang, Lei},
        booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision},
        pages={15122--15132},
        year={2023}
      }