GarmentPile: Point-Level Visual Affordance Guided Retrieval and Adaptation for Cluttered Garments Manipulation

Ruihai Wu^1*, Ziyu Zhu^2,1*, Yuran Wang^1*, Yue Chen¹, Jiarui Wang¹, Hao Dong^1†

¹CFCS, School of Computer Science, PKU, ²School of EECS, PKU,

The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 2025

Point-Level Affordance for Cluttered Garments. A higher score denotes the higher actionability for downstream retrieval. Row 1: per-point affordance simultaneously reveals 2 garments suitable for retrieval. Row 2: it is aware of garment structures (grasping edges leads other parts contacting floor) and relations (retrieving one garment while dragging nearby entangled garments out), and thus avoids manipulating on points leading to such failures. Row 3 and 4: highly tangled garments may not have plausible manipulation points, affordance can guide reorganizing the scene, and thus garments plausible for manipulation will exist.

Abstract

Cluttered garments manipulation poses significant challenges in robotics due to the complex, deformable nature of garments and intricate garment relations. Unlike single-garment manipulation, cluttered scenarios require managing complex garment entanglements and interactions, while maintaining garment cleanliness and manipulation stability. To address these demands, we propose to learn point-level affordance, the dense representation modeling the complex space and multi-modal manipulation candidates, with novel designs for the awareness of garment geometry, structure, and inter-object relations. Additionally, we introduce an adaptation module, informed by learned affordance, to reorganize cluttered garments into configurations conducive to manipulation. Our framework demonstrates effectiveness over environments featuring diverse garment types and pile scenarios in both simulation and the real world.

Video

Method

Pipeline Overview

Framework Overview. Given the observed point cloud, the Affordance Module predicts the initial point-level manipulation (retrieval) affordance score. When actionability is not good enough, the framework proposes the adaptation pick-place action. It first predicts per-point pick affordance, and selects the pick point with the highest score, conditioned on which it predicts place affordance and selects the place point. After executing adaptation action, it receives a new point cloud and generates new affordance. When actionability is good enough, the robot retrieves on the point with the highest affordance score. This loop is executed until all garments are retrieved.

Pipeline Details

Learning Framework of Retrieval, Pick and Place Affordance. Upper-left: the Affordance Module predicts the point-level (retrieval) affordance score for the downstream task. Upper-right: PointNet++ backbone aggregates both local and global features that facilitate incorporating garment geometry, structure and relation information for each point. Lower-right: the Place Module, which predicts the point-level place score conditioned on a pick point for adaptation, is supervised by the trained Affordance Module. Lower-left: the Pick Module, which predicts the point-level place score for adaptation, is supervised by the Place Module.

Results

We construct 3 different types of representative and realistic scenes built on Omniverse Isaac Sim, loading 9 categories (dress, onesie, glove, hat, scarf, trousers, underpants, skirt and top) of 126 different garments from ClothesNet into the environment. Our framework demonstrates effectiveness in both simulation and real world.

Simulation Results

Washing Machine Scene Whole Procedure (Adaptation)

Real World Results

Washing Machine Scene Whole Procedure (No Adaptation)

Sofa Scene Whole Procedure (No Adaptation)

Basket Scene Whole Procedure (No Adaptation)

Large Foundation Model Fails in Piled Garments

Segment Anything(SAM2)

SAM2 can't perform well mainly because (1) The piled garments can't be separated perfectly, which leads to the inability to get the appropriate grab points. (2) With separated area, we can just select the center of the area as grab point, which means only specific points, instead of all the segmented part, can be considered.

CHATGPT-4o

Chatgpt-4o can't judge the stacking relationship of clothes only by rgb and depth images, so there is only a small chance that it can successfully retrieve garments.

BibTeX

@InProceedings{Wu_2025_CVPR,
      author    = {Wu, Ruihai and Zhu, Ziyu and Wang, Yuran and Chen, Yue and Wang, Jiarui and Dong, Hao},
      title     = {Point-Level Visual Affordance Guided Retrieval and Adaptation for Cluttered Garments Manipulation},
      booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
      year      = {2025},
  }