Articulate Anything: Automatic Modeling of Articulated Objects via a Vision-Language Foundation Model
Jason Ma, Kyle Vedder, Arjun Krishna, Dinesh Jayaraman, Eric Eaton
University of Pennsylvania
[Project Website] [Paper] [Twitter threads]
Articulate Anything is a powerful VLM system for articulating 3D objects using various input modalities.
articulate_anything_tiktokified_2_V3.mp4
- Articulate 3D objects from text 🖋 descriptions
- Articulate 3D objects from 🖼 images
- Articulate 3D objects from 🎥 videos
We use Hydra for configuration management. You can easily customize the system by modifying the configuration files in configs/
or overload parameters from the command line. We can automatically articulate a variety of input modalities from a single command
python articulate.py modality={partnet, text, image, video} prompt={prompt} out_dir={output_dir}
Articulate-anything uses a actor-critic system, allowing for self-correction and self-improvement over iterations.
-
Download preprocessed PartNet-Mobility dataset from 🤗 Articulate-Anything Dataset on Hugging Face.
-
To use an interactive demo, run
python gradio_app.py
articulate_anything_gradio_demo.mp4
See below for more detailed guides.
Note
Skip the downloading raw dataset step if you have already downloaded our dataset from 🤗 Articulate-Anything Dataset on Hugging Face.
-
Clone the repository:
git clone https://github.com/vlongle/articulate-anything.git cd articulate-anything
-
Set up the Python environment:
conda create -n articulate-anything python=3.9 conda activate articulate-anything pip install -e .
-
Download and extract the PartNet-Mobility dataset:
# Download from https://sapien.ucsd.edu/downloads mkdir datasets mv partnet-mobility-v0.zip datasets/partnet-mobility-v0.zip cd datasets mkdir partnet-mobility-v0 unzip partnet-mobility-v0 -d partnet-mobility-v0
Our system supports Google Gemini, OpenAI GPT, and Anthropic Claude. You can set the model_name
in the config file conf/config.yaml to gemini-1.5-flash-latest
, gpt-4o
, or claude-3-5-sonnet-20241022
. Get your API key from the respective website and set it as an environment variable:
export API_KEY=YOUR_API_KEY
We support reconstruction from in-the-wild text, images, or videos, or masked reconstruction from PartNet-Mobility dataset.
Note
Skip all the processing steps if you have downloaded our preprocessed dataset from 🤗 Articulate-Anything Dataset on Hugging Face.
- First, preprocess the parntet dataset by running
python preprocess_partnet.py parallel={int} modality={}
- Run the interactive demo
python gradio_app.py
🐒 It's articulation time! For a step-by-step guide on articulating a PartNet-Mobility object, see the notebook:
or run
python articulate.py modality=partnet prompt=149 out_dir=results/149
to run for object_id
=149.
- Preprocess the dataset:
python articulate_anything/preprocess/preprocess_partnet.py parallel={int} modality=text
Our precomputed CLIP embeddings is available from our repo in partnet_mobility_embeddings.csv
. If you prefer to generate your own embeddings, follow these steps:
- Run the preprocessing with
render_aprt_views=true
to render part views for later part annotation.
python articulate_anything/preprocess/preprocess_partnet.py parallel={int} modality=text render_part_views=true
-
Annotate mesh parts using VLM (skip if using our precomputed embeddings):
python articulate_anything/preprocess/annotate_partnet_parts.py parallel={int}
-
Extract CLIP embeddings (skip if using our precomputed embeddings):
python articulate_anything/preprocess/create_partnet_embeddings.py
-
🐒 It's articulation time! For a detailed guide, see:
or run
python articulate.py modality=text prompt="suitcase with a retractable handle" out_dir=results/text/suitcase
-
Render images for each object:
python articulate_anything/preprocess/preprocess_partnet.py parallel={int} modality={image}
This renders a front-view image for each object in the PartNet-Mobility dataset. This is necessary for our mesh retrieval as we will compare the visual similarity between the input image or video against each rendered template object.
-
🐒 It's articulation time! For a detailed guide, see:
or run
python articulate.py modality=video prompt="datasets/in-the-wild-dataset/videos/suitcase.mp4" out_dir=results/video/suitcase
Note: Please download a checkpoint of cotracker for video articulation to visualize the motion traces.
Some implementation pecularity with the PartNet-Mobility dataset.
- Raise above ground: The meshes are centered at origin
(0,0,0)
. We usepybullet
to raise the links above the ground. Done automatically insapien_simulate
. - Rotate meshes: All the meshes will be on the ground. We have to get them in the upright orientation. Specifically, we need to add a fixed joint
<origin rpy="1.570796326794897 0 1.570796326794897" xyz="0 0 0"/>
between the first link and thebase
link. This is almost done in the original PartNet-Mobility dataset.render_partnet_obj
which callsrotate_urdf
saves the original urdf undermobility.urdf.backup
and get the correct rotation undermobility.urdf
. Then, for our generated python program we need to make sure that the compiled python program also has this joint. This is done automatically by the compilerodio_urdf.py
usingalign_robot_orientation
function.
Feel free to reach me at vlongle@seas.upenn.edu if you'd like to collaborate, or have any questions. You can also open a Github issue if you encounter any problems.
If you find this work useful, please consider citing our paper:
@article{le2024articulate,
title={Articulate-Anything: Automatic Modeling of Articulated Objects via a Vision-Language Foundation Model},
author={Le, Long and Xie, Jason and Liang, William and Wang, Hung-Ju and Yang, Yue and Ma, Yecheng Jason and Vedder, Kyle and Krishna, Arjun and Jayaraman, Dinesh and Eaton, Eric},
journal={arXiv preprint arXiv:2410.13882},
year={2024}
}
For more information, visit our project website.