Skip to content

vlongle/articulate-anything

Repository files navigation

Articulate Anything: Automatic Modeling of Articulated Objects via a Vision-Language Foundation Model

Python Version License: MIT arXiv Dataset Twitter follow

[Project Website] [Paper] [Twitter threads]


Articulate Anything is a powerful VLM system for articulating 3D objects using various input modalities.

articulate_anything_tiktokified_2_V3.mp4

Features

  • Text Input Articulate 3D objects from text 🖋 descriptions
  • Image Input Articulate 3D objects from 🖼 images
  • Video Input Articulate 3D objects from 🎥 videos

We use Hydra for configuration management. You can easily customize the system by modifying the configuration files in configs/ or overload parameters from the command line. We can automatically articulate a variety of input modalities from a single command

   python articulate.py modality={partnet, text, image, video} prompt={prompt} out_dir={output_dir}

Articulate-anything uses a actor-critic system, allowing for self-correction and self-improvement over iterations.

🚀 QUICK START

  1. Download preprocessed PartNet-Mobility dataset from 🤗 Articulate-Anything Dataset on Hugging Face.

  2. To use an interactive demo, run

    python gradio_app.py
articulate_anything_gradio_demo.mp4

See below for more detailed guides.

Table of Contents

Installation

Note

Skip the downloading raw dataset step if you have already downloaded our dataset from 🤗 Articulate-Anything Dataset on Hugging Face.

  1. Clone the repository:

    git clone https://github.com/vlongle/articulate-anything.git
    cd articulate-anything
  2. Set up the Python environment:

    conda create -n articulate-anything python=3.9
    conda activate articulate-anything
    pip install -e .
  3. Download and extract the PartNet-Mobility dataset:

    # Download from https://sapien.ucsd.edu/downloads
    mkdir datasets
    mv partnet-mobility-v0.zip datasets/partnet-mobility-v0.zip
    cd datasets
    mkdir partnet-mobility-v0
    unzip partnet-mobility-v0 -d partnet-mobility-v0

Getting Started

Our system supports Google Gemini, OpenAI GPT, and Anthropic Claude. You can set the model_name in the config file conf/config.yaml to gemini-1.5-flash-latest, gpt-4o, or claude-3-5-sonnet-20241022. Get your API key from the respective website and set it as an environment variable:

export API_KEY=YOUR_API_KEY

Usage

We support reconstruction from in-the-wild text, images, or videos, or masked reconstruction from PartNet-Mobility dataset.

Note

Skip all the processing steps if you have downloaded our preprocessed dataset from 🤗 Articulate-Anything Dataset on Hugging Face.

Demo

  1. First, preprocess the parntet dataset by running
    python preprocess_partnet.py parallel={int} modality={}
  2. Run the interactive demo
    python gradio_app.py

💾 PartNet-Mobility Masked Reconstruction

🐒 It's articulation time! For a step-by-step guide on articulating a PartNet-Mobility object, see the notebook:

Open in Jupyter Notebook

or run

python articulate.py modality=partnet prompt=149 out_dir=results/149

to run for object_id=149.

🖋 Text Articulation

  1. Preprocess the dataset:
    python articulate_anything/preprocess/preprocess_partnet.py parallel={int} modality=text

Our precomputed CLIP embeddings is available from our repo in partnet_mobility_embeddings.csv. If you prefer to generate your own embeddings, follow these steps:

  1. Run the preprocessing with render_aprt_views=true to render part views for later part annotation.
   python articulate_anything/preprocess/preprocess_partnet.py parallel={int} modality=text render_part_views=true
  1. Annotate mesh parts using VLM (skip if using our precomputed embeddings):

    python articulate_anything/preprocess/annotate_partnet_parts.py parallel={int}
  2. Extract CLIP embeddings (skip if using our precomputed embeddings):

    python articulate_anything/preprocess/create_partnet_embeddings.py
  3. 🐒 It's articulation time! For a detailed guide, see:

    Open in Jupyter Notebook

    or run

    python articulate.py modality=text  prompt="suitcase with a retractable handle" out_dir=results/text/suitcase

🖼 / 🎥 Visual Articulation

  1. Render images for each object:

    python articulate_anything/preprocess/preprocess_partnet.py parallel={int} modality={image}

    This renders a front-view image for each object in the PartNet-Mobility dataset. This is necessary for our mesh retrieval as we will compare the visual similarity between the input image or video against each rendered template object.

  2. 🐒 It's articulation time! For a detailed guide, see:

    Open in Jupyter Notebook

    or run

    python articulate.py modality=video prompt="datasets/in-the-wild-dataset/videos/suitcase.mp4" out_dir=results/video/suitcase

Note: Please download a checkpoint of cotracker for video articulation to visualize the motion traces.

Notes

Some implementation pecularity with the PartNet-Mobility dataset.

  • Raise above ground: The meshes are centered at origin (0,0,0). We use pybullet to raise the links above the ground. Done automatically in sapien_simulate.
  • Rotate meshes: All the meshes will be on the ground. We have to get them in the upright orientation. Specifically, we need to add a fixed joint <origin rpy="1.570796326794897 0 1.570796326794897" xyz="0 0 0"/> between the first link and the base link. This is almost done in the original PartNet-Mobility dataset. render_partnet_obj which calls rotate_urdf saves the original urdf under mobility.urdf.backup and get the correct rotation under mobility.urdf. Then, for our generated python program we need to make sure that the compiled python program also has this joint. This is done automatically by the compiler odio_urdf.py using align_robot_orientation function.

Contact

Feel free to reach me at vlongle@seas.upenn.edu if you'd like to collaborate, or have any questions. You can also open a Github issue if you encounter any problems.

Citation

If you find this work useful, please consider citing our paper:

@article{le2024articulate,
  title={Articulate-Anything: Automatic Modeling of Articulated Objects via a Vision-Language Foundation Model},
  author={Le, Long and Xie, Jason and Liang, William and Wang, Hung-Ju and Yang, Yue and Ma, Yecheng Jason and Vedder, Kyle and Krishna, Arjun and Jayaraman, Dinesh and Eaton, Eric},
  journal={arXiv preprint arXiv:2410.13882},
  year={2024}
}

For more information, visit our project website.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published