Skip to content

A real-time, high-frequency, real-world desktop environment that is suitable for desktop-based ML development (agents, world models, etc.)

Notifications You must be signed in to change notification settings

open-world-agents/desktop-env

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

96 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐Ÿ–ฅ๏ธ desktop-env

desktop-env is a cutting-edge, real-time desktop environment designed specifically for desktop-based machine learning development โ€” perfect for creating agents, world models, and more.


๐ŸŽฏ Why Choose desktop-env?

In the realm of open-source agent research, three critical components are often missing:

  1. ๐ŸŒ Open-Source Environments
  2. ๐Ÿ“Š Open-Source Data
  3. ๐Ÿ”Ž Open-Source Research Codebases & Repositories

desktop-env is here to fill these gaps:

  • ๐Ÿ–ฅ๏ธ Open-Source Environment: Provides a rich, desktop-based environment identical to what humans use daily.
  • ๐Ÿ“ˆ Data Recorder: Includes a built-in screen, audio, timestamp, keyboard/mouse recorder to capture and utilize real human desktop interactions.
  • ๐Ÿค Future Research Collaboration: Plans are underway to foster open-source research in a new repository.

Any kind of open-source contributions are always welcome.


๐Ÿ”‘ Key Features

  • โšก Real-Time Performance: Achieve sub-1ms latency in screen capture.
  • ๐ŸŽฅ High-Frequency Capture: Supports over 144 FPS screen recording with minimal CPU/GPU load.
    • Utilizes Windows APIs (DXGI/WGC) and the powerful GStreamer framework, which is largely differ from PIL.ImageGrab, mss, ...
  • ๐Ÿ–ฑ๏ธ Authentic Desktop Interaction: Work within the exact desktop environment used by real users.

Supported Desktop Events & Interfaces:

  • ๐Ÿ“บ Screen: Capture your monitor screen; specify monitor index, window name, framerate.
  • โŒจ๏ธ๐Ÿ–ฑ๏ธ Keyboard/Mouse: Capture and input keyboard and mouse events.
  • ๐ŸชŸ Window: Get active window's name, bounding box, and handle (hWnd).

โœจ Supported Operating Systems:

  • Windows: Full support with optimized performance using Direct3D11
  • macOS: Full support using AVFoundation for screen capture
  • Linux: Basic support (work in progress)

Recorder

Since Recorder utilize desktop_env, it is far more efficient than any other existing python-based screen recorders.

  • run just by typing python3 examples/recorder.py FILE_LOCATION and stop by Ctrl+C
  • almost 0% load in CPU/GPU. (Similar to commercial screen recording / broadcasting software, since it utilize Windows APIs (DXGI/WGC) and the powerful GStreamer framework under the hood)
  • screen, audio, timestamp is recorded all in once in matroska(.mkv) container, timestamp is recorded as video subtitle. keyboard, mouse, window data is recorded all in once in event.jsonl file.

For more detail, run python3 examples/recorder.py --help!

                                                                                                                                                                                                                                              
 Usage: recorder.py [OPTIONS] FILE_LOCATION                                                                                                                                                                                                   
                                                                                                                                                                                                                                              
โ•ญโ”€ Arguments โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ *    file_location      TEXT  The location of the output file, use `.mkv` extension. [default: None] [required]                                                                                                                            โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ
โ•ญโ”€ Options โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฎ
โ”‚ --record-audio        --no-record-audio                 Whether to record audio [default: record-audio]                                                                                                                                    โ”‚
โ”‚ --record-video        --no-record-video                 Whether to record video [default: record-video]                                                                                                                                    โ”‚
โ”‚ --record-timestamp    --no-record-timestamp             Whether to record timestamp [default: record-timestamp]                                                                                                                            โ”‚
โ”‚ --window-name                                  TEXT     The name of the window to capture, substring of window name is supported [default: None]                                                                                           โ”‚
โ”‚ --monitor-idx                                  INTEGER  The index of the monitor to capture [default: None]                                                                                                                                โ”‚
โ”‚ --help                                                  Show this message and exit.                                                                                                                                                        โ”‚
โ•ฐโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฏ


๐Ÿš€ Blazing Performance

desktop-env outperforms other screen capture libraries:

Library Avg. Time per Frame Relative Speed
desktop-env 5.7 ms โšก1ร— (Fastest)
pyscreenshot 33 ms ๐Ÿšถโ€โ™‚๏ธ 5.8ร— slower
PIL 34 ms ๐Ÿšถโ€โ™‚๏ธ 6.0ร— slower
MSS 37 ms ๐Ÿšถโ€โ™‚๏ธ 6.5ร— slower
PyQt5 137 ms ๐Ÿข 24ร— slower

Measured on i5-11400, GTX 1650. Not only is FPS measured, but CPU/GPU resource usage is also significantly lower.


๐Ÿ’ก Examples

๐ŸŽฎ Working Demo. real-time, local-model ZType game agent

owa-ztype-demo.mp4

For more details with self-contained running source code, see examples/typing_agent.

๐Ÿ‘ฉโ€๐Ÿ’ป Example Usage

For full runnable scripts, see scripts/minimal_example.py, scripts/main.py.

from desktop_env import Desktop, DesktopArgs
from desktop_env.msg import FrameStamped
from desktop_env.windows_capture import construct_pipeline

def on_frame_arrived(frame: FrameStamped):
    # Frame arrived at {frame.timestamp}, latency: {latency} ms, frame shape: {frame.shape}
    pass

def on_event(event):
    # event_type='{event.type}' event_data={event.data} event_time={event.time} device_name='{event.device}'
    # title='{event.title}' rect={event.rect} hWnd={event.hWnd}
    pass

if __name__ == "__main__":
    args = DesktopArgs(
        submodules=[
            {
                "module": "desktop_env.windows_capture.WindowsCapture",
                "args": {
                    "on_frame_arrived": on_frame_arrived,
                    "pipeline_description": construct_pipeline(
                        window_name=None,  # you may specify a substring of the window name
                        monitor_idx=None,  # you may specify the monitor index
                        framerate="60/1",
                    ),
                },
            },
            {"module": "desktop_env.window_publisher.WindowPublisher", "args": {"callback": on_event}},
            {
                "module": "desktop_env.control_publisher.ControlPublisher",
                "args": {"keyboard_callback": on_event, "mouse_callback": on_event},
            },
        ]
    )
    desktop = Desktop.from_args(args)

    try:
        # Option 1: Start the pipeline in the current thread (blocking)
        # desktop.start()

        # Option 2: Start the pipeline in a separate thread (non-blocking)
        desktop.start_free_threaded()
        while True:
            time.sleep(1)
    except KeyboardInterrupt:
        pass
    finally:
        desktop.stop()
        desktop.join()
        desktop.close()

๐Ÿ› ๏ธ Installation

Prerequisites: Install poetry first. See the Poetry Installation Guide.

Windows Installation

# 1. Install GStreamer and dependencies via conda
conda install -c conda-forge pygobject gst-python -y
# pygobject: PyGObject is a Python package which provides bindings for GObject based libraries such as GTK+, GStreamer, WebKitGTK+, GLib, GIO and many more.
# gst-python: `python` plugin, loader for plugins written in python
conda install -c conda-forge gstreamer gst-plugins-base gst-plugins-good gst-plugins-bad gst-plugins-ugly -y

# 2. Install desktop-env
poetry install --with windows

Install custom plugin, by configuring environment variable.

$env:GST_PLUGIN_PATH = (Join-Path -Path $pwd -ChildPath "custom_plugin")
echo $env:GST_PLUGIN_PATH

macOS Installation

# 1. Install GStreamer and dependencies via brew
brew install gstreamer gst-plugins-base gst-plugins-good gst-plugins-bad gst-plugins-ugly pkg-config gobject-introspection

# 2. Install desktop-env with macOS dependencies
poetry install --with macos

Install custom plugin, by configuring environment variable as Windows guide.

๐Ÿšจ Notes:

  1. Installing pygobject with pip on Windows causes the error:
..\meson.build:31:9: ERROR: Dependency 'gobject-introspection-1.0' is required but not found.
  1. On macOS, if you encounter permission issues with brew, you might need to fix permissions:
sudo chown -R $(whoami) $(brew --prefix)/*

Verifying Installation

After installation, verify it with the following commands:

Windows

# Check GStreamer version (should be >= 1.24.6)
$ conda list gst-*
# packages in environment at C:\Users\...\miniconda3\envs\agent:
#
# Name                    Version                   Build  Channel
gst-plugins-bad           1.24.6               he11079b_0    conda-forge
gst-plugins-base          1.24.6               hb0a98b8_0    conda-forge
gst-plugins-good          1.24.6               h3b23867_0    conda-forge
gst-plugins-ugly          1.24.6               ha7af72c_0    conda-forge
gstreamer                 1.24.6               h5006eae_0    conda-forge

# Verify Direct3D11 plugin
$ gst-inspect-1.0.exe d3d11

macOS

# Check GStreamer version
$ gst-inspect-1.0 --version
gst-inspect-1.0 version 1.24.6

# Verify AVFoundation plugin
$ gst-inspect-1.0 avfvideosrc
Plugin Details:
  Name                     avfvideosrc
  Description              AVFoundation video source
  Filename                 /opt/homebrew/lib/gstreamer-1.0/libgstavfvideosrc.so
  Version                  1.24.6
  License                  LGPL
  Source module            gst-plugins-good
  Binary package          GStreamer Good Plug-ins source release

๐Ÿ“ TODOs

  • ๐Ÿ–ฅ๏ธ Validate overall modality matching in multi-monitor setting
  • ๐ŸŒ Implement remote desktop control demo that wraps up Desktop and exposes network interface through UDP/TCP, HTTP/WebSocket, etc.
  • ๐ŸŽฅ Support various video formats besides raw RGBA (JPEG, H.264, ...)
  • ๐Ÿง๐ŸŽ Add multi-OS support (Linux & macOS)
  • ๐Ÿ’ฌ Implement language interfaces to support desktop agents written in various languages

About

A real-time, high-frequency, real-world desktop environment that is suitable for desktop-based ML development (agents, world models, etc.)

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages