desktop-env
is a cutting-edge, real-time desktop environment designed specifically for desktop-based machine learning development โ perfect for creating agents, world models, and more.
In the realm of open-source agent research, three critical components are often missing:
- ๐ Open-Source Environments
- ๐ Open-Source Data
- ๐ Open-Source Research Codebases & Repositories
desktop-env
is here to fill these gaps:
- ๐ฅ๏ธ Open-Source Environment: Provides a rich, desktop-based environment identical to what humans use daily.
- ๐ Data Recorder: Includes a built-in screen, audio, timestamp, keyboard/mouse recorder to capture and utilize real human desktop interactions.
- ๐ค Future Research Collaboration: Plans are underway to foster open-source research in a new repository.
Any kind of open-source contributions are always welcome.
- โก Real-Time Performance: Achieve sub-1ms latency in screen capture.
- ๐ฅ High-Frequency Capture: Supports over 144 FPS screen recording with minimal CPU/GPU load.
- Utilizes Windows APIs (
DXGI/WGC
) and the powerful GStreamer framework, which is largely differ fromPIL.ImageGrab
,mss
, ...
- Utilizes Windows APIs (
- ๐ฑ๏ธ Authentic Desktop Interaction: Work within the exact desktop environment used by real users.
- ๐บ Screen: Capture your monitor screen; specify monitor index, window name, framerate.
- โจ๏ธ๐ฑ๏ธ Keyboard/Mouse: Capture and input keyboard and mouse events.
- ๐ช Window: Get active window's name, bounding box, and handle (
hWnd
).
โจ Supported Operating Systems:
- Windows: Full support with optimized performance using Direct3D11
- macOS: Full support using AVFoundation for screen capture
- Linux: Basic support (work in progress)
Since Recorder
utilize desktop_env
, it is far more efficient than any other existing python-based screen recorders.
- run just by typing
python3 examples/recorder.py FILE_LOCATION
and stop byCtrl+C
- almost 0% load in CPU/GPU. (Similar to commercial screen recording / broadcasting software, since it utilize Windows APIs (
DXGI/WGC
) and the powerful GStreamer framework under the hood) - screen, audio, timestamp is recorded all in once in matroska(
.mkv
) container, timestamp is recorded as video subtitle. keyboard, mouse, window data is recorded all in once inevent.jsonl
file.
For more detail, run python3 examples/recorder.py --help
!
Usage: recorder.py [OPTIONS] FILE_LOCATION
โญโ Arguments โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ * file_location TEXT The location of the output file, use `.mkv` extension. [default: None] [required] โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
โญโ Options โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฎ
โ --record-audio --no-record-audio Whether to record audio [default: record-audio] โ
โ --record-video --no-record-video Whether to record video [default: record-video] โ
โ --record-timestamp --no-record-timestamp Whether to record timestamp [default: record-timestamp] โ
โ --window-name TEXT The name of the window to capture, substring of window name is supported [default: None] โ
โ --monitor-idx INTEGER The index of the monitor to capture [default: None] โ
โ --help Show this message and exit. โ
โฐโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฏ
desktop-env
outperforms other screen capture libraries:
Library | Avg. Time per Frame | Relative Speed |
---|---|---|
desktop-env | 5.7 ms | โก1ร (Fastest) |
pyscreenshot |
33 ms | ๐ถโโ๏ธ 5.8ร slower |
PIL |
34 ms | ๐ถโโ๏ธ 6.0ร slower |
MSS |
37 ms | ๐ถโโ๏ธ 6.5ร slower |
PyQt5 |
137 ms | ๐ข 24ร slower |
Measured on i5-11400, GTX 1650. Not only is FPS measured, but CPU/GPU resource usage is also significantly lower.
owa-ztype-demo.mp4
For more details with self-contained running source code, see examples/typing_agent.
For full runnable scripts, see scripts/minimal_example.py
, scripts/main.py
.
from desktop_env import Desktop, DesktopArgs
from desktop_env.msg import FrameStamped
from desktop_env.windows_capture import construct_pipeline
def on_frame_arrived(frame: FrameStamped):
# Frame arrived at {frame.timestamp}, latency: {latency} ms, frame shape: {frame.shape}
pass
def on_event(event):
# event_type='{event.type}' event_data={event.data} event_time={event.time} device_name='{event.device}'
# title='{event.title}' rect={event.rect} hWnd={event.hWnd}
pass
if __name__ == "__main__":
args = DesktopArgs(
submodules=[
{
"module": "desktop_env.windows_capture.WindowsCapture",
"args": {
"on_frame_arrived": on_frame_arrived,
"pipeline_description": construct_pipeline(
window_name=None, # you may specify a substring of the window name
monitor_idx=None, # you may specify the monitor index
framerate="60/1",
),
},
},
{"module": "desktop_env.window_publisher.WindowPublisher", "args": {"callback": on_event}},
{
"module": "desktop_env.control_publisher.ControlPublisher",
"args": {"keyboard_callback": on_event, "mouse_callback": on_event},
},
]
)
desktop = Desktop.from_args(args)
try:
# Option 1: Start the pipeline in the current thread (blocking)
# desktop.start()
# Option 2: Start the pipeline in a separate thread (non-blocking)
desktop.start_free_threaded()
while True:
time.sleep(1)
except KeyboardInterrupt:
pass
finally:
desktop.stop()
desktop.join()
desktop.close()
Prerequisites: Install poetry
first. See the Poetry Installation Guide.
# 1. Install GStreamer and dependencies via conda
conda install -c conda-forge pygobject gst-python -y
# pygobject: PyGObject is a Python package which provides bindings for GObject based libraries such as GTK+, GStreamer, WebKitGTK+, GLib, GIO and many more.
# gst-python: `python` plugin, loader for plugins written in python
conda install -c conda-forge gstreamer gst-plugins-base gst-plugins-good gst-plugins-bad gst-plugins-ugly -y
# 2. Install desktop-env
poetry install --with windows
Install custom plugin, by configuring environment variable.
$env:GST_PLUGIN_PATH = (Join-Path -Path $pwd -ChildPath "custom_plugin")
echo $env:GST_PLUGIN_PATH
# 1. Install GStreamer and dependencies via brew
brew install gstreamer gst-plugins-base gst-plugins-good gst-plugins-bad gst-plugins-ugly pkg-config gobject-introspection
# 2. Install desktop-env with macOS dependencies
poetry install --with macos
Install custom plugin, by configuring environment variable as Windows guide.
๐จ Notes:
- Installing
pygobject
withpip
on Windows causes the error:
..\meson.build:31:9: ERROR: Dependency 'gobject-introspection-1.0' is required but not found.
- On macOS, if you encounter permission issues with
brew
, you might need to fix permissions:
sudo chown -R $(whoami) $(brew --prefix)/*
After installation, verify it with the following commands:
# Check GStreamer version (should be >= 1.24.6)
$ conda list gst-*
# packages in environment at C:\Users\...\miniconda3\envs\agent:
#
# Name Version Build Channel
gst-plugins-bad 1.24.6 he11079b_0 conda-forge
gst-plugins-base 1.24.6 hb0a98b8_0 conda-forge
gst-plugins-good 1.24.6 h3b23867_0 conda-forge
gst-plugins-ugly 1.24.6 ha7af72c_0 conda-forge
gstreamer 1.24.6 h5006eae_0 conda-forge
# Verify Direct3D11 plugin
$ gst-inspect-1.0.exe d3d11
# Check GStreamer version
$ gst-inspect-1.0 --version
gst-inspect-1.0 version 1.24.6
# Verify AVFoundation plugin
$ gst-inspect-1.0 avfvideosrc
Plugin Details:
Name avfvideosrc
Description AVFoundation video source
Filename /opt/homebrew/lib/gstreamer-1.0/libgstavfvideosrc.so
Version 1.24.6
License LGPL
Source module gst-plugins-good
Binary package GStreamer Good Plug-ins source release
- ๐ฅ๏ธ Validate overall modality matching in multi-monitor setting
- ๐ Implement remote desktop control demo that wraps up Desktop and exposes network interface through UDP/TCP, HTTP/WebSocket, etc.
- ๐ฅ Support various video formats besides raw RGBA (JPEG, H.264, ...)
- ๐ง๐ Add multi-OS support (Linux & macOS)
- ๐ฌ Implement language interfaces to support desktop agents written in various languages