Agent Harness Engineering Part 1: Execution Layer and Sandboxes

This post is the first part of my notes on Agent Harness Engineering: A Survey of LLM Infrastructure.

The paper’s main idea is useful: for long-running agents, the model is no longer the only bottleneck. Reliability also depends on the harness around the model: where it runs, which tools it can call, what state it sees, how it is observed, how outputs are verified, and how permissions are governed.

The survey organizes the harness into seven layers:

Layer	Meaning
E: Execution	Where agent actions physically happen: shells, filesystems, browsers, containers, VMs, sandboxes.
T: Tooling	How tools are described, discovered, called, and constrained.
C: Context	What the model sees at each step, including retrieval, memory, and compaction.
L: Lifecycle	The loop that carries work across turns, tasks, retries, and subagents.
O: Observability	Traces, logs, costs, errors, and debugging signals.
V: Verification	Tests, judges, task-specific checks, and trajectory evaluation.
G: Governance	Permissions, audit, identity, policies, and blast-radius control.

This post focuses on E: the execution layer.

1. Why the Execution Layer Matters

An LLM agent becomes dangerous and useful at the same moment: when its text output is turned into an action.

For a coding agent, actions include reading a repository, editing files, running tests, installing packages, launching a web server, using a browser, opening network connections, and sometimes pushing code. The execution layer decides where those actions happen and what their consequences can be.

The execution layer has three jobs.

Security. Run untrusted or model-generated code where it cannot freely damage the host, read secrets, or exfiltrate data.

Reproducibility. Make agent runs resettable. Benchmarks such as SWE-bench need every task to start from a known repository state and dependency environment.

Liveness. Let the agent work without asking for permission every few seconds. Anthropic reported that Claude Code sandboxing reduced permission prompts by 84% in internal use, because routine Bash commands could run inside pre-defined filesystem and network boundaries rather than requiring one approval per command.

The key point: sandboxing is not just a security feature. It is also an autonomy feature.

2. Three Boundaries, Not One

When people say “sandbox”, they often mean several different things. I find it clearer to split the execution layer into three nested boundaries.

Boundary	Question	Examples
Environment boundary	What can the agent see?	mounted repo, allowed directories, network allowlist, available credentials
Tool boundary	What can the agent ask to do?	`Read`, `Edit`, `Bash`, browser, MCP tools, deny/ask/allow rules
Execution boundary	Where does code physically run?	Docker container, gVisor container, Firecracker microVM, OS sandbox, hosted VM

These are complementary. A Docker container may protect the host better than a plain local shell, but it does not know whether rm -rf src is logically bad for the task. A permission system may block a dangerous tool call, but if you approve the wrong command, the process still runs with whatever OS access it has.

A production harness usually layers them:

The tool policy decides whether the agent can attempt an action.
The environment policy decides what files, networks, and credentials are visible.
The execution substrate enforces CPU, memory, process, filesystem, and kernel isolation.
The verification layer checks whether the action actually improved the task.

3. Underlying Sandbox Technologies

If I understand the current ecosystem correctly, most agent sandboxes are built from a small set of infrastructure primitives: Docker-style containers, gVisor, Firecracker microVMs, WebAssembly, and OS-level sandboxing.

Technology	Isolation boundary	Startup and overhead	Compatibility	Security posture for agent code	Typical agent use
Docker / OCI container	Namespace and cgroup isolation; shares host kernel	Fast startup; low overhead	Excellent Linux/package/tool compatibility	Good for trusted or hardened single-tenant work; weaker boundary for hostile multi-tenant code unless further isolated	Local coding agents, CI, benchmark evaluation, custom dependency images
gVisor	User-space kernel between container process and host kernel	Container-like startup with additional syscall overhead	High, but some low-level workloads can be affected	Stronger than ordinary containers because less host-kernel surface is exposed	Managed agent sandboxes such as Modal
Firecracker microVM	Hardware-virtualized VM with its own guest kernel	Fast for a VM; more overhead than a container	Strong Linux environment compatibility through prepared images	Strong isolation suited to untrusted, multi-tenant code execution	Cloud agent sandboxes such as E2B
OS-level sandbox	Restricted host process using primitives such as Seatbelt or `bubblewrap`	Very fast; no remote environment boot	Uses the developer’s existing local tools and filesystem	Useful for limiting local command access, but remains tied to the host environment	Local interactive agents such as Claude Code sandboxed Bash
WebAssembly	Capability-constrained module runtime	Extremely fast and lightweight	Limited for full software engineering environments and native dependencies	Strong capability-oriented isolation for supported workloads	Small code execution, plugins, deterministic snippets

The table is a simplification: configuration matters as much as the technology name. A well-configured container can be safer than a poorly configured VM, and network/credential exposure can defeat an otherwise strong compute boundary.

Docker and OCI Containers

Docker is the default mental model for many coding-agent environments. It uses Linux namespaces, cgroups, filesystem layers, and container images to create a process-level boundary.

Why it is popular:

fast startup,
huge image ecosystem,
easy custom dependencies,
good benchmark reproducibility,
familiar developer workflow.

The weakness is that a normal Docker container still shares the host kernel. For trusted CI jobs this is often acceptable. For arbitrary AI-generated code in a multi-tenant service, it may not be enough unless hardened carefully.

A minimal hardened local pattern looks like this:

docker run --rm \
  --network none \
  --read-only \
  --cap-drop ALL \
  --pids-limit 256 \
  --memory 1g \
  --cpus 1 \
  --tmpfs /tmp:rw,noexec,nosuid,size=256m \
  -v "$PWD:/workspace:ro" \
  -w /workspace \
  python:3.12-slim \
  python -c "print('hello from a constrained container')"

This is not a universal secure sandbox, but it shows the harness-level knobs: network, filesystem, Linux capabilities, process count, memory, CPU, and mount mode.

gVisor

gVisor adds a stronger isolation layer between the container and the host kernel. Instead of letting the container directly exercise the host kernel surface, gVisor provides a user-space kernel that intercepts and implements much of the Linux syscall interface.

This is attractive for agent workloads because the workload still looks like a container from the outside, but the host kernel exposure is reduced. Modal documents that its Sandboxes are built on gVisor and that default sandboxes cannot accept incoming network connections or access other Modal resources by default.

The tradeoff is compatibility and performance. Some low-level syscalls, unusual filesystems, or performance-sensitive workloads may behave differently than in a normal container.

Firecracker MicroVMs

Firecracker is a lightweight virtual machine monitor originally built for serverless computing. It creates microVMs: small VMs with hardware virtualization isolation, but with startup and resource overhead closer to containers than traditional VMs.

For agent sandboxes, the appeal is simple:

each run can get its own kernel boundary,
the environment can still boot quickly,
the provider can snapshot or template images,
untrusted code is separated from the host more strongly than with plain containers.

E2B describes its sandboxes as fast Linux VMs, and its product page says each sandbox is powered by Firecracker. This is why E2B is often grouped with “microVM-based” agent sandbox providers.

OS-Level Sandboxes

Some tools do not start a separate container or VM. They restrict commands on the user’s existing machine using OS primitives.

Claude Code’s sandboxed Bash tool is an example. Anthropic says it uses OS-level primitives such as Linux bubblewrap and macOS Seatbelt to enforce filesystem and network boundaries for Bash commands and their child processes. This is lighter than booting a container, but it is scoped: in Claude Code docs, sandboxing applies to Bash commands, while the broader permission system applies to Bash, Read, Edit, WebFetch, MCP, and other tools.

This distinction matters. A local OS sandbox can reduce approval fatigue for shell commands, but it is not the same thing as running the whole agent inside a disposable cloud VM.

WebAssembly

WebAssembly is another useful substrate, especially for small deterministic code snippets or plugin systems. It can start very quickly and expose a capability-style interface. The limitation is ecosystem compatibility: many agent tasks want apt, pip, browsers, databases, native libraries, and long-running processes. That pushes most software-engineering agents toward containers or VMs.

4. Managed Sandbox Services

On top of these primitives, there is a growing layer of sandbox-as-a-service products. These are not just “Docker in the cloud”. The product value is usually lifecycle management: create, exec, stream logs, upload files, expose ports, snapshot state, terminate, and scale to many concurrent sandboxes.

How Does Code Get Into the Remote Sandbox?

A natural question is: when we call Modal, E2B, or Daytona, do we tar up all our scripts, upload them through an API, and run them remotely?

Sometimes, but that is only one transfer strategy. The more general abstraction is that the harness creates or attaches to a remote isolated environment, makes the relevant code and data available there, executes commands there, and returns observations to the agent loop.

Your application / agent loop
        |
        | create sandbox through SDK/API
        v
Remote isolated environment
        |
        | obtain repository, scripts, data, and dependencies
        | execute commands and tests
        | produce stdout, stderr, exit codes, and diffs
        v
Your application / agent loop

The application can provision the sandbox contents in several ways:

Method	How it works	Best for
Prebuilt image or template	Dependencies and base tools are installed before a task starts	Repeated workloads, benchmarks, low startup latency
Git clone inside sandbox	The sandbox clones a repository directly, possibly at a pinned commit	Coding agents modifying a repository
File upload API	The orchestrator writes scripts, patches, config, or input files into the remote filesystem	Small generated files and user data
Persistent volume	A mounted storage volume is reused or shared across sandbox runs	Package caches, datasets, durable workspaces
Download from object storage or URL	The sandbox retrieves large artifacts after startup	Large datasets, model weights, archives
Tar or zip upload	The orchestrator packages many local files and extracts them remotely	A local working tree with uncommitted changes or bulk transfer

For a coding-agent task, a typical flow is closer to cloning a repository and applying generated edits than re-uploading every script for every command:

sandbox = provider.create(template="python-dev")

try:
    sandbox.exec("git clone https://github.com/org/repo /workspace/repo")
    sandbox.write_file("/workspace/repo/fix.patch", model_generated_patch)
    sandbox.exec("cd /workspace/repo && git apply fix.patch && pytest")

    diff = sandbox.exec("cd /workspace/repo && git diff")
    print(diff.stdout)
finally:
    sandbox.delete()

The commands run inside the remote sandbox, not on the machine running the model loop. The harness sends an action, receives command output or filesystem results, and uses those observations to decide the next action.

Provider APIs expose different choices:

Service	Common ways to place code/data in the sandbox
Modal	Filesystem API, mounted Volumes or bucket mounts, image setup, commands that clone/download files
E2B	Custom templates, `files.write(...)`, and shell commands such as `git clone`; individual file transfer is part of its API
Daytona	Filesystem APIs, built-in Git operations, snapshots/images, and process execution inside a workspace

A tarball is still useful when the input is a large local directory that is not represented by a remote Git commit:

sandbox.upload_file("repo.tar.gz", "/tmp/repo.tar.gz")
sandbox.exec("mkdir -p /workspace && tar -xzf /tmp/repo.tar.gz -C /workspace")
sandbox.exec("cd /workspace/repo && pytest")

But the central idea is not “upload scripts.” It is remote controlled execution: the agent’s control loop remains in one place, while potentially unsafe shell and filesystem operations happen in a disposable or policy-constrained environment.

If the Agent Changes Remote Files, How Do They Return?

Yes: when the execution environment is remote, edits made there do not automatically appear in a local checkout. If those edits should become part of the developer’s working tree, the harness needs an explicit result-export or synchronization step.

In practice, synchronization usually happens after a useful checkpoint, such as after tests pass or when the agent proposes a final patch. Synchronizing after every command would add latency and create conflicts while the remote agent is still exploring.

Local orchestrator                         Remote sandbox
------------------                         --------------
create or attach sandbox       --------->  clone repo / receive files
send agent tool actions        --------->  edit files, run tests
receive observations            <---------  stdout, stderr, test results
retrieve accepted changes      <---------  patch, files, commit, or archive
apply/review locally

Common result-export strategies are:

Strategy	How remote changes return	Best for
Read changed files	Use the provider filesystem API to download selected outputs	A few generated files or reports
Return a Git patch	Run `git diff --binary` remotely, download the patch, and apply it locally	Coding agents editing an existing checkout
Push a branch or open a pull request	Commit and push changes from the sandbox using scoped Git credentials	Hosted agents and collaborative code review
Download an archive	Package an output directory and retrieve it	Many generated files or build artifacts
Leave workspace remote until acceptance	Keep results in the cloud workspace and only export selected work	Hosted IDE or browser-based agents

For software engineering tasks, returning a Git patch is often the cleanest design because it preserves reviewability and does not require blindly overwriting local files:

sandbox = provider.create(template="python-dev")

try:
    sandbox.exec("git clone https://github.com/org/repo /workspace/repo")

    # The agent edits files remotely and verifies the change.
    sandbox.exec("cd /workspace/repo && pytest")

    # Export only the resulting change set.
    result = sandbox.exec("cd /workspace/repo && git diff --binary")
    patch = result.stdout

    # The local harness can show, approve, then apply this patch locally.
    local_workspace.apply_patch(patch)
finally:
    sandbox.delete()

Another common workflow treats Git as the synchronization protocol:

Remote sandbox:  clone -> edit -> test -> commit -> push agent branch
Local checkout:  fetch branch -> inspect diff -> merge or cherry-pick

This distinction also explains why local and cloud coding agents feel different:

Execution mode	Where edits occur	Is remote-to-local synchronization needed?
Local Claude Code with sandboxed Bash	In the permitted local working directory	No: the edited files are already local
Claude Code on the web	In an isolated cloud session	Yes: changes need to be committed, patched, or otherwise exported
Claude Managed Agents with Daytona	In the assigned Daytona sandbox	Yes: the orchestrator must retrieve files/diffs or use a Git workflow
E2B or Modal coding-agent runtime	In the remote VM/container filesystem or mounted storage	Yes, unless the intended result remains entirely remote

Therefore, a managed sandbox has two data paths: input provisioning sends the repository or task data into the environment, and result export returns accepted edits, patches, artifacts, or commits after execution.

Are Managed Sandboxes Only for Coding Agents?

No. Coding agents are a prominent use case because terminal and filesystem execution are easy to expose through an SDK, but a managed sandbox can host any environment its image and APIs support: browsers, local web applications, virtual desktops, GUI programs, data-analysis kernels, test devices, or computer-use agents.

Service	Shell and code execution	Browser or visual-agent support	Computer-use support
Modal	Yes	Can run Chromium, Playwright, web UIs, and VNC stacks inside a Sandbox	Modal documents running Anthropic’s Computer Use demo in a Sandbox
E2B	Yes	Desktop sandboxes can host browsers and stream the display through VNC	E2B documents desktop agents that view screenshots, click, type, and scroll
Daytona	Yes	Its desktop environment can run browser and GUI applications	Daytona exposes Computer Use operations for mouse, keyboard, screenshots, recordings, display, and VNC

A VLM-based computer-use loop has the same execution-layer idea as a coding agent, but the observation is a screenshot instead of terminal output:

Remote desktop sandbox
  - browser and applications
  - virtual display
  - mouse and keyboard execution API
        |
        | screenshot
        v
VLM agent loop
  - inspect visual state
  - decide click/type/scroll action
        |
        | GUI action
        v
Remote desktop sandbox

The sandbox is the computer being operated. It protects the surrounding infrastructure from web content, downloaded files, GUI automation, or arbitrary code executed during the task.

Where Does the Agent Loop Run?

A sandbox, an agent loop, and model inference are separate components:

Component	Responsibility
Model inference	Interpret context or screenshots and select the next action
Agent loop / orchestrator	Maintain task state, call the model, validate tool calls, and route observations
Sandbox tool environment	Execute Bash commands, manipulate files, run browsers/desktops, and return results

In many production designs, the loop stays outside the sandbox while only tools run inside it:

Application / orchestration service
  - task state, policy, budgets, audit log
  - calls hosted LLM or VLM API
          |
          | approved command / file / GUI action
          v
Managed sandbox
  - repository or desktop
  - shell, browser, applications
  - screenshots, logs, files
          |
          | observation
          v
Application / orchestration service

For computer use, the orchestrator may request a screenshot from the sandbox, send that screenshot to a VLM, receive a click, type, or scroll decision, and forward the action back to the sandbox. In this architecture, the model does not need to be installed in the sandbox. The sandbox holds the environment the model controls.

Keeping the loop outside has practical advantages:

Benefit	Why it matters
Secret isolation	Model API keys and control-plane credentials do not have to live in an environment executing risky tools
Policy enforcement	The orchestrator can reject or log actions before they affect the sandbox
Session recovery	State can persist even when a disposable sandbox is terminated and recreated
Backend portability	The same loop can direct Modal, E2B, Daytona, or an internal environment
Independent scaling	Model requests and execution environments can be allocated separately

Anthropic documents this split for Claude Managed Agents self-hosted sandboxes: orchestration remains on Anthropic’s side, while tool execution, files, and network egress move into the developer’s infrastructure.

It is also possible to run the agent application inside the sandbox:

Managed sandbox
  - agent process
  - repository or desktop
  - local services
  - browser and VNC stack
        |
        | model API call
        v
Hosted LLM / VLM inference service

This can be useful when a background agent needs low-latency access to a full development stack or desktop. Modal’s official example runs Anthropic’s Computer Use demo application and VNC desktop in a Modal Sandbox. The model inference can still be a remote API call: running the agent program inside the sandbox does not imply that the foundation model weights are hosted there.

The model itself runs inside the sandbox only when the developer explicitly deploys local inference, for example an open-source VLM on GPU infrastructure. That is possible, but it requires much more compute, image management, and security engineering than calling a hosted model API.

Agent deployment	What is in the sandbox?	Where is the loop likely to run?	Where is model inference likely to run?
Coding tool runtime	Repo, shell, dependencies, tests	External orchestrator	Hosted model API
Browser or VLM computer-use runtime	Virtual desktop, browser, screenshot/action API	External orchestrator	Hosted multimodal API
Hosted development workspace	Repo, services, browser, VNC, possibly agent process	Inside workspace or split across services	Usually hosted model API
Fully self-hosted agent stack	Tools, agent process, and local model server	Inside controlled infrastructure	Local GPU model deployment

Modal Sandboxes are secure containers for running untrusted user or agent code. Modal’s docs explicitly mention LLM-generated code, arbitrary dependencies, checking out repositories, and running test suites as use cases. Its networking/security docs say Modal Sandboxes are built on gVisor.

Minimal Modal example:

import modal

app = modal.App.lookup("agent-sandbox-demo", create_if_missing=True)
image = modal.Image.debian_slim().pip_install("pytest")

sb = modal.Sandbox.create(app=app, image=image, timeout=60)

try:
    proc = sb.exec("python", "-c", "print('hello from Modal sandbox')")
    print(proc.stdout.read())

    proc = sb.exec(
        "bash",
        "-lc",
        "python - <<'PY'\nimport sys\nprint(sys.version)\nPY",
        timeout=10,
    )
    print(proc.stdout.read())
finally:
    sb.terminate()

For an agent harness, the agent loop would not call subprocess.run() locally. It would call sb.exec(...), capture stdout/stderr/exit code, and feed the observation back to the model.

Modal is not limited to code execution. Its documentation includes an Anthropic Computer Use demo that launches a desktop/VNC environment in a Sandbox and exposes ports for a Streamlit agent interface and a noVNC display.

E2B

E2B provides secure cloud sandboxes for agents. The docs describe a sandbox as a fast, secure Linux VM created on demand, and the product page says E2B sandboxes are powered by Firecracker microVMs.

Minimal E2B example:

from e2b import Sandbox

sbx = Sandbox.create()  # requires E2B_API_KEY

try:
    result = sbx.commands.run("python3 - <<'PY'\nprint('hello from E2B')\nPY")
    print(result.stdout)

    result = sbx.commands.run("mkdir -p /tmp/agent && echo ok > /tmp/agent/out.txt")
    print("exit:", result.exit_code)
finally:
    sbx.kill()

The interesting design point is the control plane. Your product can keep the LLM, planner, user session, and billing logic in your app, while code execution happens inside a short-lived remote VM.

E2B also provides Desktop sandboxes for computer-use agents. In this mode, the remote environment contains an Ubuntu desktop and applications; an agent can consume visual observations and issue mouse, keyboard, and scrolling actions while a user observes the desktop through VNC streaming.

Daytona

Daytona is an open-source sandbox platform for AI-generated code. Its docs describe sandboxes as isolated environments with a dedicated kernel, filesystem, network stack, vCPU, RAM, and disk, with OCI/Docker compatibility. The SDK exposes filesystem, Git, process execution, code interpreter, and computer-use operations.

Minimal Daytona example:

from daytona import Daytona

daytona = Daytona()  # uses DAYTONA_API_KEY or configured credentials
sandbox = daytona.create()

try:
    response = sandbox.process.exec("echo 'hello from Daytona'")
    print(response.result)

    response = sandbox.process.code_run("""
def greet(name):
    return f"hello, {name}"

print(greet("agent sandbox"))
""")
    print(response.result)
finally:
    sandbox.delete()

Daytona is especially relevant to Claude Managed Agents. Daytona’s guide says that Anthropic runs the API, agent loop, and work queue; your orchestrator manages sandbox lifecycle; and Daytona provides the sandbox containers where filesystem and shell tools execute. In that design, Bash/read/write/edit/glob/grep happen inside Daytona, while web tools and MCP tools are dispatched by Anthropic server-side.

This is a good example of execution-layer separation: the model orchestration and the shell/filesystem execution do not have to live in the same infrastructure.

Daytona’s broader platform also exposes Computer Use. The desktop runs in its sandbox, and SDK operations can control mouse/keyboard input, capture screenshots or recordings, and provide VNC access for visual inspection.

5. Benchmark and Framework Sandboxes

The execution layer is also central to evaluation frameworks and open-source coding agents.

SWE-bench

SWE-bench evaluates whether an agent can fix real GitHub issues. The benchmark is only meaningful because each task can be executed in a controlled environment.

Conceptually, a SWE-bench task contains:

a repository and commit,
an issue statement,
a patch generated by an agent,
environment setup scripts,
test scripts,
FAIL_TO_PASS tests that should become passing,
PASS_TO_PASS tests that should remain passing.

The harness builds Docker images and runs evaluation inside containers. The TestSpec object in the SWE-bench harness contains fields such as repo_script_list, env_script_list, eval_script_list, FAIL_TO_PASS, PASS_TO_PASS, and Docker image tags.

Typical evaluation flow:

python -m swebench.harness.run_evaluation \
  --dataset_name princeton-nlp/SWE-bench_Verified \
  --predictions_path predictions.jsonl \
  --max_workers 8 \
  --run_id my-agent-run

A prediction file is usually a JSONL file mapping an instance id to a generated patch:

{"instance_id":"django__django-12345","model_patch":"diff --git a/..."}

The important point is that SWE-bench is not just a dataset. It is an execution harness: reset repo, apply patch, install deps, run tests, parse logs, report pass/fail.

Reproducibility Is Not the Same as Maximum Isolation

SWE-bench’s default evaluation design relies on Docker containers. This is a sensible choice for the benchmark’s primary objective: reproducibly checking whether a generated patch fixes a real issue in the expected dependency environment.

SWE-bench instance
      |
      v
Repository-specific Docker image
      |
      | create clean container
      | apply generated patch
      | execute tests
      v
Resolved / not resolved

However, a model-generated patch is executable input. A patch can modify code that tests import and run. If evaluation is offered as a public service accepting arbitrary submissions, the threat model is no longer only “does the patch pass?” It also includes “could the patch attack the evaluator?”

Concern	Docker evaluation container	Firecracker or gVisor outer sandbox
Primary purpose	Recreate repository and test dependencies	Protect hosted infrastructure from untrusted execution
Kernel isolation on native Linux	Shares the host kernel	Adds a stronger kernel or syscall isolation boundary
Startup and operational complexity	Lower	Higher
Good use case	Local research runs and controlled evaluation	Public or multi-tenant agent/evaluation service

Therefore, Docker is not a problem simply because SWE-bench uses it. Docker is doing the benchmark-reproducibility job. A security-sensitive hosted evaluator may add another isolation layer around that same benchmark container:

Managed gVisor sandbox or Firecracker microVM
      |
      v
SWE-bench Docker evaluation environment
      |
      v
Apply patch and run official tests

The inner Docker image preserves comparable SWE-bench test conditions. The outer sandbox protects the hosting platform from hostile or accidentally harmful generated code. Even for local runs, the evaluation environment should avoid mounted credentials, Docker socket access, unnecessary network access, and unlimited CPU or memory.

SWE-ReX

SWE-ReX is an open-source runtime interface for sandboxed shell environments. It powers SWE-agent and is designed to decouple an agent’s command-execution logic from the infrastructure where commands actually run. Its documented deployments include local/container and hosted backends such as Modal, AWS/Fargate, and Daytona.

It is useful to separate three related projects:

Component	What it provides	What it does not provide
SWE-bench	Repository/issue tasks and the official patch evaluation harness	An interactive agent runtime
SWE-agent or another coding agent	The reasoning loop that explores a repo, edits files, and proposes a patch	The benchmark definition or infrastructure portability layer
SWE-ReX	A portable runtime API for commands, sessions, and files on a selected backend	New benchmark tasks or an automatic guarantee of stronger security

The flow for a SWE-bench run using an agent and SWE-ReX is:

SWE-bench task: repository + issue statement
      |
      v
SWE-agent or your own coding agent
      |
      v
SWE-ReX runtime API
      |
      | Docker / Modal / Daytona / Fargate backend
      | inspect files, edit code, run tests
      v
Candidate patch
      |
      v
SWE-bench evaluation harness
      |
      v
Benchmark score

The task is still a SWE-bench task, and the final score is still determined by SWE-bench. SWE-ReX is the execution wrapper used while an interactive agent works on that task. It can also be used for developer-defined tasks unrelated to SWE-bench.

The portability benefit is that the agent can ask for shell execution through one interface while deployment can change independently:

Development or deployment need	Possible SWE-ReX backend	Security consequence
Fast local development	Local Docker	Retains ordinary Docker/container trust assumptions
Parallel cloud agent runs	Modal	Uses Modal’s hosted sandbox boundary, documented as gVisor-based
Managed workspaces	Daytona	Uses Daytona-provided isolated workspaces; exact isolation depends on its deployment
Custom cloud operation	AWS/Fargate	Depends on the configured cloud runtime and network/credential policy

SWE-ReX therefore does not automatically fix the Docker-versus-microVM concern. It makes the backend replaceable. A developer can use Docker during local development and choose a stronger hosted execution boundary for untrusted or large-scale workloads without rewriting the agent loop around a provider-specific API.

Minimal SWE-ReX pattern:

import asyncio

from swerex.deployment.docker import DockerDeployment


async def main():
    deployment = DockerDeployment(image="python:3.12")
    await deployment.start()
    try:
        runtime = deployment.runtime
        result = await runtime.run_in_session("python -c \"print('hello from SWE-ReX')\"")
        print(result)
    finally:
        await deployment.stop()


asyncio.run(main())

The exact runtime call names may change across versions, but the architectural shape is stable:

create a deployment,
start a sandbox,
send commands through a runtime interface,
collect stdout/stderr/exit code,
stop and clean up.

That is the same shape a coding-agent harness needs.

Is SWE-ReX Only for Coding Agents?

SWE-ReX is most naturally used by coding agents because its common interface focuses on commands, persistent shell sessions, and filesystem operations. It is also suitable for terminal-driven debugging, data analysis, and security tasks where the relevant actions can be represented as commands and files.

Capability	SWE-ReX common runtime interface
Run shell commands and tests	Yes
Keep an interactive shell/session alive	Yes
Transfer or edit files	Yes
Run a script that itself uses a headless browser	Possible, if the sandbox image contains the dependencies
First-class browser navigation and DOM observation tools	Not documented as part of the common runtime API
Screenshot-based mouse and keyboard computer use	Not documented as part of the common runtime API

For example, a sandbox with Node and Playwright installed could execute a scripted browser test through a command:

result = await runtime.run_in_session("node ./run_playwright_test.js")

But SWE-ReX does not itself define an agent interaction loop such as observe screenshot -> click -> type -> scroll. A product building browser-use or desktop-use agents needs an additional tool layer, or it needs to use backend-specific capabilities directly. Daytona may expose broader computer-use features as a platform, but a portable SWE-ReX shell interface should not be assumed to expose every provider-specific capability.

OpenHands

OpenHands belongs at a different architectural level than SWE-ReX. SWE-ReX is a runtime abstraction that a developer can use while implementing an agent. OpenHands is an open-source software-agent platform: it includes model integration, an agent loop, tool routing, conversation or task state, user interfaces and SDKs, and the sandbox in which actions execute.

That makes OpenHands conceptually closer to Claude Code than to Modal, Daytona, E2B, or SWE-ReX:

System	Architectural role	Provides an agent loop?	Provides or chooses execution environments?
Claude Code	Integrated coding-agent product and SDK ecosystem	Yes	Yes: local sandboxing and hosted/managed deployment modes
OpenHands	Open-source software-engineering agent platform	Yes	Yes: Docker, Process, or Remote sandbox providers in V1
SWE-ReX	Portable command/filesystem runtime library	No	Selects execution backends for an agent built elsewhere
Modal / E2B / Daytona	Managed execution infrastructure	No, unless an application is deployed into it	Supplies remote sandbox environments and lifecycle APIs
SWE-bench	Benchmark task and evaluation harness	No	Provides reproducible test execution for scoring

The high-level OpenHands shape is:

OpenHands application
  - LLM integration
  - agent loop and conversation state
  - tool selection and observations
  - UI, CLI, or SDK surface
        |
        v
OpenHands sandbox
  - repository filesystem
  - terminal commands
  - local development servers
  - browser tools and optional visual access
        |
        v
Docker / Process / Remote sandbox provider

This is why OpenHands needs both a model and a sandbox. The model decides which action to attempt; the sandbox is where that action affects files, commands, and applications.

Model Credentials

To operate as an agent, OpenHands must be connected to an LLM. For hosted models, this normally requires model-provider credentials. Its documentation says OpenHands can connect to LLMs supported through its model stack and documents providers including the OpenHands LLM provider, Anthropic, OpenAI, and local OpenAI-compatible endpoints.

Deployment	Credential requirement
OpenHands with Anthropic, OpenAI, or another hosted provider	Provider API key configured in OpenHands
OpenHands using the OpenHands LLM provider	OpenHands LLM API key backed by OpenHands Cloud credits
Programmatic control of OpenHands Cloud	OpenHands Cloud API key, which is distinct from the LLM key
OpenHands with local Ollama	No paid external inference key; configuration may use a placeholder key such as `local-llm`
OpenHands with local vLLM or SGLang	The token configured for the local model endpoint, if one is enforced

For a self-hosted instance using a hosted model, the conceptual configuration is:

export LLM_MODEL="anthropic/claude-sonnet-4-5-20250929"
export LLM_API_KEY="your-provider-api-key"

The sandbox does not replace the model service:

LLM credentials -> authorize reasoning/inference calls
Sandbox         -> execute the selected tools in a bounded environment

OpenHands Sandbox Providers

OpenHands V1 calls the action environment a sandbox. It is where OpenHands runs commands, edits files, and starts servers while working on a task. Current documentation describes three sandbox providers:

Provider	Where work executes	Isolation and use case
Docker sandbox	Agent server and workspace run inside a Docker container	Recommended local option; provides isolation and reproducibility
Process sandbox	Agent server runs as a normal local process	Faster for controlled debugging, but explicitly unsafe because it has host-user access
Remote sandbox	Agent server runs in a remote execution environment	Used in OpenHands Cloud and advanced self-hosted/managed deployments

In current deployments the provider may still be selected through the legacy RUNTIME variable:

export RUNTIME=docker   # default/recommended local sandbox
# export RUNTIME=process  # fast but no container isolation
# export RUNTIME=remote   # configured remote sandbox endpoint

For the local Docker sandbox, a repository can be mounted into the container so the OpenHands agent directly updates that workspace:

openhands serve --mount-cwd

or explicitly:

export SANDBOX_VOLUMES="$PWD:/workspace:rw"
openhands serve

This is local mounted-workspace behavior: if the container modifies /workspace, the permitted host repository changes immediately because it is a read-write bind mount. By contrast, an OpenHands remote sandbox needs the same result-export mechanisms described earlier: patch retrieval, branch/PR workflow, downloaded files, or provider-specific persistence.

For extra tools and dependencies, the sandbox image can also be customized:

[sandbox]
base_container_image = "my-openhands-sandbox:latest"

More Than Shell Execution: Browser-Enabled Workspaces

OpenHands is still primarily a software-development agent platform, not a universal visual-desktop automation platform. However, it exposes richer software-agent tooling than SWE-ReX’s portable shell/filesystem runtime.

Its SDK Docker sandbox documentation includes browser-enabled workspaces. In that configuration, OpenHands can run a browser inside the isolated workspace and enable browser tools for the agent:

agent = get_default_agent(
    llm=llm,
    cli_mode=False,  # enables browser tools
)

With a browser-enabled Docker workspace, OpenHands can use browser interaction during a development task, for example to inspect a locally launched web application. The documentation also describes VNC access so a developer can visually watch the browser operating inside the sandbox.

Capability	SWE-ReX	OpenHands
Execute shell commands and tests	Yes	Yes
Read/write source files	Yes	Yes
Provide a complete agent loop	No	Yes
Start and inspect an application server	Possible through commands	Integrated into agent workflows
Browser tools in the sandbox	Not a documented common runtime primitive	Documented in browser-enabled SDK sandbox setup
Human visual observation through VNC	Not a core runtime abstraction	Documented for browser-enabled Docker workspace
General-purpose desktop computer-use API	Not a core runtime abstraction	Not the primary platform abstraction; use specialized desktop sandbox infrastructure when needed

The distinction is important. OpenHands can browse and interact with software during an engineering workflow. For a general VLM agent controlling arbitrary desktop applications through screenshot/mouse/keyboard primitives, Daytona Computer Use, E2B Desktop, or a purpose-built computer-use environment is a clearer substrate.

How an OpenHands Session Runs

A simplified local OpenHands task looks like:

User asks OpenHands to fix or build something
        |
        v
OpenHands sends current context to configured LLM
        |
        v
LLM proposes a tool action
        |
        v
Docker sandbox executes:
  read/edit file, Bash command, run server, browser action
        |
        v
Observation returns to OpenHands and then to the LLM
        |
        v
Repeat until completion or user intervention

For local Docker with a mounted repository, approved source edits appear in the user’s checkout. For Remote sandbox deployment, work is isolated remotely and must be exported or committed before it appears locally.

So the short characterization is:

SWE-ReX is a lower-level portable execution interface. OpenHands is an open-source coding-agent harness, comparable in architectural level to Claude Code, whose sandbox provides the controlled workspace where its terminal, file, server, and browser actions occur.

6. Claude Code as a Harness System

OpenHands and Claude Code belong at the same architectural level: both are complete software-agent harnesses, not merely sandbox APIs. They combine a model-facing loop with tools, state, execution boundaries, and a user-facing workflow. The execution design differs depending on how Claude is deployed.

Mode	Where actions execute	Main isolation mechanism	Where resulting files live
Local Claude Code	On the user’s machine	Permissions plus OS-level sandboxing for Bash subprocesses	In the local permitted workspace
Claude Code on the web	In an isolated hosted environment	Anthropic-managed cloud isolation and scoped Git access	In the hosted session until exported through Git or another workflow
Claude Managed Agents cloud environment	In a per-session Anthropic-managed container	Managed cloud container configuration	In the cloud environment/output path
Claude Managed Agents self-hosted environment	In the customer’s worker or sandbox provider	Customer-selected isolation, such as a container or managed sandbox	In customer-controlled infrastructure

For the local product, Anthropic documents an important boundary:

Permissions apply before tool execution across tools such as Bash, Read, Edit, WebFetch, and MCP.
The OS-level sandbox applies to Bash commands and their child processes, using Seatbelt on macOS and bubblewrap on Linux.
Built-in file tools such as Read, Edit, and Write are controlled by permissions rather than the Bash sandbox.
Computer use on the local machine controls the actual desktop; it is not isolated by the Bash sandbox.

This is why the phrase “Claude Code has a sandbox” should not be read as “every possible agent action runs in a remote VM.” Locally, it is a layered design: tool permissions plus OS isolation for command execution.

For hosted agents, the topology changes. Claude Managed Agents defines an environment where tool execution happens. By default this is an Anthropic-managed cloud container. With a self-hosted sandbox, Anthropic retains orchestration while the developer supplies the environment worker and determines filesystem mounts, network egress, runtime hardening, and result retrieval. Daytona documents one integration for this self-hosted path.

Anthropic orchestration / model-facing agent loop
        |
        | tool execution work item
        v
Self-hosted environment worker
        |
        | start or attach sandbox
        v
Daytona / Modal / custom container or VM
  - files, commands, tests, network policy
        |
        | tool results and final outputs
        v
Anthropic orchestration

This does not mean that ordinary local Claude Code automatically sends work to Daytona. Daytona is relevant to the Managed Agents self-hosted environment design.

A Local Configuration Pattern

The following illustrates a restrictive local Claude Code posture. The current working directory is writable by default for sandboxed Bash; additional write paths should be granted narrowly. allowUnsandboxedCommands: false disables the normal escape hatch that could otherwise ask to rerun a blocked command outside the sandbox.

{
  "sandbox": {
    "enabled": true,
    "failIfUnavailable": true,
    "autoAllowBashIfSandboxed": true,
    "allowUnsandboxedCommands": false,
    "filesystem": {
      "allowWrite": ["/tmp/agent-work"],
      "denyRead": ["~/.ssh", "~/.aws", "~/.kube", "~/.config/gh"]
    },
    "network": {
      "allowedDomains": ["github.com", "registry.npmjs.org", "pypi.org"]
    }
  },
  "permissions": {
    "deny": [
      "Read(~/.ssh/**)",
      "Read(~/.aws/**)",
      "Read(~/.kube/**)",
      "Read(~/.config/gh/**)"
    ]
  }
}

Network policy needs care. A broad denied-domain rule can override an allowlist, so a policy that intends to permit package registries should not also deny every domain. In real deployments, test the configuration against the specific build tools the agent is allowed to use.

7. An Execution Adapter Is Not a Complete Harness

A custom agent can isolate provider-specific execution calls behind an interface. This is approximately the role SWE-ReX plays for command-oriented agents:

from dataclasses import dataclass
from typing import Protocol


@dataclass
class ExecResult:
    stdout: str
    stderr: str
    exit_code: int


class Sandbox(Protocol):
    def exec(self, command: str, timeout: int = 60) -> ExecResult:
        ...

    def write_file(self, path: str, content: str) -> None:
        ...

    def read_file(self, path: str) -> str:
        ...

    def export_patch(self) -> str:
        ...

    def close(self) -> None:
        ...

But this adapter only represents the execution portion of the harness. A real loop should validate an action before sending it to the sandbox and should explicitly export accepted results:

def run_agent_step(model, policy, sandbox: Sandbox, task: str, transcript: list[str]) -> str:
    action = model.next_action(build_prompt(task, transcript))
    policy.authorize(action)

    if action["type"] == "bash":
        result = sandbox.exec(action["command"], timeout=action.get("timeout", 60))
        observation = (
            f"exit={result.exit_code}\n"
            f"stdout:\n{result.stdout}\n"
            f"stderr:\n{result.stderr}"
        )
    elif action["type"] == "write_file":
        sandbox.write_file(action["path"], action["content"])
        observation = f"wrote {action['path']}"
    else:
        raise ValueError(f"unsupported action: {action['type']}")

    transcript.append(f"action: {action}\nobservation: {observation}")
    return observation


def finalize_work(sandbox: Sandbox) -> str:
    return sandbox.export_patch()

This small example separates three concerns:

Concern	Minimal code element	Real system responsibility
Decision	`model.next_action(...)`	Prompt, context, memory, lifecycle
Control	`policy.authorize(...)`	Permissions, egress policy, approval, audit
Execution	`sandbox.exec(...)`	Docker, Modal, E2B, Daytona, SWE-ReX backend
Result transfer	`sandbox.export_patch()`	Review, Git branch, artifact retrieval, local sync

It also highlights why a complete product such as OpenHands or Claude Code is larger than a sandbox SDK. The product must implement context management, tool definitions, permissions, observation handling, verification, result delivery, and a user workflow in addition to starting an isolated process.

8. Design Lessons

The execution layer is where an agent’s proposed actions become real effects. The most useful design questions are practical:

Question	Design consequence
Is code trusted, generated, or submitted by unknown users?	Choose an isolation boundary proportionate to the threat model; hosted untrusted execution usually needs more than default Docker.
Must each run be reproducible?	Use images, templates, pinned commits, resettable workspaces, and deterministic evaluation scripts.
Does the agent edit local files or remote files?	Define the result path in advance: mounted workspace, returned patch, artifact download, or Git branch/PR.
Which secrets and network endpoints are required?	Do not mount broad credentials; explicitly restrict reads, writes, and outbound access.
Does the task require browser or desktop interaction?	Select an environment that exposes screenshots and actions, not only shell execution.
How will a bad action be noticed?	Capture commands, outputs, diffs, resource consumption, network activity, approvals, and test results.

Four distinctions summarize this layer:

Reproducibility is not security. Docker images are excellent reproducible testbeds; hostile multi-tenant execution may require gVisor, microVMs, or another stronger outer boundary.
Permissions are not isolation. A policy decides which actions should run; a sandbox limits damage if a permitted action is unsafe.
Infrastructure is not the agent harness. Modal, E2B, and Daytona supply execution environments; OpenHands and Claude Code integrate those kinds of environments into complete agent systems.
Remote execution needs a return path. An agent has not completed a software change until the patch, branch, artifact, or approved file update reaches the user or repository.

The next execution-layer design decision is not simply “which sandbox is fastest?” It is which combination of tool policy, environment boundary, runtime isolation, result export, and verification gives the agent enough freedom to be productive without giving failures an uncontrolled blast radius.

Agent Harness Engineering Part 1: Execution Layer and Sandboxes

Docker, gVisor, Firecracker, managed sandboxes, and coding-agent runtimes

Agent Harness Engineering Part 1: Execution Layer and Sandboxes

1. Why the Execution Layer Matters

2. Three Boundaries, Not One

3. Underlying Sandbox Technologies

Docker and OCI Containers

gVisor

Firecracker MicroVMs

OS-Level Sandboxes

WebAssembly

4. Managed Sandbox Services

How Does Code Get Into the Remote Sandbox?

If the Agent Changes Remote Files, How Do They Return?

Are Managed Sandboxes Only for Coding Agents?

Where Does the Agent Loop Run?

E2B

Daytona

5. Benchmark and Framework Sandboxes

SWE-bench

Reproducibility Is Not the Same as Maximum Isolation

SWE-ReX

Is SWE-ReX Only for Coding Agents?

OpenHands

Model Credentials

OpenHands Sandbox Providers

More Than Shell Execution: Browser-Enabled Workspaces

How an OpenHands Session Runs

6. Claude Code as a Harness System

A Local Configuration Pattern

7. An Execution Adapter Is Not a Complete Harness

8. Design Lessons

References