chapter: 12·part: The Organization (fleet at scale)

Trust & Operate: Observability and Debugging

Inspect, debug, and audit runs with gh aw logs, gh aw audit, and OpenTelemetry so you can trust what the fleet does.

Objective

By the end of this chapter you can inspect, debug, and audit what your workflows do — with gh aw logs, gh aw audit, run summaries, and OpenTelemetry — so you can trust a fleet you can't watch by hand.

Everything targets gh aw v0.81.6. We take a run that went wrong and trace it from an overview table down to the exact failing step.

Concept: you can't govern what you can't see

Everything in Part III assumes a fleet running unattended — agents triaging, reviewing, and opening PRs across many repos while you sleep. That only works if you can answer, after the fact: what did it do, why, what did it cost, and what did it touch? Observability is the precondition for trust. You can't govern — can't budget, can't secure, can't improve — what you can't see.

An agentic run is unusually inspectable because, as Chapter 3 showed, only one job is non-deterministic and everything is captured as artifacts. gh-aw “provides comprehensive observability through GitHub Actions runs and artifacts… [which] preserve prompts, outputs, patches, and logs for post-hoc analysis” (Security Architecture). Debugging an agent isn't guesswork; it's reading a well-kept record.

In gh-aw: logs, audit, OpenTelemetry, run summaries

Three CLI commands and one export cover the whole observability surface.

gh aw logs — the overview and the artifacts

It “download[s] and analyze[s] agentic workflow logs and artifacts… and provides an overview table with aggregate metrics including duration, token usage, and cost information” (gh aw logs --help). By default it grabs just the compact usage artifact; widen with --artifacts:

Fetch runs and choose how much to download
gh aw logs                          # overview table: duration, tokens, cost
gh aw logs repo-assistant           # just one workflow's runs
gh aw logs --artifacts all          # everything: prompt, output, patch, logs
gh aw logs --artifacts agent,firewall   # only what you need

The downloadable artifacts are the agent's black box recorder: agent-stdio.log, safe_output.jsonl (what it proposed), aw-{branch}.patch (what it changed), workflow-logs/, and summary.json. Available sets include activation, agent, detection, firewall, github-api, mcp, usage.

gh aw audit — the focused report

Where logs is broad, audit is deep. It audits runs “by downloading artifacts and logs, detecting errors, analyzing MCP tool usage, and generating a concise report” (gh aw audit --help). Point it at a run and it finds the problem for you:

Investigate one run, or diff two
gh aw audit 1234567890              # detailed Markdown report for one run
gh aw audit <run-url>/job/<id>       # a job URL — extracts the first failing step
gh aw audit 1234567890 1234567891   # compare two runs (first = baseline)

Given a job URL without a step, it “finds and extracts the first failing step's output” — it navigates to the failure for you. Its Firewall Analysis section (from Chapter 7) lists every domain the agent tried to reach with allow/deny status.

Run summaries and OpenTelemetry

Every run also writes a rich Markdown step summary in the Actions UI, and gh aw status reports fleet health at a glance. For centralized, cross-run visibility, the observability.otlp block “export[s] distributed traces from workflow runs to an OpenTelemetry Protocol (OTLP) compatible backend” (Frontmatter) — so agent runs appear in the same tracing tool as the rest of your systems.

When to inspect versus trust (human-in-the-loop)

You can't read every run of a busy fleet — nor should you. The skill is knowing which runs earn a look. Let the cheap signals (the overview table, the safe-outputs boundary, the threat-detection gate) carry the routine cases, and spend attention where the signal says something's off.

Inspect closely when…Trust the guardrails when…
a run failed or timed outit succeeded and produced expected safe outputs
tokens/cost spiked vs. the normcost is in the usual band
the firewall logged unexpected domainsegress stayed within the allowlist
you're rolling out a new or changed workflowa stable workflow is running unchanged

When not to

  • Don't skip observability because “it's working.” A silent fleet is not a healthy fleet — it's an unmonitored one. Glance at gh aw logs regularly even when nothing's on fire.
  • Don't debug from the model's chat alone. The artifacts — patch, safe-output JSON, firewall log — are ground truth; the agent's narration is not. Read the record, not the story.
  • Don't treat observability as a substitute for the guardrails. Seeing a bad action after the fact is no help if it already shipped. Logs and audit complement safe outputs and review gates; they don't replace them.

Worked example: debugging a failed run from its logs

The Repo Assistant's nightly run failed. Here's the trace from “something's wrong” to root cause — three commands, no guessing.

1. Get the overview. Start broad to find the bad run and its ID:

The overview table surfaces the anomaly
gh aw logs repo-assistant
# RUN ID       WORKFLOW         STATUS   DURATION   TOKENS    COST
# 1234567890   repo-assistant   failure  4m12s      182,400   …
# 1234567889   repo-assistant   success  0m48s       12,100   …

The failed run also burned 15× the tokens of a healthy one — two signals pointing at the same run.

2. Audit that run. Let audit find the failing step and explain it:

A focused report that detects the error for you
gh aw audit 1234567890
# Downloads artifacts + logs, detects errors, analyzes MCP tool usage,
# and writes a concise Markdown report — including the first failing step
# and a Firewall Analysis of every domain the agent tried to reach.

Say the report shows the agent looping on a tool call to a domain the firewall denied — that explains both the failure and the token blow-up (it retried until timeout).

3. Confirm and fix. Pull the full artifacts if you need to read the raw exchange, then fix the cause — add the domain to network.allowed (Chapter 7) — and recompile:

Read the black box, then fix the workflow
gh aw logs repo-assistant --artifacts all   # agent-stdio.log, firewall log, patch…
# → root cause: egress to an un-allowed domain, retried to timeout
# fix: add the domain to network.allowed, then:
gh aw compile .github/workflows/repo-assistant.md

Recap & what's next

You can now see what your fleet does, and debug it when it misbehaves:

  • Observability is the precondition for trust — you can't govern what you can't see, and every run leaves a durable artifact trail.
  • gh aw logs gives the overview + artifacts (duration, tokens, cost; --artifacts to download more); gh aw audit gives a focused report that detects the failing step and analyzes tool/firewall use.
  • Run step summaries, gh aw status, and OpenTelemetry (observability.otlp) round out the picture.
  • Inspect the runs that signal trouble (failures, cost spikes, denied egress); trust the guardrails for the rest — but never let observability replace them.

What's next. Seeing cost is the first step; controlling it is the next. In Chapter 13: Governance & FinOps, we cap and meter agentic spend with max-ai-credits and set the org policy that keeps a fleet affordable and compliant.