By the end of this chapter you can inspect, debug, and audit what your workflows do — with gh aw logs, gh aw audit, run summaries, and OpenTelemetry — so you can trust a fleet you can't watch by hand.
Everything targets gh aw v0.81.6. We take a run that went wrong and trace it from an overview table down to the exact failing step.
Everything in Part III assumes a fleet running unattended — agents triaging, reviewing, and opening PRs across many repos while you sleep. That only works if you can answer, after the fact: what did it do, why, what did it cost, and what did it touch?Observability is the precondition for trust. You can't govern — can't budget, can't secure, can't improve — what you can't see.
An agentic run is unusually inspectable because, as Chapter 3 showed, only one job is non-deterministic and everything is captured as artifacts. gh-aw “provides comprehensive observability through GitHub Actions runs and artifacts… [which] preserve prompts, outputs, patches, and logs for post-hoc analysis” (Security Architecture). Debugging an agent isn't guesswork; it's reading a well-kept record.
Three CLI commands and one export cover the whole observability surface.
gh aw logs — the overview and the artifacts
It “download[s] and analyze[s] agentic workflow logs and artifacts… and provides an overview table with aggregate metrics including duration, token usage, and cost information” (gh aw logs --help). By default it grabs just the compact usage artifact; widen with --artifacts:
Fetch runs and choose how much to download
gh aw logs # overview table: duration, tokens, cost
gh aw logs repo-assistant # just one workflow's runs
gh aw logs --artifacts all # everything: prompt, output, patch, logs
gh aw logs --artifacts agent,firewall # only what you need
The downloadable artifacts are the agent's black box recorder: agent-stdio.log, safe_output.jsonl (what it proposed), aw-{branch}.patch (what it changed), workflow-logs/, and summary.json. Available sets include activation, agent, detection, firewall, github-api, mcp, usage.
gh aw audit — the focused report
Where logs is broad, audit is deep. It audits runs “by downloading artifacts and logs, detecting errors, analyzing MCP tool usage, and generating a concise report” (gh aw audit --help). Point it at a run and it finds the problem for you:
Investigate one run, or diff two
gh aw audit 1234567890 # detailed Markdown report for one run
gh aw audit <run-url>/job/<id> # a job URL — extracts the first failing step
gh aw audit 1234567890 1234567891 # compare two runs (first = baseline)
Given a job URL without a step, it “finds and extracts the first failing step's output” — it navigates to the failure for you. Its Firewall Analysis section (from Chapter 7) lists every domain the agent tried to reach with allow/deny status.
Run summaries and OpenTelemetry
Every run also writes a rich Markdown step summary in the Actions UI, and gh aw status reports fleet health at a glance. For centralized, cross-run visibility, the observability.otlp block “export[s] distributed traces from workflow runs to an OpenTelemetry Protocol (OTLP) compatible backend” (Frontmatter) — so agent runs appear in the same tracing tool as the rest of your systems.
You can't read every run of a busy fleet — nor should you. The skill is knowing which runs earn a look. Let the cheap signals (the overview table, the safe-outputs boundary, the threat-detection gate) carry the routine cases, and spend attention where the signal says something's off.
Inspect closely when…
Trust the guardrails when…
a run failed or timed out
it succeeded and produced expected safe outputs
tokens/cost spiked vs. the norm
cost is in the usual band
the firewall logged unexpected domains
egress stayed within the allowlist
you're rolling out a new or changed workflow
a stable workflow is running unchanged
When not to
Don't skip observability because “it's working.” A silent fleet is not a healthy fleet — it's an unmonitored one. Glance at gh aw logs regularly even when nothing's on fire.
Don't debug from the model's chat alone. The artifacts — patch, safe-output JSON, firewall log — are ground truth; the agent's narration is not. Read the record, not the story.
Don't treat observability as a substitute for the guardrails. Seeing a bad action after the fact is no help if it already shipped. Logs and audit complement safe outputs and review gates; they don't replace them.
The Repo Assistant's nightly run failed. Here's the trace from “something's wrong” to root cause — three commands, no guessing.
1. Get the overview. Start broad to find the bad run and its ID:
The overview table surfaces the anomaly
gh aw logs repo-assistant
# RUN ID WORKFLOW STATUS DURATION TOKENS COST
# 1234567890 repo-assistant failure 4m12s 182,400 …
# 1234567889 repo-assistant success 0m48s 12,100 …
The failed run also burned 15× the tokens of a healthy one — two signals pointing at the same run.
2. Audit that run. Let audit find the failing step and explain it:
A focused report that detects the error for you
gh aw audit 1234567890
# Downloads artifacts + logs, detects errors, analyzes MCP tool usage,
# and writes a concise Markdown report — including the first failing step
# and a Firewall Analysis of every domain the agent tried to reach.
Say the report shows the agent looping on a tool call to a domain the firewall denied — that explains both the failure and the token blow-up (it retried until timeout).
3. Confirm and fix. Pull the full artifacts if you need to read the raw exchange, then fix the cause — add the domain to network.allowed (Chapter 7) — and recompile:
Read the black box, then fix the workflow
gh aw logs repo-assistant --artifacts all # agent-stdio.log, firewall log, patch…
# → root cause: egress to an un-allowed domain, retried to timeout
# fix: add the domain to network.allowed, then:
gh aw compile .github/workflows/repo-assistant.md
You can now see what your fleet does, and debug it when it misbehaves:
Observability is the precondition for trust — you can't govern what you can't see, and every run leaves a durable artifact trail.
gh aw logs gives the overview + artifacts (duration, tokens, cost; --artifacts to download more); gh aw audit gives a focused report that detects the failing step and analyzes tool/firewall use.
Run step summaries, gh aw status, and OpenTelemetry (observability.otlp) round out the picture.
Inspect the runs that signal trouble (failures, cost spikes, denied egress); trust the guardrails for the rest — but never let observability replace them.
What's next. Seeing cost is the first step; controlling it is the next. In Chapter 13: Governance & FinOps, we cap and meter agentic spend with max-ai-credits and set the org policy that keeps a fleet affordable and compliant.