Trust & Operate: Observability and Debugging | GitHub Agentic Workflows: An Interactive Book

Objective

By the end of this chapter you can inspect, debug, and audit what your workflows do — with gh aw logs, gh aw audit, run summaries, and OpenTelemetry — so you can trust a fleet you can't watch by hand.

Everything targets gh aw v0.81.6. We take a run that went wrong and trace it from an overview table down to the exact failing step.

Concept: you can't govern what you can't see

Everything in Part III assumes a fleet running unattended — agents triaging, reviewing, and opening PRs across many repos while you sleep. That only works if you can answer, after the fact: what did it do, why, what did it cost, and what did it touch? Observability is the precondition for trust. You can't govern — can't budget, can't secure, can't improve — what you can't see.

An agentic run is unusually inspectable because, as Chapter 3 showed, only one job is non-deterministic and everything is captured as artifacts. gh-aw “provides comprehensive observability through GitHub Actions runs and artifacts… [which] preserve prompts, outputs, patches, and logs for post-hoc analysis” (Security Architecture). Debugging an agent isn't guesswork; it's reading a well-kept record.

In gh-aw: logs, audit, OpenTelemetry, run summaries

Three CLI commands and one export cover the whole observability surface.

`gh aw logs` — the overview and the artifacts

It “download[s] and analyze[s] agentic workflow logs and artifacts… and provides an overview table with aggregate metrics including duration, token usage, and cost information” (gh aw logs --help). By default it grabs just the compact usage artifact; widen with --artifacts:

Fetch runs and choose how much to download

gh aw logs                          # overview table: duration, tokens, cost
gh aw logs repo-assistant           # just one workflow's runs
gh aw logs --artifacts all          # everything: prompt, output, patch, logs
gh aw logs --artifacts agent,firewall   # only what you need

The downloadable artifacts are the agent's black box recorder: agent-stdio.log, safe_output.jsonl (what it proposed), aw-{branch}.patch (what it changed), workflow-logs/, and summary.json. Available sets include activation, agent, detection, firewall, github-api, mcp, usage.

`gh aw audit` — the focused report

Where logs is broad, audit is deep. It audits runs “by downloading artifacts and logs, detecting errors, analyzing MCP tool usage, and generating a concise report” (gh aw audit --help). Point it at a run and it finds the problem for you:

Investigate one run, or diff two

gh aw audit 1234567890              # detailed Markdown report for one run
gh aw audit <run-url>/job/<id>       # a job URL — extracts the first failing step
gh aw audit 1234567890 1234567891   # compare two runs (first = baseline)

Given a job URL without a step, it “finds and extracts the first failing step's output” — it navigates to the failure for you. Its Firewall Analysis section (from Chapter 7) lists every domain the agent tried to reach with allow/deny status.

Run summaries and OpenTelemetry

Every run also writes a rich Markdown step summary in the Actions UI, and gh aw status reports fleet health at a glance. For centralized, cross-run visibility, the observability.otlp block “export[s] distributed traces from workflow runs to an OpenTelemetry Protocol (OTLP) compatible backend” (Frontmatter) — so agent runs appear in the same tracing tool as the rest of your systems.

When to inspect versus trust (human-in-the-loop)

You can't read every run of a busy fleet — nor should you. The skill is knowing which runs earn a look. Let the cheap signals (the overview table, the safe-outputs boundary, the threat-detection gate) carry the routine cases, and spend attention where the signal says something's off.

Inspect closely when…	Trust the guardrails when…
a run failed or timed out	it succeeded and produced expected safe outputs
tokens/cost spiked vs. the norm	cost is in the usual band
the firewall logged unexpected domains	egress stayed within the allowlist
you're rolling out a new or changed workflow	a stable workflow is running unchanged

When not to

Don't skip observability because “it's working.” A silent fleet is not a healthy fleet — it's an unmonitored one. Glance at gh aw logs regularly even when nothing's on fire.
Don't debug from the model's chat alone. The artifacts — patch, safe-output JSON, firewall log — are ground truth; the agent's narration is not. Read the record, not the story.
Don't treat observability as a substitute for the guardrails. Seeing a bad action after the fact is no help if it already shipped. Logs and audit complement safe outputs and review gates; they don't replace them.

Worked example: debugging a failed run from its logs

The Repo Assistant's nightly run failed. Here's the trace from “something's wrong” to root cause — three commands, no guessing.

1. Get the overview. Start broad to find the bad run and its ID:

The overview table surfaces the anomaly

gh aw logs repo-assistant
# RUN ID       WORKFLOW         STATUS   DURATION   TOKENS    COST
# 1234567890   repo-assistant   failure  4m12s      182,400   …
# 1234567889   repo-assistant   success  0m48s       12,100   …

The failed run also burned 15× the tokens of a healthy one — two signals pointing at the same run.

2. Audit that run. Let audit find the failing step and explain it:

A focused report that detects the error for you

gh aw audit 1234567890
# Downloads artifacts + logs, detects errors, analyzes MCP tool usage,
# and writes a concise Markdown report — including the first failing step
# and a Firewall Analysis of every domain the agent tried to reach.

Say the report shows the agent looping on a tool call to a domain the firewall denied — that explains both the failure and the token blow-up (it retried until timeout).

3. Confirm and fix. Pull the full artifacts if you need to read the raw exchange, then fix the cause — add the domain to network.allowed (Chapter 7) — and recompile:

Read the black box, then fix the workflow

gh aw logs repo-assistant --artifacts all   # agent-stdio.log, firewall log, patch…
# → root cause: egress to an un-allowed domain, retried to timeout
# fix: add the domain to network.allowed, then:
gh aw compile .github/workflows/repo-assistant.md

Recap & what's next

You can now see what your fleet does, and debug it when it misbehaves:

Observability is the precondition for trust — you can't govern what you can't see, and every run leaves a durable artifact trail.
gh aw logs gives the overview + artifacts (duration, tokens, cost; --artifacts to download more); gh aw audit gives a focused report that detects the failing step and analyzes tool/firewall use.
Run step summaries, gh aw status, and OpenTelemetry (observability.otlp) round out the picture.
Inspect the runs that signal trouble (failures, cost spikes, denied egress); trust the guardrails for the rest — but never let observability replace them.

What's next. Seeing cost is the first step; controlling it is the next. In Chapter 13: Governance & FinOps, we cap and meter agentic spend with max-ai-credits and set the org policy that keeps a fleet affordable and compliant.