The Lifecycle¶
Agents CLI is opinionated about one thing: the loop between "looks good in a notebook" and "live in production." This page is the map.
Watch a single investigation¶
Imagine an outage-recovery agent. It's been live for a week. A pager fires:
That investigation took 4.3 seconds. Nothing about the agent itself is unusual — most agent frameworks could express it. What's unusual is everything around it: the eval rubric that wouldn't have let it ship if it recommended a destructive remediation, the CI check that would have caught the runbook search returning the wrong section, the trace that lets you replay this exact investigation when something goes sideways tomorrow.
That's the loop.
Four CLI verbs on rotation¶
scaffold, eval, deploy, observe — on a rotation, forever. You write the spec; the loop catches what would have shipped, ships what passes, and shows you what happens next so the next iteration is smarter.
What goes wrong without it¶
Most agent demos stop at the prompt. You write a clever instruction, the model returns something that looks great in a notebook, and you screenshot it for the team. However, deploying to production brings real-world challenges.
| Without the loop | With Agents CLI | |
|---|---|---|
| Hallucinated remediation | Discovered customer-side, after the fact | Eval rubric blocks the PR before merge |
| Tool API change | 2 AM page, agent silently broken | CI integration test catches the schema drift |
| Production misuse | No replay, no telemetry | Cloud Trace + BigQuery analytics surface it within the hour |
| Cost spike from a chatty tool | Next month's bill is the alert | Per-tool span counts surface the loop in hours |
The eight phases¶
The loop expands to eight phases when you walk through it slowly. Each phase has an opinion encoded in a skill so your coding agent picks the right answer for you.
| # | Phase | What it does | CLI verb | Skill | Deep-dive |
|---|---|---|---|---|---|
| 0 | Spec | Write a DESIGN_SPEC.md. The other phases derive from this. |
— | google-agents-cli-workflow |
Development Guide |
| 1 | Scaffold | Turn the spec into a production-shaped project (~72 files). | scaffold create |
google-agents-cli-scaffold |
Templates |
| 2 | Build | Write the agent body — model, instruction, tools, App wrapper. |
— | google-agents-cli-adk-code |
Project Structure |
| 3 | Orchestrate | Compose specialists when one agent grows into a team. | — | google-agents-cli-adk-code |
Project Structure |
| 4 | Evaluate | Score the agent against an evalset before every deploy. | eval run |
google-agents-cli-eval |
Evaluation |
| 5 | Deploy | Ship to Agent Runtime, Cloud Run, or GKE. | deploy |
google-agents-cli-deploy |
Deployment |
| 6 | Publish | Register with Gemini Enterprise so other agents can find this one. | publish |
google-agents-cli-publish |
CI/CD |
| 7 | Observe | Cloud Trace + BigQuery analytics; production data feeds tomorrow's evalset. | — | google-agents-cli-observability |
Observability |
0 · Spec¶
A DESIGN_SPEC.md names the agent's tools, constraints, and success criteria. The whole rest of the lifecycle reads from it: the scaffold flags, the eval rubrics, the safety guardrails, the trace attributes you'll watch in production. Don't start from blank — browse Agent Garden for an existing template close to what you want, then customize.
A typical spec is one screen of markdown:
# DESIGN_SPEC.md — outage-recovery-bot
## Tools
| Tool | Backing service |
| --------------------------------------- | --------------------- |
| `query_logs(service, severity)` | Cloud Logging |
| `check_metrics(service, metric)` | Cloud Monitoring |
| `search_runbook(query)` | Vector Search |
## Constraints
1. Always cite the runbook section consulted.
2. Never recommend a destructive remediation unless the runbook
explicitly sanctions it for the observed symptom.
## Success criteria
- ≥ 80% of incidents get a diagnosis whose root cause matches ground truth
- 100% of recommendations cite a runbook section
- 0 destructive recommendations without runbook sanction
1 · Scaffold¶
One command takes the spec and emits the project: agent code, tests, eval boilerplate, Terraform, CI/CD workflows, deployment manifests. The flags aren't gratuitous — each one expands or contracts the scaffold to match the lifecycle you've signed up for.
The full setup ships ~72 files across agent code, eval boilerplate, Terraform, GitHub Actions workflows, and deploy manifests. Trim it down by skipping pieces you don't need. See Templates for the full list.
2 · Build¶
Every ADK agent boils down to four ingredients: a model, an instruction, a list of tools, and an App that wraps them. The body is barely 30 lines of meaningful code — the interesting work happens inside the tools.
from google.adk.agents import Agent
from google.adk.apps import App
from google.adk.models import Gemini
root_agent = Agent(
name="root_agent",
model=Gemini(model="gemini-flash-latest"),
instruction="You are an SRE outage-recovery assistant...",
tools=[query_logs, check_metrics, search_runbook],
)
app = App(root_agent=root_agent, name="app")
You're not locked to Gemini — swap the model line for any provider supported by ADK (Model Garden covers Anthropic Claude, OpenAI GPT, and others). The rest of the lifecycle behaves the same regardless.
Stateful agents reach for two more pieces of Agent Platform:
- Managed session storage for conversation state that survives restarts and scales horizontally — pick it at scaffold time via
--session-type agent_platform_sessionsinstead of the in-memory default. - Memory Bank for long-term memory across sessions (the SRE bot recognizing "this looks like that incident from last quarter"). Wire it in via
from google.adk.memory import VertexAiMemoryBankServiceand the agent gets a persistent store keyed to user, session, or app.
For workflows that don't fit in a single HTTP request — long investigations, multi-step batch jobs — Agent Runtime persists the agent's state so a deploy or restart doesn't lose progress.
Here's the same agent body answering a different incident, end-to-end:
3 · Orchestrate¶
The single-agent body works while the problem is small. Real production agents grow into teams — an orchestrator that routes work to a handful of specialists, each with its own narrow tool surface.
Splitting helps for three reasons that show up in eval, deploy, and observe: smaller prompts make each agent more reliable, separate tool surfaces let you apply per-agent guardrails, and the trace tells you exactly which sub-agent took the bad turn.
When the team needs to span processes — or call agents your team doesn't own — use the A2A protocol as the wire format. Scaffold with --agent adk_a2a and any A2A-compatible agent (built with Agents CLI or not) can call yours, and yours can call theirs.
4 · Evaluate¶
This is the phase most agent demos skip. agents-cli eval run can execute your evalset against the live agent, ask an LLM judge to score each response against a rubric, and give you a number you can defend.
Expect 5–10+ iterations of this loop. Every fix nudges the score, you re-run, you ship when it crosses the threshold. Below: the four failure modes the rubrics catch most often.
See the Evaluation Guide for the full schema and rubric reference.
5 · Deploy¶
The same agent code can land in three different places. agents-cli deploy dispatches based on the target you scaffolded with. Pick one to see what --dry-run would print and the steps that would follow:
agents-cli deploy --dry-run # preview the pipeline
agents-cli deploy # ship it
agents-cli deploy --no-wait # return immediately; check later with --status
Each target inherits the surrounding production primitives:
- Per-agent service account — opt in with
agents-cli deploy --agent-identity, and the deployed agent runs as its own GCP identity. Scope what it can actually call (which BigQuery datasets, which buckets, which APIs) with normal IAM. The eval rubrics that block destructive remediations have a fallback: the agent literally can'tkubectl deleteif its identity isn't allowed to. - Identity-Aware Proxy (IAP) — gate a Cloud Run deploy behind your Google Workspace SSO with the
--iapflag. Internal-only agents stop being a public-internet concern. - Workload Identity Federation — the scaffolded
pr_checks.yamlauthenticates GitHub Actions to GCP via WIF, so no service-account keys live in your repo.
See Deployment for full per-target walkthroughs.
6 · Publish¶
Deploying the agent makes it reachable at a URL. Publishing is the separate step that lists it in Gemini Enterprise so other agents (or humans browsing the catalog) can actually find it.
Two registration modes: ADK (publishes a deployed Agent Runtime instance) and A2A (publishes an A2A-compatible HTTP endpoint, no ADK required — works with agents built on any framework).
7 · Observe¶
Once the agent is live, every invocation emits a Cloud Trace span. Every tool call, model generation, and sub-agent handoff is visible. Hover any span below to see its attributes.
Observability is essential for any agent running in production, as it helps you catch regressions your evaluation might have missed, cost spikes from chatty tools, or cases where users bypass safety prompts. With --bq-analytics turned on at scaffold time, every prompt and response also lands in BigQuery for offline analysis.
The same data closes the loop: production traffic feeds tomorrow's evalset. Eval scores get re-computed continuously, so regressions surface in days, not months.
See Observability for the full setup.
Two ways to drive it¶
Where to dig deeper¶
- Templates — full list of scaffold templates (
adk,adk_a2a,agentic_rag, …) - Project Structure — what each generated file does
- Development Guide — day-to-day workflow
- Evaluation Guide — evalset schema, rubrics, the eval-fix loop
- Deployment — per-target walkthroughs
- CI/CD & Production — the full PR-to-prod path
- Observability — Cloud Trace, BigQuery analytics, third-party tools
- CLI Reference — every command and flag