Skip to content
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -36,3 +36,4 @@ go.work.sum
.vscode/

hack/keycloak-certs
/genmcp
18 changes: 18 additions & 0 deletions examples/netedge-tools/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
NETEDGE gen-mcp examples
========================

This directory contains the NETEDGE gen-mcp example toolset. Documentation now
lives under `docs/`; start with the canonical notes for build/run details and
integration guidance.

Docs
- [`docs/NETEDGE-GEN-MCP-NOTES.md`](docs/NETEDGE-GEN-MCP-NOTES.md) — canonical notes covering setup, runtime tips, and roadmap.
- [`docs/NET_DIAGNOSTIC_SCENARIOS.md`](docs/NET_DIAGNOSTIC_SCENARIOS.md) — ingress and DNS failure scenarios for agents.
- [`docs/README-netedge-break-repair.md`](docs/README-netedge-break-repair.md) — break/repair script usage for staging scenarios.

Key files
- `mcpfile.yaml` — curated MCP tool definitions used by these examples.
- `netedge-break-repair.sh` — helper script that stages the documented scenarios.
- `scripts/exec_dns_in_pod.sh` — helper invoked by the `exec_dns_in_pod` MCP tool.

Update the canonical notes if you need to expand or refine the documentation.
113 changes: 113 additions & 0 deletions examples/netedge-tools/docs/NETEDGE-GEN-MCP-NOTES.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,113 @@
# NETEDGE — gen-mcp Notes

Purpose
-------
A single concise reference for the NETEDGE Phase‑0 gen‑mcp tooling: what’s included,
how to build and run, key assumptions, and short next‑step ideas.

What this directory contains
- `mcpfile.yaml` — curated NETEDGE MCP tools using the `stdio` transport.
- `docs/` — documentation (this file, scenario catalog, break/repair guide).
- `netedge-break-repair.sh` — script that stages documented ingress scenarios.
- `scripts/exec_dns_in_pod.sh` — helper invoked by the `exec_dns_in_pod` tool.

Quick summary of provided tools
- `inspect_route` — fetch a `Route` and, when possible, its `Service` and `Endpoints`.
- `get_service_endpoints` — return an Endpoints object for a Service.
- `query_prometheus` — run a Prometheus `query_range` and return JSON.
- `get_coredns_config` — fetch a ConfigMap (e.g., CoreDNS `Corefile`).
- `probe_dns_local` — run `dig`/`nslookup` on the gen‑mcp host (probe from the host).
- `exec_dns_in_pod` — run a short ephemeral pod that executes `dig` inside the cluster.

DEV NOTES — build & run
-----------------------
- Build (recommended):

```bash
# builds server helper binaries and the CLI per the repo Makefile
make build
```

- Run (foreground):

```bash
./genmcp run -f examples/netedge-tools/mcpfile.yaml
```

- Run (detached/background):

```bash
./genmcp run -f examples/netedge-tools/mcpfile.yaml -d
./genmcp stop -f examples/netedge-tools/mcpfile.yaml
```

Integration with Codex CLI
--------------------------
The NETEDGE tools use the `stdio` transport, so Codex CLI can launch the server
directly. Example `config.toml` snippet:

```toml
[mcp_servers.netedge]
command = "/absolute/path/to/genmcp"
args = ["run", "-f", "examples/netedge-tools/mcpfile.yaml"]
```

Codex spawns the command in STDIO mode; no HTTP proxy is required. Adjust the path
to `genmcp` for your local checkout.

Key assumptions and caveats
--------------------------
- Most Phase‑0 tools use the `cli` invoker (they shell out). The MCP server machine
needs `oc` or `kubectl` plus DNS tooling (`dig` or `nslookup`). `jq` or `python3`
help when pretty-printing JSON from `inspect_route`.
- `query_prometheus` uses the HTTP invoker and requires network access to the target
Prometheus endpoint.
- `exec_dns_in_pod` pulls `registry.redhat.io/openshift4/network-tools-rhel9:latest`;
replace with an approved image if your cluster restricts external pulls.
- Template notes: when writing CLI `command` templates, each `{param}` must appear
exactly once. If a parameter must be used multiple times, assign it once to a shell
variable inside the command and reuse that variable. The repo validator counts the
`"%s"` placeholders used during formatting.

We could...
-----------
- add a single aggregator HTTP endpoint (e.g. `/diagnose`) that implements the full
`diagnose_route` playbook and returns structured JSON so agents call one concise tool.
- implement native `k8s` and `prometheus` invokers in `pkg/invocation` so tools use
`client-go` and HTTP clients (robust in‑cluster auth) instead of shelling out.
- include a tiny `mcp-remote` adapter in the repo so Codex users can reproduce the
Codex integration locally without a separate tool.
- add safe remediation tools behind approval gates (preview/dry‑run + action IDs +
rollback tokens) and integrate with audit logs for traceability.

Phased roadmap
--------------
Phase 0 — Quick wins
- Provide read-only aggregation tools and lightweight probes so agents can collect
immediate evidence without custom code. Example tasks:
- `inspect_route`, `get_service_endpoints`, `get_coredns_config`, `query_prometheus`.
- `probe_dns_local` and `exec_dns_in_pod` using ephemeral pod runs.
- Deliver a curated `mcpfile.yaml` (already present) and clear DEV notes for running
the server locally or via the Makefile.

Phase 1 — Probing and aggregation
- Add active probing and standardized aggregation:
- Implement `probe_http`, `probe_dns` (multiple transports), `probe_endpoints`.
- Build an aggregator HTTP endpoint (e.g. `/diagnose`) that executes the
`diagnose_route` playbook, runs parallel probes, and returns structured JSON
(checks, probes, root causes, recommended actions).
- Improve ephemeral pod lifecycle and add better result correlation (logs, metrics).

Phase 2 — Safe remediation (human-in-the-loop)
- Expose guarded remediation actions with audit and rollback support:
- `preview_apply_corefile`, `apply_corefile`, `patch_route`, `scale_backend` with
dry-run previews and an `action-id` for traceability.
- Integrate RBAC and scope-based filtering so tools are gated by required scopes.
- Add approval workflows and automatic rollback tokens for safe operator-driven fixes.

These phases are incremental: each phase builds on the previous to increase
automation while keeping safety and auditability central.

Location
--------
This file is the canonical NETEDGE notes for the gen‑mcp examples: `examples/netedge-tools/docs/NETEDGE-GEN-MCP-NOTES.md`.
113 changes: 113 additions & 0 deletions examples/netedge-tools/docs/NET_DIAGNOSTIC_SCENARIOS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,113 @@
__# Network Diagnostic Scenarios for Ingress & DNS

This document describes candidate **network breakage scenarios** for use in an AI agent diagnostic project on OpenShift 4.19. The scenarios are designed to be:
- Within the scope of the **Network Ingress and DNS team**
- Detectable using only **Phase 0** diagnostic tools
- Safe — the cluster remains fully accessible remotely

The [`../netedge-break-repair.sh`](../netedge-break-repair.sh) helper script can
stage each scenario end-to-end; see [`README-netedge-break-repair.md`](./README-netedge-break-repair.md)
for usage.

---

## 1. Route → Service Selector Mismatch (Empty Endpoints)

### Why it's in-scope & interesting
- Tests the **Route → Service → Endpoints chain** that the ingress team owns.
- Realistic: label drift or bad selectors happen often and produce 503s.
- 100% reversible by flipping a single label or selector.
- No impact on control-plane or kubeadmin connectivity.

### How to stage it
1. Create a test namespace, deploy a small app (e.g., `nginx`), expose it as a Service and Route.
2. **Break**: modify the Service `spec.selector` to a label that no pod has (or relabel pods).
3. **Repair**: restore the correct selector/pod labels.

### How a human would diagnose
1. `curl` the Route → 503 from the router.
2. `oc get route -o json` → identify `.spec.to.name`.
3. `oc get svc -o json` → inspect selector.
4. `oc get endpoints -o json` → find empty subsets.
5. Compare Service selector vs Pod labels, fix mismatch.

### How the phase-0 agent would diagnose
- `inspect_route` → surface Route, Service, Endpoints, detect empty endpoints.
- `get_service_endpoints` → verify endpoints are empty.
- Optionally `query_prometheus` → check router 503 metrics.

### Agent input to cause it diagnose this task:
- `On the currently connected cluster, we've deployed an app and exposed it through a Route, but it’s not working. Diagnose the root cause and suggest the fix.`
---

## 2. Route Host Without DNS Record (NXDOMAIN)

### Why it's in-scope & interesting
- Tests the DNS ↔ Ingress seam.
- Common misconfiguration: developer sets `spec.host` to `myapp.example.com` without DNS.
- Reversible: fix by using default domain or adding DNS record.

### How to stage it
1. Create a Route with `spec.host` set to `nonexistent.example.test`.
2. **Break**: leave DNS unconfigured.
3. **Repair**: update Route host to valid admitted domain or create DNS record.

### How a human would diagnose
1. `dig` / `nslookup` → NXDOMAIN.
2. `oc get route` → host is admitted, but unreachable.
3. Conclude DNS misconfiguration.

### How the phase-0 agent would diagnose
- `inspect_route` → check host, verify backend chain is healthy.
- `probe_dns_local` → show NXDOMAIN.
- Optionally `exec_dns_in_pod` → in-cluster resolution check.

### Agent input to cause it diagnose this task:
- `On the currently connected cluster, the route's hostname never resolves in DNS even though the service and pods look healthy. Diagnose the root cause and suggest the fix.`

---

## 3. NetworkPolicy Blocking Router → Service Traffic

### Why it's in-scope & interesting
- Tests namespace isolation affecting ingress traffic.
- Real-world: default-deny without allow for router.
- Reversible: apply/remove single NetworkPolicy.

### How to stage it
1. Deploy app in test namespace.
2. **Break**: apply default-deny NetworkPolicy.
3. **Repair**: remove it or add allow for ingress pods.

### How a human would diagnose
1. Route requests hang or 503.
2. Endpoints are healthy.
3. `oc get networkpolicy` → find default-deny.

### Caveat for phase-0 agent
- No built-in `get NetworkPolicy` in phase 0.
- Could infer by symptoms (503 + healthy endpoints) and escalate.

### Agent input to cause it diagnose this task:
- `On the currently connected cluster every request to the Route now times out even though the pods and service look healthy. Diagnose the root cause and suggest the fix.`

---

## Recommended Scenario for v1: Selector Mismatch

**Why**:
- Fully covered by existing Phase 0 tools.
- Deterministic, scriptable, and reversible.
- Teaches canonical ingress debugging (Route → Service → Endpoints).

### Human/Agent Diagnostic Flow
1. **Route → Service**
`inspect_route` → see Service + Endpoints. Expect empty endpoints.
2. **Endpoints inspection**
`get_service_endpoints` → confirm no addresses.
3. **Form hypothesis**
Selector mismatch or zero pods.
4. **Repair**
Fix label or selector. Endpoints repopulate.
5. **Quantify**
`query_prometheus` for router 503s before/after.
75 changes: 75 additions & 0 deletions examples/netedge-tools/docs/README-netedge-break-repair.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
# NetEdge Break/Repair Script

`netedge-break-repair.sh` stages, breaks, and repairs the deterministic ingress and DNS scenarios captured in [`NET_DIAGNOSTIC_SCENARIOS.md`](./NET_DIAGNOSTIC_SCENARIOS.md). It deploys a minimal application stack (Deployment, Service, Route), introduces the chosen fault, and restores the healthy baseline when asked.

## Prerequisites

- `oc` CLI available in `$PATH`
- `envsubst` (from GNU `gettext`) available for templating manifests
- Credentials that allow creating resources in the target namespace
- Ability to pull `quay.io/openshift/origin-hello-openshift:latest` (default demo image; override with `IMAGE` if restricted)

## Basic Usage

```bash
examples/netedge-tools/netedge-break-repair.sh [--scenario=<1|2|3>] <action>
```

Actions:

- `--setup` – Deploy the healthy baseline (Deployment, Service, Route)
- `--break` – Apply the scenario-specific failure
- `--repair` – Restore the healthy state for the scenario
- `--status` – Show Route, Service, Endpoints (and NetworkPolicy) details
- `--curl` – Curl the current Route host (best effort)
- `--cleanup` – Remove the created resources and any managed NetworkPolicy

If `--scenario` is omitted the script defaults to scenario **1**. Always reuse the same `--scenario=N` flag on follow-up commands; the script prints a reminder after each action.

## Scenarios

1. **Route → Service selector mismatch** – Patches the Service selector to a non-matching value so no endpoints remain (router returns 503). Repair restores the correct selector.
- Agent prompt: “On the current cluster we exposed an app through a Route but it keeps failing. Diagnose the root cause and tell me how to fix it.”
2. **Route host without DNS record** – Stores the original host, then patches `spec.host` to an NXDOMAIN value. Repair restores the saved host from annotation.
- Agent prompt: “On the current cluster the Route’s hostname never resolves in DNS even though the Service and Pods look healthy. Diagnose the root cause and tell me how to fix it.”
3. **NetworkPolicy blocking router traffic** – Applies a default-deny ingress NetworkPolicy in the namespace. Repair deletes the policy.
- Agent prompt: “On the current cluster every request to the Route now times out even though the pods and service look healthy. Diagnose the root cause and tell me how to fix it.”

## Environment Overrides

Export any of these before running the script to change defaults:

- `NAMESPACE` – Target namespace (default: `test-ingress`)
- `APP_NAME` – Base name for Deployment/Service/Route (default: `hello`)
- `APP_LABEL` – Label shared by Deployment and Service (default: `hello`)
- `IMAGE` – Demo image (default: `quay.io/openshift/origin-hello-openshift:latest`)
- `PORT` – Container and Service port (default: `8080`)

## Example Workflow

```bash
# Scenario 1: selector mismatch
examples/netedge-tools/netedge-break-repair.sh --scenario=1 --setup
examples/netedge-tools/netedge-break-repair.sh --scenario=1 --break
examples/netedge-tools/netedge-break-repair.sh --scenario=1 --status
examples/netedge-tools/netedge-break-repair.sh --scenario=1 --repair

# Scenario 2: NXDOMAIN host
examples/netedge-tools/netedge-break-repair.sh --scenario=2 --setup
examples/netedge-tools/netedge-break-repair.sh --scenario=2 --break
examples/netedge-tools/netedge-break-repair.sh --scenario=2 --curl
examples/netedge-tools/netedge-break-repair.sh --scenario=2 --repair

# Scenario 3: NetworkPolicy block
examples/netedge-tools/netedge-break-repair.sh --scenario=3 --setup
examples/netedge-tools/netedge-break-repair.sh --scenario=3 --break
examples/netedge-tools/netedge-break-repair.sh --scenario=3 --repair
examples/netedge-tools/netedge-break-repair.sh --scenario=3 --cleanup
```

## Notes

- The script refreshes the Route host from the API after each action so `--curl` always uses the current value.
- Scenario 2 stores the original host in the `netedge-tools-original-host` annotation on the Route; avoid deleting this annotation if you plan to run `--repair`.
- Scenario 3 leaves the namespace intact but removes the managed NetworkPolicy during cleanup.
- If `oc` cannot reach the cluster, commands fail early with diagnostics.
Loading