genmcp · bentito · Oct 7, 2025 · Oct 8, 2025 · Oct 8, 2025 · Oct 8, 2025
@@ -36,3 +36,4 @@ go.work.sum
 .vscode/
 
 hack/keycloak-certs
+/genmcp
@@ -0,0 +1,18 @@
+NETEDGE gen-mcp examples
+========================
+
+This directory contains the NETEDGE gen-mcp example toolset. Documentation now
+lives under `docs/`; start with the canonical notes for build/run details and
+integration guidance.
+
+Docs
+- [`docs/NETEDGE-GEN-MCP-NOTES.md`](docs/NETEDGE-GEN-MCP-NOTES.md) — canonical notes covering setup, runtime tips, and roadmap.
+- [`docs/NET_DIAGNOSTIC_SCENARIOS.md`](docs/NET_DIAGNOSTIC_SCENARIOS.md) — ingress and DNS failure scenarios for agents.
+- [`docs/README-netedge-break-repair.md`](docs/README-netedge-break-repair.md) — break/repair script usage for staging scenarios.
+
+Key files
+- `mcpfile.yaml` — curated MCP tool definitions used by these examples.
+- `netedge-break-repair.sh` — helper script that stages the documented scenarios.
+- `scripts/exec_dns_in_pod.sh` — helper invoked by the `exec_dns_in_pod` MCP tool.
+
+Update the canonical notes if you need to expand or refine the documentation.
@@ -0,0 +1,113 @@
+# NETEDGE — gen-mcp Notes
+
+Purpose
+-------
+A single concise reference for the NETEDGE Phase‑0 gen‑mcp tooling: what’s included,
+how to build and run, key assumptions, and short next‑step ideas.
+
+What this directory contains
+- `mcpfile.yaml` — curated NETEDGE MCP tools using the `stdio` transport.
+- `docs/` — documentation (this file, scenario catalog, break/repair guide).
+- `netedge-break-repair.sh` — script that stages documented ingress scenarios.
+- `scripts/exec_dns_in_pod.sh` — helper invoked by the `exec_dns_in_pod` tool.
+
+Quick summary of provided tools
+- `inspect_route` — fetch a `Route` and, when possible, its `Service` and `Endpoints`.
+- `get_service_endpoints` — return an Endpoints object for a Service.
+- `query_prometheus` — run a Prometheus `query_range` and return JSON.
+- `get_coredns_config` — fetch a ConfigMap (e.g., CoreDNS `Corefile`).
+- `probe_dns_local` — run `dig`/`nslookup` on the gen‑mcp host (probe from the host).
+- `exec_dns_in_pod` — run a short ephemeral pod that executes `dig` inside the cluster.
+
+DEV NOTES — build & run
+-----------------------
+- Build (recommended):
+
+  ```bash
+  # builds server helper binaries and the CLI per the repo Makefile
+  make build
+  ```
+
+- Run (foreground):
+
+  ```bash
+  ./genmcp run -f examples/netedge-tools/mcpfile.yaml
+  ```
+
+- Run (detached/background):
+
+  ```bash
+  ./genmcp run -f examples/netedge-tools/mcpfile.yaml -d
+  ./genmcp stop -f examples/netedge-tools/mcpfile.yaml
+  ```
+
+Integration with Codex CLI
+--------------------------
+The NETEDGE tools use the `stdio` transport, so Codex CLI can launch the server
+directly. Example `config.toml` snippet:
+
+```toml
+[mcp_servers.netedge]
+command = "/absolute/path/to/genmcp"
+args    = ["run", "-f", "examples/netedge-tools/mcpfile.yaml"]
+```
+
+Codex spawns the command in STDIO mode; no HTTP proxy is required. Adjust the path
+to `genmcp` for your local checkout.
+
+Key assumptions and caveats
+--------------------------
+- Most Phase‑0 tools use the `cli` invoker (they shell out). The MCP server machine
+  needs `oc` or `kubectl` plus DNS tooling (`dig` or `nslookup`). `jq` or `python3`
+  help when pretty-printing JSON from `inspect_route`.
+- `query_prometheus` uses the HTTP invoker and requires network access to the target
+  Prometheus endpoint.
+- `exec_dns_in_pod` pulls `registry.redhat.io/openshift4/network-tools-rhel9:latest`;
+  replace with an approved image if your cluster restricts external pulls.
+- Template notes: when writing CLI `command` templates, each `{param}` must appear
+  exactly once. If a parameter must be used multiple times, assign it once to a shell
+  variable inside the command and reuse that variable. The repo validator counts the
+  `"%s"` placeholders used during formatting.
+
+We could...
+-----------
+- add a single aggregator HTTP endpoint (e.g. `/diagnose`) that implements the full
+  `diagnose_route` playbook and returns structured JSON so agents call one concise tool.
+- implement native `k8s` and `prometheus` invokers in `pkg/invocation` so tools use
+  `client-go` and HTTP clients (robust in‑cluster auth) instead of shelling out.
+- include a tiny `mcp-remote` adapter in the repo so Codex users can reproduce the
+  Codex integration locally without a separate tool.
+- add safe remediation tools behind approval gates (preview/dry‑run + action IDs +
+  rollback tokens) and integrate with audit logs for traceability.
+
+Phased roadmap
+--------------
+Phase 0 — Quick wins
+- Provide read-only aggregation tools and lightweight probes so agents can collect
+  immediate evidence without custom code. Example tasks:
+  - `inspect_route`, `get_service_endpoints`, `get_coredns_config`, `query_prometheus`.
+  - `probe_dns_local` and `exec_dns_in_pod` using ephemeral pod runs.
+  - Deliver a curated `mcpfile.yaml` (already present) and clear DEV notes for running
+    the server locally or via the Makefile.
+
+Phase 1 — Probing and aggregation
+- Add active probing and standardized aggregation:
+  - Implement `probe_http`, `probe_dns` (multiple transports), `probe_endpoints`.
+  - Build an aggregator HTTP endpoint (e.g. `/diagnose`) that executes the
+    `diagnose_route` playbook, runs parallel probes, and returns structured JSON
+    (checks, probes, root causes, recommended actions).
+  - Improve ephemeral pod lifecycle and add better result correlation (logs, metrics).
+
+Phase 2 — Safe remediation (human-in-the-loop)
+- Expose guarded remediation actions with audit and rollback support:
+  - `preview_apply_corefile`, `apply_corefile`, `patch_route`, `scale_backend` with
+    dry-run previews and an `action-id` for traceability.
+  - Integrate RBAC and scope-based filtering so tools are gated by required scopes.
+  - Add approval workflows and automatic rollback tokens for safe operator-driven fixes.
+
+These phases are incremental: each phase builds on the previous to increase
+automation while keeping safety and auditability central.
+
+Location
+--------
+This file is the canonical NETEDGE notes for the gen‑mcp examples: `examples/netedge-tools/docs/NETEDGE-GEN-MCP-NOTES.md`.
@@ -0,0 +1,113 @@
+__# Network Diagnostic Scenarios for Ingress & DNS
+
+This document describes candidate **network breakage scenarios** for use in an AI agent diagnostic project on OpenShift 4.19. The scenarios are designed to be:
+- Within the scope of the **Network Ingress and DNS team**
+- Detectable using only **Phase 0** diagnostic tools
+- Safe — the cluster remains fully accessible remotely
+
+The [`../netedge-break-repair.sh`](../netedge-break-repair.sh) helper script can
+stage each scenario end-to-end; see [`README-netedge-break-repair.md`](./README-netedge-break-repair.md)
+for usage.
+
+---
+
+## 1. Route → Service Selector Mismatch (Empty Endpoints)
+
+### Why it's in-scope & interesting
+- Tests the **Route → Service → Endpoints chain** that the ingress team owns.
+- Realistic: label drift or bad selectors happen often and produce 503s.
+- 100% reversible by flipping a single label or selector.
+- No impact on control-plane or kubeadmin connectivity.
+
+### How to stage it
+1. Create a test namespace, deploy a small app (e.g., `nginx`), expose it as a Service and Route.
+2. **Break**: modify the Service `spec.selector` to a label that no pod has (or relabel pods).
+3. **Repair**: restore the correct selector/pod labels.
+
+### How a human would diagnose
+1. `curl` the Route → 503 from the router.
+2. `oc get route -o json` → identify `.spec.to.name`.
+3. `oc get svc -o json` → inspect selector.
+4. `oc get endpoints -o json` → find empty subsets.
+5. Compare Service selector vs Pod labels, fix mismatch.
+
+### How the phase-0 agent would diagnose
+- `inspect_route` → surface Route, Service, Endpoints, detect empty endpoints.
+- `get_service_endpoints` → verify endpoints are empty.
+- Optionally `query_prometheus` → check router 503 metrics.
+
+### Agent input to cause it diagnose this task:
+- `On the currently connected cluster, we've deployed an app and exposed it through a Route, but it’s not working. Diagnose the root cause and suggest the fix.`
+---
+
+## 2. Route Host Without DNS Record (NXDOMAIN)
+
+### Why it's in-scope & interesting
+- Tests the DNS ↔ Ingress seam.
+- Common misconfiguration: developer sets `spec.host` to `myapp.example.com` without DNS.
+- Reversible: fix by using default domain or adding DNS record.
+
+### How to stage it
+1. Create a Route with `spec.host` set to `nonexistent.example.test`.
+2. **Break**: leave DNS unconfigured.
+3. **Repair**: update Route host to valid admitted domain or create DNS record.
+
+### How a human would diagnose
+1. `dig` / `nslookup` → NXDOMAIN.
+2. `oc get route` → host is admitted, but unreachable.
+3. Conclude DNS misconfiguration.
+
+### How the phase-0 agent would diagnose
+- `inspect_route` → check host, verify backend chain is healthy.
+- `probe_dns_local` → show NXDOMAIN.
+- Optionally `exec_dns_in_pod` → in-cluster resolution check.
+
+### Agent input to cause it diagnose this task:
+- `On the currently connected cluster, the route's hostname never resolves in DNS even though the service and pods look healthy. Diagnose the root cause and suggest the fix.`
+
+---
+
+## 3. NetworkPolicy Blocking Router → Service Traffic
+
+### Why it's in-scope & interesting
+- Tests namespace isolation affecting ingress traffic.
+- Real-world: default-deny without allow for router.
+- Reversible: apply/remove single NetworkPolicy.
+
+### How to stage it
+1. Deploy app in test namespace.
+2. **Break**: apply default-deny NetworkPolicy.
+3. **Repair**: remove it or add allow for ingress pods.
+
+### How a human would diagnose
+1. Route requests hang or 503.
+2. Endpoints are healthy.
+3. `oc get networkpolicy` → find default-deny.
+
+### Caveat for phase-0 agent
+- No built-in `get NetworkPolicy` in phase 0.
+- Could infer by symptoms (503 + healthy endpoints) and escalate.
+
+### Agent input to cause it diagnose this task:
+- `On the currently connected cluster every request to the Route now times out even though the pods and service look healthy. Diagnose the root cause and suggest the fix.`
+
+---
+
+## Recommended Scenario for v1: Selector Mismatch
+
+**Why**:  
+- Fully covered by existing Phase 0 tools.  
+- Deterministic, scriptable, and reversible.  
+- Teaches canonical ingress debugging (Route → Service → Endpoints).
+
+### Human/Agent Diagnostic Flow
+1. **Route → Service**  
+   `inspect_route` → see Service + Endpoints. Expect empty endpoints.
+2. **Endpoints inspection**  
+   `get_service_endpoints` → confirm no addresses.
+3. **Form hypothesis**  
+   Selector mismatch or zero pods.
+4. **Repair**  
+   Fix label or selector. Endpoints repopulate.
+5. **Quantify**  
+   `query_prometheus` for router 503s before/after.
@@ -0,0 +1,75 @@
+# NetEdge Break/Repair Script
+
+`netedge-break-repair.sh` stages, breaks, and repairs the deterministic ingress and DNS scenarios captured in [`NET_DIAGNOSTIC_SCENARIOS.md`](./NET_DIAGNOSTIC_SCENARIOS.md). It deploys a minimal application stack (Deployment, Service, Route), introduces the chosen fault, and restores the healthy baseline when asked.
+
+## Prerequisites
+
+- `oc` CLI available in `$PATH`
+- `envsubst` (from GNU `gettext`) available for templating manifests
+- Credentials that allow creating resources in the target namespace
+- Ability to pull `quay.io/openshift/origin-hello-openshift:latest` (default demo image; override with `IMAGE` if restricted)
+
+## Basic Usage
+
+```bash
+examples/netedge-tools/netedge-break-repair.sh [--scenario=<1|2|3>] <action>
+```
+
+Actions:
+
+- `--setup` – Deploy the healthy baseline (Deployment, Service, Route)
+- `--break` – Apply the scenario-specific failure
+- `--repair` – Restore the healthy state for the scenario
+- `--status` – Show Route, Service, Endpoints (and NetworkPolicy) details
+- `--curl` – Curl the current Route host (best effort)
+- `--cleanup` – Remove the created resources and any managed NetworkPolicy
+
+If `--scenario` is omitted the script defaults to scenario **1**. Always reuse the same `--scenario=N` flag on follow-up commands; the script prints a reminder after each action.
+
+## Scenarios
+
+1. **Route → Service selector mismatch** – Patches the Service selector to a non-matching value so no endpoints remain (router returns 503). Repair restores the correct selector.
+   - Agent prompt: “On the current cluster we exposed an app through a Route but it keeps failing. Diagnose the root cause and tell me how to fix it.”
+2. **Route host without DNS record** – Stores the original host, then patches `spec.host` to an NXDOMAIN value. Repair restores the saved host from annotation.
+   - Agent prompt: “On the current cluster the Route’s hostname never resolves in DNS even though the Service and Pods look healthy. Diagnose the root cause and tell me how to fix it.”
+3. **NetworkPolicy blocking router traffic** – Applies a default-deny ingress NetworkPolicy in the namespace. Repair deletes the policy.
+   - Agent prompt: “On the current cluster every request to the Route now times out even though the pods and service look healthy. Diagnose the root cause and tell me how to fix it.”
+
+## Environment Overrides
+
+Export any of these before running the script to change defaults:
+
+- `NAMESPACE` – Target namespace (default: `test-ingress`)
+- `APP_NAME` – Base name for Deployment/Service/Route (default: `hello`)
+- `APP_LABEL` – Label shared by Deployment and Service (default: `hello`)
+- `IMAGE` – Demo image (default: `quay.io/openshift/origin-hello-openshift:latest`)
+- `PORT` – Container and Service port (default: `8080`)
+
+## Example Workflow
+
+```bash
+# Scenario 1: selector mismatch
+examples/netedge-tools/netedge-break-repair.sh --scenario=1 --setup
+examples/netedge-tools/netedge-break-repair.sh --scenario=1 --break
+examples/netedge-tools/netedge-break-repair.sh --scenario=1 --status
+examples/netedge-tools/netedge-break-repair.sh --scenario=1 --repair
+
+# Scenario 2: NXDOMAIN host
+examples/netedge-tools/netedge-break-repair.sh --scenario=2 --setup
+examples/netedge-tools/netedge-break-repair.sh --scenario=2 --break
+examples/netedge-tools/netedge-break-repair.sh --scenario=2 --curl
+examples/netedge-tools/netedge-break-repair.sh --scenario=2 --repair
+
+# Scenario 3: NetworkPolicy block
+examples/netedge-tools/netedge-break-repair.sh --scenario=3 --setup
+examples/netedge-tools/netedge-break-repair.sh --scenario=3 --break
+examples/netedge-tools/netedge-break-repair.sh --scenario=3 --repair
+examples/netedge-tools/netedge-break-repair.sh --scenario=3 --cleanup
+```
+
+## Notes
+
+- The script refreshes the Route host from the API after each action so `--curl` always uses the current value.
+- Scenario 2 stores the original host in the `netedge-tools-original-host` annotation on the Route; avoid deleting this annotation if you plan to run `--repair`.
+- Scenario 3 leaves the namespace intact but removes the managed NetworkPolicy during cleanup.
+- If `oc` cannot reach the cluster, commands fail early with diagnostics.
Original file line number	Diff line number	Diff line change
Expand Up		@@ -36,3 +36,4 @@ go.work.sum
		.vscode/

		hack/keycloak-certs
		/genmcp