Skip to content

Agent operations framework adr #64

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
124 changes: 124 additions & 0 deletions adr/0017-agent-operations-framework.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,124 @@
# 17. Agent Operations Framework

Date: 2025-01-27

## Status

Accepted

## Context

Following the new epics and product decisions, we need to build a framework capable of executing write operations on SAP machines.
To achieve this, we have decided to use the agent, as it runs on the machines with root privileges, has all the necessary permissions to execute commands, and includes infrastructure to consume messages from other components of Trento.

## Decision

Unlike fact-gathering operations, we have chosen to delegate this new development to a dedicated library rather than embedding the code directly into the agent's codebase. This approach ensures better separation of concerns and allows the release of additional tools supporting operations without affecting the agent's codebase.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This approach ensures better separation of concerns and allows the release of additional tools supporting operations without affecting the agent's codebase.

The agent would still need to be rebuilt in order to get new features from an updated version of the library, correct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly is a golang library, it's like the gatherers of the checks, they are built into the code (Except there are no plugins for operators)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, correct.

By the way, this part, more than the Context chapter is a technical decision, so I would write it in the Decision chapter as-is.


The requirements for this new development are very specific:

- Operations must be atomic and include rollback capabilities.
- Operations must accept arguments to enable actions in different contexts without requiring dedicated code for each case.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: I believe I am missing some details here

Operations must accept arguments to enable actions in different contexts without requiring dedicated code for each case

Could you help out with an example or point to some extra information to properly understand, please?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Take the saptuneapplynote operator, it should not be hardcoded on a single note but accept the note as argument so it can be used with different notes on different machines (It's already like that)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first operation we are working on is saptune. Applying saptune solutions specifically.
Different hosts need different solutions. We have HANA, NETWEAVER, S4HANA, HANA=S4HANA, etc.
So we are going to send this value as argument. This way, the operation will do a slightly different thing based on the argument.

- Operations must be transactional, including distinct steps for prerequisites verification, commit, rollback, and validation of changes applied during the commit phase.
- Operations must be versioned, with backward compatibility ensured for previous versions in the event of updates.
- Operations must be idempotent.
Comment on lines +22 to +24
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By Operations we mean the single action performed by an agent instance, the workflow step that does an action on every agent instance, the entire workflow, something else?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is true. I would rename this to Operator.
At the end, the agent is all about operators. I would try to avoid any Operation wording.
At the end, and operation is a set of steps. And each step requests a operator execution in the agents.
The agents, by know doesn't have the knowledge of the Operation as a whole. It is only responsible of knowing and executing this small pieces of code now called Operators


The agent will consume the library containing these operations and use it to fulfill operation requests from other components.

The library handling operations is named [workbench](https://github.com/trento-project/workbench).

### Example:

- **Operator**: `saptuneapplysolution` - Applies a SAP Tune solution using the solution name as an argument.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really like this example. I think it is more confusing than helping, as both the operator and operation as mixed, and in essence both things are the same. I would have this as:

Find the next operator as example:

`saptuneapplysolution` is an operator that applies saptune solutions. It will apply the solution given as argument. If a solution is already applied, no action is taken. In case of an error, revert the solution specifies as an argument to return to a state where no solutions are applied.

nitpick: Applies a saptune solution, it is the tool name. Let's stick to it.
Nobody knows it as SAP Tune hehe
The same for the next line

- **Operation**: Apply a SAP Tune solution if it has not already been applied. If already applied, no action is taken. In case of an error, revert the solution specified as an argument.

## Operator

An operator is a unit of code capable of performing write operations on target machines.
A write operation can either succeed or fail. Upon success, it generates a diff showing changes made during the commit phase.

The operator accepts arguments in the form of a `map[string]any` to specify operation parameters. Each operator is responsible for extracting and validating these arguments.

The operator follows a transactional workflow, which includes the following distinct phases:

- **PLAN**
- **COMMIT**
- **VERIFY**
- **ROLLBACK**
Comment on lines +42 to +47
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This applies to a single write operation, right? If an operator executes multiple write operations, will they be in the same transaction?

Also, so far we are not considering distributed transactions across multiple operators in the same operations, correct?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This applies to a single write operation, right? If an operator executes multiple write operations, will they be in the same transaction?

I guess one operator can do more than one "write" operation in the same transaction, so we can rollback the whole process. The example is the corosync file change and reload. It is the same like re-configuring nginx (easier example). You change the nginx configuration file and restart the daemon. If the restart fails, you restore the initial config file and restart. I guess we could include all this in the same operator. Otherwise things start getting complex. We didn't have yet the chance to implement such a thing though, so this can change.

Also, so far we are not considering distributed transactions across multiple operators in the same operations, correct?

damn, now I feel you pair with the terminology XD
The operator shouldn't never know that there are other operator executions in the same agent or operation. That's the meaning of "atomic" I guess. The operator only knows to do one single thing deterministically.
That's the difficulty on implementing complex multi-step operations and their rollbacks. That rollbacking operations with multiple steps is difficult. That's why we are postponing this implementation and design


### PLAN

The PLAN phase collects information about the operation and verifies prerequisites.
This phase also gathers information for generating diffs by collecting the "before" state of the system.
Backups are created for any resources modified during the COMMIT phase, ensuring restoration is possible in case of rollback or manual recovery if rollback fails.

If an error occurs during the PLAN phase, no rollback is required; the operation is simply aborted.

### COMMIT

The COMMIT phase performs the actual write operations using the data collected during the PLAN phase.
If an error occurs, rollback is triggered.

The COMMIT phase must be idempotent. If a requested change has already been applied, the commit operation is skipped without error. Idempotency must be implemented based on the specific operation's requirements.

### VERIFY

The VERIFY phase ensures the actions applied during the COMMIT phase produced the expected results.
If an error occurs, rollback is initiated.

The VERIFY phase also collects the "after" state to generate the diff showing changes applied during the commit.

### ROLLBACK

The ROLLBACK phase reverts changes made during the COMMIT phase. It uses data collected during the PLAN phase to restore the system to its previous state.

Rollback implementations may vary based on the type of operation. Clear error messages and appropriate logs must be provided.

If rollback fails, an error is returned without further action.

Each operator implements these phases by satisfying the `phaser` interface:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would put this Each operator implements... phrase and the code snippet in the main Operator chapter, after listing the 4 phases. Otherwise it looks like it belongs to the rollback sub chapter


```go
type phaser interface {
plan(ctx context.Context) error
commit(ctx context.Context) error
rollback(ctx context.Context) error
verify(ctx context.Context) error
operationDiff(ctx context.Context) map[string]any
}
```


These methods are invoked by the Executor, which wraps the operator. All operators are exposed through a constructor function, returning operators already wrapped in an Executor.

## Executor

The Executor is a wrapper around an operator that manages operations transactionally.
For library users, the Executor is transparent to the users, operators are already wrapped within an Executor.

Below is a flowchart illustrating the transactional flow:

![flow_chart](https://github.com/trento-project/workbench/raw/main/flow_chart.png)

## Registry

The Registry manages all available operators.
Each operator has a version. By default, the latest version is fetched if no specific version is requested.

The operator naming convention is: `<operatorname>@<version>`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question: is the operatorname just the operationID mentioned after, or are they two different things?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes we can say that the operatorname is like the name of the gatherer the operationID is like the execution_id of a check execution a unique identifier for an operation

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

different things. The operatorname is a unique value that identifies the operator.
The operationID is the ID of the current operation execution so to say


The Registry returns an Operator Builder:

```golang
type OperatorBuilder func(operationID string, arguments OperatorArguments) Operator
```

- `operationID`: A unique identifier for the operation.
- `arguments`: A `map[string]any` structure containing operation parameters.

## Consequences

This development will enable transactional write operations on target machines.

Each operation is atomic.
Coordination, ordering, and dependency management of multiple operations are not the agent's responsibility but are delegated to another component that orchestrates their execution.