Skip to content

Conversation

elieserr
Copy link

@elieserr elieserr commented Aug 24, 2025

Added instrumentation for promethues compatible endpoints and otel.
Added tests and e2e tests.

I went forward this ways as I was trying to get familiar with codebase. Happy to go in any direction reviewers advise.

Checklist

Part of #965

@elieserr
Copy link
Author

Hi folks, trying to get some eyes here but not really sure who to tag, thanks for any pointer in advance
@JorTurFer @rickbrouwer @zroubalik tagging you 3 since helped me with previous unrelated PR kedacore/keda#6990

Comment on lines 174 to 214
func (r *HTTPScaledObjectReconciler) updatePromMetricsOnDelete(ctx context.Context, scaledObject *httpv1alpha1.HTTPScaledObject, namespacedName string) {
logger := log.FromContext(ctx, "updatePromMetricsOnDelete", namespacedName)
logger.Info("updatePromMetricsOnDelete")
metrics.RecordDeleteHTTPScaledObjectCount(namespacedName)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like this is never called, wouldn't this result in always growing metric?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are right, forgot to call this, will fix

assert.True(t, ok, "operator_http_scaled_object_count_total is available")

requestCount := getMetricsValue(val)
assert.GreaterOrEqual(t, requestCount, float64(1))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when the whole e2e suite is executed, multiple e2e tests can run in parallel (iirc it could be like 3?). Perhaps the test could create 100 HSOs check if the metric is equal or over 100. And then we may want to delete that 100, wait for the metric to propagate and check again if it's under some reasonably low number, e.g. 10.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point, meanwhile we don't execute sequential tests (I'd like to avoid it tbh, but maybe we have to 🤔)
I guess that we can just execute both checks at once in the same tests, something like:

  • spawn 20 HTTPScaledObjects
  • check prometheus
  • check otel
  • remove the HTTPScaledObjects
  • check prometheus
  • check otel

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm updating the e2e to do some sequencing for testing metric going up and down. also including testing for http and grpc for otel

ports:
- name: metrics
containerPort: 8080
containerPort: 2223
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why this? Personally, I'd prefer to not change an already existing port

docs/operate.md Outdated
This endpoint can be enabled by setting the `OTEL_PROM_EXPORTER_ENABLED` environment variable to `true` on the operator deployment (`true` by default) and by setting `OTEL_PROM_EXPORTER_PORT` to an unused port for the endpoint to be made avaialble on (`2223` by default).

### Configuring the OTEL HTTP exporter
When configured, the ioperator can export metrics to a OTEL HTTP collector.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
When configured, the ioperator can export metrics to a OTEL HTTP collector.
When configured, the operator can export metrics to a OTEL HTTP collector.

docs/operate.md Outdated
### Configuring the OTEL HTTP exporter
When configured, the ioperator can export metrics to a OTEL HTTP collector.

The OTEL exporter can be enabled by setting the `OTEL_EXPORTER_OTLP_METRICS_ENABLED` environment variable to `true` on the operator deployment (`false` by default). When enabled the `OTEL_EXPORTER_OTLP_ENDPOINT` environment variable must also be configured so the exporter knows what collector to send the metrics to (e.g. http://opentelemetry-collector.open-telemetry-system:4318).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The OTEL exporter can be enabled by setting the `OTEL_EXPORTER_OTLP_METRICS_ENABLED` environment variable to `true` on the operator deployment (`false` by default). When enabled the `OTEL_EXPORTER_OTLP_ENDPOINT` environment variable must also be configured so the exporter knows what collector to send the metrics to (e.g. http://opentelemetry-collector.open-telemetry-system:4318).
The OTEL exporter can be enabled by setting the `OTEL_EXPORTER_OTLP_METRICS_ENABLED` environment variable to `true` on the operator deployment (`false` by default). When enabled, the `OTEL_EXPORTER_OTLP_ENDPOINT` environment variable must also be configured so the exporter knows what collector to send the metrics to (e.g. http://opentelemetry-collector.open-telemetry-system:4318).

operator/main.go Outdated
Comment on lines 129 to 140
if metricsCfg.OtelPrometheusExporterEnabled {
// start the prometheus compatible metrics server
// serves a prometheus compatible metrics endpoint on the configured port
eg.Go(func() error {
if err := runMetricsServer(ctx, ctrl.Log, metricsCfg); !util.IsIgnoredErr(err) {
setupLog.Error(err, "could not start the Prometheus metrics server")
return err
}

return nil
})
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The operator is already exposing a metrics server in prom format. Let's use it instead of starting another one, you can register more metrics on that server, for example, that's what we do for KEDA -> https://github.com/kedacore/keda/blob/main/pkg/metricscollector/prommetrics.go#L161

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah that make sense, will use it

func NewOtelMetrics(options ...metric.Option) *OtelMetrics {
ctx := context.Background()

exporter, err := otlpmetrichttp.New(ctx)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Opentelemetry supports HTTP and gRCP protocols, I prefer if we support both tbh. This is how we do it in KEDA, you can use it as example https://github.com/kedacore/keda/blob/main/pkg/metricscollector/opentelemetry.go#L67-L86

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

including it in upcoming changes, thnaks

provider := metric.NewMeterProvider(options...)
meter := provider.Meter(meterName)

httpScaledObjectCounter, err := meter.Int64UpDownCounter("operator_http_scaled_object_count", api.WithDescription("a counter of http_scaled_objects processed by the operator"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in opentelemetry context, dots are used instead of underscores

Suggested change
httpScaledObjectCounter, err := meter.Int64UpDownCounter("operator_http_scaled_object_count", api.WithDescription("a counter of http_scaled_objects processed by the operator"))
httpScaledObjectCounter, err := meter.Int64UpDownCounter("keda.http.scaled.object.count", api.WithDescription("a counter of HttpScaledObjects processed by the operator"))

)
meter := provider.Meter(meterName)

httpScaledObjectCounter, err := meter.Int64UpDownCounter("operator_http_scaled_object_count", api.WithDescription("a counter of http_scaled_objects processed by the operator"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
httpScaledObjectCounter, err := meter.Int64UpDownCounter("operator_http_scaled_object_count", api.WithDescription("a counter of http_scaled_objects processed by the operator"))
httpScaledObjectCounter, err := meter.Int64UpDownCounter("keda_http_scaled_object_total", api.WithDescription("a counter of http_scaled_objects processed by the operator"))

assert.True(t, ok, "operator_http_scaled_object_count_total is available")

requestCount := getMetricsValue(val)
assert.GreaterOrEqual(t, requestCount, float64(1))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point, meanwhile we don't execute sequential tests (I'd like to avoid it tbh, but maybe we have to 🤔)
I guess that we can just execute both checks at once in the same tests, something like:

  • spawn 20 HTTPScaledObjects
  • check prometheus
  • check otel
  • remove the HTTPScaledObjects
  • check prometheus
  • check otel

@JorTurFer
Copy link
Member

Thanks a lot for the contribution!

@elieserr elieserr force-pushed the instrument-http-add-on-operator branch from 723bbb9 to 176de5d Compare September 6, 2025 12:58
@elieserr elieserr force-pushed the instrument-http-add-on-operator branch from 176de5d to f56c1bd Compare September 23, 2025 20:17
@elieserr
Copy link
Author

I was able to reproduce the error in my local env. Was because of misconfig on go.mod. Also I rebased latest changes from main so this should work. lets see how e2e goes.

@zroubalik @wozniakjan @JorTurFer for workflow approval.

@elieserr
Copy link
Author

e2e went good on previous run, so fixing lints and some broken links from docs

Signed-off-by: Elieser Pereira <[email protected]>
Signed-off-by: Elieser Pereira <[email protected]>
@elieserr elieserr force-pushed the instrument-http-add-on-operator branch from bbad4f2 to bb94b8f Compare September 24, 2025 15:32
@wozniakjan wozniakjan requested a review from Copilot October 3, 2025 07:51
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds instrumentation for Prometheus-compatible endpoints and OpenTelemetry (OTEL) metrics to the KEDA HTTP Add-on operator. The implementation provides monitoring capabilities for HTTP scaled objects through both metrics collection methods.

  • Added metrics collection infrastructure with Prometheus and OTEL exporters
  • Implemented HTTP scaled object count tracking in the operator controller
  • Added comprehensive end-to-end tests for both metrics collection methods

Reviewed Changes

Copilot reviewed 21 out of 22 changed files in this pull request and generated 12 comments.

Show a summary per file
File Description
operator/metrics/ New package containing metrics collection infrastructure with Prometheus and OTEL exporters
operator/controllers/http/httpscaledobject_controller.go Updated controller to track HTTP scaled object metrics with thread-safe map
operator/controllers/http/config/config.go Added metrics configuration structure with environment variable parsing
operator/main.go Integrated metrics collectors initialization into main function
tests/checks/operator_*_metrics/ Comprehensive E2E tests for both Prometheus and OTEL metrics
docs/ Updated documentation with metrics configuration instructions
config/ Added Kubernetes manifests for metrics service and E2E test configurations

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

namespacedName := scaledObject.Name + scaledObject.Namespace
metricsData, ok := httpScaledObjectPromMetricsMap[namespacedName]
if ok {
metrics.RecordDeleteHTTPScaledObjectCount(scaledObject.Namespace)
Copy link
Preview

Copilot AI Oct 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the map entry already exists, the code calls RecordDeleteHTTPScaledObjectCount before RecordHTTPScaledObjectCount, which could lead to incorrect metrics if the namespace changed. This logic appears to be attempting to handle namespace changes, but it's unclear and potentially buggy.

Suggested change
metrics.RecordDeleteHTTPScaledObjectCount(scaledObject.Namespace)
// Record deletion for the old namespace before updating
metrics.RecordDeleteHTTPScaledObjectCount(metricsData.namespace)

Copilot uses AI. Check for mistakes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants