Skip to content

feat: decoupled prometheus exporter's calculation and output #12383

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 50 commits into
base: master
Choose a base branch
from

Conversation

SkyeYoung
Copy link
Member

@SkyeYoung SkyeYoung commented Jun 26, 2025

Description

This PR decouples the calculation and output processes of the Prometheus exporter. The "calculation" is performed in the privileged agent process at intervals defined by the refresh_interval(default is 15s) and written to a shared dict, while the "output" (i.e., the /apisix/prometheus/metrics API) is moved to the worker process, which only reads and returns the cached data in the shared dict.

The above are just the core changes. In fact, I encountered many other problems, which have been commented or annotated in the corresponding positions, and will not be repeated here.

For the testing part, since the Prometheus exporter currently refreshes data every 15 seconds, I used a smaller interval in the relevant tests to pass the original tests.

Which issue(s) this PR fixes:

Fixes #

Stress Testing (IN PROGRESS, Wait for discussion)

Scripts

https://gist.github.com/SkyeYoung/dc7f8b8d9c7e28e643cd851a6ad6af72

How to use

  1. install wrk2, git clone apisix
  2. deploy etcd(./test.sh init-etcd), nginx(as upstream, ./test.sh start-nginx)
  3. make run
  4. create 10k routes(./test.sh create)
  5. enable promethues(./test.sh enable-prometheus)
  6. run benchmark(./test.sh benchmark)

Test Logic

benchmark performs two tests in sequence:

echo "🔧 Test 1: Three Connections"
benchmark_metrics "three_conn" -t3 -c3 -d30s -R3 -U

echo "🔧 Test 2: Single Connection"
benchmark_metrics "single_conn" -t1 -c1 -d30s -R1 -U

the params are directly passed to wrk2, and are the settings for requesting the Prometheus exporter API.

And the specific steps in benchmark_metrics are as follows:

BASELINE_RATE="100"

# restart apisix
make stop && make run

# ...

# Start test routes load in background using wrk.lua
nohup bash -c "wrk -t 4 -c 100 -d 60s -U -R ${BASELINE_RATE} -s wrk.lua '${test_routes_url}' > '${routes_output}' 2>&1" &
local routes_pid=$!

# Wait a moment for routes load to establish
sleep 15

# Run wrk benchmark against metrics endpoint
wrk "$@" "${metrics_url}" > "${metrics_output}" 2>&1

# Wait for routes load to finish
wait $routes_pid

Results

1. ##### 1 conns:

📊 Performance Summary:
Nginx Workers (4 processes): CPU 1.4%, Memory 0.7% (114.90 MB avg)
PID 307741: CPU 1.7%, Memory 0.9% (140.27 MB) - openresty
PID 307742: CPU 1.7%, Memory 0.7% (113.73 MB) - openresty
PID 307743: CPU 1.2%, Memory 0.7% (102.95 MB) - openresty
PID 307744: CPU 1.0%, Memory 0.7% (102.67 MB) - openresty
Privileged Agents (1 processes): CPU 2.7%, Memory 0.7% (117.99 MB avg)
PID 307745: CPU 2.7%, Memory 0.7% (117.99 MB) - openresty

📈 Metrics Endpoint Results:
Latency 6.22ms 1.01ms 9.83ms 90.00%
Req/Sec 1.00 5.80 35.00 97.09%
Latency Distribution (HdrHistogram - Recorded Latency)
Latency Distribution (HdrHistogram - Uncorrected Latency (measured without taking delayed starts into account))
31 requests in 30.01s, 273.36MB read

3 conns:

📊 Performance Summary:
Nginx Workers (4 processes): CPU 1.5%, Memory 0.8% (120.32 MB avg)
PID 300776: CPU 2.8%, Memory 0.9% (142.31 MB) - openresty
PID 300777: CPU 1.0%, Memory 0.7% (111.78 MB) - openresty
PID 300778: CPU 0.8%, Memory 0.7% (112.78 MB) - openresty
PID 300779: CPU 1.4%, Memory 0.7% (114.39 MB) - openresty
Privileged Agents (1 processes): CPU 3.9%, Memory 0.7% (118.08 MB avg)
PID 300780: CPU 3.9%, Memory 0.7% (118.08 MB) - openresty

📈 Metrics Endpoint Results:
Latency 9.70ms 4.36ms 31.97ms 95.00%
Req/Sec 0.99 4.76 31.00 95.53%
Latency Distribution (HdrHistogram - Recorded Latency)
Latency Distribution (HdrHistogram - Uncorrected Latency (measured without taking delayed starts into account))
90 requests in 30.01s, 804.52MB read

2. ##### 1 conns:

📊 Performance Summary:
Nginx Workers (4 processes): CPU 1.5%, Memory 0.7% (117.02 MB avg)
PID 324993: CPU 2.1%, Memory 0.7% (114.96 MB) - openresty
PID 324994: CPU 1.5%, Memory 0.7% (104.77 MB) - openresty
PID 324995: CPU 1.3%, Memory 0.7% (105.16 MB) - openresty
PID 324996: CPU 1.2%, Memory 0.9% (143.19 MB) - openresty
Privileged Agents (1 processes): CPU 0.5%, Memory 0.8% (123.23 MB avg)
PID 324997: CPU 0.5%, Memory 0.8% (123.23 MB) - openresty

📈 Metrics Endpoint Results:
Latency 6.61ms 1.34ms 10.78ms 75.00%
Req/Sec 1.00 5.58 33.00 96.89%
Latency Distribution (HdrHistogram - Recorded Latency)
Latency Distribution (HdrHistogram - Uncorrected Latency (measured without taking delayed starts into account))
31 requests in 30.01s, 293.73MB read

3 conns:

📊 Performance Summary:
Nginx Workers (4 processes): CPU 2.1%, Memory 0.8% (122.44 MB avg)
PID 318089: CPU 3.6%, Memory 0.9% (143.75 MB) - openresty
PID 318090: CPU 1.5%, Memory 0.7% (115.52 MB) - openresty
PID 318091: CPU 1.7%, Memory 0.7% (116.41 MB) - openresty
PID 318092: CPU 1.6%, Memory 0.7% (114.07 MB) - openresty
Privileged Agents (1 processes): CPU 2.2%, Memory 0.8% (120.06 MB avg)
PID 318093: CPU 2.2%, Memory 0.8% (120.06 MB) - openresty

📈 Metrics Endpoint Results:
Latency 10.70ms 4.62ms 33.44ms 94.92%
Req/Sec 0.98 4.31 25.00 94.81%
Latency Distribution (HdrHistogram - Recorded Latency)
Latency Distribution (HdrHistogram - Uncorrected Latency (measured without taking delayed starts into account))
91 requests in 30.01s, 855.44MB read

3. ##### 1 conns:

📊 Performance Summary:
Nginx Workers (4 processes): CPU 1.6%, Memory 0.7% (114.56 MB avg)
PID 339966: CPU 2.3%, Memory 0.9% (140.78 MB) - openresty
PID 339967: CPU 1.6%, Memory 0.7% (112.49 MB) - openresty
PID 339968: CPU 1.1%, Memory 0.6% (101.79 MB) - openresty
PID 339969: CPU 1.6%, Memory 0.7% (103.17 MB) - openresty
Privileged Agents (1 processes): CPU 3.9%, Memory 0.8% (119.50 MB avg)
PID 339970: CPU 3.9%, Memory 0.8% (119.50 MB) - openresty

📈 Metrics Endpoint Results:
Latency 5.76ms 1.22ms 10.74ms 95.00%
Req/Sec 1.01 7.24 55.00 98.09%
Latency Distribution (HdrHistogram - Recorded Latency)
Latency Distribution (HdrHistogram - Uncorrected Latency (measured without taking delayed starts into account))
31 requests in 30.01s, 267.13MB read

3 conns:

📊 Performance Summary:
Nginx Workers (4 processes): CPU 1.5%, Memory 0.8% (120.26 MB avg)
PID 333005: CPU 1.8%, Memory 0.7% (101.87 MB) - openresty
PID 333006: CPU 1.5%, Memory 1.0% (151.62 MB) - openresty
PID 333007: CPU 1.0%, Memory 0.7% (114.47 MB) - openresty
PID 333008: CPU 1.7%, Memory 0.7% (113.09 MB) - openresty
Privileged Agents (1 processes): CPU 2.3%, Memory 0.8% (120.39 MB avg)
PID 333009: CPU 2.3%, Memory 0.8% (120.39 MB) - openresty

📈 Metrics Endpoint Results:
Latency 10.22ms 1.67ms 15.58ms 73.33%
Req/Sec 1.00 5.45 37.00 96.66%
Latency Distribution (HdrHistogram - Recorded Latency)
Latency Distribution (HdrHistogram - Uncorrected Latency (measured without taking delayed starts into account))
91 requests in 30.01s, 813.91MB read

Checklist

  • I have explained the need for this PR and the problem it solves
  • I have explained the changes or the new features added to this PR
  • I have added tests corresponding to this change
  • I have updated the documentation to reflect this change
  • I have verified that this change is backward compatible (If not, please discuss on the APISIX mailing list first)

{name = "waiting", var = "ngx_stat_waiting"},
}

-- Use FFI to get nginx status directly from global variables
local function nginx_status()
Copy link
Member Author

@SkyeYoung SkyeYoung Jun 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because API disabled in the context of ngx.timer, context: ngx.timer.

Here use FFI to rewrite the nginx_status logic.

OLD(master) NEW(current)
image image

Copy link
Contributor

@bzp2010 bzp2010 Jun 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Underhood: The ngx.var is related to the request context, which is inaccessible; nginx subrequests are also not available. Hence the rewrite to FFI.

This not only eliminates the total connections offset caused by the APISIX plugin requesting the /apisix/status API itself but also improves efficiency.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another way:

we can use resty.http to send the http request to itself for fetching nginx status

which is easier to read

anyway, I can accept current way but I can not sure it is easier enough to maintain for other developers

I highly recommend to add some comments which is useful

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@membphis

Actually, I recommend using these FFI API for data acquisition.
I confirmed from the nginx code that this data is synchronized between workers by shared memory, so FFI API can access them.

And this mechanism is entirely LuaJIT, not nginx fake request or openresty cosocket.
The former doesn't introduce any noise.
And when using any of the latter, no matter which one, the fetching behavior itself causes accepted/active/handled/reading/waiting/writing metrics to increase. waiting/writing. This is due to the fact that these mechanisms are always requesting APIs as network sockets, and they themselves cause the metrics to go up. This has always been a problem, and while I can understand it and it doesn't cause serious problems, it's always been confusing.

I came up with this idea and @SkyeYoung did it independently after some simple research, which I'm sure is not a difficult task for developers with almost any AI LLM assistance.
These c-variables haven't changed in years, and I don't think it's going to change nearly as much in the future (there's no need), so it's not really an area that needs constant attention. If any future openresty/nginx breaks this convention, our test cases can find them.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember I tried using resty.http locally and ran into some problems. This was also one of the first methods suggested by @bzp2010. Because there were a lot of problems with the code, I didn't even submit it, and now I can't find it.

As for cosocket, it's even harder for me, a beginner, to understand.

Later, I switched to FFI under the suggestion of @bzp2010 , and found this approach to be actually very simple and straightforward.

@@ -454,10 +458,11 @@ local function collect(ctx, stream_only)
local config = core.config.new()

-- config server status
local vars = ngx.var or {}
local hostname = vars.hostname or ""
local hostname = core.utils.gethostname() or ""
Copy link
Member Author

@SkyeYoung SkyeYoung Jun 27, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because API disabled in the context of ngx.timer, context: ngx.timer.

@dosubot dosubot bot added the enhancement New feature or request label Jul 21, 2025
@@ -100,18 +100,14 @@ http {
}

server {
{% if use_apisix_base then %}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now, this API can be run in a normal worker

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The export API will no longer be exposed to privileged processes, which provides isolation of HTTP traffic from root privileges for enhanced security.
Therefore this is no longer needed.

Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR decouples the Prometheus exporter's calculation and output processes in APISIX. The core change moves metric calculation to the privileged agent process running at configurable intervals (default 15s) and cached in shared memory, while the output endpoint simply reads and returns the cached data from worker processes.

  • Metric calculation is moved to privileged agent process with configurable refresh interval
  • Nginx status collection is optimized using FFI for direct access to global variables
  • Test configurations are updated with shorter refresh intervals to maintain test compatibility

Reviewed Changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
apisix/plugins/prometheus/exporter.lua Core implementation of decoupled calculation/output with timer-based metric collection
apisix/plugin.lua Extracts Prometheus initialization into separate function for proper timing
apisix/init.lua Adds Prometheus initialization calls to both HTTP and stream worker init phases
apisix/cli/ngx_tpl.lua Removes privileged agent process restrictions from Prometheus server configuration
conf/config.yaml.example Documents new refresh_interval configuration option
t/plugin/prometheus*.t Updates test configurations with shorter refresh intervals for test compatibility
t/cli/test_prometheus_stream.sh Adds refresh_interval configuration for stream tests
t/cli/test_prometheus_run_in_privileged.sh Removes entire test file
apisix/core/config_etcd.lua Initializes values field to empty table instead of nil
t/core/config_etcd.t Adds additional error log line expectation
Comments suppressed due to low confidence (1)

apisix/plugins/prometheus/exporter.lua:538

  • [nitpick] The error message should be more descriptive. Currently it logs the error result, but it should also indicate this is happening in the timer function and include context about the collection failure.
        core.log.error("Failed to collect metrics: ", res)

@SkyeYoung SkyeYoung marked this pull request as draft July 21, 2025 11:25
return
end

ngx.timer.at(0, exporter_timer)
Copy link
Member Author

@SkyeYoung SkyeYoung Jul 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After the modification, this place can still only be asynchronous at the moment.

If synchronization is needed, we need to modify the following parts:

local local_conf = core.config.local_conf()
local stream_only = local_conf.apisix.proxy_mode == "stream"
-- we can't get etcd index in metric server if only stream subsystem is enabled
if config.type == "etcd" and not stream_only then
-- etcd modify index
etcd_modify_index()
local version, err = config:server_version()
if version then
metrics.etcd_reachable:set(1)
else
metrics.etcd_reachable:set(0)
core.log.error("prometheus: failed to reach config server while ",
"processing metrics endpoint: ", err)
end
-- Because request any key from etcd will return the "X-Etcd-Index".
-- A non-existed key is preferred because it doesn't return too much data.
-- So use phantom key to get etcd index.
local res, _ = config:getkey("/phantomkey")
if res and res.headers then
clear_tab(key_values)
-- global max
key_values[1] = "x_etcd_index"
metrics.etcd_modify_indexes:set(res.headers["X-Etcd-Index"], key_values)
end
end

The reason is that this part requests the API, which will lead to the following error when directly using exporter_timer():

2025/07/21 14:41:12 [error] 464992#464992: init_worker_by_lua error: /home/xxx/apisix//deps/share/lua/5.1/resty/http.lua:74: API disabled in the context of init_worker_by_lua*
stack traceback:
  [C]: in function 'co_create'
  /home/xxx/apisix//deps/share/lua/5.1/resty/http.lua:74: in function '_body_reader'
  /home/xxx/apisix//deps/share/lua/5.1/resty/http.lua:821: in function 'request'
  /home/xxx/apisix//deps/share/lua/5.1/resty/etcd/v3.lua:120: in function 'request_uri_via_unix_socket'
  /home/xxx/apisix//deps/share/lua/5.1/resty/etcd/v3.lua:160: in function 'http_request_uri'
  /home/xxx/apisix//deps/share/lua/5.1/resty/etcd/v3.lua:251: in function 'server_version'
  /home/xxx/apisix/apisix/core/config_etcd.lua:1066: in function 'server_version'
  /home/xxx/apisix/apisix/plugins/prometheus/exporter.lua:474: in function 'collect'
  /home/xxx/apisix/apisix/plugins/prometheus/exporter.lua:537: in function 'exporter_timer'
  /home/xxx/apisix/apisix/plugins/prometheus/exporter.lua:553: in function 'init_exporter_timer'
  /home/xxx/apisix/apisix/plugin.lua:808: in function 'init_prometheus'
  /home/xxx/apisix/apisix/init.lua:161: in function 'http_init_worker'
  init_worker_by_lua:2: in main chunk

So, if we need to ensure synchronized initialization, we need to continue the discussion: whether to remove or move the collection of this part of the indicators.

@SkyeYoung SkyeYoung marked this pull request as ready for review July 22, 2025 01:04
Comment on lines 224 to 230
-- FIXME:
-- Now the HTTP subsystem loads the stream plugin unintentionally, which shouldn't happen.
-- It breaks the initialization logic of the plugin,
-- here it is temporarily fixed using a workaround.
if ngx.config.subsystem ~= "stream" then
return
end
Copy link
Member Author

@SkyeYoung SkyeYoung Jul 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned in the comments, the http subsystem also loads the stream plugins. This is an issue that needs to be resolved.

Image Image

Copy link
Contributor

@bzp2010 bzp2010 Jul 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please create an issue for this. thx @SkyeYoung

Comment on lines -360 to -367
local enabled = core.table.array_find(http_plugin_names, "prometheus") ~= nil
local active = exporter.get_prometheus() ~= nil
if not enabled then
exporter.destroy()
end
if enabled and not active then
exporter.http_init()
end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add some description under this comment explaining why we removed it and moved to init and destroy hooks.

Copy link
Member Author

@SkyeYoung SkyeYoung Jul 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original code skipped plugin.init() and old_plugin.destroy() used in https://github.com/apache/apisix/blob/6fb9bf94281525c1fca397f681b4890b69440369/apisix/plugin.lua, and implemented the overload of the prometheus plugin for some reason that I have not yet understood (perhaps because prometheus.lua originally did not contain two functions init and destroy).


The initial reason was that even after separating the init_prometheus part and placing it at the end of init_worker, directly calling exporter_timer() would still cause an error. After debugging, another initialization logic was found here. This is obviously redundant.

Currently, we provide init and destroy functions in prometheus.lua, allowing the initialization and reloading of the prometheus plugin to be handled within the plugin's own files, reducing coupling.

This also allows the prometheus plugin to revert to the mechanism provided by plugin.lua, reducing special cases, lowering the cost of understanding, and making the code easier to maintain.

require("apisix.plugins.prometheus.exporter").http_init(prometheus_enabled_in_stream)
elseif not is_http and core.table.array_find(stream_plugin_names, "prometheus") then
require("apisix.plugins.prometheus.exporter").stream_init()
if is_http and (enabled_in_http or enabled_in_stream) then
Copy link
Contributor

@bzp2010 bzp2010 Jul 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOTE

We will always only handle metrics generation in the http subsystem.

  1. This will ensure that there is no duplication of execution on http and stream to waste compute resources.
  2. This simplifies the design.
  3. Whether or not the user has http enabled (i.e., whether or not it is in stream only mode), an http block for the Prometheus export API and its server block (:9091) will always be present, otherwise Prometheus would be pointless. This means that we can always have an http subsystem context for periodic generation of timers and metrics anyway, even if we are currently in stream only mode.

Copy link
Contributor

@bzp2010 bzp2010 Jul 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add some comments to the code to document the design intent. @SkyeYoung

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

@@ -35,6 +34,7 @@ local _M = {
priority = 500,
name = plugin_name,
log = exporter.http_log,
destroy = exporter.destroy,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOTE

This will always destroy the plugin (the prometheus instance in it) when reloading it using the Admin API and loading it again based on the latest configuration.
If a reload is performed after a plugin is removed from the profile list, this plugin will not be restored again. Until the next reload.

Technically, exporter.destroy just backs up that instance of the prometheus module and copies it to another variable.
This will cause the export API to stop working, at which point it will always return a {}, which is consistent with the current behavior.
Under the hood, the timer will also stop working, no longer generating metrics based on interval timing, and the metrics computation overhead introduced by APISIX is completely eliminated.
When the next plugin reload occurs, if prometheus is re-enabled, the timer will resume running.

Regarding the background timer introduced by the prometheus third-party library, unfortunately, it never stops running.
It is registered with ngx.timer.every to perform the task of synchronizing the shdict at regular intervals, and this overhead cannot be paused or resumed by external intervention unless we fork and modify the library itself.

So this destruction does not mean that the prometheus instance is actually destroyed, the synchronization timer is stopped, and the shdict is cleared. none of this happens.

@@ -55,4 +55,11 @@ function _M.api()
end


function _M.init()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOTE

We turned to using the built-in hooks of the plugin system, namely init to initialize the prometheus instance and prometheus metrics registration.

Note, however, that data padding only occurs the first time the plugin is started (it is usually when the worker is started, i.e. the init_prometheus call in init.lua http_init_worker) and every timer.
This initialization just registers the metrics, but doesn't really populate the data.

function _M.init()
local local_conf = core.config.local_conf()
local enabled_in_stream = core.table.array_find(local_conf.stream_plugins, "prometheus")
exporter.http_init(enabled_in_stream)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOTE

The prometheus plugin, loaded by the http subsystem, will register http metrics there, and will decide whether to register stream metrics (xrpc) depending on whether the stream subsystem has been started.
This is mainly for metrics generation needs in privileged processes, stream data is not really reported at any phase in the http subsystem.

local version, err = config:server_version()
if version then
metrics.etcd_reachable:set(1)
if yieldable then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOTE

Metrics contains the etcd reachability report and the etcd latest modified index report, which relies on communication with etcd.
According to openresty's restrictions, yield is prohibited in the init_worker phase, i.e. cosocket-based communication with etcd is not allowed.
So we skip a capture here until the timer does them.

return
end

if not prometheus then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOTE

This is used for dynamic disable (by plugin reload API) of the plugin, i.e. done in exporter.destroy.
This is where we stop if the prometheus instance is "destroyed", and as you can see, this will happen before scheduling the next timer task, which means the timer will stop.

Technically, this is the advantage of ngx.timer.at over ngx.timer.every, every is not terminable, the developer can't get an "instance" of a timer to pause or stop it.
But by using ngx.timer.at, we can precisely control whether or not to continue scheduling the next timed task, which allows us to stop the timer. If you need to resume it, just re-execute ngx.timer.at(0).

return
end

exporter_timer(false, false)
Copy link
Contributor

@bzp2010 bzp2010 Jul 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NOTE

The initialization of the timer will perform an acquisition task synchronously, i.e. the first acquisition will always occur in the init_worker phase, which provides initial access to the metrics data.

If at any time , the metrics data (the string in the prometheus-cache shdict) is not available, the API will report an error and log it. This is by design not very likely to happen.

local cached_metrics_text = shdict_prometheus_cache:get(CACHED_METRICS_KEY)
if not cached_metrics_text then
core.log.error("Failed to retrieve cached metrics: data is nil")
return 500, "Failed to retrieve metrics: no data available"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if not prometheus then
core.response.exit(200, "{}")
return core.response.exit(200, "{}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

JFI, this behavior seems to be inconsistent with what is in get_cached_metrics and we need to confirm which mode should be used. cc @membphis

BTW, prometheus to nil will happen when the plugin is dynamically disabled.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the current way is good to me

@@ -110,6 +116,10 @@ end


function _M.http_init(prometheus_enabled_in_stream)
if ngx.config.subsystem ~= "http" then
return
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can print some warning error log, it is unexpected

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a problem that needs to be solved, and after it is solved, it seems that we will no longer need these assertions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

-- It breaks the initialization logic of the plugin,
-- here it is temporarily fixed using a workaround.
if ngx.config.subsystem ~= "stream" then
return
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

{name = "waiting", var = "ngx_stat_waiting"},
}

-- Use FFI to get nginx status directly from global variables
local function nginx_status()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another way:

we can use resty.http to send the http request to itself for fetching nginx status

which is easier to read

anyway, I can accept current way but I can not sure it is easier enough to maintain for other developers

I highly recommend to add some comments which is useful

return 500, {message = "An unexpected error occurred"}
end

local function collect(yieldable)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we return directly if the yieldable is false?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest to perform the first capture and metrics generation at init_worker, it is synchronously blocking, which makes the metrics almost always available, otherwise we can't be sure when the metrics will be available, which can lead to some unexpected exception responses (HTTP 500).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This parameter is currently used to solve the problem of not being able to obtain the following data during initialization.

I elaborated in detail in #12383 (comment).

Image

if not prometheus then
core.response.exit(200, "{}")
return core.response.exit(200, "{}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the current way is good to me

@@ -170,6 +170,7 @@ nginx_config: # Config for render the template to generate n
meta:
lua_shared_dict: # Nginx Lua shared memory zone. Size units are m or k.
prometheus-metrics: 15m
prometheus-cache: 10m
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls add some comments, tell users when they need to modify it

core.log.error("Failed to collect metrics: ", res)
return
end
shdict_prometheus_cache:set(CACHED_METRICS_KEY, res)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to capture the return value, it maybe fail

if err, tell user what is the reason, and tell the user to change the default size if the shdict is small

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request size:L This PR changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants