feat: decoupled prometheus exporter's calculation and output #12383

SkyeYoung · 2025-06-26T09:49:41Z

Description

This PR decouples the calculation and output processes of the Prometheus exporter. The "calculation" is performed in the privileged agent process at intervals defined by the refresh_interval(default is 15s) and written to a shared dict, while the "output" (i.e., the /apisix/prometheus/metrics API) is moved to the worker process, which only reads and returns the cached data in the shared dict.

The above are just the core changes. In fact, I encountered many other problems, which have been commented or annotated in the corresponding positions, and will not be repeated here.

For the testing part, since the Prometheus exporter currently refreshes data every 15 seconds, I used a smaller interval in the relevant tests to pass the original tests.

Which issue(s) this PR fixes:

Fixes #

Stress Testing (IN PROGRESS, Wait for discussion)

Scripts

https://gist.github.com/SkyeYoung/dc7f8b8d9c7e28e643cd851a6ad6af72

How to use

install wrk2, git clone apisix
deploy etcd(./test.sh init-etcd), nginx(as upstream, ./test.sh start-nginx)
make run
create 10k routes(./test.sh create)
enable promethues(./test.sh enable-prometheus)
run benchmark(./test.sh benchmark)

Test Logic

benchmark performs two tests in sequence:

echo "🔧 Test 1: Three Connections"
benchmark_metrics "three_conn" -t3 -c3 -d30s -R3 -U

echo "🔧 Test 2: Single Connection"
benchmark_metrics "single_conn" -t1 -c1 -d30s -R1 -U

the params are directly passed to wrk2, and are the settings for requesting the Prometheus exporter API.

And the specific steps in benchmark_metrics are as follows:

BASELINE_RATE="100"

# restart apisix
make stop && make run

# ...

# Start test routes load in background using wrk.lua
nohup bash -c "wrk -t 4 -c 100 -d 60s -U -R ${BASELINE_RATE} -s wrk.lua '${test_routes_url}' > '${routes_output}' 2>&1" &
local routes_pid=$!

# Wait a moment for routes load to establish
sleep 15

# Run wrk benchmark against metrics endpoint
wrk "$@" "${metrics_url}" > "${metrics_output}" 2>&1

# Wait for routes load to finish
wait $routes_pid

Results

1.

##### 1 conns:

📊 Performance Summary:
Nginx Workers (4 processes): CPU 1.4%, Memory 0.7% (114.90 MB avg)
PID 307741: CPU 1.7%, Memory 0.9% (140.27 MB) - openresty
PID 307742: CPU 1.7%, Memory 0.7% (113.73 MB) - openresty
PID 307743: CPU 1.2%, Memory 0.7% (102.95 MB) - openresty
PID 307744: CPU 1.0%, Memory 0.7% (102.67 MB) - openresty
Privileged Agents (1 processes): CPU 2.7%, Memory 0.7% (117.99 MB avg)
PID 307745: CPU 2.7%, Memory 0.7% (117.99 MB) - openresty

📈 Metrics Endpoint Results:
Latency 6.22ms 1.01ms 9.83ms 90.00%
Req/Sec 1.00 5.80 35.00 97.09%
Latency Distribution (HdrHistogram - Recorded Latency)
Latency Distribution (HdrHistogram - Uncorrected Latency (measured without taking delayed starts into account))
31 requests in 30.01s, 273.36MB read

3 conns:

📊 Performance Summary:
Nginx Workers (4 processes): CPU 1.5%, Memory 0.8% (120.32 MB avg)
PID 300776: CPU 2.8%, Memory 0.9% (142.31 MB) - openresty
PID 300777: CPU 1.0%, Memory 0.7% (111.78 MB) - openresty
PID 300778: CPU 0.8%, Memory 0.7% (112.78 MB) - openresty
PID 300779: CPU 1.4%, Memory 0.7% (114.39 MB) - openresty
Privileged Agents (1 processes): CPU 3.9%, Memory 0.7% (118.08 MB avg)
PID 300780: CPU 3.9%, Memory 0.7% (118.08 MB) - openresty

📈 Metrics Endpoint Results:
Latency 9.70ms 4.36ms 31.97ms 95.00%
Req/Sec 0.99 4.76 31.00 95.53%
Latency Distribution (HdrHistogram - Recorded Latency)
Latency Distribution (HdrHistogram - Uncorrected Latency (measured without taking delayed starts into account))
90 requests in 30.01s, 804.52MB read

2.

##### 1 conns:

📊 Performance Summary:
Nginx Workers (4 processes): CPU 1.5%, Memory 0.7% (117.02 MB avg)
PID 324993: CPU 2.1%, Memory 0.7% (114.96 MB) - openresty
PID 324994: CPU 1.5%, Memory 0.7% (104.77 MB) - openresty
PID 324995: CPU 1.3%, Memory 0.7% (105.16 MB) - openresty
PID 324996: CPU 1.2%, Memory 0.9% (143.19 MB) - openresty
Privileged Agents (1 processes): CPU 0.5%, Memory 0.8% (123.23 MB avg)
PID 324997: CPU 0.5%, Memory 0.8% (123.23 MB) - openresty

📈 Metrics Endpoint Results:
Latency 6.61ms 1.34ms 10.78ms 75.00%
Req/Sec 1.00 5.58 33.00 96.89%
Latency Distribution (HdrHistogram - Recorded Latency)
Latency Distribution (HdrHistogram - Uncorrected Latency (measured without taking delayed starts into account))
31 requests in 30.01s, 293.73MB read

3 conns:

📊 Performance Summary:
Nginx Workers (4 processes): CPU 2.1%, Memory 0.8% (122.44 MB avg)
PID 318089: CPU 3.6%, Memory 0.9% (143.75 MB) - openresty
PID 318090: CPU 1.5%, Memory 0.7% (115.52 MB) - openresty
PID 318091: CPU 1.7%, Memory 0.7% (116.41 MB) - openresty
PID 318092: CPU 1.6%, Memory 0.7% (114.07 MB) - openresty
Privileged Agents (1 processes): CPU 2.2%, Memory 0.8% (120.06 MB avg)
PID 318093: CPU 2.2%, Memory 0.8% (120.06 MB) - openresty

📈 Metrics Endpoint Results:
Latency 10.70ms 4.62ms 33.44ms 94.92%
Req/Sec 0.98 4.31 25.00 94.81%
Latency Distribution (HdrHistogram - Recorded Latency)
Latency Distribution (HdrHistogram - Uncorrected Latency (measured without taking delayed starts into account))
91 requests in 30.01s, 855.44MB read

3.

##### 1 conns:

📊 Performance Summary:
Nginx Workers (4 processes): CPU 1.6%, Memory 0.7% (114.56 MB avg)
PID 339966: CPU 2.3%, Memory 0.9% (140.78 MB) - openresty
PID 339967: CPU 1.6%, Memory 0.7% (112.49 MB) - openresty
PID 339968: CPU 1.1%, Memory 0.6% (101.79 MB) - openresty
PID 339969: CPU 1.6%, Memory 0.7% (103.17 MB) - openresty
Privileged Agents (1 processes): CPU 3.9%, Memory 0.8% (119.50 MB avg)
PID 339970: CPU 3.9%, Memory 0.8% (119.50 MB) - openresty

📈 Metrics Endpoint Results:
Latency 5.76ms 1.22ms 10.74ms 95.00%
Req/Sec 1.01 7.24 55.00 98.09%
Latency Distribution (HdrHistogram - Recorded Latency)
Latency Distribution (HdrHistogram - Uncorrected Latency (measured without taking delayed starts into account))
31 requests in 30.01s, 267.13MB read

3 conns:

📊 Performance Summary:
Nginx Workers (4 processes): CPU 1.5%, Memory 0.8% (120.26 MB avg)
PID 333005: CPU 1.8%, Memory 0.7% (101.87 MB) - openresty
PID 333006: CPU 1.5%, Memory 1.0% (151.62 MB) - openresty
PID 333007: CPU 1.0%, Memory 0.7% (114.47 MB) - openresty
PID 333008: CPU 1.7%, Memory 0.7% (113.09 MB) - openresty
Privileged Agents (1 processes): CPU 2.3%, Memory 0.8% (120.39 MB avg)
PID 333009: CPU 2.3%, Memory 0.8% (120.39 MB) - openresty

📈 Metrics Endpoint Results:
Latency 10.22ms 1.67ms 15.58ms 73.33%
Req/Sec 1.00 5.45 37.00 96.66%
Latency Distribution (HdrHistogram - Recorded Latency)
Latency Distribution (HdrHistogram - Uncorrected Latency (measured without taking delayed starts into account))
91 requests in 30.01s, 813.91MB read

Checklist

I have explained the need for this PR and the problem it solves
I have explained the changes or the new features added to this PR
I have added tests corresponding to this change
I have updated the documentation to reflect this change
I have verified that this change is backward compatible (If not, please discuss on the APISIX mailing list first)

…heus-exporter-concurrency

SkyeYoung · 2025-06-27T09:29:48Z

apisix/plugins/prometheus/exporter.lua

+    {name = "waiting", var = "ngx_stat_waiting"},
+}
+
+-- Use FFI to get nginx status directly from global variables
 local function nginx_status()


Because API disabled in the context of ngx.timer, context: ngx.timer.

Here use FFI to rewrite the nginx_status logic.

OLD(master) NEW(current)

Underhood: The ngx.var is related to the request context, which is inaccessible; nginx subrequests are also not available. Hence the rewrite to FFI.

This not only eliminates the total connections offset caused by the APISIX plugin requesting the /apisix/status API itself but also improves efficiency.

another way:

we can use resty.http to send the http request to itself for fetching nginx status

which is easier to read

anyway, I can accept current way but I can not sure it is easier enough to maintain for other developers

I highly recommend to add some comments which is useful

@membphis

Actually, I recommend using these FFI API for data acquisition.
I confirmed from the nginx code that this data is synchronized between workers by shared memory, so FFI API can access them.

And this mechanism is entirely LuaJIT, not nginx fake request or openresty cosocket.
The former doesn't introduce any noise.
And when using any of the latter, no matter which one, the fetching behavior itself causes accepted/active/handled/reading/waiting/writing metrics to increase. waiting/writing. This is due to the fact that these mechanisms are always requesting APIs as network sockets, and they themselves cause the metrics to go up. This has always been a problem, and while I can understand it and it doesn't cause serious problems, it's always been confusing.

I came up with this idea and @SkyeYoung did it independently after some simple research, which I'm sure is not a difficult task for developers with almost any AI LLM assistance.
These c-variables haven't changed in years, and I don't think it's going to change nearly as much in the future (there's no need), so it's not really an area that needs constant attention. If any future openresty/nginx breaks this convention, our test cases can find them.

I remember I tried using resty.http locally and ran into some problems. This was also one of the first methods suggested by @bzp2010. Because there were a lot of problems with the code, I didn't even submit it, and now I can't find it.

As for cosocket, it's even harder for me, a beginner, to understand.

Later, I switched to FFI under the suggestion of @bzp2010 , and found this approach to be actually very simple and straightforward.

SkyeYoung · 2025-06-27T09:31:33Z

apisix/plugins/prometheus/exporter.lua

@@ -454,10 +458,11 @@ local function collect(ctx, stream_only)
    local config = core.config.new()

    -- config server status
-    local vars = ngx.var or {}
-    local hostname = vars.hostname or ""
+    local hostname = core.utils.gethostname() or ""


Because API disabled in the context of ngx.timer, context: ngx.timer.

…heus-exporter-concurrency

This reverts commit 521f4c9.

…heus-exporter-concurrency

SkyeYoung · 2025-07-21T03:45:21Z

apisix/cli/ngx_tpl.lua

@@ -100,18 +100,14 @@ http {
    }

    server {
-        {% if use_apisix_base then %}


Now, this API can be run in a normal worker

The export API will no longer be exposed to privileged processes, which provides isolation of HTTP traffic from root privileges for enhanced security.
Therefore this is no longer needed.

Copilot

Pull Request Overview

This PR decouples the Prometheus exporter's calculation and output processes in APISIX. The core change moves metric calculation to the privileged agent process running at configurable intervals (default 15s) and cached in shared memory, while the output endpoint simply reads and returns the cached data from worker processes.

Metric calculation is moved to privileged agent process with configurable refresh interval
Nginx status collection is optimized using FFI for direct access to global variables
Test configurations are updated with shorter refresh intervals to maintain test compatibility

Reviewed Changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
apisix/plugins/prometheus/exporter.lua	Core implementation of decoupled calculation/output with timer-based metric collection
apisix/plugin.lua	Extracts Prometheus initialization into separate function for proper timing
apisix/init.lua	Adds Prometheus initialization calls to both HTTP and stream worker init phases
apisix/cli/ngx_tpl.lua	Removes privileged agent process restrictions from Prometheus server configuration
conf/config.yaml.example	Documents new refresh_interval configuration option
t/plugin/prometheus*.t	Updates test configurations with shorter refresh intervals for test compatibility
t/cli/test_prometheus_stream.sh	Adds refresh_interval configuration for stream tests
t/cli/test_prometheus_run_in_privileged.sh	Removes entire test file
apisix/core/config_etcd.lua	Initializes values field to empty table instead of nil
t/core/config_etcd.t	Adds additional error log line expectation

Comments suppressed due to low confidence (1)

apisix/plugins/prometheus/exporter.lua:538

[nitpick] The error message should be more descriptive. Currently it logs the error result, but it should also indicate this is happening in the timer function and include context about the collection failure.

        core.log.error("Failed to collect metrics: ", res)

apisix/plugins/prometheus/exporter.lua

SkyeYoung · 2025-07-22T01:00:15Z

apisix/plugins/prometheus/exporter.lua

+        return
+    end
+
+    ngx.timer.at(0, exporter_timer)


After the modification, this place can still only be asynchronous at the moment.

If synchronization is needed, we need to modify the following parts:

apisix/apisix/plugins/prometheus/exporter.lua

Lines 467 to 494 in 6fb9bf9

local local_conf = core.config.local_conf()

local stream_only = local_conf.apisix.proxy_mode == "stream"

-- we can't get etcd index in metric server if only stream subsystem is enabled

if config.type == "etcd" and not stream_only then

-- etcd modify index

etcd_modify_index()

local version, err = config:server_version()

if version then

metrics.etcd_reachable:set(1)

else

metrics.etcd_reachable:set(0)

core.log.error("prometheus: failed to reach config server while ",

"processing metrics endpoint: ", err)

end

-- Because request any key from etcd will return the "X-Etcd-Index".

-- A non-existed key is preferred because it doesn't return too much data.

-- So use phantom key to get etcd index.

local res, _ = config:getkey("/phantomkey")

if res and res.headers then

clear_tab(key_values)

-- global max

key_values[1] = "x_etcd_index"

metrics.etcd_modify_indexes:set(res.headers["X-Etcd-Index"], key_values)

end

end

The reason is that this part requests the API, which will lead to the following error when directly using exporter_timer():

2025/07/21 14:41:12 [error] 464992#464992: init_worker_by_lua error: /home/xxx/apisix//deps/share/lua/5.1/resty/http.lua:74: API disabled in the context of init_worker_by_lua* stack traceback: [C]: in function 'co_create' /home/xxx/apisix//deps/share/lua/5.1/resty/http.lua:74: in function '_body_reader' /home/xxx/apisix//deps/share/lua/5.1/resty/http.lua:821: in function 'request' /home/xxx/apisix//deps/share/lua/5.1/resty/etcd/v3.lua:120: in function 'request_uri_via_unix_socket' /home/xxx/apisix//deps/share/lua/5.1/resty/etcd/v3.lua:160: in function 'http_request_uri' /home/xxx/apisix//deps/share/lua/5.1/resty/etcd/v3.lua:251: in function 'server_version' /home/xxx/apisix/apisix/core/config_etcd.lua:1066: in function 'server_version' /home/xxx/apisix/apisix/plugins/prometheus/exporter.lua:474: in function 'collect' /home/xxx/apisix/apisix/plugins/prometheus/exporter.lua:537: in function 'exporter_timer' /home/xxx/apisix/apisix/plugins/prometheus/exporter.lua:553: in function 'init_exporter_timer' /home/xxx/apisix/apisix/plugin.lua:808: in function 'init_prometheus' /home/xxx/apisix/apisix/init.lua:161: in function 'http_init_worker' init_worker_by_lua:2: in main chunk

So, if we need to ensure synchronized initialization, we need to continue the discussion: whether to remove or move the collection of this part of the indicators.

SkyeYoung · 2025-07-22T09:19:57Z

apisix/plugins/prometheus/exporter.lua

+    -- FIXME:
+    -- Now the HTTP subsystem loads the stream plugin unintentionally, which shouldn't happen.
+    -- It breaks the initialization logic of the plugin,
+    -- here it is temporarily fixed using a workaround.
+    if ngx.config.subsystem ~= "stream" then
+        return
+    end


As mentioned in the comments, the http subsystem also loads the stream plugins. This is an issue that needs to be resolved.

Please create an issue for this. thx @SkyeYoung

bzp2010 · 2025-07-22T10:00:38Z

apisix/plugin.lua

-            local enabled = core.table.array_find(http_plugin_names, "prometheus") ~= nil
-            local active  = exporter.get_prometheus() ~= nil
-            if not enabled then
-                exporter.destroy()
-            end
-            if enabled and not active then
-                exporter.http_init()
-            end


Add some description under this comment explaining why we removed it and moved to init and destroy hooks.

The original code skipped plugin.init() and old_plugin.destroy() used in https://github.com/apache/apisix/blob/6fb9bf94281525c1fca397f681b4890b69440369/apisix/plugin.lua, and implemented the overload of the prometheus plugin for some reason that I have not yet understood (perhaps because prometheus.lua originally did not contain two functions init and destroy).

The initial reason was that even after separating the init_prometheus part and placing it at the end of init_worker, directly calling exporter_timer() would still cause an error. After debugging, another initialization logic was found here. This is obviously redundant.

Currently, we provide init and destroy functions in prometheus.lua, allowing the initialization and reloading of the prometheus plugin to be handled within the plugin's own files, reducing coupling.

This also allows the prometheus plugin to revert to the mechanism provided by plugin.lua, reducing special cases, lowering the cost of understanding, and making the code easier to maintain.

bzp2010 · 2025-07-22T10:07:10Z

apisix/plugin.lua

-        require("apisix.plugins.prometheus.exporter").http_init(prometheus_enabled_in_stream)
-    elseif not is_http and core.table.array_find(stream_plugin_names, "prometheus") then
-        require("apisix.plugins.prometheus.exporter").stream_init()
+    if is_http and (enabled_in_http or enabled_in_stream) then


NOTE

We will always only handle metrics generation in the http subsystem.

This will ensure that there is no duplication of execution on http and stream to waste compute resources.

This simplifies the design.

Whether or not the user has http enabled (i.e., whether or not it is in stream only mode), an http block for the Prometheus export API and its server block (:9091) will always be present, otherwise Prometheus would be pointless. This means that we can always have an http subsystem context for periodic generation of timers and metrics anyway, even if we are currently in stream only mode.

Please add some comments to the code to document the design intent. @SkyeYoung

bzp2010 · 2025-07-22T10:19:52Z

apisix/plugins/prometheus.lua

@@ -35,6 +34,7 @@ local _M = {
    priority = 500,
    name = plugin_name,
    log  = exporter.http_log,
+    destroy = exporter.destroy,


NOTE

This will always destroy the plugin (the prometheus instance in it) when reloading it using the Admin API and loading it again based on the latest configuration.
If a reload is performed after a plugin is removed from the profile list, this plugin will not be restored again. Until the next reload.

Technically, exporter.destroy just backs up that instance of the prometheus module and copies it to another variable.
This will cause the export API to stop working, at which point it will always return a {}, which is consistent with the current behavior.
Under the hood, the timer will also stop working, no longer generating metrics based on interval timing, and the metrics computation overhead introduced by APISIX is completely eliminated.
When the next plugin reload occurs, if prometheus is re-enabled, the timer will resume running.

Regarding the background timer introduced by the prometheus third-party library, unfortunately, it never stops running.
It is registered with ngx.timer.every to perform the task of synchronizing the shdict at regular intervals, and this overhead cannot be paused or resumed by external intervention unless we fork and modify the library itself.

So this destruction does not mean that the prometheus instance is actually destroyed, the synchronization timer is stopped, and the shdict is cleared. none of this happens.

bzp2010 · 2025-07-22T10:22:54Z

apisix/plugins/prometheus.lua

@@ -55,4 +55,11 @@ function _M.api()
 end


+function _M.init()


NOTE

We turned to using the built-in hooks of the plugin system, namely init to initialize the prometheus instance and prometheus metrics registration.

Note, however, that data padding only occurs the first time the plugin is started (it is usually when the worker is started, i.e. the init_prometheus call in init.lua http_init_worker) and every timer.
This initialization just registers the metrics, but doesn't really populate the data.

bzp2010 · 2025-07-22T10:25:49Z

apisix/plugins/prometheus.lua

+function _M.init()
+    local local_conf = core.config.local_conf()
+    local enabled_in_stream = core.table.array_find(local_conf.stream_plugins, "prometheus")
+    exporter.http_init(enabled_in_stream)


NOTE

The prometheus plugin, loaded by the http subsystem, will register http metrics there, and will decide whether to register stream metrics (xrpc) depending on whether the stream subsystem has been started.
This is mainly for metrics generation needs in privileged processes, stream data is not really reported at any phase in the http subsystem.

bzp2010 · 2025-07-22T10:36:16Z

apisix/plugins/prometheus/exporter.lua

-        local version, err = config:server_version()
-        if version then
-            metrics.etcd_reachable:set(1)
+        if yieldable then


NOTE

Metrics contains the etcd reachability report and the etcd latest modified index report, which relies on communication with etcd.
According to openresty's restrictions, yield is prohibited in the init_worker phase, i.e. cosocket-based communication with etcd is not allowed.
So we skip a capture here until the timer does them.

bzp2010 · 2025-07-22T10:44:26Z

apisix/plugins/prometheus/exporter.lua

+        return
+    end
+
+    if not prometheus then


NOTE

This is used for dynamic disable (by plugin reload API) of the plugin, i.e. done in exporter.destroy.
This is where we stop if the prometheus instance is "destroyed", and as you can see, this will happen before scheduling the next timer task, which means the timer will stop.

Technically, this is the advantage of ngx.timer.at over ngx.timer.every, every is not terminable, the developer can't get an "instance" of a timer to pause or stop it.
But by using ngx.timer.at, we can precisely control whether or not to continue scheduling the next timed task, which allows us to stop the timer. If you need to resume it, just re-execute ngx.timer.at(0).

bzp2010 · 2025-07-22T10:48:01Z

apisix/plugins/prometheus/exporter.lua

+        return
+    end
+
+    exporter_timer(false, false)


NOTE

The initialization of the timer will perform an acquisition task synchronously, i.e. the first acquisition will always occur in the init_worker phase, which provides initial access to the metrics data.

If at any time , the metrics data (the string in the prometheus-cache shdict) is not available, the API will report an error and log it. This is by design not very likely to happen.

bzp2010 · 2025-07-22T10:49:14Z

apisix/plugins/prometheus/exporter.lua

+    local cached_metrics_text = shdict_prometheus_cache:get(CACHED_METRICS_KEY)
+    if not cached_metrics_text then
+        core.log.error("Failed to retrieve cached metrics: data is nil")
+        return 500, "Failed to retrieve metrics: no data available"


Is it more standard to return a JSON with reference to
https://github.com/apache/apisix/pull/12383/files#diff-390eaff60bfa1071dd1850bce9c7689452eaa13f07bf45a927921ecf05886d1bR580 ?

bzp2010 · 2025-07-22T10:52:40Z

apisix/plugins/prometheus/exporter.lua

    if not prometheus then
-        core.response.exit(200, "{}")
+       return core.response.exit(200, "{}")


JFI, this behavior seems to be inconsistent with what is in get_cached_metrics and we need to confirm which mode should be used. cc @membphis

BTW, prometheus to nil will happen when the plugin is dynamically disabled.

the current way is good to me

membphis · 2025-07-22T11:31:49Z

apisix/plugins/prometheus/exporter.lua

@@ -110,6 +116,10 @@ end


 function _M.http_init(prometheus_enabled_in_stream)
+    if ngx.config.subsystem ~= "http" then
+        return


we can print some warning error log, it is unexpected

This is a problem that needs to be solved, and after it is solved, it seems that we will no longer need these assertions.

membphis · 2025-07-22T11:31:57Z

apisix/plugins/prometheus/exporter.lua

+    -- It breaks the initialization logic of the plugin,
+    -- here it is temporarily fixed using a workaround.
+    if ngx.config.subsystem ~= "stream" then
+        return


membphis · 2025-07-22T11:38:45Z

apisix/plugins/prometheus/exporter.lua

+    {name = "waiting", var = "ngx_stat_waiting"},
+}
+
+-- Use FFI to get nginx status directly from global variables
 local function nginx_status()


another way:

we can use resty.http to send the http request to itself for fetching nginx status

which is easier to read

anyway, I can accept current way but I can not sure it is easier enough to maintain for other developers

I highly recommend to add some comments which is useful

membphis · 2025-07-22T11:42:33Z

apisix/plugins/prometheus/exporter.lua

-        return 500, {message = "An unexpected error occurred"}
-    end
-
+local function collect(yieldable)


can we return directly if the yieldable is false?

I suggest to perform the first capture and metrics generation at init_worker, it is synchronously blocking, which makes the metrics almost always available, otherwise we can't be sure when the metrics will be available, which can lead to some unexpected exception responses (HTTP 500).

This parameter is currently used to solve the problem of not being able to obtain the following data during initialization.

I elaborated in detail in #12383 (comment).

membphis · 2025-07-22T11:45:09Z

apisix/plugins/prometheus/exporter.lua

    if not prometheus then
-        core.response.exit(200, "{}")
+       return core.response.exit(200, "{}")


the current way is good to me

membphis · 2025-07-22T11:47:07Z

conf/config.yaml.example

@@ -170,6 +170,7 @@ nginx_config:                     # Config for render the template to generate n
  meta:
    lua_shared_dict:              # Nginx Lua shared memory zone. Size units are m or k.
      prometheus-metrics: 15m
+      prometheus-cache: 10m


pls add some comments, tell users when they need to modify it

membphis · 2025-07-22T11:50:17Z

apisix/plugins/prometheus/exporter.lua

+        core.log.error("Failed to collect metrics: ", res)
+        return
+    end
+    shdict_prometheus_cache:set(CACHED_METRICS_KEY, res)


need to capture the return value, it maybe fail

if err, tell user what is the reason, and tell the user to change the default size if the shdict is small

SkyeYoung added 11 commits June 26, 2025 09:43

feat: decoupled prometheus exporter's calculation and output

d32ec9f

fix: require process

35a38ad

fix

55619cf

fix: part of exportor logic

467363a

rewrite: use FFI to get nginx_status

e34a924

chore

d91509b

Merge remote-tracking branch 'upstream/master' into young/perf/promet…

d2b3bd4

…heus-exporter-concurrency

chore: simplify logic

b22e74f

chore: rm useless logic

27eec34

fix: lint

a42006f

fix: lint

740c39b

SkyeYoung commented Jun 27, 2025

View reviewed changes

SkyeYoung added 17 commits July 7, 2025 03:45

fix: try init cached metrics to pass tests

e4800f5

try fix logic

97e00e7

fix: init only one time

171406d

fix: lint

57959e6

chore: try fix tests

a22b9a7

test: fix prometheus related cases

27ac5c9

Merge remote-tracking branch 'upstream/master' into young/perf/promet…

1fb6217

…heus-exporter-concurrency

fix: logic

521f4c9

Revert "fix: logic"

226a14c

This reverts commit 521f4c9.

fix: init etcd logic

f35b711

Merge remote-tracking branch 'upstream/master' into young/perf/promet…

df84c76

…heus-exporter-concurrency

fix: logic

b998b64

fix

0d61202

fix: test case

3601800

fix: test cases

9523fed

test: improve stability

6ae0448

chore(ngx_tpl): rm prom privileged_agent conf

005cc83

dosubot bot added the enhancement New feature or request label Jul 21, 2025

SkyeYoung requested review from membphis, bzp2010 and Copilot July 21, 2025 03:44

SkyeYoung commented Jul 21, 2025

View reviewed changes

Copilot AI reviewed Jul 21, 2025

View reviewed changes

SkyeYoung added 4 commits July 21, 2025 07:27

feat: use a separate shared dict cache

e89de27

chore: add comment

5167098

feat: add fallback err msg when data not exsits

3364351

fix: prometheus init logic

77e4132

SkyeYoung marked this pull request as draft July 21, 2025 11:25

SkyeYoung added 3 commits July 21, 2025 15:11

fix: workaround

a25069e

fix

bb845a0

fix: rm sleep

6fb9bf9

SkyeYoung commented Jul 22, 2025

View reviewed changes

SkyeYoung marked this pull request as ready for review July 22, 2025 01:04

SkyeYoung added 8 commits July 22, 2025 02:21

fix: tests

5395418

fix: tests

fe01d03

fix: tests

287a176

fix: tests

db209c3

fix: stream plugin prometheus

2023022

fix: workaround

508655a

fix: xrpc test

9bf5b5e

fix: tests

85eafa5

SkyeYoung commented Jul 22, 2025

View reviewed changes

bzp2010 reviewed Jul 22, 2025

View reviewed changes

membphis reviewed Jul 22, 2025

View reviewed changes

SkyeYoung added 3 commits July 22, 2025 13:12

fix: lint

72c6f22

fix: add usefull comments and logs

9d5e45c

chore: add comments

61f4812

	local local_conf = core.config.local_conf()
	local stream_only = local_conf.apisix.proxy_mode == "stream"
	-- we can't get etcd index in metric server if only stream subsystem is enabled
	if config.type == "etcd" and not stream_only then
	-- etcd modify index
	etcd_modify_index()

	local version, err = config:server_version()
	if version then
	metrics.etcd_reachable:set(1)

	else
	metrics.etcd_reachable:set(0)
	core.log.error("prometheus: failed to reach config server while ",
	"processing metrics endpoint: ", err)
	end

	-- Because request any key from etcd will return the "X-Etcd-Index".
	-- A non-existed key is preferred because it doesn't return too much data.
	-- So use phantom key to get etcd index.
	local res, _ = config:getkey("/phantomkey")
	if res and res.headers then
	clear_tab(key_values)
	-- global max
	key_values[1] = "x_etcd_index"
	metrics.etcd_modify_indexes:set(res.headers["X-Etcd-Index"], key_values)
	end
	end

feat: decoupled prometheus exporter's calculation and output #12383

Are you sure you want to change the base?

feat: decoupled prometheus exporter's calculation and output #12383

Uh oh!

Conversation

SkyeYoung commented Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Which issue(s) this PR fixes:

Stress Testing (IN PROGRESS, Wait for discussion)

Scripts

How to use

Test Logic

Results

3 conns:

3 conns:

3 conns:

Checklist

Uh oh!

SkyeYoung Jun 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bzp2010 Jun 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SkyeYoung Jun 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

SkyeYoung Jul 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SkyeYoung Jul 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bzp2010 Jul 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SkyeYoung Jul 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bzp2010 Jul 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

NOTE

Uh oh!

bzp2010 Jul 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SkyeYoung commented Jun 26, 2025 •

edited

Loading

SkyeYoung Jun 27, 2025 •

edited

Loading

bzp2010 Jun 27, 2025 •

edited

Loading

SkyeYoung Jun 27, 2025 •

edited

Loading

SkyeYoung Jul 22, 2025 •

edited

Loading

SkyeYoung Jul 22, 2025 •

edited

Loading

bzp2010 Jul 22, 2025 •

edited

Loading

SkyeYoung Jul 22, 2025 •

edited

Loading

bzp2010 Jul 22, 2025 •

edited

Loading

bzp2010 Jul 22, 2025 •

edited

Loading

bzp2010 Jul 22, 2025 •

edited

Loading