Skip to content

feat(upstream): filter nodes in upstream with metadata #12448

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

jizhuozhi
Copy link

@jizhuozhi jizhuozhi commented Jul 20, 2025

Description

This PR introduces metadata-based node filtering for upstreams, supporting both static and service discovery-based upstreams (e.g., via Consul).

Motivation

Currently, APISIX selects upstream nodes based on service name from discovery without additional filtering logic. In real-world scenarios like canary release or swimlane routing, users often tag backend instances with custom metadata (e.g., version, env) and expect the gateway to route only to specific subsets.

This change allows users to define a metadata_match field in upstream configuration, which filters nodes before load balancing based on their metadata values.

Changes

  • Consul discovery: Include Service.Meta in the node definition and respect its weight if available.

  • Upstream core logic: Add metadata_match filtering before computing upstream nodes.

  • Tests: Add test cases to cover both:

    • static upstream with metadata_match
    • discovery-based upstream (Consul) with metadata_match

Example Usage

upstream:
  type: roundrobin
  nodes:
    "127.0.0.1:1980":
      weight: 1
      metadata:
        version: v1
    "127.0.0.2:1980":
      weight: 1
      metadata:
        version: v2
  metadata_match:
    lane:
      - prod
    version:
      - v1
      - v2

Only nodes with metadata.lane in [prod] and metadata.version in [v1, v2] will be used for load balancing.


Fixes

Fixes # (please link to the issue or describe use case clearly if not yet filed)


Checklist

  • I have explained the need for this PR and the problem it solves
  • I have explained the changes or the new features added to this PR
  • I have added tests corresponding to this change
  • I have updated the documentation to reflect this change
  • I have verified that this change is backward compatible (no breaking changes)

@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. enhancement New feature or request labels Jul 20, 2025
@Baoyuantop
Copy link
Contributor

Hi @jizhuozhi, thanks for your contribution.

I think it is useful for node filtering of Consul discovery. But I don't understand static upstream filtering. It seems that I need to mark metadata for each node in the upstream object, and then use metadata_match to configure filtering? Because each node is manually defined, if it is not needed, can I just add or delete the node?

@jizhuozhi
Copy link
Author

Hi @jizhuozhi, thanks for your contribution.

I think it is useful for node filtering of Consul discovery. But I don't understand static upstream filtering. It seems that I need to mark metadata for each node in the upstream object, and then use metadata_match to configure filtering? Because each node is manually defined, if it is not needed, can I just add or delete the node?

Here is a unified approach: I only determine whether there is a filtering rule, without distinguishing whether it is a service discovery or static list.

However, according to my previous experience as a gateway administrator, there will be corresponding business developers who temporarily add rules for some debugging considerations but do not want to change the original instance list (for quick adjustment)

@Baoyuantop
Copy link
Contributor

there will be corresponding business developers who temporarily add rules for some debugging considerations but do not want to change the original instance list

Thanks for your reply. Could you please describe the scenario in detail? Why can't the existing methods solve this problem?

@moonming moonming requested a review from Copilot July 21, 2025 08:31
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds metadata-based node filtering capabilities to APISIX upstreams, allowing users to filter backend nodes based on custom metadata values before load balancing. This enables advanced routing scenarios like canary releases and swimlane routing.

  • Implements metadata_match filtering logic in upstream core module
  • Enhances Consul discovery to include service metadata and support weight configuration
  • Adds comprehensive test coverage for both static and service discovery-based upstreams

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
apisix/upstream.lua Adds metadata filtering functions and integration into upstream node selection
apisix/discovery/consul/init.lua Updates Consul discovery to include service metadata and weight from Meta field
t/node/upstream-discovery.t Adds test case for metadata filtering with mock discovery
t/discovery/consul.t Adds integration test for metadata filtering with actual Consul discovery
Comments suppressed due to low confidence (3)

apisix/upstream.lua:535

  • The weight from metadata should be converted to a number. If metadata.weight is a string (which is common in Consul Meta fields), this could cause issues with load balancing calculations that expect numeric weights.
                return nil, "failed to fetch ssl info by "

t/discovery/consul.t:810

  • The test endpoint /test_metadata_match returns the server address but doesn't actually proxy to the filtered upstream nodes. This test validates the configuration but doesn't verify that the metadata filtering is working correctly since it's not actually routing through the filtered nodes.
    content_by_lua_block {

t/node/upstream-discovery.t:530

  • The test configuration shows metadata_match as a simple key-value pair (version: v2), but the implementation expects an array format. This test will not properly validate the metadata filtering functionality due to this structural mismatch.
        version: v2

Comment on lines 347 to 348
end
core.log.info("new_nodes: ", core.json.delay_encode(new_nodes, true))
Copy link
Preview

Copilot AI Jul 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This info-level log statement will be printed on every request and could generate excessive logs in production. Consider using debug level or adding a condition to only log when metadata filtering actually occurs.

Suggested change
end
core.log.info("new_nodes: ", core.json.delay_encode(new_nodes, true))
core.log.info("new_nodes: ", core.json.delay_encode(new_nodes, true))
end

Copilot uses AI. Check for mistakes.

@@ -310,6 +339,14 @@ function _M.set_by_route(route, api_ctx)
return HTTP_CODE_UPSTREAM_UNAVAILABLE, "no valid upstream node: " .. (err or "nil")
end

if up_conf.metadata_match then
Copy link
Preview

Copilot AI Jul 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The metadata_match field is used directly without validation of its structure. The code expects it to be an array of objects with 'key' and 'allowed_vals' fields, but the test shows it as a simple key-value object. This mismatch will cause the filtering to fail.

Suggested change
if up_conf.metadata_match then
if up_conf.metadata_match then
-- Validate and transform metadata_match into the expected format
if type(up_conf.metadata_match) == "table" and not (#up_conf.metadata_match > 0) then
local transformed_metadata = {}
for key, allowed_vals in pairs(up_conf.metadata_match) do
if type(allowed_vals) ~= "table" then
return HTTP_CODE_UPSTREAM_UNAVAILABLE, "invalid metadata_match format: allowed_vals must be a table"
end
table.insert(transformed_metadata, { key = key, allowed_vals = allowed_vals })
end
up_conf.metadata_match = transformed_metadata
elseif type(up_conf.metadata_match) ~= "table" or not (#up_conf.metadata_match > 0) then
return HTTP_CODE_UPSTREAM_UNAVAILABLE, "invalid metadata_match format: must be an array of objects with 'key' and 'allowed_vals'"
end

Copilot uses AI. Check for mistakes.

@jizhuozhi
Copy link
Author

jizhuozhi commented Jul 21, 2025

Thanks for your reply. Could you please describe the scenario in detail? Why can't the existing methods solve this problem?

It is not a production environment, but it is common in the testing and verification phase. We need to specify specific instances frequently (for example, to capture flame graphs for performance analysis), but we need to add other instances back after deleting them, so we need to specify instances by filtering.

In fact, we also matched according to the dynamic colored metadata when loading balancing but not predefine the routes, similar to https://github.com/kitex-contrib/loadbalance-tagging (I am also using lua to implement the same capabilities, but this is not within the scope of this discussion).

@Baoyuantop
Copy link
Contributor

Baoyuantop commented Jul 22, 2025

Thanks for your reply. Could you please describe the scenario in detail? Why can't the existing methods solve this problem?

It is not a production environment, but it is common in the testing and verification phase. We need to specify specific instances frequently (for example, to capture flame graphs for performance analysis), but we need to add other instances back after deleting them, so we need to specify instances by filtering.

In fact, we also matched according to the dynamic colored metadata when loading balancing but not predefine the routes, similar to https://github.com/kitex-contrib/loadbalance-tagging (I am also using lua to implement the same capabilities, but this is not within the scope of this discussion).

I still have doubts about what is in Example Usage. Do you mean that if I need to adjust the nodes used, I don't need to change the content of the nodes list, but adjust metadata_match?
By the way, are you using static nodes or service discovery?

@jizhuozhi
Copy link
Author

jizhuozhi commented Jul 22, 2025

I still have doubts about what is in Example Usage. Do you mean that if I need to adjust the nodes used, I don't need to change the content of the nodes list, but adjust metadata_match?

Yes, just adjust metadata_match (but the discussion of this use case has been separated from this PR). For the runtime, it is a unified filtering rule for the service list that does not need to distinguish the source.

By the way, are you using static nodes or service discovery?

We are currently using Consul on kubernetes. When I was working in another company a few years ago, we were using cloud virtual machines (or EC2). The cloud platform did not provide an API interface, but we used scripts to synchronize static instance lists at regular intervals. At this time, the static list was also a kind of dynamic discovery. (why not filter in the script? Because we were lazy:)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request size:L This PR changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants