Skip to content

Linear retriever top level option for normalizer #129693

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 19 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions docs/changelog/129693.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
pr: 129693
summary: Add top level normalizer for linear retriever
area: Search
type: enhancement
issues: []
150 changes: 150 additions & 0 deletions docs/reference/elasticsearch/rest-apis/retrievers.md
Original file line number Diff line number Diff line change
Expand Up @@ -99,6 +99,156 @@
GET books/_search
{
"retriever": {
"knn": { <1>
"field": "vector", <2>
"query_vector": [10, 22, 77], <3>
"k": 10, <4>
"num_candidates": 10 <5>
}
}
}
```

1. Configuration for k-nearest neighbor (knn) search, which is based on vector similarity.
2. Specifies the field name that contains the vectors.
3. The query vector against which document vectors are compared in the `knn` search.
4. The number of nearest neighbors to return as top hits. This value must be fewer than or equal to `num_candidates`.
5. The size of the initial candidate set from which the final `k` nearest neighbors are selected.




## Linear Retriever [linear-retriever]

A retriever that normalizes and linearly combines the scores of other retrievers.


#### Parameters [linear-retriever-parameters]

`retrievers`
: (Required, array of objects)

A list of the sub-retrievers' configuration, that we will take into account and whose result sets we will merge through a weighted sum. Each configuration can have a different weight and normalization depending on the specified retriever.

`normalizer`
: (Optional, String)

Specifies a normalizer to be applied to all sub-retrievers. This provides a simple way to configure normalization for all retrievers at once.

The `normalizer` can be specified at the top level, at the per-retriever level, or both, with the following rules:

* If only the top-level `normalizer` is specified, it applies to all sub-retrievers.
* If both a top-level and a per-retriever `normalizer` are specified, the per-retriever normalizer must be identical to the top-level one. If they differ, the request will fail.
* If only per-retriever normalizers are specified, they can be different for each sub-retriever.
* If no normalizer is specified at any level, no normalization is applied.

Available values are: `minmax`, `l2_norm`, and `none`. Defaults to `none`.

Each entry in the `retrievers` array specifies the following parameters:

`retriever`
: (Required, a `retriever` object)

Specifies the retriever for which we will compute the top documents for. The retriever will produce `rank_window_size` results, which will later be merged based on the specified `weight` and `normalizer`.

`weight`
: (Optional, float)

The weight that each score of this retriever’s top docs will be multiplied with. Must be greater or equal to 0. Defaults to 1.0.

`normalizer`
: (Optional, String)

Specifies how we will normalize this specific retriever’s scores, before applying the specified `weight`. If a top-level `normalizer` is also specified, this normalizer must be the same. Available values are: `minmax`, `l2_norm`, and `none`. Defaults to `none`.

* `none`
* `minmax` : A `MinMaxScoreNormalizer` that normalizes scores based on the following formula

```
score = (score - min) / (max - min)
```

* `l2_norm` : An `L2ScoreNormalizer` that normalizes scores using the L2 norm of the score values.

See also [this hybrid search example](docs-content://solutions/search/retrievers-examples.md#retrievers-examples-linear-retriever) using a linear retriever on how to independently configure and apply normalizers to retrievers.

Check failure on line 173 in docs/reference/elasticsearch/rest-apis/retrievers.md

View workflow job for this annotation

GitHub Actions / docs-preview / build

Redirect target 'elasticsearch://reference/elasticsearch/rest-apis/retrievers/retrievers-examples.md' points to repository 'elasticsearch' for which no links.json was found.

`rank_window_size`
: (Optional, integer)

This value determines the size of the individual result sets per query. A higher value will improve result relevance at the cost of performance. The final ranked result set is pruned down to the search request’s [size](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-search#search-size-param). `rank_window_size` must be greater than or equal to `size` and greater than or equal to `1`. Defaults to the `size` parameter.


`filter`
: (Optional, [query object or list of query objects](/reference/query-languages/querydsl.md))

Applies the specified [boolean query filter](/reference/query-languages/query-dsl/query-dsl-bool-query.md) to all of the specified sub-retrievers, according to each retriever’s specifications.



## RRF Retriever [rrf-retriever]

An [RRF](/reference/elasticsearch/rest-apis/reciprocal-rank-fusion.md) retriever returns top documents based on the RRF formula, equally weighting two or more child retrievers. Reciprocal rank fusion (RRF) is a method for combining multiple result sets with different relevance indicators into a single result set.


#### Parameters [rrf-retriever-parameters]

`retrievers`
: (Required, array of retriever objects)

A list of child retrievers to specify which sets of returned top documents will have the RRF formula applied to them. Each child retriever carries an equal weight as part of the RRF formula. Two or more child retrievers are required.


`rank_constant`
: (Optional, integer)

This value determines how much influence documents in individual result sets per query have over the final ranked result set. A higher value indicates that lower ranked documents have more influence. This value must be greater than or equal to `1`. Defaults to `60`.


`rank_window_size`
: (Optional, integer)

This value determines the size of the individual result sets per query. A higher value will improve result relevance at the cost of performance. The final ranked result set is pruned down to the search request’s [size](https://www.elastic.co/docs/api/doc/elasticsearch/operation/operation-search#search-size-param). `rank_window_size` must be greater than or equal to `size` and greater than or equal to `1`. Defaults to the `size` parameter.


`filter`
: (Optional, [query object or list of query objects](/reference/query-languages/querydsl.md))

Applies the specified [boolean query filter](/reference/query-languages/query-dsl/query-dsl-bool-query.md) to all of the specified sub-retrievers, according to each retriever’s specifications.



### Example: Hybrid search [rrf-retriever-example-hybrid]

A simple hybrid search example (lexical search + dense vector search) combining a `standard` retriever with a `knn` retriever using RRF:

```console

Check failure on line 224 in docs/reference/elasticsearch/rest-apis/retrievers.md

View workflow job for this annotation

GitHub Actions / docs-preview / build

Code block has 4 callouts but the following list only has 3
GET /restaurants/_search
{
"retriever": {
"rrf": { <1>
"retrievers": [ <2>
{
"standard": { <3>
"query": {
"multi_match": {
"query": "Austria",
"fields": [
"city",
"region"
]
}
}
}
},
{
"knn": { <4>
"field": "vector",
"query_vector": [10, 22, 77],
"k": 10,
"num_candidates": 10
}
}
=======
"linear": {
"query": "elasticsearch",
"fields": [
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,7 @@
import static org.elasticsearch.action.ValidateActions.addValidationError;
import static org.elasticsearch.xcontent.ConstructingObjectParser.optionalConstructorArg;
import static org.elasticsearch.xpack.rank.RankRRFFeatures.LINEAR_RETRIEVER_SUPPORTED;
import static org.elasticsearch.xpack.rank.linear.LinearRetrieverComponent.DEFAULT_NORMALIZER;
import static org.elasticsearch.xpack.rank.linear.LinearRetrieverComponent.DEFAULT_WEIGHT;

/**
Expand Down Expand Up @@ -118,10 +119,41 @@ private static float[] getDefaultWeight(List<RetrieverSource> innerRetrievers) {
private static ScoreNormalizer[] getDefaultNormalizers(List<RetrieverSource> innerRetrievers) {
int size = innerRetrievers != null ? innerRetrievers.size() : 0;
ScoreNormalizer[] normalizers = new ScoreNormalizer[size];
Arrays.fill(normalizers, IdentityScoreNormalizer.INSTANCE);
Arrays.fill(normalizers, DEFAULT_NORMALIZER);
return normalizers;
}

private void normalizeNormalizerArray(ScoreNormalizer topLevelNormalizer, ScoreNormalizer[] normalizers) {
for (int i = 0; i < normalizers.length; i++) {
ScoreNormalizer current = normalizers[i];

if (topLevelNormalizer != null) {
// Validate explicit per-retriever normalizers match top-level
if (current != null && !current.equals(DEFAULT_NORMALIZER) && !current.equals(topLevelNormalizer)) {
throw new IllegalArgumentException(
String.format(
"[%s] All per-retriever normalizers must match the top-level normalizer: "
+ "expected [%s], found [%s] in retriever [%d]",
NAME,
topLevelNormalizer.getName(),
current.getName(),
i
)
);
}
// Propagate top-level normalizer to unspecified positions
if (current == null || current.equals(DEFAULT_NORMALIZER)) {
normalizers[i] = topLevelNormalizer;
}
} else {
// No top-level normalizer: ensure null values become DEFAULT_NORMALIZER
if (current == null) {
normalizers[i] = DEFAULT_NORMALIZER;
}
}
}
}

public static LinearRetrieverBuilder fromXContent(XContentParser parser, RetrieverParserContext context) throws IOException {
if (context.clusterSupportsFeature(LINEAR_RETRIEVER_SUPPORTED) == false) {
throw new ParsingException(parser.getTokenLocation(), "unknown retriever [" + NAME + "]");
Expand Down Expand Up @@ -157,17 +189,35 @@ public LinearRetrieverBuilder(
// Use a mutable list for innerRetrievers so that we can use addChild
super(innerRetrievers == null ? new ArrayList<>() : new ArrayList<>(innerRetrievers), rankWindowSize);
if (weights.length != this.innerRetrievers.size()) {
throw new IllegalArgumentException("The number of weights must match the number of inner retrievers");
throw new IllegalArgumentException(
"["
+ NAME
+ "] the number of weights must be equal to the number of retrievers, but found ["
+ weights.length
+ "] weights and ["
+ this.innerRetrievers.size()
+ "] retrievers"
);
}
if (normalizers.length != this.innerRetrievers.size()) {
throw new IllegalArgumentException("The number of normalizers must match the number of inner retrievers");
throw new IllegalArgumentException(
"["
+ NAME
+ "] the number of normalizers must be equal to the number of retrievers, but found ["
+ normalizers.length
+ "] normalizers and ["
+ this.innerRetrievers.size()
+ "] retrievers"
);
}

this.fields = fields == null ? null : List.copyOf(fields);
this.query = query;
this.normalizer = normalizer;
this.weights = weights;
this.normalizers = normalizers;
this.fields = fields;
this.query = query;
this.normalizer = normalizer;

normalizeNormalizerArray(normalizer, normalizers);

}

public LinearRetrieverBuilder(
Expand Down Expand Up @@ -368,6 +418,34 @@ protected RetrieverBuilder doRewrite(QueryRewriteContext ctx) {
rewritten = new StandardRetrieverBuilder(new MatchNoneQueryBuilder());
}
}
if (rewritten instanceof LinearRetrieverBuilder == false) {
return rewritten;
}
LinearRetrieverBuilder linearRewritten = (LinearRetrieverBuilder) rewritten;

if (normalizer != null) {
ScoreNormalizer[] newNormalizers = new ScoreNormalizer[linearRewritten.normalizers.length];
Arrays.fill(newNormalizers, normalizer);
rewritten = new LinearRetrieverBuilder(
linearRewritten.innerRetrievers,
linearRewritten.fields,
linearRewritten.query,
normalizer,
linearRewritten.rankWindowSize,
linearRewritten.weights,
newNormalizers
);
} else {
rewritten = new LinearRetrieverBuilder(
linearRewritten.innerRetrievers,
linearRewritten.fields,
linearRewritten.query,
linearRewritten.normalizer,
linearRewritten.rankWindowSize,
linearRewritten.weights,
linearRewritten.normalizers
);
}

return rewritten;
}
Expand All @@ -393,7 +471,9 @@ public void doToXContent(XContentBuilder builder, Params params) throws IOExcept
builder.startObject();
builder.field(LinearRetrieverComponent.RETRIEVER_FIELD.getPreferredName(), entry.retriever());
builder.field(LinearRetrieverComponent.WEIGHT_FIELD.getPreferredName(), weights[index]);
builder.field(LinearRetrieverComponent.NORMALIZER_FIELD.getPreferredName(), normalizers[index].getName());
if (normalizers[index] != null && normalizers[index].equals(DEFAULT_NORMALIZER) == false) {
builder.field(LinearRetrieverComponent.NORMALIZER_FIELD.getPreferredName(), normalizers[index].getName());
}
builder.endObject();
index++;
}
Expand All @@ -410,7 +490,7 @@ public void doToXContent(XContentBuilder builder, Params params) throws IOExcept
if (query != null) {
builder.field(QUERY_FIELD.getPreferredName(), query);
}
if (normalizer != null) {
if (normalizer != null && normalizer.equals(DEFAULT_NORMALIZER) == false) {
builder.field(NORMALIZER_FIELD.getPreferredName(), normalizer.getName());
}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ public LinearRetrieverComponent(RetrieverBuilder retrieverBuilder, Float weight,
assert retrieverBuilder != null;
this.retriever = retrieverBuilder;
this.weight = weight == null ? DEFAULT_WEIGHT : weight;
this.normalizer = normalizer == null ? DEFAULT_NORMALIZER : normalizer;
this.normalizer = normalizer;
if (this.weight < 0) {
throw new IllegalArgumentException("[weight] must be non-negative");
}
Expand All @@ -48,7 +48,9 @@ public LinearRetrieverComponent(RetrieverBuilder retrieverBuilder, Float weight,
public XContentBuilder toXContent(XContentBuilder builder, Params params) throws IOException {
builder.field(RETRIEVER_FIELD.getPreferredName(), retriever);
builder.field(WEIGHT_FIELD.getPreferredName(), weight);
builder.field(NORMALIZER_FIELD.getPreferredName(), normalizer.getName());
if (normalizer != null && normalizer.equals(DEFAULT_NORMALIZER) == false) {
builder.field(NORMALIZER_FIELD.getPreferredName(), normalizer.getName());
}
return builder;
}

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -62,12 +62,32 @@ protected LinearRetrieverBuilder createTestInstance() {
List<CompoundRetrieverBuilder.RetrieverSource> innerRetrievers = new ArrayList<>();
float[] weights = new float[num];
ScoreNormalizer[] normalizers = new ScoreNormalizer[num];
for (int i = 0; i < num; i++) {
innerRetrievers.add(
new CompoundRetrieverBuilder.RetrieverSource(TestRetrieverBuilder.createRandomTestRetrieverBuilder(), null)
);
weights[i] = randomFloat();
normalizers[i] = randomScoreNormalizer();
// Create normalizer combinations that follow the API design rules
if (normalizer != null) {
// When top-level normalizer is specified, per-retriever normalizers must either:
// 1. Be null/default (will be propagated), or
// 2. Exactly match the top-level normalizer
boolean useMatchingNormalizers = randomBoolean();
for (int i = 0; i < num; i++) {
innerRetrievers.add(
new CompoundRetrieverBuilder.RetrieverSource(TestRetrieverBuilder.createRandomTestRetrieverBuilder(), null)
);
weights[i] = randomFloat();
if (useMatchingNormalizers) {
normalizers[i] = normalizer; // Exactly match top-level
} else {
normalizers[i] = randomBoolean() ? null : IdentityScoreNormalizer.INSTANCE; // Will be propagated
}
}
} else {
// No top-level normalizer: per-retriever normalizers can be anything
for (int i = 0; i < num; i++) {
innerRetrievers.add(
new CompoundRetrieverBuilder.RetrieverSource(TestRetrieverBuilder.createRandomTestRetrieverBuilder(), null)
);
weights[i] = randomFloat();
normalizers[i] = randomScoreNormalizer();
}
}

return new LinearRetrieverBuilder(innerRetrievers, fields, query, normalizer, rankWindowSize, weights, normalizers);
Expand Down
Loading
Loading