Skip to content

BE: Full text search support #1267

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 35 commits into
base: main
Choose a base branch
from
Open

BE: Full text search support #1267

wants to merge 35 commits into from

Conversation

germanosin
Copy link
Member

@germanosin germanosin commented Aug 14, 2025

  • Breaking change? (if so, please describe the impact and migration path for existing application instances)

What changes did you make? (Give an overview)

Is there anything you'd like reviewers to focus on?

How Has This Been Tested? (put an "x" (case-sensitive!) next to an item)

  • No need to
  • Manually (please, describe, if necessary)
  • Unit checks
  • Integration checks
  • Covered by existing automation

Checklist (put an "x" (case-sensitive!) next to all the items, otherwise the build will fail)

  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation (e.g. ENVIRONMENT VARIABLES)
  • My changes generate no new warnings (e.g. Sonar is happy)
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes
  • Any dependent changes have been merged

Check out Contributing and Code of Conduct

A picture of a cute animal (not mandatory but encouraged)

@germanosin germanosin requested a review from a team as a code owner August 14, 2025 08:16
@germanosin germanosin requested review from a team as code owners August 14, 2025 08:16
@kapybro kapybro bot added status/triage Issues pending maintainers triage status/triage/completed Automatic triage completed and removed status/triage Issues pending maintainers triage labels Aug 14, 2025
@germanosin germanosin marked this pull request as draft August 14, 2025 08:16
@germanosin germanosin changed the title Issues/1087 fts BE: Full text search support Aug 14, 2025
@germanosin germanosin marked this pull request as ready for review August 15, 2025 12:01
@Haarolean Haarolean added this to the 1.4 milestone Aug 15, 2025
@Haarolean Haarolean added scope/backend Related to backend changes type/feature A brand new feature area/ux User experiense issues labels Aug 15, 2025
@fallen-up
Copy link

@germanosin
is the option KAFKA_FTS_ENABLED: "true" enough for testing?

@germanosin
Copy link
Member Author

@germanosin is the option KAFKA_FTS_ENABLED: "true" enough for testing?

Yes correct, and you could adjust WITH
KAFKA_FTS_TOPICSMINNGRAM, KAFKA_FTS_TOPICSMAXNGRAM for topics index
KAFKA_FTS_FILTERMINNGRAM, KAFKA_FTS_FILTERMAXNGRAM for other searches

@germanosin germanosin linked an issue Aug 18, 2025 that may be closed by this pull request
2 tasks
@germanosin
Copy link
Member Author

@fallen-up new config:
KAFKA_FTS_TOPICSNGRAMENABLED: true / false

import org.apache.lucene.analysis.miscellaneous.WordDelimiterGraphFilter;
import org.apache.lucene.analysis.standard.StandardTokenizer;

public class ShortWordAnalyzer extends Analyzer {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

package-private

import org.apache.lucene.analysis.ngram.NGramTokenFilter;
import org.apache.lucene.analysis.standard.StandardTokenizer;

public class ShortWordNGramAnalyzer extends Analyzer {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

package-private


protected abstract List<Tuple2<List<String>, T>> getItems();

private static Map<String, List<String>> cache = new ConcurrentHashMap<>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  private static final Map<String, List<String>> cache =
      CacheBuilder.newBuilder()
          .maximumSize(1_000)
          .<String, List<String>>build()
          .asMap();

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+ final

}


public static List<String> tokenizeString(Analyzer analyzer, String text) throws IOException {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

package-private

}

@SneakyThrows
public static List<String> tokenizeStringSimple(Analyzer analyzer, String text) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

package-private

Comment on lines +217 to +219
return new TopicsIndex(topicStates.values().stream().map(
topicState -> buildInternalTopic(topicState, clustersProperties)
).toList(), fts.isTopicsNgramEnabled(), fts.getTopicsMinNGram(), fts.getTopicsMaxNGram());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make TopicsIndex contructor take FtsProperties as an argument, this creation looks overwhelmed

doc.add(new TextField(FIELD_NAME, topic.getName(), Field.Store.NO));
doc.add(new IntPoint(FIELD_PARTITIONS, topic.getPartitionCount()));
doc.add(new IntPoint(FIELD_REPLICATION, topic.getReplicationFactor()));
doc.add(new LongPoint(FIELD_SIZE, topic.getSegmentSize()));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it looks like segmentsize can be 0 in two cases - when it is unknown, and when topic is empty. For the first case maybe we should not index this field?

Comment on lines +30 to +39
public List<String> find(String search) {
if (fts) {
return super.find(search);
} else {
return this.subjects
.stream()
.map(Tuple2::getT2)
.filter(subj -> search == null || CI.contains(subj, search))
.sorted().toList();
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets either do this check inside all filters or check if fts enabled on upper level (like its done for ConsumerGroupFilter).

  • I suggest to always create filters and put this check inside / create different impls. It will make calling code cleaner.
  • Also, maybe rename NgramFilter to SearchFilter or smth and implement search alg (fts/ngram/etc) according to properties

Copy link
Contributor

@iliax iliax Aug 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

smth like this

public class SearchFilters {

  public static SearchFilter consumerGroupFilter(Collection<String> g, FtsProperties fts) {
    if (fts.isEnabled()){
      return new ConsumerGroupNgramFilter(g, fts.getFilterMinNGram(), fts.getFilterMaxNGram());
    }
    return new CaseInsensitiveContainsFilter(g);
  }
...

int topicsMinNGram = 3;
int topicsMaxNGram = 5;
int filterMinNGram = 1;
int filterMaxNGram = 4;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe create

class NgramSettings {
 int minNGram = 1;
 int maxNGram = 4;
}

and tune it for each search

public static class FtsProperties {
  ...
  NgramSettings topiscNgram = new NgramSettings(3,5);
  NgramSettings schemasNgram = new NgramSettings(1,4);
  NgramSettings consumerGroupsNgram = new NgramSettings(1,4);
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/ux User experiense issues scope/backend Related to backend changes status/triage/completed Automatic triage completed type/feature A brand new feature
Projects
Status: Todo
Development

Successfully merging this pull request may close these issues.

BE: Add a full text search
4 participants