-
-
Notifications
You must be signed in to change notification settings - Fork 165
BE: Full text search support #1267
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
@germanosin |
Yes correct, and you could adjust WITH |
@fallen-up new config: |
import org.apache.lucene.analysis.miscellaneous.WordDelimiterGraphFilter; | ||
import org.apache.lucene.analysis.standard.StandardTokenizer; | ||
|
||
public class ShortWordAnalyzer extends Analyzer { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
package-private
import org.apache.lucene.analysis.ngram.NGramTokenFilter; | ||
import org.apache.lucene.analysis.standard.StandardTokenizer; | ||
|
||
public class ShortWordNGramAnalyzer extends Analyzer { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
package-private
|
||
protected abstract List<Tuple2<List<String>, T>> getItems(); | ||
|
||
private static Map<String, List<String>> cache = new ConcurrentHashMap<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
private static final Map<String, List<String>> cache =
CacheBuilder.newBuilder()
.maximumSize(1_000)
.<String, List<String>>build()
.asMap();
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+ final
} | ||
|
||
|
||
public static List<String> tokenizeString(Analyzer analyzer, String text) throws IOException { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
package-private
} | ||
|
||
@SneakyThrows | ||
public static List<String> tokenizeStringSimple(Analyzer analyzer, String text) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
package-private
return new TopicsIndex(topicStates.values().stream().map( | ||
topicState -> buildInternalTopic(topicState, clustersProperties) | ||
).toList(), fts.isTopicsNgramEnabled(), fts.getTopicsMinNGram(), fts.getTopicsMaxNGram()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
make TopicsIndex contructor take FtsProperties as an argument, this creation looks overwhelmed
doc.add(new TextField(FIELD_NAME, topic.getName(), Field.Store.NO)); | ||
doc.add(new IntPoint(FIELD_PARTITIONS, topic.getPartitionCount())); | ||
doc.add(new IntPoint(FIELD_REPLICATION, topic.getReplicationFactor())); | ||
doc.add(new LongPoint(FIELD_SIZE, topic.getSegmentSize())); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it looks like segmentsize can be 0 in two cases - when it is unknown, and when topic is empty. For the first case maybe we should not index this field?
public List<String> find(String search) { | ||
if (fts) { | ||
return super.find(search); | ||
} else { | ||
return this.subjects | ||
.stream() | ||
.map(Tuple2::getT2) | ||
.filter(subj -> search == null || CI.contains(subj, search)) | ||
.sorted().toList(); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets either do this check inside all filters or check if fts enabled on upper level (like its done for ConsumerGroupFilter).
- I suggest to always create filters and put this check inside / create different impls. It will make calling code cleaner.
- Also, maybe rename NgramFilter to SearchFilter or smth and implement search alg (fts/ngram/etc) according to properties
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
smth like this
public class SearchFilters {
public static SearchFilter consumerGroupFilter(Collection<String> g, FtsProperties fts) {
if (fts.isEnabled()){
return new ConsumerGroupNgramFilter(g, fts.getFilterMinNGram(), fts.getFilterMaxNGram());
}
return new CaseInsensitiveContainsFilter(g);
}
...
int topicsMinNGram = 3; | ||
int topicsMaxNGram = 5; | ||
int filterMinNGram = 1; | ||
int filterMaxNGram = 4; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe create
class NgramSettings {
int minNGram = 1;
int maxNGram = 4;
}
and tune it for each search
public static class FtsProperties {
...
NgramSettings topiscNgram = new NgramSettings(3,5);
NgramSettings schemasNgram = new NgramSettings(1,4);
NgramSettings consumerGroupsNgram = new NgramSettings(1,4);
}
What changes did you make? (Give an overview)
Is there anything you'd like reviewers to focus on?
How Has This Been Tested? (put an "x" (case-sensitive!) next to an item)
Checklist (put an "x" (case-sensitive!) next to all the items, otherwise the build will fail)
Check out Contributing and Code of Conduct
A picture of a cute animal (not mandatory but encouraged)