Skip to content
This repository was archived by the owner on Jan 22, 2021. It is now read-only.

Implement optional expanding of compound words into separate tokens #5

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 11 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -101,16 +101,17 @@ Include `finnish` tokenizer and `voikko` filter in your analyzer, for example:

You can use the following filter options to customize the behaviour of the filter:

| Parameter | Default value | Description |
|-------------------|------------------|--------------------------------------------------|
| language | fi_FI | Language to use |
| dictionaryPath | system dependent | path to voikko dictionaries |
| analyzeAll | false | Use all analysis possibilities or just the first |
| minimumWordSize | 3 | minimum length of words to analyze |
| maximumWordSize | 100 | maximum length of words to analyze |
| libraryPath | system dependent | path to directory containing libvoikko |
| poolMaxSize | 10 | maximum amount of Voikko-instances to pool |
| analysisCacheSize | 1024 | number of analysis results to cache |
| Parameter | Default value | Description |
|-------------------|------------------|----------------------------------------------------------------|
| language | fi_FI | Language to use |
| dictionaryPath | system dependent | path to voikko dictionaries |
| analyzeAll | false | Use all analysis possibilities or just the first |
| minimumWordSize | 3 | minimum length of words to analyze |
| maximumWordSize | 100 | maximum length of words to analyze |
| libraryPath | system dependent | path to directory containing libvoikko |
| poolMaxSize | 10 | maximum amount of Voikko-instances to pool |
| analysisCacheSize | 1024 | number of analysis results to cache |
| expandCompounds | false | whether to produce separate tokens for parts of compound words |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The minimumSubwordSize and maximumSubwordSize are not documented here. Is that intentional or accidental? If you you consider them to be so rarely needed that they don't need to be documented, that's fine by me. I just want to make sure that they aren't forgotten by accident.


## Development

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -27,6 +27,8 @@
import java.io.IOException;
import java.util.ArrayDeque;
import java.util.ArrayList;
import java.util.Set;
import java.util.LinkedHashSet;
import java.util.Deque;
import java.util.List;
import java.util.regex.Pattern;
Expand Down Expand Up @@ -88,7 +90,7 @@ private void analyzeToken() {

charTermAttribute.setEmpty().append(baseForms.get(0));

if (cfg.analyzeAll && baseForms.size() > 1) {
if ((cfg.analyzeAll || cfg.expandCompounds) && baseForms.size() > 1) {
current = captureState();

for (String baseForm : baseForms.subList(1, baseForms.size()))
Expand All @@ -112,8 +114,17 @@ private List<String> analyzeUncached(String word) {

for (Analysis result : results) {
String baseForm = result.get("BASEFORM");
if (baseForm != null)
if (baseForm != null) {
baseForms.add(baseForm);
}
}
if (cfg.expandCompounds) {
for (String compound : expandCompounds(results)) {
baseForms.add(compound);
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be just:

if (cfg.expandCompounds)
    baseForms.addAll(expandCompounds(results));

if (!cfg.analyzeAll) {
return new ArrayList<String>(new LinkedHashSet<String>(baseForms));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of notes:

First of all, this is Java 8, so you can leave out the explicit types and say:

return new ArrayList<>(new LinkedHashSet<>(baseForms));

Second of all, I'd probably add a method like CollectionUtils.withoutDuplicates and write just return withoutDuplicates(baseForms) here. There are two reasons for this: first one is communicating the intent more clearly, saying what the code is supposed to do instead of how it's implemented. And second, having a separate method would in fact allow nice little optimizations without complicating the calling code. For example:

public static <T> List<T> withoutDuplicates(List<T> xs) {
    if (xs.size() < 2) return xs;
    return ArrayList<>(new LinkedHashSet<>(xs));
}

Or even something like:

public static <T> List<T> withoutDuplicates(List<T> xs) {
    if (xs.size() < 2) return xs;

    // for small collections (our usual case) the naive algorithm should be a win
    if (xs.size() < 16) {
        ArrayList<T> result = new ArrayList<>(xs.size());
        for (T x : xs)
            if (!result.contains(x))
                result.add(x);
        return result;
    }

    return ArrayList<>(new LinkedHashSet<>(xs));
}

Not that I'd probably bother to optimize it, but having it's nice to write higher-level code in the calling side and know that I have a good place to perform optimization if I have to get my hands dirty. This code is called a lot and reducing extra allocations helps.

}
return baseForms;
}
Expand All @@ -128,4 +139,82 @@ private void outputAlternative(String token) {
private boolean isCandidateForAnalyzation(CharSequence word) {
return word.length() >= cfg.minimumWordSize && word.length() <= cfg.maximumWordSize && VALID_WORD_PATTERN.matcher(word).matches();
}

private Set<String> expandCompounds(List<Analysis> analysisList) {
// Contains code from the Voikko Filter for Solr
// by the National Library of Finland.
//
// https://github.com/NatLibFi/SolrPlugins
Set<String> compoundForms = new LinkedHashSet<String>();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now duplicate removal is performed at two different levels: this method performs and the caller performs it again. It's probably best to just leave it to the caller and return a List from here.


for (Analysis analysis: analysisList) {
if (!analysis.containsKey("WORDBASES")) {
continue;
}
String wordbases = analysis.get("WORDBASES");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This performs lookups to the map: first with containsKey and then with get. It's better to just say:

String wordbases = analysis.get("WORDBASES");
if (wordbases == null) continue;

Almost the only reason to call containsKey is in the obscure case where null is a valid value in the map and you need to distinguish between the cases where null was returned because value was found and the case where null was returned because null was found. Even in that case it's best to call containsKey after the call to get only if necessary. So something like:

String value = map.get(key);
if (value != null || map.containsKey(key)) {
    // map contained entry for 'key', value could be null
} else {
    // map did not contain 'key'
}

That said, it's super-rare to contain maps where null is a valid value.

// Split by plus sign (unless right after an open parenthesis)
String matches[] = wordbases.split("(?<!\\()\\+");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ends up compiling the same regex over and over again. It's better to perform the expensive compilation of the regex just once:

private static final Pattern WORDBASE_SPLIT = Pattern.compile(""(?<!\\()\\+"");

and then say:

String[] matches = WORDBASE_SPLIT.split(wordbases);


int currentPos = 0, lastPos = 0;
String lastWordBody = "";
assert matches.length > 1;
// The string starts with a plus sign, so skip the first (empty) entry.
for (int i = 1; i <= matches.length - 1; i++) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The canonical way of writing this termination condition would be i < matches.length. At first glance I though that this would do something different than iterate to the end because the - 1 confused me.

String wordAnalysis, wordBody, baseForm;

// Get rid of equals sign in e.g. di=oksidi.
wordAnalysis = matches[i].replaceAll("=", "");;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of performing this separately for each match, we could perform this normalization just once on the original string before splitting it.

And also worth noting that String.replace(target, replacement) will do the same thing (replace all occurrences, of given input), but won't go through the regex-machinery, which is better in this case. (Yes, the methods are confusingly named.)

int parenPos = wordAnalysis.indexOf('(');
if (parenPos == -1) {
wordBody = baseForm = wordAnalysis;
} else {
// Word body is before the parenthesis
wordBody = wordAnalysis.substring(0, parenPos);
// Base form or derivative is in parenthesis
baseForm = wordAnalysis.substring(parenPos + 1, wordAnalysis.length() - 1);
}

String word;
int wordOffset, wordLen;
boolean isDerivative = baseForm.startsWith("+");
if (isDerivative) {
// Derivative suffix, merge with word body
word = lastWordBody + wordBody;
wordOffset = lastPos;
wordLen = word.length();
} else {
word = baseForm;
wordOffset = currentPos;
wordLen = word.length();
lastWordBody = wordBody;
lastPos = currentPos;
currentPos += baseForm.length();
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't say that I understand the meaning of juggling all these variables, but just by looking at the code above, I can say that at least assignment to wordLen can be moved outside of the branch since it's the same on both cases:

int wordLen = word.length();


// Make sure we don't exceed the length of the original term
int termLen = charTermAttribute.toString().length();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why must we not exceed the length of the original term? What input could cause that to happen and what problems would follow?

Also, the toString() call causes a string to be allocated needlesly, we can just call length on the original char sequence:

int termLen = charTermAttribute.length();

if (wordOffset + wordLen > termLen) {
if (wordOffset >= termLen) {
wordOffset = wordLen - termLen;
if (wordOffset < 0) {
wordOffset = 0;
}
} else {
wordLen = termLen - wordOffset;
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The wordOffset = 0; assignment is dead, because the variable is never used after that point on branches reachable from row 200. This means that the condition and the previous assignment is dead as well. So rows 196-205 can be simplified down to:

if (wordOffset < termLen && wordOffset + wordLen > termLen)
    wordLen = termLen - wordOffset;


int maxSubwordSize = cfg.maximumSubwordSize;
int minSubwordSize = cfg.minimumSubwordSize;
if (wordLen > minSubwordSize) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I specify minimum size to be 2, I'd expect that words of length 2 will be included, but this excludes them. So perhaps this should be >= instead of >?

if (wordLen > maxSubwordSize) {
word = word.substring(0, maxSubwordSize);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is prudent to just cut words to maxSubwordSize or should we just ignore the words that exceed the length? I could be persuaded that either behavior is correct, but it is somewhat confusing that the semantics of minSubwordSize and maxSubwordSize are so different even though the names are symmetrical.

wordLen = maxSubwordSize;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This assignment is also dead.

}
compoundForms.add(word);
}
}
}
return compoundForms;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This method is quite complex and it's hard simply look at it and be assured that it's correct. Furthermore, it's hard to write tests for this because it's tied to running Voikko.

Perhaps the code could be structured along these lines:

private List<String> expandCompounds(List<Analysis> analysisList) {
    List<String> compoundForms = new ArrayList<String>();

    for (Analysis analysis: analysisList) {
        String wordBases = analysis.get("WORDBASES");
            if (wordBases != null)
                compoundForms.addAll(expandWordBases(wordBases));
    }

    return compoundForms:
}

private List<String> expandWordBases(String wordBases) {
    return CompoundWordParser.expandWordBases(wordBases, charTermAttribute.length(), cfg.maximumSubwordSize, cfg.minimumSubwordSize);
}

This way the complex logic would be in a static method for which it would be trivial to write test cases:

assertWordBases("+köyde(köysi)+n+veto(veto)", 
    "köysi", "veto");
assertWordBases("+alkio(Alkio)+-+opisto(opisto)",
    "Alkio", "opisto");
assertWordBases("+kansa(kansa)+llis(+llinen)+eepos(eepos)",
    "kansa", "kansallis", "eepos");

I copy-pasted your code to a scratch file, hacked it a bit to make it callable from test, asked what it gives back to the examples inputs above and then wrote the expected results based on the results the code gave me. Do those look expected? I'm surprised about "kansallis" is the results. Wouldn't the correct value be "kansallinen"?

Furthermore I'm a bit confused if the logic can be right because the only reason why the input +köyde(köysi)+n+veto(veto) does not produce nonsensical output köysi, n, veto is that n is rejected for being too short. I guess that "n" should be rejected because it does not have a specified root, not because it's too short. If this is fixed perhaps, the minimum and maximum lengths are not needed at all?

All that said, here's my attempt at this function:

private static final Pattern WORDBASE = Pattern.compile("\\+([^+(]+)(\\(([^)]*)\\))?");

static List<String> expandWordBases(String wordbases) {
    Matcher m = WORDBASE.matcher(wordbases.replace("=", ""));

    String lastBody = "";
    List<String> result = new ArrayList<>();
    while (m.find()) {
        String body = m.group(1);
        String root = m.group(3);

        if (root != null) {
            if (root.startsWith("+")) {
                result.add(lastBody + root.substring(1));

            } else {
                result.add(root);
                lastBody = body;
            }
        }
    }
    return result;
}

And some tests to go with it:

@Test
void voikkoExamples() {
    assertWordBases("+köyde(köysi)+n+veto(veto)", "köysi", "veto");
    assertWordBases("+alkio(Alkio)+-+opisto(opisto)", "Alkio", "opisto");
    assertWordBases("+kansa(kansa)+llis(+llinen)+eepos(eepos)", "kansa", "kansallinen", "eepos");
}

private static void assertWordBases(String input, String... expected) {
    assertEquals(asList(expected), Foo.expandWordBases(input));
}

}
}
Original file line number Diff line number Diff line change
Expand Up @@ -27,4 +27,13 @@ final class VoikkoTokenFilterConfiguration {
/** Words longer than this threshold are ignored */
int maximumWordSize = 100;

/** If true, include parts of compound words as alternatives to the whole word */
boolean expandCompounds = false;

/** Subwords (parts of compound words) shorter than this treshold are ignored */
int minimumSubwordSize = 2;

/** Subwords longer than this treshold are ignored */
int maximumSubwordSize = 30;

}
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,9 @@ public VoikkoTokenFilterFactory(IndexSettings indexSettings,
cfg.analyzeAll = settings.getAsBoolean("analyzeAll", cfg.analyzeAll);
cfg.minimumWordSize = settings.getAsInt("minimumWordSize", cfg.minimumWordSize);
cfg.maximumWordSize = settings.getAsInt("maximumWordSize", cfg.maximumWordSize);
cfg.expandCompounds = settings.getAsBoolean("expandCompounds", cfg.expandCompounds);
cfg.minimumSubwordSize = settings.getAsInt("minimumSubwordSize", cfg.minimumSubwordSize);
cfg.maximumSubwordSize = settings.getAsInt("maximumSubwordSize", cfg.maximumSubwordSize);

analysisCache = new AnalysisCache(settings.getAsInt("analysisCacheSize", 1024));

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -138,6 +138,24 @@ public void testCompoundWordsWithHyphens() {
assertTokens("rippi-isälle", token("rippi-isälle", "rippi-isä", 1));
}

public void testExpandedCompoundWords() {
settings.put("index.analysis.filter.myFilter.expandCompounds", true);
assertTokens("isoisälle", token("isoisälle", "isoisä", 1));
assertTokens("tekokuusta keinokuuhun",
token("tekokuusta", "tekokuu", 1),
token("tekokuusta", "tekokuusi", 0),
token("tekokuusta", "teko", 0),
token("tekokuusta", "kuu", 0),
token("tekokuusta", "kuusi", 0),
token("keinokuuhun", "keinokuu", 1),
token("keinokuuhun", "keino", 0),
token("keinokuuhun", "kuu", 0));
assertTokens("hammaslääkäri",
token("hammaslääkäri", "hammaslääkäri", 1),
token("hammaslääkäri", "hammas", 0),
token("hammaslääkäri", "lääkäri", 0));
}

private static TokenData token(String original, String token, int positionIncrement) {
return new TokenData(original, token, positionIncrement);
}
Expand Down