Normalize search inputs in the database #4632

MiSikora · 2025-10-23T09:30:14Z

Description

Currently, when performing a search, we operate on the regular title text fields. This can be problematic because the inputs aren’t normalized. Any accented or non-ASCII letters require an exact match. Meaning, for example, we won’t find ł when typing l, or œ when typing oe.

Sometimes we handle a very basic normalization in the code, but sometimes we don’t. When the search is performed on the database side, however, there’s no reliable way to handle it. In this PR I opted for using unidecode which is a common library that can be used for that purpose backported to many languages.

There are a few strategies to address this in SQLite, most of which rely on custom functions or collators. Unfortunately, the native Android SQLite library doesn’t support binding external functions. While we could use Requery, which does expose bindings, that would be a separate project in itself. Moreover, we’d likely need to implement native code functions to handle normalization efficiently since JNI interop can lead to performance issues, especially when searching over large data sets.

A common alternative is to store a normalized field directly in the database. This approach has the added advantage of integrating nicely with Full Text Search in the future, should we decide to improve local search capabilities further. Additionally, we can reuse the same normalized fields for searches performed in code, ensuring consistent behavior across all layers.

I handled all tables except for the podcast_episodes table during the database migration. Some users have hundreds of thousands of episodes in their databases. In my testing, migrating a database with 1,000,000 episodes on a Pixel 6 took about one minute. All while the app was locked.

To address this, I added a batched worker that runs during the app version migration. It updates episodes in batches of 50,000, which took around two minutes to migrate all episodes. This approach allows users to continue interacting with the app during the migration process.

One surprising piece of code in this PR is the following pattern:

data class Entitiy(
    @ColumnInfo(name = "title") var title: String = "",
) : {
    @ColumnInfo(name = "clean_title")
    var cleanTitle: String = ""
        get() = title.unidecode()
        internal set
}

It’s an ugly solution, but a necessary one. Most of our entities are mutable, and for this reason, cleanTitle cannot be a constructor property. Otherwise, if we mutate the title somewhere in the code, the change wouldn’t be reflected in cleanTitle, leading to discrepancies and incorrect search results.

Ideally, we’d make all classes immutable and move the properties to the constructors. Similar to how it’s done in the ManualPlaylistEpisode instance. However, this would be a significant refactor and can’t be done ad-hoc due to the potential side effects and wide-reaching implications.

Testing Instructions

Install the app from the main branch.
Follow some podcasts and create playlists with non-ascii names.
Checkout the mehow/task/normalized-search branch.
Apply this patch.

curl -L https://github.com/user-attachments/files/23123452/version.patch | git apply

Upgrade the app.
Open the app.
Open the App Inspection.
Execute this query and verify the clean titles.

SELECT title, cleanTitle FROM podcast_episodes

Execute this query and verify the clean titles.

SELECT title, clean_title FROM playlists

Checklist

If this is a user-facing change, I have added an entry in CHANGELOG.md
Ensure the linter passes (./gradlew spotlessApply to automatically apply formatting/linting)
I have considered whether it makes sense to add tests for my changes
All strings that need to be localized are in modules/services/localization/src/main/res/values/strings.xml
Any jetpack compose components I added or changed are covered by compose previews
I have updated (or requested that someone edit) the spreadsheet to reflect any new or changed analytics.

dangermattic · 2025-10-23T09:31:05Z

	2 Warnings
⚠️	This PR is larger than 500 lines of changes. Please consider splitting it into smaller PRs for easier and faster reviews.
⚠️	Class `EpisodeTitlesNormalizationWorker` is missing tests, but `unit-tests-exemption` label was set to ignore this.

Generated by 🚫 Danger

wpmobilebot · 2025-10-23T09:35:51Z

Project dependencies changes

list

+ New Dependencies
net.gcardone.junidecode:junidecode:0.5.2

tree

+\--- project :modules:features:account
+     \--- project :modules:features:search
+          \--- project :modules:services:analytics
+               \--- project :modules:services:model
+                    \--- project :modules:services:utils
+                         \--- net.gcardone.junidecode:junidecode:0.5.2

Copilot

Pull Request Overview

This PR introduces a database normalization approach for search fields to improve search accuracy across Unicode characters. The implementation adds unidecode normalization to convert accented and non-ASCII characters to their ASCII equivalents, enabling searches for "l" to match "ł" and "oe" to match "œ". Normalized fields are stored directly in the database to support efficient database-side searching and future Full-Text Search integration.

Key Changes:

Adds unidecode library dependency and implements Unicode normalization utility
Creates database migration to add normalized title columns across multiple tables
Implements batched worker for podcast episodes normalization to prevent app locking during migration

Reviewed Changes

Copilot reviewed 24 out of 24 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
gradle/libs.versions.toml	Adds junidecode library dependency
modules/services/utils/src/main/java/au/com/shiftyjelly/pocketcasts/utils/extensions/String.kt	Implements unidecode() extension function with character filtering and whitespace normalization
modules/services/utils/src/main/java/au/com/shiftyjelly/pocketcasts/utils/search/KmpSearch.kt	Updates KMP search to use Normalizer-based accent removal
modules/services/model/src/main/java/au/com/shiftyjelly/pocketcasts/models/entity/*.kt	Adds cleanTitle/cleanName computed properties to entity classes
modules/services/model/src/main/java/au/com/shiftyjelly/pocketcasts/models/db/AppDatabase.kt	Implements database migration 121→122 with normalized column backfilling
modules/services/model/src/main/java/au/com/shiftyjelly/pocketcasts/models/db/dao/*.kt	Updates DAO methods to maintain normalized fields on updates
modules/services/repositories/src/main/java/au/com/shiftyjelly/pocketcasts/repositories/jobs/EpisodeTitlesNormalizationWorker.kt	Implements batched worker for episode title normalization
modules/services/preferences/src/main/java/au/com/shiftyjelly/pocketcasts/preferences/SettingsImpl.kt	Changes upNextShuffle default value from false to true

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

...ervices/preferences/src/main/java/au/com/shiftyjelly/pocketcasts/preferences/SettingsImpl.kt

...in/java/au/com/shiftyjelly/pocketcasts/repositories/jobs/EpisodeTitlesNormalizationWorker.kt

Copilot

Pull Request Overview

Copilot reviewed 23 out of 23 changed files in this pull request and generated 2 comments.

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

modules/services/utils/src/main/java/au/com/shiftyjelly/pocketcasts/utils/search/KmpSearch.kt

.../services/model/src/main/java/au/com/shiftyjelly/pocketcasts/models/entity/PodcastEpisode.kt

Copilot

Pull Request Overview

Copilot reviewed 23 out of 23 changed files in this pull request and generated 1 comment.

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

...in/java/au/com/shiftyjelly/pocketcasts/repositories/jobs/EpisodeTitlesNormalizationWorker.kt

…pocketcasts/preferences/SettingsImpl.kt Co-authored-by: Copilot <[email protected]>

…/pocketcasts/repositories/jobs/EpisodeTitlesNormalizationWorker.kt Co-authored-by: Copilot <[email protected]>

MiSikora added this to the 7.101 milestone Oct 23, 2025

Copilot AI review requested due to automatic review settings October 23, 2025 09:30

MiSikora requested a review from a team as a code owner October 23, 2025 09:30

MiSikora added [Area] Search [Type] Enhancement Improve an existing feature. labels Oct 23, 2025

MiSikora requested review from sztomek and removed request for a team October 23, 2025 09:30

MiSikora marked this pull request as draft October 23, 2025 09:30

MiSikora removed request for Copilot and sztomek October 23, 2025 09:30

Copilot AI review requested due to automatic review settings October 23, 2025 12:27

Copilot AI reviewed Oct 23, 2025

View reviewed changes

Copilot AI review requested due to automatic review settings October 23, 2025 12:32

Copilot AI reviewed Oct 23, 2025

View reviewed changes

modules/services/utils/src/main/java/au/com/shiftyjelly/pocketcasts/utils/search/KmpSearch.kt Show resolved Hide resolved

.../services/model/src/main/java/au/com/shiftyjelly/pocketcasts/models/entity/PodcastEpisode.kt Show resolved Hide resolved

Copilot AI review requested due to automatic review settings October 23, 2025 12:38

MiSikora force-pushed the mehow/task/normalized-search branch from a4fa885 to cdc7ad7 Compare October 23, 2025 12:38

Copilot AI reviewed Oct 23, 2025

View reviewed changes

...in/java/au/com/shiftyjelly/pocketcasts/repositories/jobs/EpisodeTitlesNormalizationWorker.kt Show resolved Hide resolved

MiSikora marked this pull request as ready for review October 23, 2025 12:41

MiSikora requested a review from sztomek October 23, 2025 12:41

MiSikora added unit-tests-exemption do not merge and removed do not merge labels Oct 23, 2025

MiSikora force-pushed the mehow/task/normalized-search branch from cdc7ad7 to 31f9e1b Compare October 24, 2025 12:31

MiSikora and others added 5 commits October 24, 2025 14:31

Normalize search inputs

b7ff7a4

Accept only letters or digits

254c6f1

Update modules/services/preferences/src/main/java/au/com/shiftyjelly/…

a98a798

…pocketcasts/preferences/SettingsImpl.kt Co-authored-by: Copilot <[email protected]>

Update modules/services/repositories/src/main/java/au/com/shiftyjelly…

7069ae3

…/pocketcasts/repositories/jobs/EpisodeTitlesNormalizationWorker.kt Co-authored-by: Copilot <[email protected]>

Fix docs

215308a

MiSikora force-pushed the mehow/task/normalized-search branch from 31f9e1b to 215308a Compare October 24, 2025 12:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Normalize search inputs in the database #4632

Normalize search inputs in the database #4632

MiSikora commented Oct 23, 2025 •

edited

Loading

Uh oh!

dangermattic commented Oct 23, 2025 •

edited

Loading

Uh oh!

wpmobilebot commented Oct 23, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Normalize search inputs in the database #4632

Are you sure you want to change the base?

Normalize search inputs in the database #4632

Conversation

MiSikora commented Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Testing Instructions

Checklist

Uh oh!

dangermattic commented Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wpmobilebot commented Oct 23, 2025

Project dependencies changes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

MiSikora commented Oct 23, 2025 •

edited

Loading

dangermattic commented Oct 23, 2025 •

edited

Loading