Skip to content

Conversation

@MiSikora
Copy link
Contributor

@MiSikora MiSikora commented Oct 23, 2025

Description

Currently, when performing a search, we operate on the regular title text fields. This can be problematic because the inputs aren’t normalized. Any accented or non-ASCII letters require an exact match. Meaning, for example, we won’t find ł when typing l, or œ when typing oe.

Sometimes we handle a very basic normalization in the code, but sometimes we don’t. When the search is performed on the database side, however, there’s no reliable way to handle it. In this PR I opted for using unidecode which is a common library that can be used for that purpose backported to many languages.

There are a few strategies to address this in SQLite, most of which rely on custom functions or collators. Unfortunately, the native Android SQLite library doesn’t support binding external functions. While we could use Requery, which does expose bindings, that would be a separate project in itself. Moreover, we’d likely need to implement native code functions to handle normalization efficiently since JNI interop can lead to performance issues, especially when searching over large data sets.

A common alternative is to store a normalized field directly in the database. This approach has the added advantage of integrating nicely with Full Text Search in the future, should we decide to improve local search capabilities further. Additionally, we can reuse the same normalized fields for searches performed in code, ensuring consistent behavior across all layers.

I handled all tables except for the podcast_episodes table during the database migration. Some users have hundreds of thousands of episodes in their databases. In my testing, migrating a database with 1,000,000 episodes on a Pixel 6 took about one minute. All while the app was locked.

To address this, I added a batched worker that runs during the app version migration. It updates episodes in batches of 50,000, which took around two minutes to migrate all episodes. This approach allows users to continue interacting with the app during the migration process.

One surprising piece of code in this PR is the following pattern:

data class Entitiy(
    @ColumnInfo(name = "title") var title: String = "",
) : {
    @ColumnInfo(name = "clean_title")
    var cleanTitle: String = ""
        get() = title.unidecode()
        internal set
}

It’s an ugly solution, but a necessary one. Most of our entities are mutable, and for this reason, cleanTitle cannot be a constructor property. Otherwise, if we mutate the title somewhere in the code, the change wouldn’t be reflected in cleanTitle, leading to discrepancies and incorrect search results.

Ideally, we’d make all classes immutable and move the properties to the constructors. Similar to how it’s done in the ManualPlaylistEpisode instance. However, this would be a significant refactor and can’t be done ad-hoc due to the potential side effects and wide-reaching implications.

Testing Instructions

  1. Install the app from the main branch.
  2. Follow some podcasts and create playlists with non-ascii names.
  3. Checkout the mehow/task/normalized-search branch.
  4. Apply this patch.
curl -L https://github.com/user-attachments/files/23123452/version.patch | git apply
  1. Upgrade the app.
  2. Open the app.
  3. Open the App Inspection.
  4. Execute this query and verify the clean titles.
SELECT title, cleanTitle FROM podcast_episodes
  1. Execute this query and verify the clean titles.
SELECT title, clean_title FROM playlists

Checklist

  • If this is a user-facing change, I have added an entry in CHANGELOG.md
  • Ensure the linter passes (./gradlew spotlessApply to automatically apply formatting/linting)
  • I have considered whether it makes sense to add tests for my changes
  • All strings that need to be localized are in modules/services/localization/src/main/res/values/strings.xml
  • Any jetpack compose components I added or changed are covered by compose previews
  • I have updated (or requested that someone edit) the spreadsheet to reflect any new or changed analytics.

@MiSikora MiSikora added this to the 7.101 milestone Oct 23, 2025
@Copilot Copilot AI review requested due to automatic review settings October 23, 2025 09:30
@MiSikora MiSikora requested a review from a team as a code owner October 23, 2025 09:30
@MiSikora MiSikora added [Area] Search [Type] Enhancement Improve an existing feature. labels Oct 23, 2025
@MiSikora MiSikora requested review from sztomek and removed request for a team October 23, 2025 09:30
@MiSikora MiSikora marked this pull request as draft October 23, 2025 09:30
@dangermattic
Copy link
Collaborator

dangermattic commented Oct 23, 2025

2 Warnings
⚠️ This PR is larger than 500 lines of changes. Please consider splitting it into smaller PRs for easier and faster reviews.
⚠️ Class EpisodeTitlesNormalizationWorker is missing tests, but unit-tests-exemption label was set to ignore this.

Generated by 🚫 Danger

@wpmobilebot
Copy link
Collaborator

Project dependencies changes

list
+ New Dependencies
net.gcardone.junidecode:junidecode:0.5.2
tree
+\--- project :modules:features:account
+     \--- project :modules:features:search
+          \--- project :modules:services:analytics
+               \--- project :modules:services:model
+                    \--- project :modules:services:utils
+                         \--- net.gcardone.junidecode:junidecode:0.5.2

@Copilot Copilot AI review requested due to automatic review settings October 23, 2025 12:27
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces a database normalization approach for search fields to improve search accuracy across Unicode characters. The implementation adds unidecode normalization to convert accented and non-ASCII characters to their ASCII equivalents, enabling searches for "l" to match "ł" and "oe" to match "œ". Normalized fields are stored directly in the database to support efficient database-side searching and future Full-Text Search integration.

Key Changes:

  • Adds unidecode library dependency and implements Unicode normalization utility
  • Creates database migration to add normalized title columns across multiple tables
  • Implements batched worker for podcast episodes normalization to prevent app locking during migration

Reviewed Changes

Copilot reviewed 24 out of 24 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
gradle/libs.versions.toml Adds junidecode library dependency
modules/services/utils/src/main/java/au/com/shiftyjelly/pocketcasts/utils/extensions/String.kt Implements unidecode() extension function with character filtering and whitespace normalization
modules/services/utils/src/main/java/au/com/shiftyjelly/pocketcasts/utils/search/KmpSearch.kt Updates KMP search to use Normalizer-based accent removal
modules/services/model/src/main/java/au/com/shiftyjelly/pocketcasts/models/entity/*.kt Adds cleanTitle/cleanName computed properties to entity classes
modules/services/model/src/main/java/au/com/shiftyjelly/pocketcasts/models/db/AppDatabase.kt Implements database migration 121→122 with normalized column backfilling
modules/services/model/src/main/java/au/com/shiftyjelly/pocketcasts/models/db/dao/*.kt Updates DAO methods to maintain normalized fields on updates
modules/services/repositories/src/main/java/au/com/shiftyjelly/pocketcasts/repositories/jobs/EpisodeTitlesNormalizationWorker.kt Implements batched worker for episode title normalization
modules/services/preferences/src/main/java/au/com/shiftyjelly/pocketcasts/preferences/SettingsImpl.kt Changes upNextShuffle default value from false to true

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@Copilot Copilot AI review requested due to automatic review settings October 23, 2025 12:32
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 23 out of 23 changed files in this pull request and generated 2 comments.


Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@Copilot Copilot AI review requested due to automatic review settings October 23, 2025 12:38
@MiSikora MiSikora force-pushed the mehow/task/normalized-search branch from a4fa885 to cdc7ad7 Compare October 23, 2025 12:38
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 23 out of 23 changed files in this pull request and generated 1 comment.


Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@MiSikora MiSikora marked this pull request as ready for review October 23, 2025 12:41
@MiSikora MiSikora requested a review from sztomek October 23, 2025 12:41
@MiSikora MiSikora force-pushed the mehow/task/normalized-search branch from cdc7ad7 to 31f9e1b Compare October 24, 2025 12:31
MiSikora and others added 5 commits October 24, 2025 14:31
@MiSikora MiSikora force-pushed the mehow/task/normalized-search branch from 31f9e1b to 215308a Compare October 24, 2025 12:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants