- 
                Notifications
    You must be signed in to change notification settings 
- Fork 269
Normalize search inputs in the database #4632
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
| 
 Generated by 🚫 Danger | 
| Project dependencies changeslist+ New Dependencies
net.gcardone.junidecode:junidecode:0.5.2tree+\--- project :modules:features:account
+     \--- project :modules:features:search
+          \--- project :modules:services:analytics
+               \--- project :modules:services:model
+                    \--- project :modules:services:utils
+                         \--- net.gcardone.junidecode:junidecode:0.5.2 | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces a database normalization approach for search fields to improve search accuracy across Unicode characters. The implementation adds unidecode normalization to convert accented and non-ASCII characters to their ASCII equivalents, enabling searches for "l" to match "ł" and "oe" to match "œ". Normalized fields are stored directly in the database to support efficient database-side searching and future Full-Text Search integration.
Key Changes:
- Adds unidecodelibrary dependency and implements Unicode normalization utility
- Creates database migration to add normalized title columns across multiple tables
- Implements batched worker for podcast episodes normalization to prevent app locking during migration
Reviewed Changes
Copilot reviewed 24 out of 24 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description | 
|---|---|
| gradle/libs.versions.toml | Adds junidecode library dependency | 
| modules/services/utils/src/main/java/au/com/shiftyjelly/pocketcasts/utils/extensions/String.kt | Implements unidecode() extension function with character filtering and whitespace normalization | 
| modules/services/utils/src/main/java/au/com/shiftyjelly/pocketcasts/utils/search/KmpSearch.kt | Updates KMP search to use Normalizer-based accent removal | 
| modules/services/model/src/main/java/au/com/shiftyjelly/pocketcasts/models/entity/*.kt | Adds cleanTitle/cleanName computed properties to entity classes | 
| modules/services/model/src/main/java/au/com/shiftyjelly/pocketcasts/models/db/AppDatabase.kt | Implements database migration 121→122 with normalized column backfilling | 
| modules/services/model/src/main/java/au/com/shiftyjelly/pocketcasts/models/db/dao/*.kt | Updates DAO methods to maintain normalized fields on updates | 
| modules/services/repositories/src/main/java/au/com/shiftyjelly/pocketcasts/repositories/jobs/EpisodeTitlesNormalizationWorker.kt | Implements batched worker for episode title normalization | 
| modules/services/preferences/src/main/java/au/com/shiftyjelly/pocketcasts/preferences/SettingsImpl.kt | Changes upNextShuffle default value from false to true | 
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
        
          
                ...ervices/preferences/src/main/java/au/com/shiftyjelly/pocketcasts/preferences/SettingsImpl.kt
              
                Outdated
          
            Show resolved
            Hide resolved
        
              
          
                ...in/java/au/com/shiftyjelly/pocketcasts/repositories/jobs/EpisodeTitlesNormalizationWorker.kt
              
                Outdated
          
            Show resolved
            Hide resolved
        
              
          
                ...in/java/au/com/shiftyjelly/pocketcasts/repositories/jobs/EpisodeTitlesNormalizationWorker.kt
              
                Outdated
          
            Show resolved
            Hide resolved
        
              
          
                ...in/java/au/com/shiftyjelly/pocketcasts/repositories/jobs/EpisodeTitlesNormalizationWorker.kt
          
            Show resolved
            Hide resolved
        
              
          
                ...in/java/au/com/shiftyjelly/pocketcasts/repositories/jobs/EpisodeTitlesNormalizationWorker.kt
          
            Show resolved
            Hide resolved
        
      There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 23 out of 23 changed files in this pull request and generated 2 comments.
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
        
          
                modules/services/utils/src/main/java/au/com/shiftyjelly/pocketcasts/utils/search/KmpSearch.kt
          
            Show resolved
            Hide resolved
        
              
          
                .../services/model/src/main/java/au/com/shiftyjelly/pocketcasts/models/entity/PodcastEpisode.kt
          
            Show resolved
            Hide resolved
        
      a4fa885    to
    cdc7ad7      
    Compare
  
    There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 23 out of 23 changed files in this pull request and generated 1 comment.
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
        
          
                ...in/java/au/com/shiftyjelly/pocketcasts/repositories/jobs/EpisodeTitlesNormalizationWorker.kt
          
            Show resolved
            Hide resolved
        
      cdc7ad7    to
    31f9e1b      
    Compare
  
    …pocketcasts/preferences/SettingsImpl.kt Co-authored-by: Copilot <[email protected]>
…/pocketcasts/repositories/jobs/EpisodeTitlesNormalizationWorker.kt Co-authored-by: Copilot <[email protected]>
31f9e1b    to
    215308a      
    Compare
  
    
Description
Currently, when performing a search, we operate on the regular title text fields. This can be problematic because the inputs aren’t normalized. Any accented or non-ASCII letters require an exact match. Meaning, for example, we won’t find
łwhen typingl, orœwhen typingoe.Sometimes we handle a very basic normalization in the code, but sometimes we don’t. When the search is performed on the database side, however, there’s no reliable way to handle it. In this PR I opted for using
unidecodewhich is a common library that can be used for that purpose backported to many languages.There are a few strategies to address this in SQLite, most of which rely on custom functions or collators. Unfortunately, the native Android SQLite library doesn’t support binding external functions. While we could use Requery, which does expose bindings, that would be a separate project in itself. Moreover, we’d likely need to implement native code functions to handle normalization efficiently since JNI interop can lead to performance issues, especially when searching over large data sets.
A common alternative is to store a normalized field directly in the database. This approach has the added advantage of integrating nicely with Full Text Search in the future, should we decide to improve local search capabilities further. Additionally, we can reuse the same normalized fields for searches performed in code, ensuring consistent behavior across all layers.
I handled all tables except for the
podcast_episodestable during the database migration. Some users have hundreds of thousands of episodes in their databases. In my testing, migrating a database with 1,000,000 episodes on a Pixel 6 took about one minute. All while the app was locked.To address this, I added a batched worker that runs during the app version migration. It updates episodes in batches of 50,000, which took around two minutes to migrate all episodes. This approach allows users to continue interacting with the app during the migration process.
One surprising piece of code in this PR is the following pattern:
It’s an ugly solution, but a necessary one. Most of our entities are mutable, and for this reason,
cleanTitlecannot be a constructor property. Otherwise, if we mutate thetitlesomewhere in the code, the change wouldn’t be reflected incleanTitle, leading to discrepancies and incorrect search results.Ideally, we’d make all classes immutable and move the properties to the constructors. Similar to how it’s done in the
ManualPlaylistEpisodeinstance. However, this would be a significant refactor and can’t be done ad-hoc due to the potential side effects and wide-reaching implications.Testing Instructions
mehow/task/normalized-searchbranch.curl -L https://github.com/user-attachments/files/23123452/version.patch | git applyChecklist
./gradlew spotlessApplyto automatically apply formatting/linting)modules/services/localization/src/main/res/values/strings.xml