Skip to content

Conversation

triphopMahithi
Copy link

What does this changes

  • Emoji replacement with sentiment-aware tokens (<<EMO_POS>>, <<EMO_NEG>>)
  • Normalization of elongated/repeated characters (e.g., "มากกกกกก" → "มากก")

What was wrong

  • Sentiment classification of emojis

How this fixes it

Applies sentiment-aware emoji replacement based on a defined dictionary

Your checklist for this pull request

  • [ /] Passed code styles and structures
  • [/ ] Passed code linting checks and unit test

Copy link

@bact bact added the enhancement enhance functionalities label Jul 22, 2025
@bact
Copy link
Member

bact commented Jul 22, 2025

Thank you.

Currently our text normalization functions are in
https://github.com/PyThaiNLP/pythainlp/blob/dev/pythainlp/util/normalize.py

If you like, you can try to fit the new functions in that file structure.

@coveralls
Copy link

Coverage Status

coverage: 52.88%. remained the same
when pulling 74ed33d on triphop-mahithi:text-cleaning
into a069230 on PyThaiNLP:dev.

@triphopMahithi
Copy link
Author

Thank you.

Currently our text normalization functions are in https://github.com/PyThaiNLP/pythainlp/blob/dev/pythainlp/util/normalize.py

If you like, you can try to fit the new functions in that file structure.

Thank you for pointing that out.
I apologize — I didn’t look carefully enough and missed the function in the file. I’ll take another look and try to fit the new functions accordingly.


"""

thai_special_chars_unicode = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can also try to use the list defined here:

# Paiyannoi, Maiyamok, Phinthu, Thanthakhat, Nikhahit, Yamakkan:
# These signs can be part of a word
thai_signs = "\u0e2f\u0e3a\u0e46\u0e4c\u0e4d\u0e4e" # 6 chars

(it doesn't include ฯลฯ though)

Comment on lines +53 to +74
emoji_sentiment = {
"positive": [
"😊", "😁", "😂", "🤣", "😄", "😍", "😘", "😻", "👍", "👏", "💕", "❤️", "😇", "😎", "🥰", "😃", "☺️"
],

"negative": [
"😢", "😭", "😠", "😡", "😤", "👎", "💔", "😞", "😖", "😩", "😣", "😫", "😓", "😰", "😱", "😿"
],

"neutral": [
"😐", "😶", "🤔", "😑", "😬", "😴", "😕", "😒", "🙄", "😮", "🤨", "😲"
]
}

def replace_emoji_with_sentiment(sentence: str, emoji_dict: dict) -> str:
for emo in emoji_dict["positive"]:
sentence = sentence.replace(emo, " <<EMO_POS>> ")
for emo in emoji_dict["negative"]:
sentence = sentence.replace(emo, " <<EMO_NEG>> ")
for emo in emoji_dict["neutral"]:
sentence = sentence.replace(emo, " <<EMO_NEU>> ")
return sentence
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
emoji_sentiment = {
"positive": [
"😊", "😁", "😂", "🤣", "😄", "😍", "😘", "😻", "👍", "👏", "💕", "❤️", "😇", "😎", "🥰", "😃", "☺️"
],
"negative": [
"😢", "😭", "😠", "😡", "😤", "👎", "💔", "😞", "😖", "😩", "😣", "😫", "😓", "😰", "😱", "😿"
],
"neutral": [
"😐", "😶", "🤔", "😑", "😬", "😴", "😕", "😒", "🙄", "😮", "🤨", "😲"
]
}
def replace_emoji_with_sentiment(sentence: str, emoji_dict: dict) -> str:
for emo in emoji_dict["positive"]:
sentence = sentence.replace(emo, " <<EMO_POS>> ")
for emo in emoji_dict["negative"]:
sentence = sentence.replace(emo, " <<EMO_NEG>> ")
for emo in emoji_dict["neutral"]:
sentence = sentence.replace(emo, " <<EMO_NEU>> ")
return sentence
emoji_sentiment = {
"POS": [
"😊", "😁", "😂", "🤣", "😄", "😍", "😘", "😻", "👍", "👏", "💕", "❤️", "😇", "😎", "🥰", "😃", "☺️"
],
"NEG": [
"😢", "😭", "😠", "😡", "😤", "👎", "💔", "😞", "😖", "😩", "😣", "😫", "😓", "😰", "😱", "😿"
],
"NEU": [
"😐", "😶", "🤔", "😑", "😬", "😴", "😕", "😒", "🙄", "😮", "🤨", "😲"
]
}
# Create an emoji-sentiment map from `emoji_sentiment`
emoji_to_tag = {}
for sentiment, emojis in emoji_sentiment.items():
tag = f" <<EMO_{sentiment}>> "
for emo in emojis:
emoji_to_tag[emo] = tag
# Alternatively, we can have the static map predefined.
def replace_emoji_with_sentiment(sentence: str) -> str:
for emo, tag in emoji_to_tag.items():
sentence = sentence.replace(emo, tag)
return sentence

We can also try to reduce the number of text passes from three to one by having an emoji-sentiment map.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! would it be okay if I add this code to the text normalization function?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please take it and use it in any way you want :)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I think this is the same as emojiconv.py

Copy link

This PR is stale because it has been open for 30 days with no activity.

@github-actions github-actions bot added the stale label Aug 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement enhance functionalities stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants