-
Notifications
You must be signed in to change notification settings - Fork 282
(feat): add text normalization for Thai and emoji #1130
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Conversation
|
Thank you. Currently our text normalization functions are in If you like, you can try to fit the new functions in that file structure. |
Thank you for pointing that out. |
|
||
""" | ||
|
||
thai_special_chars_unicode = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can also try to use the list defined here:
pythainlp/pythainlp/__init__.py
Lines 20 to 22 in a069230
# Paiyannoi, Maiyamok, Phinthu, Thanthakhat, Nikhahit, Yamakkan: | |
# These signs can be part of a word | |
thai_signs = "\u0e2f\u0e3a\u0e46\u0e4c\u0e4d\u0e4e" # 6 chars |
(it doesn't include ฯลฯ
though)
emoji_sentiment = { | ||
"positive": [ | ||
"😊", "😁", "😂", "🤣", "😄", "😍", "😘", "😻", "👍", "👏", "💕", "❤️", "😇", "😎", "🥰", "😃", "☺️" | ||
], | ||
|
||
"negative": [ | ||
"😢", "😭", "😠", "😡", "😤", "👎", "💔", "😞", "😖", "😩", "😣", "😫", "😓", "😰", "😱", "😿" | ||
], | ||
|
||
"neutral": [ | ||
"😐", "😶", "🤔", "😑", "😬", "😴", "😕", "😒", "🙄", "😮", "🤨", "😲" | ||
] | ||
} | ||
|
||
def replace_emoji_with_sentiment(sentence: str, emoji_dict: dict) -> str: | ||
for emo in emoji_dict["positive"]: | ||
sentence = sentence.replace(emo, " <<EMO_POS>> ") | ||
for emo in emoji_dict["negative"]: | ||
sentence = sentence.replace(emo, " <<EMO_NEG>> ") | ||
for emo in emoji_dict["neutral"]: | ||
sentence = sentence.replace(emo, " <<EMO_NEU>> ") | ||
return sentence |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
emoji_sentiment = { | |
"positive": [ | |
"😊", "😁", "😂", "🤣", "😄", "😍", "😘", "😻", "👍", "👏", "💕", "❤️", "😇", "😎", "🥰", "😃", "☺️" | |
], | |
"negative": [ | |
"😢", "😭", "😠", "😡", "😤", "👎", "💔", "😞", "😖", "😩", "😣", "😫", "😓", "😰", "😱", "😿" | |
], | |
"neutral": [ | |
"😐", "😶", "🤔", "😑", "😬", "😴", "😕", "😒", "🙄", "😮", "🤨", "😲" | |
] | |
} | |
def replace_emoji_with_sentiment(sentence: str, emoji_dict: dict) -> str: | |
for emo in emoji_dict["positive"]: | |
sentence = sentence.replace(emo, " <<EMO_POS>> ") | |
for emo in emoji_dict["negative"]: | |
sentence = sentence.replace(emo, " <<EMO_NEG>> ") | |
for emo in emoji_dict["neutral"]: | |
sentence = sentence.replace(emo, " <<EMO_NEU>> ") | |
return sentence | |
emoji_sentiment = { | |
"POS": [ | |
"😊", "😁", "😂", "🤣", "😄", "😍", "😘", "😻", "👍", "👏", "💕", "❤️", "😇", "😎", "🥰", "😃", "☺️" | |
], | |
"NEG": [ | |
"😢", "😭", "😠", "😡", "😤", "👎", "💔", "😞", "😖", "😩", "😣", "😫", "😓", "😰", "😱", "😿" | |
], | |
"NEU": [ | |
"😐", "😶", "🤔", "😑", "😬", "😴", "😕", "😒", "🙄", "😮", "🤨", "😲" | |
] | |
} | |
# Create an emoji-sentiment map from `emoji_sentiment` | |
emoji_to_tag = {} | |
for sentiment, emojis in emoji_sentiment.items(): | |
tag = f" <<EMO_{sentiment}>> " | |
for emo in emojis: | |
emoji_to_tag[emo] = tag | |
# Alternatively, we can have the static map predefined. | |
def replace_emoji_with_sentiment(sentence: str) -> str: | |
for emo, tag in emoji_to_tag.items(): | |
sentence = sentence.replace(emo, tag) | |
return sentence |
We can also try to reduce the number of text passes from three to one by having an emoji-sentiment map.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! would it be okay if I add this code to the text normalization function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please take it and use it in any way you want :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I think this is the same as emojiconv.py
This PR is stale because it has been open for 30 days with no activity. |
What does this changes
What was wrong
How this fixes it
Applies sentiment-aware emoji replacement based on a defined dictionary
Your checklist for this pull request