Last active
June 21, 2022 17:05
Bash Script to Remove Arabic Dialects from UTF-8 or Windows-1256 / iso-8859-1 Encoding
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Bash Script to Remove Arabic Dialects from UTF-8 or Windows-1256 / iso-8859-1 Encoding | |
# - Converts arabic commas to latin comma | |
# - Remove Dialect symbols | |
# - Remove running spaces with a single | |
# - Replace Alif-with-hamza with Alif | |
# | |
# Example: removeArabicDialects my_utf8.txt > clear.txt | |
# Install: Copy this gist into your ~/.bashrc | |
# Author: Tarek Eldeeb | |
# | |
removeArabicDialects () { | |
if [[ $(file -bi $1 | grep -c utf) -gt 0 ]] ; then | |
sed "s/[$(echo -ne '\u060C\u061B')]/,/g" $1 | \ | |
sed "s/[$(echo -ne '\u064B-\u065E')]//g" | \ | |
sed "s/ \+/ /g" | \ | |
sed "s/[$(echo -ne '\u0622\u0623\u0625')]/$(echo -ne '\u0627')/g"; | |
else | |
cat $1 | tr $'\xA1\xBA.,:t' ' ' | \ | |
tr -d '\356-\377\327\334\340\342\347-\353'| \ | |
sed "s/ \+/ /g"| \ | |
tr $'\xc5\xc2\xc3' $'\xc7'; | |
fi; | |
} |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment