beware of non-ascii characters
October 13th, 2006 mysurface Posted in cat, Misc, sed, Text Manipulation | Hits: 40985 | 4 Comments »
When I copy source code from an ebook in pdf format and paste into vim, and I try to compile it fails. The reason it fails is because it contains non-ascii character, for my case it uses UTF-8 encorded characters. You can check out the non-ascii or hidden characters by doing this:
cat -v mycode.c
One line of the sample output here:
fprintf (stream, M-bM-^@M-^\This is a test.\nM-bM-^@M-^]);
The M-bM-^@M-^\ is open double quote ( “ ) and M-bM-^@M-^] is close double quote ( ” ), but for c/c++ programming it just have to use ( ” )
For this case, I need to convert it to ( ” ), I uses sed for this
cat -v mycode.c | sed -e 's/M-bM-^@M-^\\/"/g' -e 's/M-bM-^@M-^]/"/g' >mycode2.c
First, I cat -v to display the non-ascii character in M- and ^ format. Then i uses sed to search and replace all the non-ascii character to ( ” ) and return the output to a new file call mycode2.c







August 10th, 2008 at 10:05 pm
Thanks for the cat -v file.txt tip. i’d been wondering how to view oddball characters in linux for the last few days until i stumbled onto your post.
I just discovered that
sed ‘s/[^a-zA-Z0-9]//’ tester.txt > tester2.txt
works also to remove non-ascii characters.
thanks,
joe
August 10th, 2008 at 10:24 pm
oops.
sed ’s/[^a-zA-Z0-9]//’ tester.txt > tester2.txt
will remove non-ascii characters, but will also remove
!@#$% etc., too!
the rule here is never take advice from a dummy.
August 21st, 2008 at 6:45 am
The code above was very useful in removing all of the bad characters from within my file but what if the non-ascii character is at the end of the file name?
somehow my Expect script is adding a special character to teh end of the file name.
August 5th, 2011 at 9:27 pm
In addition to sed ’s/[^a-zA-Z0-9]//’ tester.txt > tester2.txt,
this also works to remove non-ascii: rm
(i.e. delete the file)