[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: [maybe OT] unicode control characters in filenames



On Tue, Aug 09, 2011 at 12:42:18PM -0400, Eike Lantzsch wrote:
> Hi:
> 
> For some time I'm looking to find a method to remove unicode control 
> characters like U+202A; U+202C; U+200F from filenames.
> I found lots of examples to do this programmatically with python, perl, even 
> for VB and Java.
> I was looking to do this with bash, find, grep and/or even sed because I just 
> never wrote code in python or perl.
> Can some kind soul please give me a hint how to proceed?
> 
> Oh well, Dolphin in KDE 4.7.0 lets me change the filenames manually without 
> showing the actual control characters - you sort-of need to "feel" your way - 
> which is OK. Dolphin interprets those characters as what they are: control 
> characters - but manually file by file - I got hundreds - good grief!
> ls -la is so kind as to show the unicode characters as <U+202A> and so forth.
> Even mc shows at least dots for the unicode charcters, but no easy method or 
> function to eliminate those chars from filenames - I mean a method simple 
> enough and usable for simple-minded-non-perl-cracks-users like me.
> 
> Thank y'all
> Eike

Most command line utilities predate unicode and only understand bytes.
While I don't know unicode I have had to deal with such in files and 
have used some thing similar to <code> tr "\200-\377" "_" </code> or
if you need to more definition use sed's 's/\107\221\168\319/_a_/'.
You can put all the translations in a sed file one character sequence
translation per line, then something along the lines of:
<code>
    readable_filename = $( echo $unicode_filename | 
        sed -f translation_file.sed; ); 
    mv $unicode_filename $readable_filename ;
</code>

Mind you the former may give you duplicate file names which will result
in one of the files disappearing and the latter requires you to figure
out some mapping between the old and new file names that will be 
recognizable by anyone else involved.
Since using a script to diddle 100s of files is fraught with risk
I strongly suggest you zip up the directories in question and save that
zip for a month or 2 until you are SURE there are no mistakes.

To get a list of the unicode groupings you need to deal with
start with:
<code> ls -1R /path_to_unicode_files | tr 'a-zA-Z0-9./-' '_' </code>

I've never had to deal with your problem but that's how I'd approach it.

HTH,
Mike
PS: \107\221\168\319 is totally arbitrary and only meant to be
illustrative and you may find \x0A handier.
MM
-- 
Satisfied user of Linux since 1997.
O< ascii ribbon campaign - stop html mail - www.asciiribbon.org


Reply to: