Re: [maybe OT] unicode control characters in filenames

To: Debian-user List <debian-user@lists.debian.org>
Subject: Re: [maybe OT] unicode control characters in filenames
From: Mike McClain <mike.junk@cox.net>
Date: Tue, 9 Aug 2011 13:24:46 -0700
Message-id: <[🔎] 20110809202446.GB3753@playground>
Mail-followup-to: Debian-user List <debian-user@lists.debian.org>
In-reply-to: <[🔎] 201108091242.19221.zp6cge@gmx.net>
References: <[🔎] 201108091242.19221.zp6cge@gmx.net>

On Tue, Aug 09, 2011 at 12:42:18PM -0400, Eike Lantzsch wrote:
> Hi:
> 
> For some time I'm looking to find a method to remove unicode control 
> characters like U+202A; U+202C; U+200F from filenames.
> I found lots of examples to do this programmatically with python, perl, even 
> for VB and Java.
> I was looking to do this with bash, find, grep and/or even sed because I just 
> never wrote code in python or perl.
> Can some kind soul please give me a hint how to proceed?
> 
> Oh well, Dolphin in KDE 4.7.0 lets me change the filenames manually without 
> showing the actual control characters - you sort-of need to "feel" your way - 
> which is OK. Dolphin interprets those characters as what they are: control 
> characters - but manually file by file - I got hundreds - good grief!
> ls -la is so kind as to show the unicode characters as <U+202A> and so forth.
> Even mc shows at least dots for the unicode charcters, but no easy method or 
> function to eliminate those chars from filenames - I mean a method simple 
> enough and usable for simple-minded-non-perl-cracks-users like me.
> 
> Thank y'all
> Eike

Most command line utilities predate unicode and only understand bytes.
While I don't know unicode I have had to deal with such in files and 
have used some thing similar to <code> tr "\200-\377" "_" </code> or
if you need to more definition use sed's 's/\107\221\168\319/_a_/'.
You can put all the translations in a sed file one character sequence
translation per line, then something along the lines of:
<code>
    readable_filename = $( echo $unicode_filename | 
        sed -f translation_file.sed; ); 
    mv $unicode_filename $readable_filename ;
</code>

Mind you the former may give you duplicate file names which will result
in one of the files disappearing and the latter requires you to figure
out some mapping between the old and new file names that will be 
recognizable by anyone else involved.
Since using a script to diddle 100s of files is fraught with risk
I strongly suggest you zip up the directories in question and save that
zip for a month or 2 until you are SURE there are no mistakes.

To get a list of the unicode groupings you need to deal with
start with:
<code> ls -1R /path_to_unicode_files | tr 'a-zA-Z0-9./-' '_' </code>

I've never had to deal with your problem but that's how I'd approach it.

HTH,
Mike
PS: \107\221\168\319 is totally arbitrary and only meant to be
illustrative and you may find \x0A handier.
MM
-- 
Satisfied user of Linux since 1997.
O< ascii ribbon campaign - stop html mail - www.asciiribbon.org

Reply to:

Follow-Ups:
- Re: [maybe OT] unicode control characters in filenames
  - From: Darac Marjal <mailinglist@darac.org.uk>

References:
- [maybe OT] unicode control characters in filenames
  - From: Eike Lantzsch <zp6cge@gmx.net>

Prev by Date: Re: Unable to create either bootable USB flashdrive or CD/DVDrom
Next by Date: squeeze aptitude update error
Previous by thread: [maybe OT] unicode control characters in filenames
Next by thread: Re: [maybe OT] unicode control characters in filenames
Index(es):
- Date
- Thread