[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Umlaut problems in filenames when going from samba2 to samba3



On Fri, 2006-03-03 08:16:05 +0100, Juergen.Leibner@t-online.de <Juergen.Leibner@t-online.de> wrote:
> -----Original Message-----
> > Date: Fri,  3 Mar 2006 07:48:58 +0100
> > Subject: Umlaut problems in filenames when going from samba2 to samba3
> > From: Klaus Ade Johnstad 
> > To: user@skolelinux.de
> > My problems is that the filenames do not have the German umlauts (öäü)
> > or the Norwegian special characthers (øæå). I have about 8000 such
> > files, the teachers say that they have square signs, underscores and
> > other "strange" stuff instead of umlauts and specially Norwegian
> > characters.

Yeah, the filesystem uses one representation (eg. UTF-8) while Samba
interprets it as another (eg. ISO-8859-something).

> I've running samba on a debian system here at work.
> Windows and Linux are using the same files.
> Samba is configured:
> #       unix charset = UTF-8
> #       display charset = UTF-8
> 
> debian is configured:
> LANG=de_DE.UTF-8@euro
> LC_CTYPE="de_DE.UTF-8@euro"
> LC_NUMERIC="de_DE.UTF-8@euro"
> LC_TIME="de_DE.UTF-8@euro"
> LC_COLLATE="de_DE.UTF-8@euro"
> LC_MONETARY="de_DE.UTF-8@euro"
> LC_MESSAGES="de_DE.UTF-8@euro"
> LC_PAPER="de_DE.UTF-8@euro"
> LC_NAME="de_DE.UTF-8@euro"
> LC_ADDRESS="de_DE.UTF-8@euro"
> LC_TELEPHONE="de_DE.UTF-8@euro"
> LC_MEASUREMENT="de_DE.UTF-8@euro"
> LC_IDENTIFICATION="de_DE.UTF-8@euro"
> LC_ALL=

That's a well-working configuration. It'll just work for anybody,
allowing any kind of Umlauts. Even if some pupil tries to give his
russian homework a cyrillic filename.

> -rwxrwx---+  1 root          domänen-benutzer    0 2006-03-03 07:51 ÜÄÖßüäö.txt

Pah!

> > 1. What should I use in smb.conf for the values
> > unix charset =
> > DOS charset =

utf-8 for unix; the DOS charset isn't all that important anymore,
since it can only handle one-byte encodings. Maybe cp850 or something
like that is a good choice, but newer windows variants shouldn't use
that anymore.

> > 2. What should actually the LOCALES be?
> > 
> > 3. I've found a program that supposedly will help me,
> > http://j3e.de/linux/convmv/
> > I've tried different combinations of
> > convmv -f cp850 -t iso8859-1
> > convmv -f cp850 -t utf8
> > But even if the umlauts are again visible from linux, they look
> > strange on windows. Anyone having used this program before?

I haven't used these, but wrote little shell scripts and ran a 'find'
command back in those days.

Most important is that you've got a real plan what to convert from
which originating encoding to a equally named target encoding.

So first decide on locale settings (I'd choose some UTF-8 encoding
these days). Then create a filename containing some Umlauts (eg.
cut'n'paste from the UTF-8 test files containing lots of Umlauts:-)
and look at it, byte-by-byte, eg with

	ls | xxd

Verify that the hex dump contains the correct sequence for the choosen
encoding.

Then continue with configuring Samba. UTF-8 for Unix charset,
something for the DOS charset (as I wrote, there's probably no client
using this anymore, if you don't insist in using DOS/Lanman or things
like that.)

Then go to a Windows box and create a filename containing Umlauts. It
should look correct afterwards on the creating Windows box as well as
on a different one.

Then go back to the Linux box and verify that the filename reads okay.
(If not, Samba hasn't taken the new configury yet...)

You'd better do these things *fast*. You don't want the guys to create
new files in this time, because you'd end up with a not-so-nice mix of
differently encoded filenames!

Finally, fix the pre-existing filenames.

> > Oh, another problem is that they are using this system very heavily 24
> > hours a day (lots of vpn connections), so I can't just restart Samba
> > whenever I like to...
> 
> It should IMHO not be necassary to restart samba.

You'd need to restart the sessions, but that's not much of a problem
either: It is normal behavior that a Samba server (instance) quits
after some time of inactivity; the client will reinstate the
connection on it's own when getting busy again.

That's actually a nice thing: Just kill all the fork()ed Samba clients
(letting the parent survive!) so all clients will claim a fresh
connection, with a fresh server reading the new config file:-)

> I think your scenario is similar to the configuration I descrobed above.
> But IMHO to convert old files coming in from an old samba version to the
> actual version of samba,  only changing settings in the smb.conf  wont work.
> I think you have to do both. First configure your system and samba for
> propper work with all newly created files and folders and then put the old
> data on the shares and convert the files to your needs.

ACK. You need to convert the filenames. In a company I worked for, we
even had the nice thing that the filenames containing (now improperly
encoded) Umlauts weren't visible any longer (from the Windows
clients).

For a start, something like this should to the recoding:

----------- recoding-script.sh -------------
#!/usr/bin/env sh

SRC_ENCODING=ISO-8859-1
DEST_ENCODING=UTF-8
FILENAME="${1}"

NEW_NAME="`echo "${FILENAME}" | iconv --from-code="${SRC_ENCODING}" --to-code="${DEST_ENCODING}"`"
mv -- "${FILENAME}" "${NEW_NAME}"
---------------------------------------------

...and then call it on all the names:

find /path/to/share -exec /path/to/recoding-script.sh {} \;

Notice that SRC_ENCODING is the encoding that was written by the Samba
server (prior to charset reconfiguration), so you'd check that using
the xxd trick with some encoding tables. DEST_ENCODING is the Linux
encoding you'd like to use afterwards, which you'll also need to
configure in Samba.

MfG, JBG

-- 
Jan-Benedict Glaw       jbglaw@lug-owl.de    . +49-172-7608481             _ O _
"Eine Freie Meinung in  einem Freien Kopf    | Gegen Zensur | Gegen Krieg  _ _ O
 für einen Freien Staat voll Freier Bürger"  | im Internet! |   im Irak!   O O O
ret = do_actions((curr | FREE_SPEECH) & ~(NEW_COPYRIGHT_LAW | DRM | TCPA));

Attachment: signature.asc
Description: Digital signature


Reply to: