[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

Re: Umlaut problems in filenames when going from samba2 to samba3



On Fri, 2006-03-03 18:54:30 +0100, Klaus Ade Johnstad <klaus@skolelinux.no> wrote:
> fredag 3. mars 2006, 07:48, skrev Klaus Ade Johnstad:
> I found out that if I use in smb.conf
> unix charset = cp850
> display charset = cp850
> 
> Then all the German and Norwegian special characters looks "fine" again 

That basically means that your Linux box, too, uses cp850 (or
something alike) as it's local Umlaut representation, and you've
loaded a working font and mapping for that.

> I'm not sure if this is "a smart thing", but I've not been able to get 

It is enough if it solves your problem, but it's not a general
solution, eg. won't work if you ever need to support some more fancy
Umlauts.

> this result using the different methods with "iconv -f cp850 -f utf-8" 
> or "convmv -f cp850 -t utf8".


First understand the stack in which filenames are saved and seen:

Lets use an example, the German's sharp-s, "ß".

Looking at the console, typing a "ß", you'll produce 0xdf (in
iso-8859-1) or 0xc39f (in UTF-8). Notice that in UTF-8, this is two
bytes, which the console driver displays as _one_ glyph on your
monitor.

If this is given as a filename to the VFS API, the VFS will usually
save it as-is. (There are rare examples where the FS driver _forces_ a
specific internal representation and thus, it may convert the filename
on it's own, like ntfs, which generally uses a two-byte
representation.)

So now we've got a filename with 0xc39f in it; if the console is setup
to use unicode/UTF-8, that'll view okay. If it is configured to use
eg. iso8859-1, you'll see two wrong chars (because these encodings are
purely one-byte encodings.)

Now Samba steps in.  Since SMB (in all newer protocol variants,
ignoring traditional Lanman here) uses the same always-two-bytes
representation as NTFS does (wonder, wonder), Samba needs to convert
from whatever the filename physically contains to this
two-byte encoding. This is why Samba needs to be told about the
actually used charset. (And for Lanman clients, Samba will try to
convert the NTFS-like two-byte encoding into a single-byte charset
encoding, too.)

> I suspect that the "correct" way of dealing with this is by using 
> convvm/iconv, but I haven't managed that yet.

You're now basically back to a very simple DOS-like approach. That may
be enough for your tasks.

MfG, JBG

-- 
Jan-Benedict Glaw       jbglaw@lug-owl.de    . +49-172-7608481             _ O _
"Eine Freie Meinung in  einem Freien Kopf    | Gegen Zensur | Gegen Krieg  _ _ O
 für einen Freien Staat voll Freier Bürger"  | im Internet! |   im Irak!   O O O
ret = do_actions((curr | FREE_SPEECH) & ~(NEW_COPYRIGHT_LAW | DRM | TCPA));

Attachment: signature.asc
Description: Digital signature


Reply to: