[Date Prev][Date Next] [Thread Prev][Thread Next] [Date Index] [Thread Index]

msdosfs mount options, short filenames, legacy codepages and utf-8 terminals



Hi all. Apologies if this mailing list isn't exactly the right place for
this question, but I'm stumped and I'd absolutely love a little help -
pretty please? :-)

I have a diskette, which I've formatted under Windows 2000, and on it
I've created an empty file, called ABÇDE.TXT (the third letter is called
a "LATIN CAPITAL LETTER C WITH CEDILLA", unicode codepoint U+00C7). I'm
using a standard UK Windows 2000 installation which, for short
filenames, uses cp850 (I'm aware that long filenames are encoded on disk
in utf-16/ucs-2, but I'm interested in the short filename), so I'd
expect the Ç to be encoded as 0x80.

Then I reboot into Debian Etch and check the byte representation on
disk:
~$ dosfsck -v /dev/fd0 | grep "Root directory"
Root directory starts at byte 9728 (sector 19)
~$ hexdump -C -s 9728 /dev/fd0
00002600  41 41 00 42 00 c7 00 44  00 45 00 0f 00 19 2e 00  |AA.B...D.E......|
00002610  54 00 58 00 54 00 00 00  ff ff 00 00 ff ff ff ff  |T.X.T...........|
00002620  41 42 80 44 45 20 20 20  54 58 54 20 00 7d a3 98  |AB.DE   TXT .}..|
00002630  ca 34 ca 34 00 00 a4 98  ca 34 00 00 00 00 00 00  |.4.4.....4......|
00002640  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00004200  f6 f6 f6 f6 f6 f6 f6 f6  f6 f6 f6 f6 f6 f6 f6 f6  |................|
*
00168000

And there's my 0x80, in the third row/third column. Cool. So far, so
good. Now I mount the diskette:
~# mount -t msdos -o codepage=850 /dev/fd0 mnt
~# ls mnt
ab?de.txt

Hmmm. Let's just check that third character:
~# ls mnt | hexdump -C
00000000  61 62 80 64 65 2e 74 78  74 0a                    |ab.de.txt.|
0000000a

OK, so the 0x80 is still there. Now we get to the bit that I don't
understand. I think I'm using a utf-8-aware terminal (gnome-terminal on
ubuntu dapper). I say that because the following prints out the unicode
codepoint for the character:

~# /usr/bin/printf Ç | hexdump -C
00000000  c3 87                                             |..|
00000002

So perhaps I'm missing a translation-layer to go from cp850 to utf-8. Or
then again, my understanding of how this works could be completely
wrong.

Interestingly, I can get to see the Ç if I mount the disk as vfat rather
than msdos, but then I think what I'm really seeing is the long
filename, not the short filename:
~# mount -t vfat -o iocharset=cp850,utf8 /dev/fd0 mnt
~# ls mnt
ABÇDE.TXT

Just to check my understanding, I've also tried the other way round:
write the file using linux:
~# dd if=/dev/zero of=loopbackImage bs=1024 count=128 ; mkdosfs
loopbackImage
~# mount -t msdos -o loop,codepage=850 loopbackImage mnt
~# touch mnt/ABÇDE.TXT ; umount mnt
~# dosfsck -v loopbackImage | grep "Root directory"
Root directory starts at byte 1536 (sector 3)
# hexdump -C -s 1536 loopbackImage
00000600  41 42 c3 87 44 45 20 20  54 58 54 20 00 00 00 00  |AB..DE  TXT ....|
00000610  00 00 00 00 00 00 a2 b4  ca 34 00 00 00 00 00 00  |.........4......|
00000620  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
*
00020000

I've discovered that "c3 87" just happens to be the utf-8 encoding of my
blessed "C cedilla" - but why is Linux writing a utf-8 encoded filename
onto an msdos filesystem (especially when "cat /proc/mounts" shows
"codepage=cp850")?

Is this a bug, or am I doing something wrong? I'd love to know whether I
can get my "C cedilla" out using an msdos filesystem/short filenames
(rather than vfat/long filenames). By the way, I've put the image of the
"Windows 2000" dosfs diskette here:
http://www.carbon.eclipse.co.uk/msdosfs.diskImage - you can "dd" it onto
a blank diskette to recreate the diskette if you want.

Also, I've had a thought that perhaps I should change my locale to match
the encoding - but "dpkg-reconfigure locales" doesn't present me with a
choice for "en_GB.cp850".

Many thanks, Jaime




Reply to: