msdosfs mount options, short filenames, legacy codepages and utf-8 terminals
Hi all. Apologies if this mailing list isn't exactly the right place for
this question, but I'm stumped and I'd absolutely love a little help -
pretty please? :-)
I have a diskette, which I've formatted under Windows 2000, and on it
I've created an empty file, called ABÇDE.TXT (the third letter is called
a "LATIN CAPITAL LETTER C WITH CEDILLA", unicode codepoint U+00C7). I'm
using a standard UK Windows 2000 installation which, for short
filenames, uses cp850 (I'm aware that long filenames are encoded on disk
in utf-16/ucs-2, but I'm interested in the short filename), so I'd
expect the Ç to be encoded as 0x80.
Then I reboot into Debian Etch and check the byte representation on
disk:
~$ dosfsck -v /dev/fd0 | grep "Root directory"
Root directory starts at byte 9728 (sector 19)
~$ hexdump -C -s 9728 /dev/fd0
00002600 41 41 00 42 00 c7 00 44 00 45 00 0f 00 19 2e 00 |AA.B...D.E......|
00002610 54 00 58 00 54 00 00 00 ff ff 00 00 ff ff ff ff |T.X.T...........|
00002620 41 42 80 44 45 20 20 20 54 58 54 20 00 7d a3 98 |AB.DE TXT .}..|
00002630 ca 34 ca 34 00 00 a4 98 ca 34 00 00 00 00 00 00 |.4.4.....4......|
00002640 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
00004200 f6 f6 f6 f6 f6 f6 f6 f6 f6 f6 f6 f6 f6 f6 f6 f6 |................|
*
00168000
And there's my 0x80, in the third row/third column. Cool. So far, so
good. Now I mount the diskette:
~# mount -t msdos -o codepage=850 /dev/fd0 mnt
~# ls mnt
ab?de.txt
Hmmm. Let's just check that third character:
~# ls mnt | hexdump -C
00000000 61 62 80 64 65 2e 74 78 74 0a |ab.de.txt.|
0000000a
OK, so the 0x80 is still there. Now we get to the bit that I don't
understand. I think I'm using a utf-8-aware terminal (gnome-terminal on
ubuntu dapper). I say that because the following prints out the unicode
codepoint for the character:
~# /usr/bin/printf Ç | hexdump -C
00000000 c3 87 |..|
00000002
So perhaps I'm missing a translation-layer to go from cp850 to utf-8. Or
then again, my understanding of how this works could be completely
wrong.
Interestingly, I can get to see the Ç if I mount the disk as vfat rather
than msdos, but then I think what I'm really seeing is the long
filename, not the short filename:
~# mount -t vfat -o iocharset=cp850,utf8 /dev/fd0 mnt
~# ls mnt
ABÇDE.TXT
Just to check my understanding, I've also tried the other way round:
write the file using linux:
~# dd if=/dev/zero of=loopbackImage bs=1024 count=128 ; mkdosfs
loopbackImage
~# mount -t msdos -o loop,codepage=850 loopbackImage mnt
~# touch mnt/ABÇDE.TXT ; umount mnt
~# dosfsck -v loopbackImage | grep "Root directory"
Root directory starts at byte 1536 (sector 3)
# hexdump -C -s 1536 loopbackImage
00000600 41 42 c3 87 44 45 20 20 54 58 54 20 00 00 00 00 |AB..DE TXT ....|
00000610 00 00 00 00 00 00 a2 b4 ca 34 00 00 00 00 00 00 |.........4......|
00000620 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 |................|
*
00020000
I've discovered that "c3 87" just happens to be the utf-8 encoding of my
blessed "C cedilla" - but why is Linux writing a utf-8 encoded filename
onto an msdos filesystem (especially when "cat /proc/mounts" shows
"codepage=cp850")?
Is this a bug, or am I doing something wrong? I'd love to know whether I
can get my "C cedilla" out using an msdos filesystem/short filenames
(rather than vfat/long filenames). By the way, I've put the image of the
"Windows 2000" dosfs diskette here:
http://www.carbon.eclipse.co.uk/msdosfs.diskImage - you can "dd" it onto
a blank diskette to recreate the diskette if you want.
Also, I've had a thought that perhaps I should change my locale to match
the encoding - but "dpkg-reconfigure locales" doesn't present me with a
choice for "en_GB.cp850".
Many thanks, Jaime
Reply to: