Re: 中文文件名乱码

我这有个py脚本，也许可以用得上

#!/usr/bin/env python

# -*- coding: utf-8 -*-

# uzip_gbk.py

import os

import sys

import argparse

import zipfile

parser = argparse.ArgumentParser(description='Unzip zip file.')

parser.add_argument('file', nargs='+', help='zip file')

parser.add_argument('-p', '--password', help='set password')

args = parser.parse_args()

zip_file = args.file[0]

password = args.password

print "Processing File " + zip_file

file=zipfile.ZipFile(zip_file, "r")

if password != None:

file.setpassword(password)

for name in file.namelist():

utf8name=name.decode('gbk')

print "Extracting " + utf8name

pathname = os.path.dirname(utf8name)

if not os.path.exists(pathname) and pathname!= "":

os.makedirs(pathname)

data = "">

if not os.path.exists(utf8name):

fo = open(utf8name, "w")

fo.write(data)

fo.close

file.close()

2017-04-25 14:41 GMT+08:00 Shell Xu <shell909090@gmail.com>:

我知道的信息不一定对。
zip内部是使用raw bytes来存储文件名的，也就是说各个平台自行存储。在windows上，一般会使用CP936来存储文件名。而linux上会试图使用utf-8来解压文件名。

问题的原因在于，解压的时候，由于编码不一致。所以在读取raw bytes的时候，会读到一堆乱码。utf-8有验证能力，因而（可能）可以发现这种错误。此时很多utf-8里面解析不出的内容，会被直接用?替代。在替代后，信息永久的被破坏，再也无法还原。

也就是说，问题必须在解析zip文件的时候解决，而不是zip解开文件后。

我man了一下zip。似乎-UN参数可能有帮助。但是尚未测试。如果你有兴趣，请帮忙测试并公布结论。

谢谢。

2017-04-25 13:11 GMT+08:00 atzlinux <atzlinux@sina.com>:

一个 zip 压缩文件结压缩后，里面的中文文件名乱码如下：

+???+_?? ?-+???»+? ??+?-???-??? ??++?+?٦?-+?-ˬ DevOps-4?-22+i +?-?ο??
?-+??_+? ??++?????+ ?+ߦ? DevOps-4?-21+i +¥?ο?? ??+?ο??

我尝试用常见的中文字符编码 gbk，gb2312 等，用 iconv，convmv 命令转码到 UTF -8，但还是无法正常显示中文文件名。

大家之前有遇到类似问题吗？如何解决的？

怎么知道这些文件，之前是用的哪个中文字符集编码方式呢？

atzlinux

--

彼節者有間，而刀刃者無厚；以無厚入有間，恢恢乎其於游刃必有餘地矣。
blog: http://shell909090.org/
twitter: @shell909090
about.me: http://about.me/shell909090