玩转Python令人讨厌的编码问题
分类:编程应用

玩转Python令人深恶痛绝的编码难点

Python的编码难题宗旨是各类菜鸟都会遇上的坎,但一旦完全调整了就跳过了这几个坑,万变不离个中,那不前段时间笔者也遇上了那个主题材料,来共同走访啊。

事情的导火线是review同事做的一个上传成效,看下边一段代码,self.fp是上传的文件句柄

fpdata = [line.strip().decode('gbk').encode('utf-8').decode('utf-8') for line in self.fp]
data = [''.join(['(', self.game, ',', ','.join(map(lambda x: "'%s'" % x, d.split(','))), ')']) for d in fpdata[1:]]

这段代码暴光了2个难题
1.私下认可编码使用gbk,为何不用utf8?
2..encode(‘utf-8’).decode(‘utf-8’)大可不必,decode(‘gbk’)之后就曾经是unicode了

自身提出上传文本编码为utf8,于是代码产生那样?

fpdata = [line.strip() for line in self.fp if line.strip()]
data = [''.join(['(', self.game, ',', ','.join(map(lambda x: "'%s'" % x, d.split(','))), ')']) for d in fpdata[1:]]

可测量试验时报UnicodeDecodeError: ‘ascii’ codec can’t decode byte 0xe4 in position 0: ordinal not in range(128),这一个那几个估量新手看到很发烧,那是如何意思吧?
尽管说:在将ascii字符串decode为unicode时遇见了oxe4那个比特,那不在ascii的界定内,全部decode错误。

交代一下类型背景,大家用Python2.7,那是django工程,self.game是unicode对象,精晓人一看正是sys.setdefaultencoding的难题,其实正是底下那么些难题
图片 1
咦哎,可自个儿精晓在settings.py中安装默许编码了哟,查看了弹指间,才开采闹了个笑话
图片 2

看看到底发生了什么样?由于fpdata都是utf8字符串,而self.game是unicode对象,字符串join时会将utf8字符串解码为unicode对象,但系统不领会你是utf8编码,暗许使用ascii去解码,这出错也就欠缺为奇了。

那也是为啥在遇见编码难点时,老鸟提议扩展sys.setdefaultencoding(“utf8”)操作,但为什么要加这么一个操作呢?大家从最底层看看发生了什么?

当字符串连接unicode对象时,也等于a+b时,会调用PyString_Concat

# stringobject.c
void
PyString_Concat(register PyObject **pv, register PyObject *w)
{
    register PyObject *v;
    if (*pv == NULL)
        return;
    if (w == NULL || !PyString_Check(*pv)) {
        Py_CLEAR(*pv);
        return;
    }
    v = string_concat((PyStringObject *) *pv, w);
    Py_DECREF(*pv);
    *pv = v;
}

static PyObject *
string_concat(register PyStringObject *a, register PyObject *bb)
{
    register Py_ssize_t size;
    register PyStringObject *op;
    if (!PyString_Check(bb)) {
        if (PyUnicode_Check(bb))
            return PyUnicode_Concat((PyObject *)a, bb);
        if (PyByteArray_Check(bb))
            return PyByteArray_Concat((PyObject *)a, bb);
        PyErr_Format(PyExc_TypeError,
                     "cannot concatenate 'str' and '%.200s' objects",
                     Py_TYPE(bb)->tp_name);
        return NULL;
    }
    ...
}

举个例子检查测验到b是unicode对象,会调用PyUnicode_Concat

PyObject *PyUnicode_Concat(PyObject *left,
                           PyObject *right)
{
    PyUnicodeObject *u = NULL, *v = NULL, *w;

    /* Coerce the two arguments */
    u = (PyUnicodeObject *)PyUnicode_FromObject(left);
    v = (PyUnicodeObject *)PyUnicode_FromObject(right);
    w = _PyUnicode_New(u->length + v->length);
    Py_DECREF(v);
    return (PyObject *)w;
}

PyObject *PyUnicode_FromObject(register PyObject *obj)
{
    if (PyUnicode_Check(obj)) {
        /* For a Unicode subtype that's not a Unicode object,
           return a true Unicode object with the same data. */
        return PyUnicode_FromUnicode(PyUnicode_AS_UNICODE(obj),
                                     PyUnicode_GET_SIZE(obj));
    }
    return PyUnicode_FromEncodedObject(obj, NULL, "strict");
}

出于a不是unicode对象会调用PyUnicode_FromEncodedObject将a调换为unicode对象,传递的编码是NULL

PyObject *PyUnicode_FromEncodedObject(register PyObject *obj,
                                      const char *encoding,
                                      const char *errors)
{
    const char *s = NULL;
    Py_ssize_t len;
    PyObject *v;

    /* Coerce object */
    if (PyString_Check(obj)) {
        s = PyString_AS_STRING(obj);
        len = PyString_GET_SIZE(obj);
    }

    /* Convert to Unicode */

    v = PyUnicode_Decode(s, len, encoding, errors);
    return v;
}

PyObject *PyUnicode_Decode(const char *s,
                           Py_ssize_t size,
                           const char *encoding,
                           const char *errors)
{
    PyObject *buffer = NULL, *unicode;
    if (encoding == NULL)
        encoding = PyUnicode_GetDefaultEncoding();

    /* Shortcuts for common default encodings */
    if (strcmp(encoding, "utf-8") == 0)
        return PyUnicode_DecodeUTF8(s, size, errors);
    else if (strcmp(encoding, "latin-1") == 0)
        return PyUnicode_DecodeLatin1(s, size, errors);
    else if (strcmp(encoding, "ascii") == 0)
        return PyUnicode_DecodeASCII(s, size, errors);

    /* Decode via the codec registry */
    buffer = PyBuffer_FromMemory((void *)s, size);
    if (buffer == NULL)
        goto onError;
    unicode = PyCodec_Decode(buffer, encoding, errors);

    return unicode;
}

大家看来当encoding是NULL时,encoding是PyUnicode_GetDefaultEncoding(),其实那么些正是大家sys.getdefaultencoding()的再次回到值,Python暗许就是ascii

static char unicode_default_encoding[100 + 1] = "ascii";

const char *PyUnicode_GetDefaultEncoding(void)
{
    return unicode_default_encoding;
}

这里unicode_default_encoding是个静态变量,且分配了足足的上空令你钦命差别的编码,估量98个字符分明是够了

咱俩在探望sys模块的getdefaultencoding和setdefaultencoding

static PyObject *
sys_getdefaultencoding(PyObject *self)
{
    return PyString_FromString(PyUnicode_GetDefaultEncoding());
}

static PyObject *
sys_setdefaultencoding(PyObject *self, PyObject *args)
{
    if (PyUnicode_SetDefaultEncoding(encoding))
        return NULL;
    Py_INCREF(Py_None);
    return Py_None;
}

PyUnicode_SetDefaultEncoding不用想也明白设置unicode_default_encoding数组就可以了,Python用的是strncpy

int PyUnicode_SetDefaultEncoding(const char *encoding)
{
    PyObject *v;
    /* Make sure the encoding is valid. As side effect, this also
       loads the encoding into the codec registry cache. */
    v = _PyCodec_Lookup(encoding);
    if (v == NULL)
        goto onError;
    Py_DECREF(v);
    strncpy(unicode_default_encoding,
            encoding,
            sizeof(unicode_default_encoding) - 1);
    return 0;

  onError:
    return -1;
}

后面大家在sys.setdefaultencoding(“utf8”)时是reload(sys)的,那是因为在Python site.py中有那样多个操作

    if hasattr(sys, "setdefaultencoding"):
        del sys.setdefaultencoding

当然你完全能够定制site.py,修改setencoding,使用locale的设置,也正是将if 0修改为if 1。一般windows的安装locale编码为cp936,服务器一般都以utf8

def setencoding():
    """Set the string encoding used by the Unicode implementation.  The
    default is 'ascii', but if you're willing to experiment, you can
    change this."""
    encoding = "ascii" # Default value set by _PyUnicode_Init()
    if 0:
        # Enable to support locale aware default string encodings.
        import locale
        loc = locale.getdefaultlocale()
        if loc[1]:
            encoding = loc[1]
    if 0:
        # Enable to switch off string to Unicode coercion and implicit
        # Unicode to string conversion.
        encoding = "undefined"
    if encoding != "ascii":
        # On Non-Unicode builds this will raise an AttributeError...
        sys.setdefaultencoding(encoding) # Needs Python Unicode build !

就此Python的编码是一见倾心的,
要想玩转Python的编码你供给明白
1.unicode与utf8,gbk的界别,以及unicode与具象编码的转移
2.字符串与unicode连接时会转变为unicode, str(unicode)会转移为字符串
3.当不知晓具体编码会利用系统暗中认可编码ascii,可透过sys.setdefaultencoding修改

万一能分解上面现象应当就能够玩转Python令人刻骨仇恨的编码难点
图片 3

Python的编码难题着力是每种新手都会遭逢的坎,但假如完全掌握了就跳过了这些坑,万变不离个中,这不最近...

本文由正版必中一肖图发布于编程应用,转载请注明出处:玩转Python令人讨厌的编码问题

上一篇:没有了 下一篇:详解Python中怎么着写调整台进程条的整理,Pyth
猜你喜欢
热门排行
精彩图文