今天panda.read_csv时遇到以下错误:
File "/root/anaconda2/lib/python2.7/site-packages/pandas/io/parsers.py", line 1213, in read
data = self._reader.read(nrows)
File "pandas/parser.pyx", line 766, in pandas.parser.TextReader.read (pandas/parser.c:7988)
File "pandas/parser.pyx", line 788, in pandas.parser.TextReader._read_low_memory (pandas/parser.c:8244)
File "pandas/parser.pyx", line 842, in pandas.parser.TextReader._read_rows (pandas/parser.c:8970)
File "pandas/parser.pyx", line 829, in pandas.parser.TextReader._tokenize_rows (pandas/parser.c:8838)
File "pandas/parser.pyx", line 1833, in pandas.parser.raise_parser_error (pandas/parser.c:22649)
pandas.parser.CParserError: Error tokenizing data. C error: Buffer overflow caught - possible malformed input file.
发现是因为csv文件单个item内有/r,即回车符
测试回车符
import pandas as pd
a = "/r/r 汉考克在接受当地媒体采访时表示" /
"/r/r 汉考克"
d = {'nid':[100], 'doc':[a]}
df = pd.DataFrame(data=d, columns=('nid', 'doc'))
df.to_csv('p.txt', index=False)
df1 = pd.read_csv('p.txt')
print df1.head()
出现上面同样的错误,去掉/r就可以
测试换行符
import pandas as pd
a = "/n/n 汉考克在接受当地媒体采访时表示" /
"/n/n 汉考克"
d = {'nid':[100], 'doc':[a]}
df = pd.DataFrame(data=d, columns=('nid', 'doc'))
df.to_csv('p.txt', index=False)
df1 = pd.read_csv('p.txt')
print df1.head()
结果没有出现上面的错误
回车与换行
/r —回车符,光标移动到行首
/n—换行符, 光标移动到下一行
经测试linux、mac系统中没有回车/r
echo -en '12/n34/r56/n/r78/r/n' > tmp
可以看到/r会被处理成^M:
12
34^M56
^M78^M
但window有/r,将光标移动到行首, /n是换行
这样,带/r的字符在mac,linux系统下出现^M符,pandas.read_csv异常
总得来说,自己使用/n就够了;带/r的字符串在linus和mac系统下要处理一下,例如python 的string 有split方法去除。
原创文章,作者:ItWorker,如若转载,请注明出处:https://blog.ytso.com/9336.html