python 读取pdf,导出 txt 或 html


本文链接:https://www.cnblogs.com/tujia/p/16670374.html

 

一、安装 pdfminer.six

pip install pdfminer.six

 

二、使用代码读取pdf

from io import StringIO
from pdfminer.layout import LAParams
from pdfminer.high_level import extract_text_to_fp


output_string = StringIO()

with open('test.pdf', 'rb') as fin:
    # 导出txt
    # extract_text_to_fp(fin, output_string)
    # 导出html
    extract_text_to_fp(fin, output_string, laparams=LAParams(), output_type='html', codec=None)


with open('test.html', 'w', encoding='utf-8') as f:
    f.write(output_string.getvalue().strip())

官方文档:

https://pdfminersix.readthedocs.io/en/latest/tutorial/highlevel.html

https://pdfminersix.readthedocs.io/en/latest/reference/highlevel.html

 

三、使用脚本读取pdf

https://pdfminersix.readthedocs.io/en/latest/tutorial/commandline.html

https://pdfminersix.readthedocs.io/en/latest/reference/commandline.html

说明:略

 

本文链接:https://www.cnblogs.com/tujia/p/16670374.html


完。

 

原创文章,作者:ItWorker,如若转载,请注明出处:https://blog.ytso.com/288308.html

(0)
上一篇 2022年9月8日
下一篇 2022年9月8日

相关推荐

发表回复

登录后才能评论