PDFBox：使用Java轻松从PDF文件提取内容

不知道你是不是听过Java提供一种工具，可以提取PDDF的内容。本文就来向你介绍这种工具，它的名字叫 Apache PDFBox。

什么是PDFBox

Apache PDFBox库是用来处理PDF文档的开源Java工具。它可帮我们做到：

1）创建新PDF文档；
2）更新现有文档；
例如添加样式，增加超链接等；
3）从PDF文档中提出内容。

从PDF阅读内容

当我们能够从PDF提取文本内容时，问题已经解决一半。下面我们举代码示例来完成此任务。

Apche PDFTextStripper的类PDFTextStripper（https://pdfbox.apache.org/docs … .html）可以将PDF中的文本以去除所有格式的形式提出出来。它将忽略所有格式和特别样式。

tStripper = new PDFTextStripper();

tStripper.setStartPage(1);

tStripper.setEndPage(3);

 

PDDOocument document = PDDocument.load(new File("youpdfname.pdf"));

 

document.getClass();

  if(!document.isEncrypted()){

    pdfFileInText = tStripper.getText(document);

    lines = pdfFileIntext.split("//r//n//r//n");

    for(String line : llines){

      System.out.printlln(line);

      content += line;

    }

  }

System.out.println(content.trim());

从PDF文件中获得所有超链接

第二件事是验证PDF中的超链接。下列代码为大家提供了PDF文档中的超链接。

PDPage类（https://pdfbox.apache.org/docs … .html）中的getAnnotations()方法提供了取得文档中的注释列表。接下来，获取文档或页面中的URI列表，使用PDActionURI（https://pdfbox.apache.org/docs … .html）

 

PDDocument document = PDDocument.load(new File("name.pdf"));

document.getClass();

PDPage pdfpage = document.getPage(1);

            annotations = pdfpage.getAnnotations();

            for (int j = 0; j < annotations.size(); j++) {

                PDAnnotation annot = annotations.get(j);

                if (annot instanceof PDAnnotationLink) {

                    PDAnnotationLink link = (PDAnnotationLink) annot;

                    PDAction action = link.getAction();

                    if (action instanceof PDActionURI) {

                        PDActionURI uri = (PDActionURI) action;

                        urls += uri.getURI();

                        System.out.println(uri.getURI());

                    }

                }

            }

        }

从PDF文件中导出图片

除了文本与超链接外，PDFBox还提供了从文档中提取图片的功能。

PDPage类（https://pdfbox.apache.org/docs … urces()）还提供了getResource() 方法，用来取得PDF中的所有资源对象，比如图片对象的列表。

 

PDDocument document = PDDocument.load(new File("name.pdf"));

PDPage pdfpage = document.getPage(1);

        int i = 1;

        PDResources pdResources = pdfpage.getResources();

        for (COSName c : pdResources.getXObjectNames()) {

            PDXObject o = pdResources.getXObject(c);

            if (o instanceof org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject) {

                File file = new File(i + ".png");

                i++;

                ImageIO.write(((org.apache.pdfbox.pdmodel.graphics.image.PDImageXObject) o).getImage(), "png", file);

            }

        }

获得PDF中超链接中的单词

从PDF中获取文本、超链接和图片的任务变得很简单。接下来，我们进一步了解如何提取超链接的单词。
如我们在第二部分中所见，对于“链接”注释以及每一个链接，可使用PDRectangle类来裁剪文本区域。
以下代码为大家提供了取得在文档中超链接中单词的列表：

PDDocument document = PDDocument.load(new File("name.pdf"));

        int pageNum=0;

        for( PDPage page : doc.getPages() )

        {

            pageNum++;

            PDFTextStripperByArea stripper = new PDFTextStripperByArea();

            List annotations = page.getAnnotations();

            //first setup text extraction regions

            for( int j=0; j
            {

                PDAnnotation annot = annotations.get(j);

                if( annot instanceof PDAnnotationLink )

                {

                    PDAnnotationLink link = (PDAnnotationLink)annot;

                    PDRectangle rect = link.getRectangle();

                    //need to reposition link rectangle to match text space

                    float x = rect.getLowerLeftX();

                    float y = rect.getUpperRightY();

                    float width = rect.getWidth();

                    float height = rect.getHeight();

                    int rotation = page.getRotation();

                    if( rotation == 0 )

                    {

                        PDRectangle pageSize = page.getMediaBox();

                        y = pageSize.getHeight()-y;

                    }

                    else if( rotation == 90 )

                    {}

                    Rectangle2D.Float awtRect = new Rectangle2D.Float( x,y,width,height );

                    stripper.addRegion( "" + j, awtRect );

                }

            }

            stripper.extractRegions( page );

            for( int j=0; j
            {

                PDAnnotation annot = annotations.get(j);

                if( annot instanceof PDAnnotationLink )

                {

                    PDAnnotationLink link = (PDAnnotationLink)annot;

                    PDAction action = link.getAction();

                    String urlText = stripper.getTextForRegion( "" + j );

                    if( action instanceof PDActionURI )

                    {

                        PDActionURI uri = (PDActionURI)action;

                        System.out.println( "Page " + pageNum +":'" + urlText.trim() + "'=" + uri.getURI() );

                    }

                }

            }

        }

以上为全部。

作者：大勇

原创文章，作者：ItWorker，如若转载，请注明出处：https://blog.ytso.com/258079.html

PDFBox：使用Java轻松从PDF文件提取内容

相关推荐

发表回复