Working with PDF files in Python

As in today’s world, we all are familiar with PDF files because they are one of the most widely used digital formats of documents. The full form of pdf is “Portable Document Format,” which uses the “.pdf” extension to save the document files. This is independent of software-hardware or operating systems, and it can be used for presenting or exchanging documents reliably.

PDF was invented by Adobe, and this is now an open standard maintained by the international organization for standardization. The PDF file can also contain links or buttons form fields, audio-video, or other business logic for better interaction with the users or the viewers.

In this tutorial, we will discuss how we can perform various operations:

How to extract text from PDF
How to rotate pages of the PDF
How to merge two PDF together
How to split a PDF file
How to add watermarks to the PDF pages

We can perform all these operations by using a simple Python script.

Installation

For interacting with PDF files, we will be using a 3rd party module, that is, PyPDF2. The PyPDF2 is an inbuilt library of Python, which is used as a PDF toolkit. This module is capable of:

It can extract the information of the documents such as title, author name, and many more.
It can split the pages of the document file.
It can crop the pages of the PDF document file.
It can you merge the multiple pages into a single page inside the PDF document file.
It can encrypt and decrypt PDF files.

For installing PyPDF2 we can use the following command from the command line:

<br />
!pip3 install PyPDF2<br />

The name of this module is case-sensitive, so we have to make sure that the “y” is in lowercase and everything in the name of the module is in uppercase.

Operations on PDF File using PyPDF2 Module

In this section, we will discuss various operations that we can perform on PDF files by using the PyPDF2 module in Python.

1. How to Extract Text from PDF Document File.

We can extract the text from the PDF file by using the PyPDF2 module in Python by using the following approach.

Approach:

For extracting the text from the PDF file using Python, we will follow the following steps:

Step 1: We will open the PDF file named ‘exp.pdf’ in binary mode and save the file object as “pdf_File_Object“.

<br />
pdf_File_Object = open(‘exp.pdf’, ‘rb’)<br />

Step 2: We will create an object “pdf_Reader” for the “PDFFileReader” class of the “PyPDF2” module, and then we will pass the PDF file object and get the object for reading the PDF.

<br />
pdf_Reader = PDF.PdfFileReader(pdf_File_Object)<br />

Step 3: For getting the number of pages in the PDF document file, we will use the numPages

<br />
print(“No. of pages in the given PDF file: “, pdf_Reader.numPages)<br />

Step 4: We will create an object “page_Object” for PageObject class of the “PyPDF2” The PDF reader object has the function “getPage()” which takes the page number as an argument and returns the object of the page.

<br />
page_Object = pdf_Reader.getPage(0)<br />

Step 5: We will use extract text which is a function of page object for extracting text from the PDF page.

Step 6: At last, we will close the PDF document file object.

<br />
pdf_File_Object.close()<br />

Code:

<br />
import PyPDF2 as PDF</p>
<p># Here we will create a pdf file object<br />pdf_File_Object = open(‘exp.pdf’, ‘rb’)</p>
<p># Here, we will creat a pdf reader object<br />pdf_Reader = PDF.PdfFileReader(pdf_File_Object)</p>
<p># Now we will print number of pages in pdf file<br />print(“No. of pages in the given PDF file: “, pdf_Reader.numPages)</p>
<p># Here, create a page object<br />page_Object = pdf_Reader.getPage(0)</p>
<p># Now, we will extract text from page<br />print(page_Object.extractText())</p>
<p># At last, close the pdf file object<br />pdf_File_Object.close()<br />

Output:

No. of pages in the given PDF file:  10
 
GUIDELINES
*
 
 
FOR 
 
RE
-
OPENING OF CAMPUS 
 
IN VIEW OF COVID
-
19 PANDEMIC
 
(FOR 
STUDENTS
)
 
2021
-
22

This has printed the text of the first page of the PDF file in output.

2. How to Rotate PDF File Pages

We can rotate the pages of PDF file using PyPDF2 module in Python.

Approach:

For rotating the pages of the given pdf file, we will be using the following steps:

Step 1: We will create a PDF reader object for the original PDF.

Step 2: We will write the rotated pages to the new PDF file. For writing Into the PDF file, we will use the object of the pdfFileWriter class of the PyPDF2

<br />
pdf_Writer = PDF.PdfFileWriter()<br />

Step 3: We will iterate each page of the original PDF document file. We will get page object getPage() function of the PDF reader class. then we will rotate the page by using the rotateClockwise() function of the page object class.

<br />
for page in range(pdf_Reader.numPages):<br />
page_Object = pdf_Reader.getPage(page)<br />
page_Object.rotateClockwise(rotation_1)<br />
pdf_Writer.addPage(page_Object)<br />

Step 4: We will add pages PDF writer object using the addPage() function of the PDF writer class by passing the rotated page object.

Step 5: Then, we will write the PDF pages to the newly created PDF file. We can do this by opening the new file object and writing PDF pages by using the write() function off the PDF writer object.

<br />
new_File = open(new_File_Name, ‘wb’)<br />
pdf_Writer.write(new_File)<br />

Step 6: We will close the original PDF file object end the newly created new file object.

<br />
pdf_File_Object.close()<br />
new_File.close()<br />

Code:

<br />
# Frst, we will import the modules<br />
import PyPDF2 as PDF</p>
<p>def PDF_rotate(original_File_Name, new_File_Name, rotation_1):</p>
<p>    # Then, we will create a pdf File object of original pdf<br />    pdf_File_Object = open(original_File_Name, ‘rb’)</p>
<p>    # Then, we will create a pdf Reader object<br />    pdf_Reader = PDF.PdfFileReader(pdf_File_Object)</p>
<p>    # Then we will create a pdf writer object for new pdf<br />    pdf_Writer = PDF.PdfFileWriter()</p>
<p>    # Now, we will rotate each page of the PDF document<br />    for page in range(pdf_Reader.numPages):</p>
<p>        # Then, we will create rotated page object<br />        page_Object = pdf_Reader.getPage(page)<br />        page_Object.rotateClockwise(rotation_1)</p>
<p>        # We will add the rotated page object to pdf writer<br />        pdf_Writer.addPage(page_Object)</p>
<p>    # Now we will open a new pdf file object<br />    new_File = open(new_File_Name, ‘wb’)</p>
<p>    # We will write the rotated pages to new file<br />    pdf_Writer.write(new_File)</p>
<p>    # At last, we will close the original pdf file object<br />    pdf_File_Object.close()</p>
<p>    # And now, we will close the new pdf file object<br />    new_File.close()</p>
<p>def main():</p>
<p>    # original pdf file name<br />    original_File_Name = ‘exp.pdf’</p>
<p>    # new pdf file name<br />    new_File_Name = ‘rotated_exp.pdf’</p>
<p>    # rotation angle<br />    rotation_1 = 270</p>
<p>    # calling the PDF_rotate function<br />    PDF_rotate(original_File_Name, new_File_Name, rotation_1)</p>
<p>if __name__ == “__main__”:<br />    # calling the main function<br />    main()<br />

Output:

Original File:

Working with PDF files in Python

Rotated File:

Working with PDF files in Python

3. How to Merge two PDF Files.

We can merge two PDF files by using the PyPDF2 module in Python.

Approach:

For merging two PDF files in Python, we will be using the following steps:

Step 1: For merging two PDf files, we will be using a pre-built class, pdfFileMerger of the PyPDF2

<br />
We will create an object called pdf_Merger of PDF merger class:<br />
pdf_Merger = PDF.PdfFileMerger()<br />

Step 2: Then, we will append the file object of each PDF to the PDF merger object using the append()

<br />
for pdf in pdf:<br />
pdf_Merger.append(pdf)<br />

Step 3: At last, we will write the pdf pages to the output pdf file by using the write method of the PDF merger object.

<br />
with open(output_1, ‘wb’) as K:<br />
pdf_Merger.write(K)<br />

Code:

<br />
# First, we will import the modules<br />
import PyPDF2 as PDF</p>
<p>def PDF_merge(pdf, output_1):<br />    # Here, we will create pdf file merger object<br />    pdf_Merger = PDF.PdfFileMerger()</p>
<p>    # now, we will append pdfs one by one<br />    for pdf in pdf:<br />        pdf_Merger.append(pdf)</p>
<p>    # then, we will write combined pdf to output pdf file<br />    with open(output_1, ‘wb’) as K:<br />        pdf_Merger.write(K)</p>
<p>def main():<br />    # here, we will select the pdf files to merge<br />    pdf = [‘exp.pdf’, ‘rotated_exp.pdf’]</p>
<p>    # Here, we will create output pdf file name<br />    output_1 = ‘combined_exp.pdf’</p>
<p>    # Now, we will call pdf merge function<br />    PDF_merge(pdf = pdf, output_1 = output_1)</p>
<p>if __name__ == “__main__”:<br />    # At last we will call the main function<br />    main()<br />

Output:

The output of this code will be in the form of a combined PDF named combined_exp.pdf, which is obtained by merging exp.pdf and rotate_exp.pdf file.

Working with PDF files in Python

4. How to Split PDF File

We can split the PDF document file in Python using the PyPDF2 module according to our requirements.

In this code, we will not use a new function or class, and we will be using simple logic and iterations. The splits of the pdf will be created according to the list of splits_1 we would be passing.

Code:

<br />
# First, we will import the modules<br />
import PyPDF2 as PDF</p>
<p>def PDF_split(pdf_1, splits_1):<br />    # here, we will create an input pdf file object<br />    pdf_File_Object = open(pdf_1, ‘rb’)</p>
<p>    # here, we will create pdf reader object<br />    pdf_Reader = PDF.PdfFileReader(pdf_File_Object)</p>
<p>    # Now we will start indexing of first slice<br />    start = 0</p>
<p>    # then we will start indexing of last slice<br />    end = splits_1[0]</p>
<p>    for g in range(len(splits_1) + 1):<br />        # we will create pdf writer object for (g + 1)th split<br />        pdf_Writer = PDF.PdfFileWriter()</p>
<p>        # output pdf file name<br />        output_pdf = pdf_1.split(‘.pdf’)[0] + str(g) + ‘.pdf’</p>
<p>        # Now, we will add pages to pdf writer object<br />        for page_1 in range(start, end):<br />            pdf_Writer.addPage(pdf_Reader.getPage(page_1))</p>
<p>        # Here, we will write split pdf pages to pdf file<br />        with open(output_pdf, “wb”) as K:<br />            pdf_Writer.write(K)</p>
<p>        # Now, we will interchange page split start position for next split<br />        start = end<br />        try:<br />            # then, we will set split end position for next split<br />            end = splits_1[g + 1]<br />        except IndexError:<br />            # then, we will set split end position for last split<br />            end = pdf_Reader.numPages</p>
<p>    # Now, we will close the input pdf file object<br />    pdf_File_Object.close()</p>
<p>def main():<br />    # pdf file to split<br />    pdf_1 = ‘exp.pdf’</p>
<p>    # split page positions<br />    splits_1 = [2,4]</p>
<p>    # we will call PDF_split function to split pdf<br />    PDF_split(pdf_1, splits_1)</p>
<p>if __name__ == “__main__”:<br />    # at last, we will call the main function<br />    main()<br />

Output:

The output of this code will generate 3 new pdf files, which are the split files of the main pdf. We can check in the PDF folder. It contains 3 new pdf files.

Working with PDF files in Python

5. How to Add Watermark to PDF Pages.

We can add watermark to the pages of PDF document files using the PyPDF2 module in Python.

Approach:

In this, we will follow every step same as the page rotation example, the only difference is:

<br />
wm_page_Object = add_watermark(user_watermark, pdf_Reader.getPage(page_1))<br />

The page object will be converted into the watermark page object by using the add_watermark() function.

For understanding what the add_watermark() function do, we can see the following example:

<br />
wm_File_Object = open(wm_File, ‘rb’)<br />
pdf_Reader = PDF.PdfFileReader(wm_File_Object)<br />
page_Object.mergePage(pdf_Reader.getPage(0))<br />
wm_File_Object.close()<br />
return page_Object<br />

In this, first, we created a pdf reader object of the water_mark.pdf file. For the passed page object, we have used the mergepage() function, which has passed the page object of the first page of the water_mark pdf reader object. This will cause an overlay of water_mark pdf over the passed page object.

Code:

<br />
# First, we will import the modules<br />
import PyPDF2 as PDF</p>
<p>def add_watermark_1(wm_File, page_Object):<br />    # here, we will open watermark pdf file<br />    wm_File_Object = open(wm_File, ‘rb’)</p>
<p>    # Now, we will create pdf reader object of watermark pdf file<br />    pdf_Reader = PDF.PdfFileReader(wm_File_Object)</p>
<p>    # then, we will merge watermark pdf’s first page with passed page object.<br />    page_Object.mergePage(pdf_Reader.getPage(3))</p>
<p>    # Here, we will close the watermark pdf file object<br />    wm_File_Object.close()</p>
<p>    # we will return watermarked page object<br />    return page_Object</p>
<p>def main():<br />    # Now, we will create watermark pdf file name<br />    user_watermark = ‘water_mark.pdf’</p>
<p>    # original pdf file name<br />    original_File_Name = ‘exp.pdf’</p>
<p>    # new pdf file name<br />    new_File_Name = ‘watermarked_exp.pdf’</p>
<p>    # now, we will create pdf File object of original pdf<br />    pdf_File_Object = open(original_File_Name, ‘rb’)</p>
<p>    # here, we will create a pdf Reader object<br />    pdf_Reader = PDF.PdfFileReader(pdf_File_Object)</p>
<p>    # create a pdf writer object for new pdf<br />    pdf_Writer = PDF.PdfFileWriter()</p>
<p>    # add watermark to each page<br />    for page_1 in range(pdf_Reader.numPages):<br />        # Now, we will create watermarked page object<br />        wm_page_Object = add_watermark(user_watermark, pdf_Reader.getPage(page_1))</p>
<p>        # then, we will add watermarked page object to pdf writer<br />        pdf_Writer.addPage(wm_page_Object)</p>
<p>    # new pdf file object<br />    new_File = open(new_File_Name, ‘wb’)</p>
<p>    # we will then write watermarked pages to new file<br />    pdf_Writer.write(new_File)</p>
<p>    # close the original pdf file object<br />    pdf_File_Object.close()<br />    # close the new pdf file object<br />    new_File.close()</p>
<p>if __name__ == “__main__”:<br />    # call the main function<br />    main()<br />

Output:

water_mark.pdf:

Working with PDF files in Python

user_watermark.pdf file:

Working with PDF files in Python

The above code will generate a user_Watermark.pdf file which has the watermark of the water_mark.pdf file.

Conclusion

In this tutorial, we have discussed how we can operate different functions on PDF files using Python and its modules’ functions and methods.

原创文章，作者：ItWorker，如若转载，请注明出处：https://blog.ytso.com/263345.html

Working with PDF files in Python

Working with PDF files in Python

Installation

Operations on PDF File using PyPDF2 Module

1. How to Extract Text from PDF Document File.

2. How to Rotate PDF File Pages

3. How to Merge two PDF Files.

4. How to Split PDF File

5. How to Add Watermark to PDF Pages.

Conclusion

相关推荐

发表回复