Resize PDF Documents Using Python: A Comprehensive Guide

Resize and Compress PDF Documents Using Python

Learn how to resize and compress PDF documents using Python. This guide covers simple methods with pdf2image, Pillow, and more advanced tools like pikepdf and Ghostscript.

Table of Contents

Introduction

When it comes to managing PDFs, one common task is resizing or compressing these documents to save storage space, improve load times, or meet size requirements for uploads. Fortunately, Python provides powerful libraries that allow us to automate this process efficiently. In this blog post, we’ll cover how to resize and compress PDF documents using Python. We’ll dive into a solution that works for most PDF documents, explaining both the concepts and the tools used.

Why Compress or Resize PDF Documents?

Before jumping into the code, let’s quickly explore why you might want to compress or resize a PDF file:

  • Storage Space: PDFs, especially those with images or scans, can quickly become large files that take up significant storage.
  • Uploading and Sharing: Many websites or email services impose file size limits. Compressing your PDFs ensures they are easier to upload and share.
  • Performance: Smaller PDF files load and process more quickly, which is especially important for web-based PDF viewers or when sending multiple files.

Tools We’ll Use

In this post, we’ll use the following Python libraries:

  • pdf2image: This library is used to convert PDF pages into images, making it easy to manipulate the content.
  • Pillow: The Python Imaging Library (PIL), now known as Pillow, allows us to manipulate images, including resizing, compressing, and saving them in different formats.
  • pikepdf: While we’ll focus on pdf2image and Pillow for compression, pikepdf is useful for directly manipulating PDFs, such as removing unnecessary objects or optimizing streams.

Resize and Compress PDF Documents Using Python: Step-by-Step Guide

Let’s walk through how you can resize and compress PDFs using Python. In this guide, we’ll focus on compressing PDFs by converting each page into an image, resizing the images, and then recombining the images into a PDF.

1. Install Required Libraries

First, you need to install the required Python3 libraries. Open your terminal or command prompt and run:

				
					sudo dnf install -y python3-pip && pip3 install pdf2image Pillow
				
			
				
					Defaulting to user installation because normal site-packages is not writeable
Collecting pdf2image
  Downloading pdf2image-1.17.0-py3-none-any.whl (11 kB)
Collecting Pillow
  Downloading pillow-11.1.0-cp39-cp39-manylinux_2_28_x86_64.whl (4.5 MB)
     |████████████████████████████████| 4.5 MB 13.0 MB/s            
Installing collected packages: Pillow, pdf2image
Successfully installed Pillow-11.1.0 pdf2image-1.17.0
				
			

You’ll also need to install Poppler for pdf2image to work. On Ubuntu, you can do this with:

				
					sudo apt-get install -y poppler-utils
				
			

On Red Hat Enterprise Linux (RHEL)/CentOS or RPM-based distributions, you can do this with:

				
					sudo dnf install -y poppler-utils
				
			
Resize and Compress PDF Documents Using Python

Photo by admingeek from Infotechys

For macOS, use:

				
					brew install poppler
				
			

2. Convert PDF to Images and Compress

We’ll use pdf2image to convert each PDF page into an image, then compress it using Pillow. The following script shows how to perform these tasks (using your preferred text editor, open a file called compress_pdf.py). 

				
					sudo vim compress_pdf.py
				
			

Copy and paste the following content. Then, save and exit the file:

				
					import sys
import io
from pdf2image import convert_from_path
from PIL import Image

def compress_pdf(input_pdf, output_pdf, image_quality=50):
    try:
        # Convert PDF pages to images
        images = convert_from_path(input_pdf)

        # Compress each image
        compressed_images = []
        for img in images:
            # Compress the image by reducing its quality
            img = img.convert("RGB")
            compressed_image_io = io.BytesIO()
            img.save(compressed_image_io, format="JPEG", quality=image_quality)
            compressed_image = Image.open(compressed_image_io)
            compressed_images.append(compressed_image)

        # Save the compressed images as a PDF
        compressed_images[0].save(output_pdf, save_all=True, append_images=compressed_images[1:])

        print(f"Compressed PDF saved as {output_pdf}")
    
    except Exception as e:
        print(f"An error occurred: {e}")

if __name__ == "__main__":
    if len(sys.argv) != 3:
        print("Usage: python compress_pdf.py <input_pdf> <output_pdf>")
    else:
        input_pdf = sys.argv[1]
        output_pdf = sys.argv[2]
        compress_pdf(input_pdf, output_pdf)
				
			

3. How the Script Works

Here’s a breakdown of the script:

Convert PDF to Images

We use pdf2image.convert_from_path(input_pdf)to convert each page of the PDF into a Pillow image object.

Compress the Images

We iterate through the list of images and use the Pillow library to save each image as a JPEG with reduced quality. You can adjust the image_quality (from 0 to 100) to control the degree of compression.

Recombine the Images into a PDF

After compressing the images, we save them as a new PDF using Pillow's save() function, where save_all=True ensures all pages are included.


4. Running the Script

To run the script, use the following command:

				
					python3 compress_pdf.py ~/path/to/original_doc.pdf ~/path/to/compressed_doc.pdf
				
			

This will take the original_doc.pdf file, compress the images within it, and save the output as compressed_doc.pdf.

5. Customizing Compression

The image_quality parameter in the script controls how much compression is applied to the images. The value can range from 0 (lowest quality) to 100 (highest quality). A lower value will result in a smaller file size but poorer image quality. For example, to apply more aggressive compression, use:

				
					compress_pdf(input_pdf, output_pdf, image_quality=30)
				
			

6. Why This Method Works

The image_quality parameter in the script controls how much compression is applied to the images. The value can range from 0 (lowest quality) to 100 (highest quality). A lower value will result in a smaller file size but poorer image quality. For example, to apply more aggressive compression, use:

Handles All Types of PDFs

By converting each PDF page to an image, we avoid the complexity of dealing with different types of content (text, vector images, etc.). This method works for any PDF document, including those with scanned pages or complex layouts.

Customizable Compression

You can fine-tune the compression level based on your needs. Lower quality will result in smaller file sizes but might degrade the quality of the document.

Simple to Use

The script is easy to run from the command line, making it a convenient solution for automating PDF compression tasks.


Advanced PDF Compression Techniques

While converting a PDF to images and recombining them is a simple and effective method for compressing PDFs, there are other advanced techniques to reduce PDF size:

1. Remove Unused Objects with pikepdf

If your PDF contains unnecessary metadata, unused fonts, or orphaned objects, you can use the pikepdf library to clean up the PDF structure. Here’s an example that removes unreferenced objects from the PDF:

				
					import pikepdf

def optimize_pdf(input_pdf, output_pdf):
    with pikepdf.open(input_pdf) as pdf:
        pdf.remove_unreferenced()
        pdf.save(output_pdf)
    print(f"Optimized PDF saved as {output_pdf}")

# Example usage
optimize_pdf("input.pdf", "optimized_output.pdf")
				
			

2. Using Ghostscript for Advanced Compression

For even more advanced compression options, Ghostscript is a popular tool that can be used to compress PDF files by adjusting image resolution and compression settings. You can run it via the command line:

				
					gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.4 -dPDFSETTINGS=/screen -dNOPAUSE -dQUIET -dBATCH -sOutputFile=output_compressed.pdf input_file.pdf
				
			

-dPDFSETTINGS=/screen: This setting applies compression with a lower image resolution for smaller file sizes.

3. Use QPDF to Optimize Streams

QPDF is another powerful tool for optimizing PDFs. It allows you to compress streams within the PDF:

				
					qpdf input_file.pdf --compress-streams=y --output output_compressed.pdf
				
			

This command reduces the file size by compressing streams within the PDF.


Conclusion

In this post, we’ve shown how to resize and compress PDF documents using Python. By converting PDF pages to images, compressing those images, and then saving them back into a PDF, we can significantly reduce file sizes. Additionally, we’ve covered more advanced techniques like using pikepdf, Ghostscript, and QPDF to further optimize PDF files.

Did you find this article useful? Your feedback is invaluable to us! Please feel free to share this post–along with your thoughts in the comments section below.


Related Posts

Leave a Reply

Your email address will not be published. Required fields are marked *