Read PDF using Selenium Java

In this article we will see how we can read a pdf file using selenium java.

Organizations frequently generate various types of PDF reports, such as mobile bills, electricity bills, financial reports, and revenue reports. Quality Assurance (QA) teams are then tasked with verifying the information contained in these reports. Typically, this process involves manually downloading the reports and reading the data they contain. To automate this process, the test framework must be capable of automatically downloading PDF reports and extracting the data without any human intervention.

Reading a PDF document in Selenium using Java requires some additional libraries because Selenium itself does not provide direct support for reading PDFs. The most commonly used library for reading PDFs in Java is Apache PDFBox.

Table of Contents

Sample PDF File

1. Add the dependencies

Add the Selenium, commons and pdfbox dependencies to the project. To download the latest version of these dependencies, refer to the official Maven site – https://mvnrepository.com/.

  <dependency>
      <groupId>org.seleniumhq.selenium</groupId>
      <artifactId>selenium-java</artifactId>
      <version>4.24.0</version>
    </dependency>

    <dependency>
      <groupId>commons-io</groupId>
      <artifactId>commons-io</artifactId>
      <version>2.16.1</version>
    </dependency>

    <dependency>
      <groupId>org.apache.pdfbox</groupId>
      <artifactId>pdfbox</artifactId>
      <version>3.0.3</version>
    </dependency>

2. Download the PDF Document

Use Selenium WebDriver to navigate to the PDF URL and download it to a desired location.

 String downloadFilepath = System.getProperty("user.dir") + File.separator + "downloads";

        ChromeOptions options = new ChromeOptions();
        Map<String, Object> prefs = new HashMap<>();
        prefs.put("plugins.always_open_pdf_externally", true);
        prefs.put("download.default_directory", downloadFilepath);
        options.setExperimentalOption("prefs", prefs);

        WebDriver driver = new ChromeDriver(options);
        driver.manage().window().maximize();
        driver.get("https://freetestdata.com/document-files/pdf/");

        // Locate and click the download link or button if necessary
        WebElement downloadLink = driver.findElement(By.xpath("//*[@class=\"elementor-button-text\"]"));
        downloadLink.click();

        //Wait for download to complete
        File downloadedFile = new File(downloadFilepath + "/Free_Test_Data_100KB_PDF.pdf");
        WebDriverWait wait = new WebDriverWait(driver, Duration.ofSeconds(30));
        wait.until((ExpectedCondition<Boolean>) wd -> downloadedFile.exists());

        // Check if the file exists
        if (downloadedFile.exists()) {
            System.out.println("File is downloaded!");
        } else {
            System.out.println("File is not downloaded.");
        }

To know more about the PDF download, please refer to this tutorial – Download PDF in Chrome with Selenium Java

3. Read the PDF Content

We are using the Apache PDFBox to read the downloaded PDF file and extract text.

Step 1 – Load PDF Document

File file = new File("Path of Document");   
PDDocument doc = Loader.loadPDF(file);

Step 2 – Retrieve the text

PDFTextStripper class is used to retrieve text from a PDF document. We can instantiate this class as following

PDFTextStripper pdfStripper = new PDFTextStripper();

getText() method is used to read the text contents from the PDF document. In this method, we need to pass the document object as a parameter.

String text = pdfStripper.getText(doc);

The complete program can be seen below:

import org.apache.pdfbox.Loader;
import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.text.PDFTextStripper;
import org.openqa.selenium.By;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.WebElement;
import org.openqa.selenium.chrome.ChromeDriver;
import org.openqa.selenium.chrome.ChromeOptions;
import org.openqa.selenium.support.ui.ExpectedCondition;
import org.openqa.selenium.support.ui.WebDriverWait;

import java.io.File;
import java.io.IOException;
import java.time.Duration;
import java.util.HashMap;
import java.util.Map;

public class ReadPDF_Chrome_Demo {

    public static void main(String[] args) throws InterruptedException {

        String downloadFilepath = System.getProperty("user.dir") + File.separator + "chrome_downloads";

        ChromeOptions options = new ChromeOptions();
        Map<String, Object> prefs = new HashMap<>();
        prefs.put("plugins.always_open_pdf_externally", true);
        prefs.put("download.default_directory", downloadFilepath);
        options.setExperimentalOption("prefs", prefs);

        WebDriver driver = new ChromeDriver(options);
        driver.manage().window().maximize();
        driver.get("https://freetestdata.com/document-files/pdf/");

        // Locate the download link/button and click and wait for the download to complete
        WebElement downloadLink = driver.findElement(By.xpath("//*[@class='elementor-button-text']"));
        downloadLink.click();

        //Wait for download to complete
        File downloadedFile = new File(downloadFilepath + "/Free_Test_Data_100KB_PDF.pdf");
        WebDriverWait wait = new WebDriverWait(driver, Duration.ofSeconds(30));
        wait.until((ExpectedCondition<Boolean>) wd -> downloadedFile.exists());
        
        // Check if the file exists
        if (downloadedFile.exists()) {
            System.out.println("File is downloaded!");
        } else {
            System.out.println("File is not downloaded.");
        }

        driver.quit();

        // Read the downloaded PDF using PDFBox
        PDDocument document = null;
        try {
            document = Loader.loadPDF(downloadedFile);
            PDFTextStripper pdfStripper = new PDFTextStripper();
            String text = pdfStripper.getText(document);
            document.close();

            // Print the PDF text content
            System.out.println("Text in PDF: ");
            System.out.println(text);
        } catch (IOException e) {
            System.err.println("An error occurred while loading or reading the PDF file: " + e.getMessage());
            e.printStackTrace();
        }

    }

}

The output of the above program is

Summary:

### Summary

1. Setup WebDriver: Configure the browser to handle automatic downloads.
2. Trigger Download: Navigate to the webpage and trigger the download.
3. Wait for Completion: Implement a waiting mechanism to ensure the download completes.
4. Verify Content: Use a library like Apache PDFBox to read the content of the downloaded PDF.

That’s it! Congratulations on making it through this tutorial and hope you found it useful! Happy Learning!!

QA Automation Expert

Automation solutions to build Test Framework

Read PDF Files with Selenium in Java

1. Add the dependencies

2. Download the PDF Document

3. Read the PDF Content