Automating PDF Link Testing Across Multiple Sites Using GitLab CI and Playwright

Safique Ahmed Faruque

July 4, 2025

As a developer or QA engineer, you know the frustration of discovering broken PDF links after a site update — especially when those links span multiple websites. I remember spending hours manually checking PDFs across five different sites, copying and pasting URLs into a browser, hoping I wouldn’t miss any. Each time a site updated, the tedious routine repeated itself, eating into time better spent building new features.

Frustrated by this inefficiency, I decided there had to be a better way. What if I could automate the whole process, running tests that checked every PDF link across all sites — reliably, consistently, and with zero manual effort?

In this article, I’ll walk you through how I built an automated PDF link testing solution using Python’s pytest, Playwright for browser automation, and GitLab CI/CD for continuous integration. To make it scalable and easy to maintain, I placed all the testing logic in a separate GitLab project and ran the tests on self-hosted runners, avoiding costly CI minute limits. This setup not only saved me hours of repetitive work but also gave my team quick feedback on link health — no more surprises in production.

If you’re juggling multiple sites or just want a practical example of integrating automated tests in your CI pipelines, this guide is for you.

Tools and Technology Stack

Pytest: Python’s robust testing framework to write flexible tests.
Playwright: Browser automation library for simulating user actions and checking PDFs.
GitLab CI/CD: For running tests automatically in pipelines.
Self-hosted GitLab Runners: Dedicated machines under my control to execute CI jobs without usage limits.
Text files with URLs: Easy-to-edit files storing site-specific PDF URLs.

The Challenge

I manage 5+ sites, each with a large list of PDF URLs. I needed a way to:

Run link tests against any site dynamically without manual file copying.
Keep the test code and config separate from each site’s repository for easier maintenance.
Avoid limitations or costs associated with shared GitLab CI minutes.
Provide a simple way for team members to run tests without complex setup.

Solution Overview

1. Centralized Test Repository

I created a dedicated GitLab project containing all test code, URL lists, and CI configurations. This acts as a single source of truth for testing across sites.

2. Dynamic URL Loading

Using pytest’s CLI options and hooks, the tests load URLs from a file passed as a parameter, so one test script can run against any site by simply changing the URL list.

3. GitLab CI Jobs per Site

The .gitlab-ci.yml defines jobs for each site, passing the appropriate URL file as a test argument.

4. Self-Hosted Runners

To avoid consuming limited GitLab shared runner minutes, I installed self-hosted GitLab runners that execute these jobs on my own servers, giving me full control and no time limits.

Implementation Details

Parametrizing URLs in Pytest

In conftest.py, I added the following:

import pytest
import os

def pytest_addoption(parser):
    parser.addoption(
        "--urlfile",
        action="store",
        default="test_urls.txt",
        help="Path to the file containing URLs to test"
    )

def pytest_generate_tests(metafunc):
    if "url" in metafunc.fixturenames:
        urlfile = metafunc.config.getoption("urlfile")
        if not os.path.exists(urlfile):
            pytest.fail(f"URL file '{urlfile}' does not exist")

        with open(urlfile) as f:
            urls = [line.strip() for line in f if line.strip()]

        if not urls:
            pytest.skip(f"No URLs found in {urlfile}")

        metafunc.parametrize("url", urls)

Example Test Function

def test_pdf_links(url):
    # Playwright logic to verify PDF links would go here
    assert url.startswith("https://"), f"Invalid URL: {url}"

GitLab CI Configuration

stages:
  - test

test_siteA:
  tags:
    - self-hosted
  stage: test
  script:
    - pip install -r requirements.txt
    - python -m playwright install chromium
    - pytest -s tests/test_pdf_links.py --urlfile=test_siteA_urls.txt --junitxml=report_siteA.xml
  artifacts:
    reports:
      junit: report_siteA.xml
    paths:
      - report_siteA.xml

test_siteB:
  tags:
    - self-hosted
  stage: test
  script:
    - pip install -r requirements.txt
    - python -m playwright install chromium
    - pytest -s tests/test_pdf_links.py --urlfile=test_siteB_urls.txt --junitxml=report_siteB.xml
  artifacts:
    reports:
      junit: report_siteB.xml
    paths:
      - report_siteB.xml

Using Self-Hosted Runners

I registered a runner on my own infrastructure with the tag self-hosted. This allows GitLab to assign these test jobs to my machines, bypassing shared runner limits and enabling a stable, configurable environment.

Benefits of This Approach

Centralized management: One repo holds all test logic and URLs, simplifying updates and adding new sites.
Flexible and scalable: New sites require only a URL file and a CI job.
Cost control: Self-hosted runners eliminate CI minute costs and give full control.
Accessible to team: Anyone with repo access can trigger tests without setup hassle.
Reliable and repeatable: Tests run consistently in a clean, isolated environment.

Author: Safique Ahmed Faruque