PDF Skill

Relevant source files

skills/pdf/scripts/extract_form_structure.py

This page covers the technical implementation of the PDF skill located at skills/pdf/. The PDF skill's primary programmatic capability is extracting form structure from non-fillable PDFs — identifying text labels, horizontal rules, checkboxes, and row boundaries so Claude can reason about field layout for downstream filling operations. For a comparison of the PDF skill with the other three document skills (DOCX, XLSX, PPTX) and their shared architectural patterns, see Document Skills.

Purpose and Scope

Non-fillable PDFs (scanned forms, printed-and-digitized documents) do not carry embedded field metadata. The PDF skill bridges this gap by analyzing the visual geometry of the PDF and emitting a structured JSON description of the form's layout. Claude then uses this JSON to determine where to place field values, checkmarks, or other content.

The sole Python utility in this skill is scripts/extract_form_structure.py. It has no LibreOffice dependency — unlike the DOCX and XLSX skills — and relies entirely on the pdfplumber library.

Skill Invocation Flow

Diagram: PDF Skill — Natural Language to Code Entities

Sources: skills/pdf/scripts/extract_form_structure.py1-115

`extract_form_structure.py`

Entry point: main() at skills/pdf/scripts/extract_form_structure.py91-115

Core function: extract_form_structure(pdf_path) at skills/pdf/scripts/extract_form_structure.py20-88

The script accepts two positional arguments: the input PDF path and the output JSON path.

python extract_form_structure.py <input.pdf> <output.json>

extract_form_structure() opens the PDF with pdfplumber.open(), iterates over every page, and populates a single structure dictionary with five top-level keys.

Element Detection Heuristics

Each element type is extracted with a specific filter. The table below summarizes the source of each element in pdfplumber's page API and the filtering condition applied.

Element Type	pdfplumber Source	Filter Condition	Output Key
Page metadata	`page.width`, `page.height`	None — all pages included	`pages`
Text labels	`page.extract_words()`	None — all words included	`labels`
Horizontal lines	`page.lines`	`abs(x1 - x0) > page.width * 0.5`	`lines`
Checkboxes	`page.rects`	`5 ≤ width ≤ 15`, `5 ≤ height ≤ 15`, `abs(width - height) < 2`	`checkboxes`
Row boundaries	Derived from `lines`	Consecutive distinct y-coordinates per page	`row_boundaries`

Text Labels

skills/pdf/scripts/extract_form_structure.py37-46

Every word returned by page.extract_words() is recorded without filtering. Each label entry carries its bounding box (x0, top, x1, bottom) and page number. Coordinates are rounded to one decimal place.

Horizontal Lines

skills/pdf/scripts/extract_form_structure.py48-55

A line is included only if its horizontal span exceeds 50% of the page width. This filters out short decorative strokes, borders, and table cell dividers, retaining only the full-width rules that typically separate form sections or define fill-in rows. Only the y position (from line["top"]), x0, and x1 are stored — vertical position is the only meaningful coordinate for row detection.

Checkboxes

skills/pdf/scripts/extract_form_structure.py57-69

Rectangles from page.rects are classified as checkboxes when they are:

Between 5 and 15 PDF units wide
Between 5 and 15 PDF units tall
Nearly square: abs(width - height) < 2

This excludes large boxes (table cells, section borders) and very small artifacts. Checkbox entries include center_x and center_y — the midpoint of the bounding rectangle — to simplify click-targeting or fill-position calculations downstream.

Row Boundaries

skills/pdf/scripts/extract_form_structure.py71-86

Row boundaries are derived in a post-processing step after all pages are scanned. Horizontal line y-coordinates are grouped by page, sorted, deduplicated, and then each consecutive pair (y[i], y[i+1]) defines one row boundary. The output includes row_top, row_bottom, and row_height for each interval.

Diagram: Row Boundary Derivation from lines

Sources: skills/pdf/scripts/extract_form_structure.py71-86

Output JSON Schema

The script writes a single JSON object to the output file. The full structure:

{
  "pages": [
    { "page_number": int, "width": float, "height": float }
  ],
  "labels": [
    { "page": int, "text": str, "x0": float, "top": float, "x1": float, "bottom": float }
  ],
  "lines": [
    { "page": int, "y": float, "x0": float, "x1": float }
  ],
  "checkboxes": [
    { "page": int, "x0": float, "top": float, "x1": float, "bottom": float,
      "center_x": float, "center_y": float }
  ],
  "row_boundaries": [
    { "page": int, "row_top": float, "row_bottom": float, "row_height": float }
  ]
}

All floating-point coordinates are rounded to one decimal place at emission time using round(..., 1).

Diagram: structure Dictionary — Fields to Source

Sources: skills/pdf/scripts/extract_form_structure.py20-88

Console Output

After writing the JSON, main() prints a summary to stdout:

Extracting structure from form.pdf...
Found:
  - 2 pages
  - 341 text labels
  - 18 horizontal lines
  - 12 checkboxes
  - 17 row boundaries
Saved to output.json

skills/pdf/scripts/extract_form_structure.py105-111

Dependencies

Library	Role
`pdfplumber`	PDF parsing: word extraction, line/rect geometry
`json`	Serializing output structure
`sys`	CLI argument handling and exit codes

pdfplumber is the only third-party dependency. There is no LibreOffice, Pillow, or openpyxl involvement in this skill — see Document Skills for where those libraries appear in the other skills.

PDF Skill

Relevant source files

skills/pdf/scripts/extract_form_structure.py

Purpose and Scope

The sole Python utility in this skill is scripts/extract_form_structure.py. It has no LibreOffice dependency — unlike the DOCX and XLSX skills — and relies entirely on the pdfplumber library.

Skill Invocation Flow

Diagram: PDF Skill — Natural Language to Code Entities

Sources: skills/pdf/scripts/extract_form_structure.py1-115

`extract_form_structure.py`

Entry point: main() at skills/pdf/scripts/extract_form_structure.py91-115

Core function: extract_form_structure(pdf_path) at skills/pdf/scripts/extract_form_structure.py20-88

The script accepts two positional arguments: the input PDF path and the output JSON path.

python extract_form_structure.py <input.pdf> <output.json>

extract_form_structure() opens the PDF with pdfplumber.open(), iterates over every page, and populates a single structure dictionary with five top-level keys.

Element Detection Heuristics

Each element type is extracted with a specific filter. The table below summarizes the source of each element in pdfplumber's page API and the filtering condition applied.

Element Type	pdfplumber Source	Filter Condition	Output Key
Page metadata	`page.width`, `page.height`	None — all pages included	`pages`
Text labels	`page.extract_words()`	None — all words included	`labels`
Horizontal lines	`page.lines`	`abs(x1 - x0) > page.width * 0.5`	`lines`
Checkboxes	`page.rects`	`5 ≤ width ≤ 15`, `5 ≤ height ≤ 15`, `abs(width - height) < 2`	`checkboxes`
Row boundaries	Derived from `lines`	Consecutive distinct y-coordinates per page	`row_boundaries`

Rectangles from page.rects are classified as checkboxes when they are:

Between 5 and 15 PDF units wide
Between 5 and 15 PDF units tall
Nearly square: abs(width - height) < 2

Row Boundaries

skills/pdf/scripts/extract_form_structure.py71-86

Diagram: Row Boundary Derivation from lines

Sources: skills/pdf/scripts/extract_form_structure.py71-86

Output JSON Schema

The script writes a single JSON object to the output file. The full structure:

{
  "pages": [
    { "page_number": int, "width": float, "height": float }
  ],
  "labels": [
    { "page": int, "text": str, "x0": float, "top": float, "x1": float, "bottom": float }
  ],
  "lines": [
    { "page": int, "y": float, "x0": float, "x1": float }
  ],
  "checkboxes": [
    { "page": int, "x0": float, "top": float, "x1": float, "bottom": float,
      "center_x": float, "center_y": float }
  ],
  "row_boundaries": [
    { "page": int, "row_top": float, "row_bottom": float, "row_height": float }
  ]
}

All floating-point coordinates are rounded to one decimal place at emission time using round(..., 1).

Diagram: structure Dictionary — Fields to Source

Sources: skills/pdf/scripts/extract_form_structure.py20-88

Console Output

After writing the JSON, main() prints a summary to stdout:

Extracting structure from form.pdf...
Found:
  - 2 pages
  - 341 text labels
  - 18 horizontal lines
  - 12 checkboxes
  - 17 row boundaries
Saved to output.json

skills/pdf/scripts/extract_form_structure.py105-111

Dependencies

Library	Role
`pdfplumber`	PDF parsing: word extraction, line/rect geometry
`json`	Serializing output structure
`sys`	CLI argument handling and exit codes

pdfplumber is the only third-party dependency. There is no LibreOffice, Pillow, or openpyxl involvement in this skill — see Document Skills for where those libraries appear in the other skills.

PDF Skill

Purpose and Scope

Skill Invocation Flow

`extract_form_structure.py`

Element Detection Heuristics

Text Labels

Horizontal Lines

Checkboxes

Row Boundaries

Output JSON Schema

Console Output

Dependencies

On this page

PDF Skill

Purpose and Scope

Skill Invocation Flow

`extract_form_structure.py`

Element Detection Heuristics

Text Labels

Horizontal Lines

Checkboxes

Row Boundaries

Output JSON Schema

Console Output

Dependencies

On this page