This page covers the technical implementation of the PDF skill located at skills/pdf/. The PDF skill's primary programmatic capability is extracting form structure from non-fillable PDFs — identifying text labels, horizontal rules, checkboxes, and row boundaries so Claude can reason about field layout for downstream filling operations. For a comparison of the PDF skill with the other three document skills (DOCX, XLSX, PPTX) and their shared architectural patterns, see Document Skills.
Non-fillable PDFs (scanned forms, printed-and-digitized documents) do not carry embedded field metadata. The PDF skill bridges this gap by analyzing the visual geometry of the PDF and emitting a structured JSON description of the form's layout. Claude then uses this JSON to determine where to place field values, checkmarks, or other content.
The sole Python utility in this skill is scripts/extract_form_structure.py. It has no LibreOffice dependency — unlike the DOCX and XLSX skills — and relies entirely on the pdfplumber library.
Diagram: PDF Skill — Natural Language to Code Entities
Sources: skills/pdf/scripts/extract_form_structure.py1-115
extract_form_structure.pyEntry point: main() at skills/pdf/scripts/extract_form_structure.py91-115
Core function: extract_form_structure(pdf_path) at skills/pdf/scripts/extract_form_structure.py20-88
The script accepts two positional arguments: the input PDF path and the output JSON path.
python extract_form_structure.py <input.pdf> <output.json>
extract_form_structure() opens the PDF with pdfplumber.open(), iterates over every page, and populates a single structure dictionary with five top-level keys.
Each element type is extracted with a specific filter. The table below summarizes the source of each element in pdfplumber's page API and the filtering condition applied.
| Element Type | pdfplumber Source | Filter Condition | Output Key |
|---|---|---|---|
| Page metadata | page.width, page.height | None — all pages included | pages |
| Text labels | page.extract_words() | None — all words included | labels |
| Horizontal lines | page.lines | abs(x1 - x0) > page.width * 0.5 | lines |
| Checkboxes | page.rects | 5 ≤ width ≤ 15, 5 ≤ height ≤ 15, abs(width - height) < 2 | checkboxes |
| Row boundaries | Derived from lines | Consecutive distinct y-coordinates per page | row_boundaries |
skills/pdf/scripts/extract_form_structure.py37-46
Every word returned by page.extract_words() is recorded without filtering. Each label entry carries its bounding box (x0, top, x1, bottom) and page number. Coordinates are rounded to one decimal place.
skills/pdf/scripts/extract_form_structure.py48-55
A line is included only if its horizontal span exceeds 50% of the page width. This filters out short decorative strokes, borders, and table cell dividers, retaining only the full-width rules that typically separate form sections or define fill-in rows. Only the y position (from line["top"]), x0, and x1 are stored — vertical position is the only meaningful coordinate for row detection.
skills/pdf/scripts/extract_form_structure.py57-69
Rectangles from page.rects are classified as checkboxes when they are:
abs(width - height) < 2This excludes large boxes (table cells, section borders) and very small artifacts. Checkbox entries include center_x and center_y — the midpoint of the bounding rectangle — to simplify click-targeting or fill-position calculations downstream.
skills/pdf/scripts/extract_form_structure.py71-86
Row boundaries are derived in a post-processing step after all pages are scanned. Horizontal line y-coordinates are grouped by page, sorted, deduplicated, and then each consecutive pair (y[i], y[i+1]) defines one row boundary. The output includes row_top, row_bottom, and row_height for each interval.
Diagram: Row Boundary Derivation from lines
Sources: skills/pdf/scripts/extract_form_structure.py71-86
The script writes a single JSON object to the output file. The full structure:
{
"pages": [
{ "page_number": int, "width": float, "height": float }
],
"labels": [
{ "page": int, "text": str, "x0": float, "top": float, "x1": float, "bottom": float }
],
"lines": [
{ "page": int, "y": float, "x0": float, "x1": float }
],
"checkboxes": [
{ "page": int, "x0": float, "top": float, "x1": float, "bottom": float,
"center_x": float, "center_y": float }
],
"row_boundaries": [
{ "page": int, "row_top": float, "row_bottom": float, "row_height": float }
]
}
All floating-point coordinates are rounded to one decimal place at emission time using round(..., 1).
Diagram: structure Dictionary — Fields to Source
Sources: skills/pdf/scripts/extract_form_structure.py20-88
After writing the JSON, main() prints a summary to stdout:
Extracting structure from form.pdf...
Found:
- 2 pages
- 341 text labels
- 18 horizontal lines
- 12 checkboxes
- 17 row boundaries
Saved to output.json
skills/pdf/scripts/extract_form_structure.py105-111
| Library | Role |
|---|---|
pdfplumber | PDF parsing: word extraction, line/rect geometry |
json | Serializing output structure |
sys | CLI argument handling and exit codes |
pdfplumber is the only third-party dependency. There is no LibreOffice, Pillow, or openpyxl involvement in this skill — see Document Skills for where those libraries appear in the other skills.
Refresh this wiki
This wiki was recently refreshed. Please wait 5 days to refresh again.