By PDFKits Team — Published February 19, 2026
TL;DR. A PDF you think is "just text" can contain the author's name, the original Word file path, deleted-but-recoverable text, track-changes from earlier versions, GPS coordinates of embedded photos, and the editing software's revision history. For most public-facing documents — published reports, court filings, RFP responses — this leakage is harmless. For confidential documents — settlement agreements, internal strategy decks, redacted court exhibits — it is a problem. Sanitizing a PDF means stripping the metadata and reflattening the content so only what is visible on screen remains. Both GDPR (Art. 5) and the U.S. HIPAA Privacy Rule treat metadata as part of the underlying record — leaked metadata can trigger reportable incidents.
A PDF is a structured document with multiple layers beyond what you see when you open it. The visible content stream — text, images, vector graphics — is one part. Around it sits an object table that catalogs every element, a metadata stream (XMP) that records authoring software and timestamps, document-info fields (author, title, subject, keywords) carried forward from the source document, optional embedded files, JavaScript, form data, annotations, and bookmarks. Most of this is invisible during normal reading but trivially extractable with a free PDF inspector or even a text editor.
The riskiest categories of hidden data, in order of typical exposure:
Document-info fields — Author name, organization, original file name, often the local file path. A Word document exported to PDF by "j.smith@bigcorp.com" carries that identity forward unless explicitly stripped.
XMP metadata — Software version, edit timestamps, creator tool. Useful for forensics, embarrassing in some contexts (a press release whose XMP shows it was edited in Adobe Acrobat by an unidentified contractor 12 minutes before publication).
Embedded image EXIF — JPGs embedded in a PDF retain their EXIF data, including GPS coordinates from phone-taken photos. A "photo of a contract page" can leak the lawyer's home address.
Recoverable redactions — Black rectangles drawn as annotations leave the underlying text intact. A "redacted" PDF where the text under the rectangles is still selectable is the most common high-profile leak.
Form data — Filled-in form fields can persist as JavaScript values even when the visible field shows empty. Flattening forms removes this risk.
Track changes / revision marks — In some workflows, edit history from Word survives the export. The final PDF can be expanded to show every deletion and insertion.
A solo immigration attorney filed a sealed motion to seal further filings, attaching an exhibit that was a phone photo of a client letter. The phone photo's EXIF embedded GPS coordinates of the client's home — extractable from the public filing by anyone who downloaded it. The seal was rendered meaningless. Stripping EXIF before filing would have prevented the exposure.
A management consulting firm submitted a competitive strategy deck to a Fortune 500 client. The PDF's document-info field carried the originating analyst's name and the firm's internal project code. A leaked copy let competitors identify the engagement and the analyst. The deck was technically anonymous in the visible content; the metadata broke the anonymity.
A reporter received documents from a confidential source, converted them to PDF for archiving, and shared one with a colleague. The PDF retained the original Word file's "last edited by" field — the source's username. The exposure required no decryption, no forensics — just opening the document properties.
A federal agency released a FOIA-responsive PDF with black rectangles over names and addresses. The rectangles were annotations, not content removal — the underlying text was selectable and pasted into a journalist's notes within hours. The agency had to issue a correction and notify affected individuals.
A publicly-traded company published its quarterly earnings PDF with embedded illustrations. The illustrations contained named layers ("Q3 launch — Pegasus") visible in the PDF object inspector. The unreleased codename leaked to financial press before the company's planned announcement.
An HR manager exported an offer letter to PDF from Word with Track Changes left on. The visible PDF showed a clean offer; the underlying revision marks documented an earlier higher salary that the company had reduced before sending. Discovered during litigation, the revision history became Exhibit A.
In any PDF reader: File → Properties → Description tab. This shows the document-info fields: Author, Subject, Keywords, Application (the software that created the PDF). If these are filled with anything non-public, sanitize before sharing.
Use any PDF inspector (pdftk, PDFKits Clean Metadata, or even strings on Linux). Look for <xmpMM:DerivedFrom>, <xmp:CreatorTool>, <xmpMM:DocumentID>. These often reveal the local file system path and software version.
For any black-rectangle "redaction," open the file, click into the redacted region, drag a selection, press Ctrl+C. If anything lands on the clipboard, the redaction is cosmetic, not content-stream.
For any PDF that includes phone-taken or scanned photos, the EXIF data inside those images survives. Use Extract Images to pull each image, then check EXIF (a tool like exiftool surfaces GPS coordinates, capture timestamps, camera serial numbers).
Many PDFs include scripted form behavior. Document inspectors flag /JS and /AcroForm entries. For documents that should be passive (a PDF being sent for review, not interaction), strip these.
The right sequence for sanitization depends on the document type. For a confidential document being shared externally:
Step 1 — Strip document-info and XMP metadata. Clean Metadata removes Author, Subject, Application, and the full XMP stream. Re-open the document properties to confirm the fields are blank.
Step 2 — Flatten forms and annotations. If the document was a fillable form or carries highlights, comments, or sticky notes you do not want to share, run Flatten PDF Forms to bake the visible state into the page content. Hidden form values are removed.
Step 3 — Verify redactions are content-stream, not annotation. Use the selection test above. If redactions were drawn as rectangles, re-do them with Redact PDF, which removes underlying text.
Step 4 — Re-process embedded images if EXIF matters. If the document contains phone-taken photos that embed GPS, the cleanest fix is to re-take the photos with location services disabled, or run Extract Images → strip EXIF → re-embed. Some metadata cleaners also strip image EXIF; verify on the output.
Step 5 — Reflatten with a clean export. The strongest sanitization is to re-render the PDF to a fresh PDF using a tool that does not carry source metadata forward. This is a higher cost — interactive elements are lost — but it ensures a maximally clean output.
| Feature | PDFKits | Adobe Acrobat Pro | pdftk (CLI) | Smallpdf |
|---|---|---|---|---|
| Cost | Free | $29.99/month | Free (open source) | $108/year |
| Files stay on your device | Yes | Yes (desktop) | Yes | No — cloud |
| Strip document-info / XMP | Yes | Yes (Sanitize) | Yes (update_info) | Limited |
| Content-stream redaction | Yes | Yes | No | Limited |
| Flatten forms / annotations | Yes | Yes | Yes | Yes |
| Strip image EXIF | Yes | Partial | No (image-level needed) | Partial |
| No login required | Yes | Adobe ID | — | Account-gated |
For lawyers, journalists, healthcare workers, and anyone subject to GDPR's Article 5 data-minimization principle, the procurement question is simpler with browser-only tools: there is no third-party processor to evaluate, no Data Processing Agreement to negotiate, no breach-notification clause to draft.
Under GDPR Article 4(1), "personal data" includes any information relating to an identified or identifiable natural person — names, identification numbers, location data, online identifiers, factors specific to that person's physical, physiological, genetic, mental, economic, cultural, or social identity. Metadata that names a document's author or contains GPS coordinates clearly qualifies. EXIF data from a photo of a person is personal data even if the photo itself is in the document for non-identifying reasons.
HIPAA's Privacy Rule, 45 CFR § 164.514, sets a similar bar for protected health information. A PDF containing medical records with metadata pointing to the originating clinic, the doctor, or the patient can constitute a reportable disclosure if the metadata is exposed to an unauthorized party. The cure is sanitization before release.
For both regimes, the operational rule is: if the visible content is meant to be public, strip everything else. If the visible content is meant to be restricted, control distribution AND sanitize hidden data, because copies leak and metadata travels with copies.
No. Viewers display the visible content but leave the underlying file intact. Metadata persists across views, downloads, and re-uploads unless explicitly stripped.
"Save as PDF" typically re-exports from a source document; some metadata is carried forward depending on the export options. "Sanitize" specifically strips known metadata fields and may remove embedded JavaScript, form data, and annotations. Acrobat's "Sanitize Document" is the most comprehensive built-in option among desktop tools.
No, but it will remove functionality you may want — interactive forms, bookmarks, embedded media. For a final read-only release, this is usually desirable. For a working document, sanitize a copy before sharing externally.
Open the file. Click into the redacted region. Drag a selection across it. Press Ctrl+C. If anything lands on your clipboard, the redaction is cosmetic. Real redaction removes text from the content stream, leaving nothing to copy.
The image content is metadata-clean (scanners do not typically write personal metadata into raw scans), but the surrounding PDF still has document-info fields from the scanning software. Phone-scanned PDFs additionally carry the phone's camera EXIF, including potentially GPS.
Partially. Encryption protects content during transmission and at rest from someone without the password. But anyone with the password sees the metadata fully. Encryption is a confidentiality layer; sanitization is a leak-prevention layer. Most workflows need both.
Yes. Attached PDFs travel intact, metadata and all. Email signatures and mail server hops add transit metadata but do not alter the file itself. If you would not want a recipient to see the PDF's metadata, sanitize before attaching.
In Adobe Reader: File → Properties → Description and Advanced tabs. For deeper inspection, use the command-line tool pdfinfo (Linux/macOS) or open the PDF in a text editor and search for /Author, /Creator, /Producer, xmp:CreatorTool.
For single files, no — both finish in 1–5 seconds. For batches of 100+ files, a CLI tool like pdftk run from a shell script is faster than clicking through each file. For most users sanitizing one document before sending, browser-only is identical in user experience.
Some viewers add a "Forwarded by" annotation or update document-info when re-saved. To be safe, do not "Save As" on the file — forward the original received file unchanged, or sanitize first.
Clean Metadata — Strip document-info and XMP fields. Redact PDF — Content-stream redaction (not annotation overlay). Flatten PDF — Convert interactive forms and annotations to static content. Extract Images — Pull embedded images for EXIF inspection. Protect PDF — Add password protection (combine with sanitization). Edit PDF — Modify visible content before sanitizing.