Adobe PDF OCR Scanned Archive to Searchable
When a scanned (image-only) PDF lands in an archive library, the flow runs OCR with Adobe PDF Services to produce a searchable PDF, replaces the image-only copy, extracts the text, and indexes it into Azure AI Search. Makes a scanned document archive fully searchable by content.
Provided as-is, without warranty of any kind. Review and test each pattern in a non-production environment before deploying it to live automations. See our Terms.
Overview
CF-545 — built and verified (Flow Checker 0 errors / 0 warnings), ships Off. When a scanned, image-only PDF lands in a SharePoint archive library, this flow runs Adobe PDF Services OCR to produce a searchable PDF, replaces the image-only copy in place, extracts the recognized text with Adobe Extract, and indexes that text into Azure AI Search (mergeOrUpload). The result: a scanned-document archive that is fully searchable by content, not just by filename.
Why it matters: Image-only scans are invisible to enterprise search and eDiscovery. OCR + indexing unlocks the archive for retrieval, compliance, and discovery.
Solution: 1 cloud flow, 32 top-level actions, 2 bound connectors (SharePoint, Azure AI Search) + Adobe PDF Services REST over built-in HTTP.
Use Case
Records, legal, and IT teams holding large libraries of historical scans (contracts, invoices, correspondence) need them searchable by their actual content. Dropping a scan into the monitored library automatically converts it to a searchable PDF and pushes its text into the enterprise search index — no manual OCR step.
Flow Architecture
When a Scanned PDF Arrives
SharePoint — GetOnNewFileItems (poll 5 min, splitOn)Fires once per new file in the archive library.
Initialize variables
Initialize Variable (x13)Bind env vars (site, library, work folder, OCR language, index, Adobe base/id/secret, poll interval) and derive file name, doc key, and the two loop-status holders.
Get Scanned File Content
SharePoint — GetFileContentRead the scanned PDF bytes.
Adobe token + asset + upload
HTTP — POST /token, POST /assets, PUT (upload)Get a client-credentials OAuth token, create an OCR source asset, and upload the PDF bytes.
Start OCR Job + Poll Until Complete
HTTP — POST /operation/ocr + Until loopSubmit OCR (ocrLang, ocrType=searchable_image) and poll the location until done/failed.
Download + Replace With Searchable PDF
HTTP GET (presigned) + SharePoint CreateFileDownload the searchable PDF and overwrite the original scan in place.
Start Extract + Poll + Download Zip
HTTP — POST /operation/extractpdf + Until + GETRun Extract on the OCR result asset (chained assetID), poll until done, and download the Extract result ZIP.
Save + Unzip + Read Structured Data
SharePoint CreateFile + ExtractFolderV2 + GetFileContentByPathSave the ZIP, unzip to a per-document subfolder, and read structuredData.json.
Compose Search Document
Compose + SelectEnvironment Variables
| Schema name | Type | Default | Description |
|---|---|---|---|
| flowlibs_SharePointSiteURL | String | https://your-tenant.sharepoint.com | Site hosting the archive (trigger dataset) — reused. |
| flowlibs_ScannedArchiveLibrary | String | Scanned Archive | Library to monitor + write the searchable copy back to (trigger table) — new. |
| flowlibs_OcrWorkFolderPath | String | /Scanned Archive/_ocrwork | Scratch folder for the Extract ZIP + unzip — new. |
| flowlibs_OcrLanguage | String | en-US | Adobe OCR language — new. |
| flowlibs_ArchiveSearchIndexName | String | document-archive | Azure AI Search index that receives the text (must pre-exist) — new. |
| flowlibs_AdobeClientId | String | <configure> | Adobe API client id / X-API-Key — reused. |
| flowlibs_AdobeClientSecret | String | <configure> | Adobe API client secret — reused. |
| flowlibs_AdobePdfServicesBase | String | https://pdf-services.adobe.io | Adobe REST base URL — reused. |
| flowlibs_AdobePollIntervalSeconds | String | 5 |
Connectors & Connections
| Connector | API name | Actions used |
|---|---|---|
| HTTP | http | POST /token POST /assets PUT (upload) POST /operation/ocr POST /operation/extractpdf GET (status/download) |
| SharePoint | shared_sharepointonline | GetOnNewFileItems GetFileContent CreateFile ExtractFolderV2 GetFileContentByPath |
| Azure AI Search | shared_azureaisearch | IndexDocuments |
Note — All connections are referenced as solution connection references; the flow is portable between environments as long as a connection is mapped at import time.
Customization Guide
Almost every realistic variant of this flow can be implemented by changing environment variable values. A few cases require small edits inside the flow definition — those are called out explicitly below.
- Point at your library
- Set flowlibs_ScannedArchiveLibrary (and flowlibs_OcrWorkFolderPath) to your document library; set flowlibs_SharePointSiteURL if not the root site.
- Index schema
- Create the document-archive index (or rename via flowlibs_ArchiveSearchIndexName) with the required fields before turning the flow on.
- OCR quality
- Switch ocrType to searchable_image_exact in Start OCR Job to preserve image fidelity (default searchable_image compresses).
- Vector / semantic search
- Add an embeddings step and a vector field for semantic retrieval.
- Chunking
- Split long documents into multiple search docs with a shared parent id for better relevance.
- Language auto-detect
- Detect language per document instead of the fixed flowlibs_OcrLanguage for mixed archives.
- Subfolders
- The demo assumes scans arrive at the library root; for subfolders, derive the folder from the trigger path.
Key Expressions
The flow is intentionally light on Power Fx / WDL gymnastics — the heaviest expressions are the branch-name concatenation and the approval outcome check. They are listed below in the order they appear in the flow.
EXPR.01OCR job submit body
Request body for the Adobe OCR operation.
EXPR.02Poll location (case-safe)
Resolve the job-status poll URL from the OCR response headers.
EXPR.03Chain OCR output into Extract (no re-upload)
Use the OCR result asset id directly as the Extract input.
EXPR.04Extract ZIP download
Resolve the Extract result ZIP download URL.
EXPR.05Document key
Stable AI Search document key from the SharePoint list item id.
Customize & download
Generate a ready-to-import copy of this solution with your environment-variable values baked in — available on Base, Pro, or Team.
Upgrade to customize
Comments
Sign in to join the conversation.
Sign inNo comments yet. Be the first to share your experience with this flow.