Advanced

Adobe PDF OCR Scanned Archive to Searchable

When a scanned (image-only) PDF lands in an archive library, the flow runs OCR with Adobe PDF Services to produce a searchable PDF, replaces the image-only copy, extracts the text, and indexes it into Azure AI Search. Makes a scanned document archive fully searchable by content.

HTTPSharePointAzure AI Search

Unique name

FlowLibsAdobePdfOcrSearchableArchive

Publisher

FlowLibs (flowlibs)

Version

1.0.0.0

Components

9 env vars + 1 cloud flow

Request access

Provided as-is, without warranty of any kind. Review and test each pattern in a non-production environment before deploying it to live automations. See our Terms.

Report an issue with this flow

What it does

Overview

CF-545 — built and verified (Flow Checker 0 errors / 0 warnings), ships Off. When a scanned, image-only PDF lands in a SharePoint archive library, this flow runs Adobe PDF Services OCR to produce a searchable PDF, replaces the image-only copy in place, extracts the recognized text with Adobe Extract, and indexes that text into Azure AI Search (mergeOrUpload). The result: a scanned-document archive that is fully searchable by content, not just by filename.

Why it matters: Image-only scans are invisible to enterprise search and eDiscovery. OCR + indexing unlocks the archive for retrieval, compliance, and discovery.

Solution: 1 cloud flow, 32 top-level actions, 2 bound connectors (SharePoint, Azure AI Search) + Adobe PDF Services REST over built-in HTTP.

Why you'd use it

Use Case

Records, legal, and IT teams holding large libraries of historical scans (contracts, invoices, correspondence) need them searchable by their actual content. Dropping a scan into the monitored library automatically converts it to a searchable PDF and pushes its text into the enterprise search index — no manual OCR step.

Step-by-step

Flow Architecture

When a Scanned PDF Arrives

SharePoint — GetOnNewFileItems (poll 5 min, splitOn)

Fires once per new file in the archive library.

Initialize variables

Initialize Variable (x13)

Bind env vars (site, library, work folder, OCR language, index, Adobe base/id/secret, poll interval) and derive file name, doc key, and the two loop-status holders.

Get Scanned File Content

SharePoint — GetFileContent

Read the scanned PDF bytes.

Adobe token + asset + upload

HTTP — POST /token, POST /assets, PUT (upload)

Get a client-credentials OAuth token, create an OCR source asset, and upload the PDF bytes.

Start OCR Job + Poll Until Complete

HTTP — POST /operation/ocr + Until loop

Submit OCR (ocrLang, ocrType=searchable_image) and poll the location until done/failed.

Download + Replace With Searchable PDF

HTTP GET (presigned) + SharePoint CreateFile

Download the searchable PDF and overwrite the original scan in place.

Start Extract + Poll + Download Zip

HTTP — POST /operation/extractpdf + Until + GET

Run Extract on the OCR result asset (chained assetID), poll until done, and download the Extract result ZIP.

Save + Unzip + Read Structured Data

SharePoint CreateFile + ExtractFolderV2 + GetFileContentByPath

Save the ZIP, unzip to a per-document subfolder, and read structuredData.json.

Compose Search Document

Compose + Select

Solution config

Environment Variables

Schema name	Type	Default	Description
flowlibs_SharePointSiteURL	String	https://your-tenant.sharepoint.com	Site hosting the archive (trigger dataset) — reused.
flowlibs_ScannedArchiveLibrary	String	Scanned Archive	Library to monitor + write the searchable copy back to (trigger table) — new.
flowlibs_OcrWorkFolderPath	String	/Scanned Archive/_ocrwork	Scratch folder for the Extract ZIP + unzip — new.
flowlibs_OcrLanguage	String	en-US	Adobe OCR language — new.
flowlibs_ArchiveSearchIndexName	String	document-archive	Azure AI Search index that receives the text (must pre-exist) — new.
flowlibs_AdobeClientId	String	<configure>	Adobe API client id / X-API-Key — reused.
flowlibs_AdobeClientSecret	String	<configure>	Adobe API client secret — reused.
flowlibs_AdobePdfServicesBase	String	https://pdf-services.adobe.io	Adobe REST base URL — reused.
flowlibs_AdobePollIntervalSeconds	String	5

Auth dependencies

Connectors & Connections

Connector	API name	Actions used
HTTP	http	POST /token POST /assets PUT (upload) POST /operation/ocr POST /operation/extractpdf GET (status/download)
SharePoint	shared_sharepointonline	GetOnNewFileItems GetFileContent CreateFile ExtractFolderV2 GetFileContentByPath
Azure AI Search	shared_azureaisearch	IndexDocuments

Note — All connections are referenced as solution connection references; the flow is portable between environments as long as a connection is mapped at import time.

Tweaks & variations

Customization Guide

Almost every realistic variant of this flow can be implemented by changing environment variable values. A few cases require small edits inside the flow definition — those are called out explicitly below.

Point at your library: Set flowlibs_ScannedArchiveLibrary (and flowlibs_OcrWorkFolderPath) to your document library; set flowlibs_SharePointSiteURL if not the root site.
Index schema: Create the document-archive index (or rename via flowlibs_ArchiveSearchIndexName) with the required fields before turning the flow on.
OCR quality: Switch ocrType to searchable_image_exact in Start OCR Job to preserve image fidelity (default searchable_image compresses).
Vector / semantic search: Add an embeddings step and a vector field for semantic retrieval.
Chunking: Split long documents into multiple search docs with a shared parent id for better relevance.
Language auto-detect: Detect language per document instead of the fixed flowlibs_OcrLanguage for mixed archives.
Subfolders: The demo assumes scans arrive at the library root; for subfolders, derive the folder from the trigger path.

Helpers & literals

Key Expressions

The flow is intentionally light on Power Fx / WDL gymnastics — the heaviest expressions are the branch-name concatenation and the approval outcome check. They are listed below in the order they appear in the flow.

EXPR.01OCR job submit body

Request body for the Adobe OCR operation.

json

{ "assetID": "@{body('Create_OCR_Source_Asset')?['assetID']}", "ocrLang": "@{variables('varOcrLanguage')}", "ocrType": "searchable_image" }

EXPR.02Poll location (case-safe)

Resolve the job-status poll URL from the OCR response headers.

workflow definition language

@coalesce(outputs('Start_OCR_Job')?['headers']?['location'], outputs('Start_OCR_Job')?['headers']?['Location'])

EXPR.03Chain OCR output into Extract (no re-upload)

Use the OCR result asset id directly as the Extract input.

workflow definition language

@{body('Get_OCR_Job_Status')?['asset']?['assetID']}

EXPR.04Extract ZIP download

Resolve the Extract result ZIP download URL.

workflow definition language

@coalesce(body('Get_Extract_Job_Status')?['content']?['downloadUri'], body('Get_Extract_Job_Status')?['asset']?['downloadUri'])

EXPR.05Document key

Stable AI Search document key from the SharePoint list item id.

workflow definition language

@concat('sp-', string(triggerOutputs()?['body/ID']))

Make it yours

Customize & download

Generate a ready-to-import copy of this solution with your environment-variable values baked in — available on Base, Pro, or Team.

Upgrade to customize

Adobe PDF OCR Scanned Archive to Searchable

Overview

Use Case

Flow Architecture

When a Scanned PDF Arrives

Initialize variables

Get Scanned File Content

Adobe token + asset + upload

Start OCR Job + Poll Until Complete

Download + Replace With Searchable PDF

Start Extract + Poll + Download Zip

Save + Unzip + Read Structured Data

Compose Search Document

Environment Variables

Connectors & Connections

Customization Guide

Key Expressions

EXPR.01OCR job submit body

EXPR.02Poll location (case-safe)

EXPR.03Chain OCR output into Extract (no re-upload)

EXPR.04Extract ZIP download

EXPR.05Document key

Customize & download

Index Document To Search

Comments