Skip to Content
Health Ai PlatformGuidesOperationalize Research Ingestion

Operationalize Research Ingestion

Use this guide when you need to understand what is available today for research ingestion, what is reusable, and where the current wiring stops.

Goal

Turn a research paper into retrieval-friendly chunks that can support downstream search, summarization, or agent context.

What exists in the repository

  • A reusable ingestion pipeline in packages/ai/src/ingestion/research-paper-pipeline.ts
  • Embedding generation in packages/ai/src/embeddings
  • An admin-side placeholder hook in apps/admin/payload/hooks/researchPaperIngestionPipeline.ts

Happy-path pipeline

The reusable pipeline does four things:

  1. Extract text from a PDF buffer or URL
  2. Split the paper into section-based chunks
  3. Generate embeddings in batches
  4. Return structured chunk output
import { createResearchPaperIngestionPipeline } from '@loop/ai'; const pipeline = createResearchPaperIngestionPipeline({ apiKey: process.env.OPENAI_API_KEY!, }); const result = await pipeline.ingestPaper({ payloadId: 'paper_123', title: 'GLP-1 receptor agonists and metabolic outcomes', pdfUrl: 'https://example.com/paper.pdf', authors: ['A. Researcher', 'B. Author'], });

Progress callbacks

The pipeline reports progress during execution.

const result = await pipeline.ingestPaper( { payloadId: 'paper_123', title: 'Inflammation and recovery biomarkers', pdfUrl: 'https://example.com/recovery-paper.pdf', }, (progress) => { console.log(progress.step, progress.progress, progress.message); }, );

Expected stages:

extracting -> chunking -> embedding -> complete

Result shape

Successful ingestion returns paper metadata plus chunk-level embeddings.

if (result.ok) { console.log(result.data.title); console.log(result.data.abstract); console.log(result.data.chunks[0]); }

Representative chunk:

{ "section": "Methods", "content": "Participants completed a 12-week intervention...", "chunkIndex": 4, "embedding": [0.012, -0.019, 0.441] }

Current platform caveat

The admin hook currently wraps a placeholder implementation that always returns an internal error until the @loop/ai export is restored there.

import { createResearchPaperIngestionPipeline } from '@/payload/hooks/researchPaperIngestionPipeline'; const pipeline = createResearchPaperIngestionPipeline({ apiKey: process.env.OPENAI_API_KEY!, }); const result = await pipeline.ingestPaper({ payloadId: 'paper_123' }); // In the current placeholder, this resolves to an error Result.

That means the reusable pipeline is a real building block, but the full admin ingestion loop is not fully reconnected in this repo snapshot.

1. Validate the source input

Use either pdfBuffer or pdfUrl.

const input = { payloadId: 'paper_123', title: 'Sleep architecture and HRV', pdfUrl: 'https://example.com/sleep-paper.pdf', };

2. Run the reusable pipeline directly

const result = await pipeline.ingestPaper(input); if (!result.ok) { throw new Error(result.error.message); }

3. Store or hand off the chunks

The ingestion pipeline returns chunk records, so the calling system decides how to persist them.

for (const chunk of result.data.chunks) { console.log(chunk.section, chunk.chunkIndex, chunk.embedding.length); }

4. Use embeddings for retrieval

The repository also includes an embeddings schema and semantic-search-oriented storage patterns elsewhere in the stack.

const texts = result.data.chunks.map((chunk) => chunk.content); console.log(`Ready to index ${texts.length} research chunks`);

Operational checklist

  • Confirm OPENAI_API_KEY is present
  • Use the reusable @loop/ai pipeline for local or service-level testing
  • Do not assume the admin hook is fully wired
  • Store chunk metadata alongside embeddings so retrieval stays explainable
  • Track ingestion errors separately from retrieval errors

Next steps