Operationalize Research Ingestion

Use this guide when you need to understand what is available today for research ingestion, what is reusable, and where the current wiring stops.

Goal

Turn a research paper into retrieval-friendly chunks that can support downstream search, summarization, or agent context.

What exists in the repository

A reusable ingestion pipeline in packages/ai/src/ingestion/research-paper-pipeline.ts
Embedding generation in packages/ai/src/embeddings
An admin-side placeholder hook in apps/admin/payload/hooks/researchPaperIngestionPipeline.ts

Happy-path pipeline

The reusable pipeline does four things:

Extract text from a PDF buffer or URL
Split the paper into section-based chunks
Generate embeddings in batches
Return structured chunk output


import { createResearchPaperIngestionPipeline } from '@loop/ai';
 
const pipeline = createResearchPaperIngestionPipeline({
  apiKey: process.env.OPENAI_API_KEY!,
});
 
const result = await pipeline.ingestPaper({
  payloadId: 'paper_123',
  title: 'GLP-1 receptor agonists and metabolic outcomes',
  pdfUrl: 'https://example.com/paper.pdf',
  authors: ['A. Researcher', 'B. Author'],
});

Progress callbacks

The pipeline reports progress during execution.


const result = await pipeline.ingestPaper(
  {
    payloadId: 'paper_123',
    title: 'Inflammation and recovery biomarkers',
    pdfUrl: 'https://example.com/recovery-paper.pdf',
  },
  (progress) => {
    console.log(progress.step, progress.progress, progress.message);
  },
);

Expected stages:


extracting -> chunking -> embedding -> complete

Result shape

Successful ingestion returns paper metadata plus chunk-level embeddings.


if (result.ok) {
  console.log(result.data.title);
  console.log(result.data.abstract);
  console.log(result.data.chunks[0]);
}

Representative chunk:


{
  "section": "Methods",
  "content": "Participants completed a 12-week intervention...",
  "chunkIndex": 4,
  "embedding": [0.012, -0.019, 0.441]
}

Current platform caveat

The admin hook currently wraps a placeholder implementation that always returns an internal error until the @loop/ai export is restored there.


import { createResearchPaperIngestionPipeline } from '@/payload/hooks/researchPaperIngestionPipeline';
 
const pipeline = createResearchPaperIngestionPipeline({
  apiKey: process.env.OPENAI_API_KEY!,
});
 
const result = await pipeline.ingestPaper({ payloadId: 'paper_123' });
// In the current placeholder, this resolves to an error Result.

That means the reusable pipeline is a real building block, but the full admin ingestion loop is not fully reconnected in this repo snapshot.

Recommended operating pattern

1. Validate the source input

Use either pdfBuffer or pdfUrl.


const input = {
  payloadId: 'paper_123',
  title: 'Sleep architecture and HRV',
  pdfUrl: 'https://example.com/sleep-paper.pdf',
};

2. Run the reusable pipeline directly


const result = await pipeline.ingestPaper(input);
 
if (!result.ok) {
  throw new Error(result.error.message);
}

3. Store or hand off the chunks

The ingestion pipeline returns chunk records, so the calling system decides how to persist them.


for (const chunk of result.data.chunks) {
  console.log(chunk.section, chunk.chunkIndex, chunk.embedding.length);
}

4. Use embeddings for retrieval

The repository also includes an embeddings schema and semantic-search-oriented storage patterns elsewhere in the stack.


const texts = result.data.chunks.map((chunk) => chunk.content);
console.log(`Ready to index ${texts.length} research chunks`);

Operational checklist

Confirm OPENAI_API_KEY is present
Use the reusable @loop/ai pipeline for local or service-level testing
Do not assume the admin hook is fully wired
Store chunk metadata alongside embeddings so retrieval stays explainable
Track ingestion errors separately from retrieval errors

Next steps

Read Research Engine
Review ML Layer
Keep the API contract nearby in ML Endpoints