Research Engine

In the current repository, the “research engine” is best understood as a pipeline rather than a standalone service.

It is built from:

PDF extraction
section-aware chunking
embedding generation
downstream storage or retrieval hooks

The main implementation lives in packages/ai/src/ingestion/research-paper-pipeline.ts.

What it does

The ingestion pipeline:

accepts a Payload-linked paper input
extracts text from a PDF buffer or URL
derives an abstract when possible
chunks the paper by section
generates embeddings in batches
returns chunked output for storage


import { createResearchPaperIngestionPipeline } from '@loop/ai';
 
const pipeline = createResearchPaperIngestionPipeline({
  apiKey: process.env.OPENAI_API_KEY!,
});
 
const result = await pipeline.ingestPaper({
  payloadId: 'paper_123',
  title: 'Peptide Recovery Signaling',
  pdfUrl: 'https://example.com/paper.pdf',
});

Pipeline stages

1. Extraction

The pipeline starts by pulling text from a PDF:


const result = await pipeline.ingestPaper({
  payloadId: 'paper_123',
  title: 'Clinical Review',
  pdfBuffer,
});

Input rules:

provide pdfBuffer or pdfUrl
include a stable payloadId
include the paper title

2. Chunking

After extraction, the paper is chunked by section.

This produces a structure like:


{
  "section": "methods",
  "content": "Participants completed a 12-week protocol...",
  "chunkIndex": 3
}

3. Embedding

Embeddings are generated in batches and attached back to each chunk:


{
  section: 'results',
  content: 'hs-CRP improved significantly...',
  chunkIndex: 5,
  embedding: [0.012, -0.038, 0.221]
}

Progress reporting

The pipeline exposes progress callbacks so the caller can surface ingestion status:


const result = await pipeline.ingestPaper(
  {
    payloadId: 'paper_123',
    title: 'Clinical Review',
    pdfUrl: 'https://example.com/paper.pdf',
  },
  (progress) => {
    console.log(progress.step, progress.progress, progress.message);
  },
);

Current progress steps:

extracting
chunking
embedding
storing
complete

Real implementation vs. current admin wiring

The important nuance is that two things are true at once:

The reusable ingestion pipeline exists in @loop/ai
The current admin hook wrapper is a placeholder and returns an internal error until the export is restored to the integration point

That placeholder lives in:


apps/admin/payload/hooks/researchPaperIngestionPipeline.ts

Current placeholder behavior:


return err(
  createError(
    'INTERNAL_ERROR',
    'Research paper ingestion pipeline is not available; implement or restore @loop/ai export.',
  ),
);

Why this matters

This split is important for contributors:

the ingestion design is real
the core pipeline is implemented
the admin entry point is not fully wired in this branch snapshot

So platform docs should describe this as an available ingestion pipeline with partial application wiring, not as a universally active production feature.

Typical use cases

Build a retrieval corpus


const result = await pipeline.ingestPaper({
  payloadId: 'paper_456',
  title: 'Sleep Recovery and HRV',
  pdfUrl: 'https://example.com/sleep.pdf',
});
 
if (result.ok) {
  console.log(result.data.chunks.length);
}

Enrich AI prompts with research snippets


const chunk = {
  section: 'discussion',
  content: 'Participants with higher adherence showed improved recovery markers.',
};
 
const prompt = `
Use this supporting evidence:
${chunk.content}
`;

Store embeddings for semantic search

The repository also includes a shared embeddings schema and migration support for vector search.


SELECT *
FROM match_embeddings($1, 0.7, 10);

packages/ai/src/ingestion/research-paper-pipeline.ts
packages/ai/src/embeddings.ts
packages/shared/src/db/embeddings-schema.ts
apps/admin/payload/hooks/ingestResearchPaper.ts

Design guidance

Keep ingestion separate from retrieval policy.
Keep retrieval separate from workflow logic.
Keep workflow logic separate from prompt generation.

That separation makes the platform easier to test and easier to evolve.

Next steps

Read ML Layer
Follow Operationalize Research Ingestion
Review ML Endpoints