Skip to Content

Research Engine

In the current repository, the “research engine” is best understood as a pipeline rather than a standalone service.

It is built from:

  • PDF extraction
  • section-aware chunking
  • embedding generation
  • downstream storage or retrieval hooks

The main implementation lives in packages/ai/src/ingestion/research-paper-pipeline.ts.

What it does

The ingestion pipeline:

  1. accepts a Payload-linked paper input
  2. extracts text from a PDF buffer or URL
  3. derives an abstract when possible
  4. chunks the paper by section
  5. generates embeddings in batches
  6. returns chunked output for storage
import { createResearchPaperIngestionPipeline } from '@loop/ai'; const pipeline = createResearchPaperIngestionPipeline({ apiKey: process.env.OPENAI_API_KEY!, }); const result = await pipeline.ingestPaper({ payloadId: 'paper_123', title: 'Peptide Recovery Signaling', pdfUrl: 'https://example.com/paper.pdf', });

Pipeline stages

1. Extraction

The pipeline starts by pulling text from a PDF:

const result = await pipeline.ingestPaper({ payloadId: 'paper_123', title: 'Clinical Review', pdfBuffer, });

Input rules:

  • provide pdfBuffer or pdfUrl
  • include a stable payloadId
  • include the paper title

2. Chunking

After extraction, the paper is chunked by section.

This produces a structure like:

{ "section": "methods", "content": "Participants completed a 12-week protocol...", "chunkIndex": 3 }

3. Embedding

Embeddings are generated in batches and attached back to each chunk:

{ section: 'results', content: 'hs-CRP improved significantly...', chunkIndex: 5, embedding: [0.012, -0.038, 0.221] }

Progress reporting

The pipeline exposes progress callbacks so the caller can surface ingestion status:

const result = await pipeline.ingestPaper( { payloadId: 'paper_123', title: 'Clinical Review', pdfUrl: 'https://example.com/paper.pdf', }, (progress) => { console.log(progress.step, progress.progress, progress.message); }, );

Current progress steps:

  • extracting
  • chunking
  • embedding
  • storing
  • complete

Real implementation vs. current admin wiring

The important nuance is that two things are true at once:

  1. The reusable ingestion pipeline exists in @loop/ai
  2. The current admin hook wrapper is a placeholder and returns an internal error until the export is restored to the integration point

That placeholder lives in:

apps/admin/payload/hooks/researchPaperIngestionPipeline.ts

Current placeholder behavior:

return err( createError( 'INTERNAL_ERROR', 'Research paper ingestion pipeline is not available; implement or restore @loop/ai export.', ), );

Why this matters

This split is important for contributors:

  • the ingestion design is real
  • the core pipeline is implemented
  • the admin entry point is not fully wired in this branch snapshot

So platform docs should describe this as an available ingestion pipeline with partial application wiring, not as a universally active production feature.

Typical use cases

Build a retrieval corpus

const result = await pipeline.ingestPaper({ payloadId: 'paper_456', title: 'Sleep Recovery and HRV', pdfUrl: 'https://example.com/sleep.pdf', }); if (result.ok) { console.log(result.data.chunks.length); }

Enrich AI prompts with research snippets

const chunk = { section: 'discussion', content: 'Participants with higher adherence showed improved recovery markers.', }; const prompt = ` Use this supporting evidence: ${chunk.content} `;

The repository also includes a shared embeddings schema and migration support for vector search.

SELECT * FROM match_embeddings($1, 0.7, 10);
  • packages/ai/src/ingestion/research-paper-pipeline.ts
  • packages/ai/src/embeddings.ts
  • packages/shared/src/db/embeddings-schema.ts
  • apps/admin/payload/hooks/ingestResearchPaper.ts

Design guidance

  • Keep ingestion separate from retrieval policy.
  • Keep retrieval separate from workflow logic.
  • Keep workflow logic separate from prompt generation.

That separation makes the platform easier to test and easier to evolve.

Next steps