Research Engine
In the current repository, the “research engine” is best understood as a pipeline rather than a standalone service.
It is built from:
- PDF extraction
- section-aware chunking
- embedding generation
- downstream storage or retrieval hooks
The main implementation lives in packages/ai/src/ingestion/research-paper-pipeline.ts.
What it does
The ingestion pipeline:
- accepts a Payload-linked paper input
- extracts text from a PDF buffer or URL
- derives an abstract when possible
- chunks the paper by section
- generates embeddings in batches
- returns chunked output for storage
import { createResearchPaperIngestionPipeline } from '@loop/ai';
const pipeline = createResearchPaperIngestionPipeline({
apiKey: process.env.OPENAI_API_KEY!,
});
const result = await pipeline.ingestPaper({
payloadId: 'paper_123',
title: 'Peptide Recovery Signaling',
pdfUrl: 'https://example.com/paper.pdf',
});Pipeline stages
1. Extraction
The pipeline starts by pulling text from a PDF:
const result = await pipeline.ingestPaper({
payloadId: 'paper_123',
title: 'Clinical Review',
pdfBuffer,
});Input rules:
- provide
pdfBufferorpdfUrl - include a stable
payloadId - include the paper title
2. Chunking
After extraction, the paper is chunked by section.
This produces a structure like:
{
"section": "methods",
"content": "Participants completed a 12-week protocol...",
"chunkIndex": 3
}3. Embedding
Embeddings are generated in batches and attached back to each chunk:
{
section: 'results',
content: 'hs-CRP improved significantly...',
chunkIndex: 5,
embedding: [0.012, -0.038, 0.221]
}Progress reporting
The pipeline exposes progress callbacks so the caller can surface ingestion status:
const result = await pipeline.ingestPaper(
{
payloadId: 'paper_123',
title: 'Clinical Review',
pdfUrl: 'https://example.com/paper.pdf',
},
(progress) => {
console.log(progress.step, progress.progress, progress.message);
},
);Current progress steps:
extractingchunkingembeddingstoringcomplete
Real implementation vs. current admin wiring
The important nuance is that two things are true at once:
- The reusable ingestion pipeline exists in
@loop/ai - The current admin hook wrapper is a placeholder and returns an internal error until the export is restored to the integration point
That placeholder lives in:
apps/admin/payload/hooks/researchPaperIngestionPipeline.tsCurrent placeholder behavior:
return err(
createError(
'INTERNAL_ERROR',
'Research paper ingestion pipeline is not available; implement or restore @loop/ai export.',
),
);Why this matters
This split is important for contributors:
- the ingestion design is real
- the core pipeline is implemented
- the admin entry point is not fully wired in this branch snapshot
So platform docs should describe this as an available ingestion pipeline with partial application wiring, not as a universally active production feature.
Typical use cases
Build a retrieval corpus
const result = await pipeline.ingestPaper({
payloadId: 'paper_456',
title: 'Sleep Recovery and HRV',
pdfUrl: 'https://example.com/sleep.pdf',
});
if (result.ok) {
console.log(result.data.chunks.length);
}Enrich AI prompts with research snippets
const chunk = {
section: 'discussion',
content: 'Participants with higher adherence showed improved recovery markers.',
};
const prompt = `
Use this supporting evidence:
${chunk.content}
`;Store embeddings for semantic search
The repository also includes a shared embeddings schema and migration support for vector search.
SELECT *
FROM match_embeddings($1, 0.7, 10);Related repository surfaces
packages/ai/src/ingestion/research-paper-pipeline.tspackages/ai/src/embeddings.tspackages/shared/src/db/embeddings-schema.tsapps/admin/payload/hooks/ingestResearchPaper.ts
Design guidance
- Keep ingestion separate from retrieval policy.
- Keep retrieval separate from workflow logic.
- Keep workflow logic separate from prompt generation.
That separation makes the platform easier to test and easier to evolve.
Next steps
- Read ML Layer
- Follow Operationalize Research Ingestion
- Review ML Endpoints