Skip to content

Summarization: Design, API, and Future Scope

This document explains how summarization works in Ecosyz today, how to configure it, the API contracts (JSON + SSE streaming), and ideas for future enhancements.

Goals

  • Provide fast, zero-LLM extractive summaries for papers and metadata.
  • Use PDFs when possible ("deep") and fall back to abstract/title ("quick").
  • Stream partial results to the UI for responsiveness.
  • Cache results in-memory and optionally in a shared KV store.

Architecture Overview

  • UI: app/openresources/page.tsx
  • Presents a search list. Each paper has a Summarize button.
  • Summary is shown inline and in a glassy modal.
  • Uses Server-Sent Events (SSE) to stream TL;DR, bullets, and tags.
  • API: app/api/summarize/route.ts (Node runtime)
  • POST /api/summarize returns a full JSON summary.
  • GET /api/summarize?... streams partial results via SSE.
  • Deep: attempts to fetch a PDF and summarize extracted text.
  • Quick: summarizes title + abstract.

Flow Diagram (Mermaid)

flowchart TD
  U[User clicks Summarize] --> UI[UI opens glassy modal]\n(EventSource /api/summarize)
  UI -->|SSE| API[API GET /api/summarize]
  API --> KV{KV configured?}
  KV -- yes --> KVGET[KV get(key)]
  KVGET -->|hit| META1[send meta (fromCache=kv)] --> UI
  KVGET -->|hit| TLDR1[send tldr] --> UI
  KVGET -->|hit| BUL1[send bullets] --> UI
  KVGET -->|hit| TAG1[send tags] --> UI
  KVGET -->|hit| DONE1[send done] --> UI
  KV -- no or miss --> MEM{Memory cache hit?}
  MEM -- yes --> META2[send meta (fromCache=memory)] --> UI
  MEM --> TLDR2[send tldr] --> UI
  MEM --> BUL2[send bullets] --> UI
  MEM --> TAG2[send tags] --> UI
  MEM --> DONE2[send done] --> UI
  MEM -- no --> MODE{mode === deep?}
  MODE -- deep --> PDF[Derive PDF URL (provider-specific)]
  PDF -->|arXiv/OpenAlex/Zenodo/Generic| FETCH[Fetch PDF & pdf-parse]
  FETCH -->|ok| SUM[Summarize text (extractive)]
  SUM --> SETMEM[Set memory cache]
  SUM --> SETKV[Set KV (if configured)]
  SUM --> META3[send meta (fromCache=false)] --> UI
  SUM --> TLDR3[send tldr] --> UI
  SUM --> BUL3[send bullets] --> UI
  SUM --> TAG3[send tags] --> UI
  SUM --> DONE3[send done] --> UI
  FETCH -- fail --> ERR1[send error] --> UI
  MODE -- quick --> QSUM[Summarize title+abstract]
  QSUM --> SETMEM
  QSUM --> SETKV
  QSUM --> META4[send meta (fromCache=false)] --> UI
  QSUM --> TLDR4[send tldr] --> UI
  QSUM --> BUL4[send bullets] --> UI
  QSUM --> TAG4[send tags] --> UI
  QSUM --> DONE4[send done] --> UI

Modes: Quick vs Deep

  • Quick mode: summarize title + abstract. Always available.
  • Deep mode: summarize extracted text from a PDF.
  • PDF Derivation:
    • arXiv: build https://arxiv.org/pdf/<id>.pdf from abs/id or arxiv:.
    • OpenAlex: query Works API for OA PDF URLs; confirm via HEAD if needed.
    • Zenodo: fetch record files; pick a PDF by mimetype or filename.
    • Generic: accept direct .pdf URLs or HEAD with content-type: application/pdf.
  • Extract text using pdf-parse (dynamically imported) and summarize it.

Summarization Algorithm (Extractive)

  • Tokenization, stopwords filtering, and sentence splitting.
  • Sentence frequencies score sentences; pick top N (e.g., 5–6), keep original order.
  • Outputs:
  • tldr (1–2 sentences), bullets (top sentences), tags (top non-stopword tokens), and readingTimeMinutes.
  • confidence is a fixed "medium" for now.

API Contracts

POST /api/summarize

  • Request body:
    {
      "id": "string",             // optional but recommended
      "source": "string",         // e.g., "arxiv", "openalex", "zenodo"
      "title": "string",          // title text
      "abstract": "string",       // abstract or description
      "url": "string",            // original resource URL
      "mode": "quick" | "deep"    // requested mode
    }
    
  • Response (200):
    {
      "tldr": "string",
      "bullets": ["..."],
      "tags": ["..."],
      "readingTimeMinutes": 3,
      "confidence": "medium",
      "modeUsed": "quick" | "deep",
      "fromCache": true | false,
      "cache": "kv" | "memory" | "none"
    }
    
  • Error (4xx/5xx): { "error": "message" }.

GET /api/summarize (SSE)

  • Query params: same fields as POST but via ?id=...&source=...&title=...&abstract=...&url=...&mode=....
  • Events:
  • meta: { fromCache, cache, modeUsed | modeRequested }
  • tldr: string
  • bullets: string[]
  • tags: string[]
  • done: { ok: true }
  • error: { message }

Caching

  • In-memory (Map with a 7-day TTL) for fast local reuse.
  • Optional shared cache via Vercel KV / Upstash (7-day TTL).
  • Set env vars to enable:
    • KV_REST_API_URL
    • KV_REST_API_TOKEN
  • If not configured, KV is ignored (no-op) and only in-memory cache is used.

Configuration

  • Node runtime routes are required (uses Buffer, dynamic import):
  • app/api/summarize/route.ts exports export const runtime = 'nodejs'.
  • Dependencies:
  • pdf-parse (dynamic import)
  • Optional: @vercel/kv (only if KV env vars are set)

UI Behavior

  • Clicking Summarize opens a glassy modal.
  • Results stream in (TL;DR then bullets and tags).
  • Badges display mode (Deep vs Quick) and cache source (KV/Memory/Fresh).
  • Errors are shown via toasts; modal closes on streaming error.
  • Descriptions and summary text are sanitized of HTML tags/entities before rendering.

Limitations

  • Some PDFs are not accessible (paywalled, blocked HEAD, or CORS); deep may fail.
  • Scanned/image-only PDFs won’t extract text (no OCR).
  • Very long PDFs are truncated for responsiveness.

Future Scope

  • LLM Summarization (Optional tier)
  • Use an LLM for abstractive summaries with citations.
  • Stream tokens over SSE (the plumbing already exists).
  • Add safety and cost controls with per-user limits and provider selection.
  • OCR for Scanned PDFs
  • Integrate Tesseract or a service to OCR image-based PDFs.
  • KV / Durable Cache
  • Add eviction policies, versions, and per-source sanctity checks.
  • Optional background prewarming for popular items.
  • User Controls
  • Toggle: Deep only vs Auto (deep → quick fallback).
  • Controls for max sentences, tone (technical/lay), and language.
  • Diagnostics & Observability
  • Debug panel showing which PDF URL was chosen, time to fetch/parse, and cache hits.
  • Metrics and tracing for failures and latencies.
  • Accessibility & UX
  • Keyboard shortcuts (open modal, navigate bullets, close on Esc).
  • Improved animations and responsive behavior.
  • Security & Sanitization
  • Move sanitization server-side for consistent behavior.
  • Add allowlists for domains when following redirects to PDFs.

Local Development

npm install
npm run dev
# open http://localhost:3000/openresources

Troubleshooting

  • "Module not found: pdf-parse": ensure pdf-parse is in dependencies and the route runs on Node runtime.
  • "Module not found: @vercel/kv": install @vercel/kv or remove KV env vars if not using KV.
  • Deep summaries failing: check network access to the PDF URL; if blocked, try quick mode or set up a proxy.

How to Add a New Provider to Deep Mode

This guide shows the minimal steps to enable PDF-based deep summaries for an additional provider.

1) Implement PDF URL Derivation - Add a helper in app/api/summarize/route.ts similar to deriveOpenAlexPdf or deriveZenodoPdf. For example: - Call the provider’s record/works API. - Find a file link with a PDF mimetype or .pdf filename. - If the URL doesn’t end with .pdf, make a HEAD request and confirm content-type: application/pdf.

2) Wire the Helper into Deep Mode - In POST and GET handlers, extend the if (mode === 'deep') section:

else if (source === 'myprovider') pdfUrl = await deriveMyProviderPdf(url || id || '', controller.signal);
- Keep the same fallback flow as existing providers: if no valid PDF URL is found, return an error (deep-only) or fall back to quick (if you opt for auto mode).

3) Sanity & Limits - Try-catch all network calls and treat non-2xx as no PDF available. - Enforce a reasonable timeout (e.g., 25s via AbortController). - Cap extracted text size (e.g., 200k chars) before summarization.

4) Caching - The route already caches by key ${id || title}:${mode}. - No extra work is needed to benefit from memory/KV caches.

5) Testing Checklist - Resource with an accessible PDF → should stream a Deep summary with the badge. - Resource without a PDF → should error in deep-only mode, or fall back if allowed. - Very large PDF → processed and truncated; summary still produced. - Ensure the glassy modal opens, updates progressively, and closes on error.

Tips - If the provider uses redirects or signed URLs, prefer redirect: 'follow' and confirm with HEAD. - If the provider requires auth, do not hardcode secrets. Add a secure fetch layer and read tokens from environment variables.