Developing a Long Context AI Knowledge App

Nov 26, 2025

It has been a development day today, implementing an AI knowledge system that works with local files / network shares (using OpenAI’s gpt-oss-120B) but which can also burst out and leverage Gemini 1.5 Flash’s massive 1-million-token context window if needed.

Users can organize files into “Knowledge Bases” (Document Sets - such as “Finance” set, “Project” set etc) and easily switch chat contexts easily.

The App can be internally hosted (or ran locally) and all chats and documents are persisted in the local users browser even when the tab is closed and the browser exited.

Files are only sent during the specific chat request, and even when Gemini is used rather than the private model, data is not stored on Google’s servers permanently) and no backend is required.

If the browser’s “Site Data” or “Application Data,” is cleared then the chats and document sets are expunged (although they can be exported for auditing purposes).

For 100–500 reasonably sized documents (and/or for things marked as classified or sensitive) then the local model is used. For larger document sets (and/or things not marked as classified or sensitive) then Gemini 1.5 Flash is used with its one million token context window.

Unlike apps such as OpenWebUI, there are no vector elements (embeddings) created or stored. This approach leverages the context window rather than a Vector Search approach.

When a user uploads a document, the App extracts the raw text from the docs (PDF’s, office docs etc) and stores it directly in the browser’s IndexedDB database ie. it does not convert this text into mathematical vectors (arrays of numbers) or store them in a vector index.

When a user sends a message, the app retrieves the entire raw text of all files in the selected Knowledge Base and injects it directly into the System Prompt sent to the LLM API.

Because the model can ‘read’ the documents simultaneously in a single prompt it doesn’t need to use vectors to find specific relevant chunks, it simply processes everything at once.

The RAG pipeline works like this:

➡️ Convert text to embeddings

➡️ Store vectors

➡️ k-nearest-neighbour retrieval

➡️ Return top-k chunks

This can cause the following issues:

➡️ Slight wording mismatch retrieves irrelevant material

➡️ Related content can be spread across multiple pages

➡️ Boundary conditions break (useful content lost in chunk boundaries)

➡️ Important global context is never retrieved

➡️ Key answers often live across multiple documents

Even perfect embeddings still have a c. 5–10% retrieval failure rate on nuanced questions. This ‘long context’ approach preserves the full document narrative which gives higher accuracy, fewer hallucinations.

This is why Anthropic, Google, and OpenAI all say, “Models perform best when they can directly read all relevant content.”

Why AI Man

Discussion about this post