Offline RAG with Bun, libSQL, LangChain.js, and Ollama

When Meta released its Llama 3.2 model, I wondered what a simple, personal and offline-only RAG (retrieval augmented generation) chatbot might look like. A command-line tool that would import a single PDF at a time and allow me to chat with its contents was the goal. Being comfortable writing TypeScript, I decided to give it a go using Bun, libSQL, LangChain.js and Ollama. Here’s how it went.

A cute, glowing robot sitting in front of a large cloud icon with an “X” symbol inside it. The robot has a simple design, featuring a round head, antennae, and a friendly smile. It is connected to various devices through neon-lit wires. The overall color scheme is a mix of neon blues and purples, creating a futuristic, digital environment with cloud computing elements. — Pulling the Plug on Cloud Dependency: How libSQL, LangChain, and Ollama Empower Offline Retrieval-Augmented Generation.

Building Blocks

Bun: The new JS kid on the block. TypeScript-first, blazingly fast 😜, and has a package.json. Definitely my go-to JS runtime for side projects.
libSQL: A fork of SQLite that has a vector type and vector queries built-in. Battle-tested on Turso but also a perfect local-first, file-based database for personal AI powered apps.
LangChain.js: The JavaScript port of the popular LangChain python library. Brings everything seamlessly together: Document loading, chunking, embeddings, vector stores, and retrieval.
Ollama: The best way to run open source AI models locally on your computer. You’re up and running with one simple command.

Addressing LangChain Issues

After I started implementing loading a PDF, chunking it and creating the embeddings for the chunks, I immediately ran into problems with LangChain. Namely the @langchain/community package from which I imported the LibSQLVectorStore.

It threw an error while inserting documents into the database. At first I thought there was something wrong with my chunking and embedding functions, but soon I realized there was an actual bug in the @langchain/community package. I filed an issue on GitHub and a PR with a proposed fix.

Turns out the LibSQLVectorStore implementation wasn’t quite finished, but somehow made it into a release. Mainter @jacoblee93 helped fix a few more issues and also wrote some integration tests.

Now back to work… 🤓

Creating Local Embeddings

Back in April Ollama announced support for embeddings. That’s great because now with Ollama we are able to run the complete retrieval chain including embedding documents and user queries as well as the LLM for chatting locally on our own computer. No need to rely on OpenAI and the like anymore.

Choosing the Embeddings Model

You can find every supported embeddings model in the Ollama library. Since I wanted to get started quickly I simply chose nomic-embed-text since it had the most downloads and looked like a good fit for my use case.

🔥 Hot Tip: Check out Hugging Face’s MTEB Leaderboard for comparing open source embeddings models and finding the right one for your needs.

How to Store Vectors Locally?

Generating embeddings is quite a time consuming task. So I wanted to store them locally to avoid having to recompute them every time I start the chatbot. But you also need to be able to query them efficiently.

By now there are a million proprierty as well as open source databases that support vectors. But I wanted to keep it simple and local. That’s why I chose libSQL. It’s a fork of SQLite that has a vector type, vector indices and vector queries built-in. Perfect for my offline RAG chatbot.

🔥 Hot Tip: Have a look into pgvector, a great extension that brings full support for vectors to Postgres. Also checkout Supabase for a cloud solution based on Postgres and pgvector.

Help, My Database is Huge!

While testing, I used the Brexit withdrawal agreement as it is quite long and therefore looked like a good choice to do some retrieval tests. It took me quite a while to realize that with this one document alone, the libSQL file was already more than 500 MB in size. 🤯

The PDF of the Brexit agreement is 181 pages, but that’s still too much. If I used this chatbot with multiple documents, the database file would quickly reach several gigabytes. That’s not really what I had in mind for a simple, personal, offline-only chatbot.

Fortunately, I found this article about The space complexity of vector indexes in LibSQL on the Turso blog.

Compression and Limiting

The main problem wasn’t the vectors themselves but rather the vector index. By limiting the number of neighbors and compressing them, the database file size was significantly reduced. Here’s the updated SQL command:

CREATE INDEX IF NOT EXISTS
  idx_vecs_embeddings ON vecs(
    libsql_vector_idx(
      embeddings, 'compress_neighbors=float8', 'max_neighbors=20'
    )
  );

As the Turso article explains, setting the compression to float8 and the maximum number of neighbors to 20 would still give good results. I can also confirm that I haven’t noticed any difference in the quality of the retrieval results so far. But the file size dropped from over 500 MB to under 80 MB. 🎉

Bringing It All Together

With the libSQL implementation in LangChain fixed, the embeddings model chosen and the database file size reduced, I was finally able to chat with the Brexit withdrawal agreement. 🎉

Now, how does it all come together? Enter: LangChain.js.

Loading and Chunking Documents

LangChain.js handles document loading. You just need pdf-parse as a peer dependency and a path to a PDF file.

import { PDFLoader } from '@langchain/community/document_loaders/fs/pdf'

export async function load(path: string) {
  const loader = new PDFLoader(path)
  return await loader.load()
}

By default this returns one Document object for each page of the PDF file. Since the context windows of LLMs grow constantly, this may be good enough for your use case.

☝️ Good To Know: The Llama 3.2 1B and 3B models support a context length of 128K tokens.

Chunking can significantly improve retrieval results. Creating embeddings for smaller parts of a document often leads to vectors that represent the embedded content more accurately than embeddings for an entire page.

For simplicity I decided to split the documents with a RecursiveCharacterTextSplitter. It basically splits the text into chunks of a given number of tokens each, but tries to keep sentences together.

import { RecursiveCharacterTextSplitter } from '@langchain/textsplitters'
import type { Document } from 'langchain/document'

const CHUNK_SIZE = 250

const splitter = new RecursiveCharacterTextSplitter({
  chunkSize: CHUNK_SIZE,
})

export async function split(docs: Document[]) {
  return await splitter.splitDocuments(docs)
}

🔥 Hot Tip: Try semantic chunking to make retrieval even more accurate by keeping related content together.

The Vector Store

The LibSQLVectorStore is the heart of the retrieval process. It stores the embeddings of the document chunks and allows for efficient retrieval of the most similar chunks to a given query, which than can be used to let the Llama model generate a response.

In order to create the vector store, you need to provide a database client as well as an embeddings model. The vector store will then be responsible for storing the embeddings in the database and also retrieving them later.

const embeddings = new OllamaEmbeddings({ model: 'nomic-embed-text' })

const libsqlClient = createClient({
  url: 'file:vector-store.db',
})

export const vectorStore = new LibSQLVectorStore(embeddings, {
  db: libsqlClient,
  table: 'vecs',
  column: 'embeddings',
})

In the highlighted line above you can see that we provide file:vector-store.db as the URL for the database. This tells libSQL to create a local database file in the current directory.

Creating Tables and Indices

Before you can start storing vectors in the database, you need to create tables and indices.

Make sure that your database schema matches the one you provided to the LibSQLVectorStore constructor as table and column.

CREATE TABLE IF NOT EXISTS vecs (
    id INTEGER PRIMARY KEY AUTOINCREMENT,
    content TEXT,
    metadata TEXT,
    embeddings F32_BLOB(768)
);

The F32_BLOB(768) type matches the vector size of the nomic-embed-text model. If you choose a different model, be sure to adjust the size accordingly.

Keeping Track of Files

The files table also stores the document paths. This is useful to check if a document has already been processed and to avoid re-embedding it. The file ID is stored in the metadata column of the vecs table for seamless filtering with the built-in retriever of the LibSQLVectorStore.

// Insert the file into the database and get the ID
const { rows } = await libsqlClient.execute({
  sql: `INSERT INTO files (filename, path)
        VALUES (?, ?) RETURNING id;`,
  args: [fileName, filePath],
})

// Add the file ID to the metadata of each document chunk
const docsWithMetadata = docs.map((doc) =>
  ({ ...doc, metadata: { file_id: rows[0].id } }))

// Store the documents with metadata in the vector store
await vectorStore.addDocuments(docsWithMetadata)

Let’s Chat

Now that we have the document chunks stored in the database, we can start chatting with the Llama model. With LangChain.js, this is as easy as using the vector store as a base for a history-ware retriever, stuffing matching documents into the LLM’s system prompt, and generating a response.

Making the Retriever Aware of the Chat History

In order to allow follow-up questions, we might need to re-phrase the current user’s question to include the context of the previous conversation. Image the following chat history:

🧑‍💻: What does RAG stand for in AI?

🤖: RAG stands for Retrieval-Augmented Generation.

🧑‍💻: How does it work?

If we use How does it work? for similarity search in libSQL, we will not get the results that we need from our vector store. Instead, we need to re-phrase the question to include the context of the previous conversation:

🧑‍💻: How does Retrieval-Augmented Generation work?

Here’s how it looks in code:

// LibSQLVectorStore
const retriever = vectorStore.asRetriever({
  k: 10,
  filter: (doc) => doc.metadata.file_id === fileId
})

const rephraseSystemPrompt =
  `Given a chat history and the latest user question
  which might reference context in the chat history,
  formulate a standalone question which can be understood
  without the chat history. Do NOT answer the question
  just reformulate it if needed and otherwise return it as is.`

const rephrasePrompt = ChatPromptTemplate.fromMessages([
  ['system', rephraseSystemPrompt],
  new MessagesPlaceholder('chat_history'),
  ['human', '{input}'],
])

const historyAwareRetriever = await createHistoryAwareRetriever({
  llm,
  retriever,
  rephrasePrompt,
})

Response Generation

This is where the magic happens. The final step has two parts: First, we retrieve the most similar document chunks to the user’s question. Then we feed these chunks into the Llama model to generate an answer.

For the first part, we use the createStuffDocumentsChain helper method from LangChain.js. For the second part, we use the createRetrievalChain helper to bring it all together.

const systemPrompt = `
  You are an assistant for question-answering tasks.
  Use the following pieces of retrieved context to
  answer the question. If you don't know the answer,
  say that you don't know. Use three sentences maximum
  and keep the answer concise.\n\n{context}`

const qaPrompt = ChatPromptTemplate.fromMessages([
  ['system', systemPrompt],
  new MessagesPlaceholder('chat_history'),
  ['human', '{input}'],
])

// Pass the list of documents to the LLM
const combineDocsChain = await createStuffDocumentsChain({
  llm,
  prompt: qaPrompt,
})

// The final retrieval chain
const ragChain = await createRetrievalChain({
  retriever: historyAwareRetriever,
  combineDocsChain,
})

Streaming Responses

You could now simply call invoke() on the final chain and wait for the full answer. But for a better user experience, you might want to stream the response in chunks as they are generated by the model:

// `IterableReadableStream`
const stream = ragChain.stream({
  input,
  chat_history,
})

// Consuming the stream
for await (const chunk of await stream) {
  if (chunk['answer']) {
    process.stdout.write(chunk.answer)
  }
}

Code Example on GitHub

The full code is available on my GitHub.

Where To Go From Here?

Powered by Bun, this could easily become a web-based chatbot. However, to run Llama effectively, a powerful server would be necessary.

Offloading the AI part to the client would be a viable option. But then all your users must run Ollama locally. You could also mess around with WebGPU and run the Llama model in the browser. There are demos out there that show it’s possible. But that’s a whole different story and browser support is still limited.

To me, the best way to grow this into something more useful would be to create a desktop app with Tauri. That way you could bundle everything together and have a nice offline-first chatbot that you can use on your computer with a nice UI.

Conclusion

Building an offline RAG chatbot with Bun, libSQL, LangChain.js, and Ollama has been a fun and challenging project. It’s great to be able to run the whole retrieval chain locally on your own machine. No need to rely on cloud services anymore. And with Llama 3.2, the quality of the generated responses is actually quite good. 🥳