When Meta released its Llama 3.2 model, I wondered what a simple, personal and offline-only RAG (retrieval augmented generation) chatbot might look like. A command-line tool that would import a single PDF at a time and allow me to chat with its contents was the goal. Being comfortable writing TypeScript, I decided to give it a go using Bun, libSQL, LangChain.js and Ollama. Here’s how it went.
Building Blocks
- Bun: The new JS kid on the block. TypeScript-first, blazingly fast 😜, and has a
package.json
. Definitely my go-to JS runtime for side projects. - libSQL: A fork of SQLite that has a vector type and vector queries built-in. Battle-tested on Turso but also a perfect local-first, file-based database for personal AI powered apps.
- LangChain.js: The JavaScript port of the popular LangChain python library. Brings everything seamlessly together: Document loading, chunking, embeddings, vector stores, and retrieval.
- Ollama: The best way to run open source AI models locally on your computer. You’re up and running with one simple command.
Addressing LangChain Issues
After I started implementing loading a PDF, chunking it and creating the embeddings for the chunks, I immediately ran into problems with LangChain. Namely the @langchain/community
package from which I imported the LibSQLVectorStore
.
It threw an error while inserting documents into the database. At first I thought there was something wrong with my chunking and embedding functions, but soon I realized there was an actual bug in the @langchain/community
package. I filed an issue on GitHub and a PR with a proposed fix.
Turns out the LibSQLVectorStore
implementation wasn’t quite finished, but somehow made it into a release. Mainter @jacoblee93 helped fix a few more issues and also wrote some integration tests.
Now back to work… 🤓
Creating Local Embeddings
Back in April Ollama announced support for embeddings. That’s great because now with Ollama we are able to run the complete retrieval chain including embedding documents and user queries as well as the LLM for chatting locally on our own computer. No need to rely on OpenAI and the like anymore.
Choosing the Embeddings Model
You can find every supported embeddings model in the Ollama library. Since I wanted to get started quickly I simply chose nomic-embed-text
since it had the most downloads and looked like a good fit for my use case.
🔥 Hot Tip: Check out Hugging Face’s MTEB Leaderboard for comparing open source embeddings models and finding the right one for your needs.
How to Store Vectors Locally?
Generating embeddings is quite a time consuming task. So I wanted to store them locally to avoid having to recompute them every time I start the chatbot. But you also need to be able to query them efficiently.
By now there are a million proprierty as well as open source databases that support vectors. But I wanted to keep it simple and local. That’s why I chose libSQL. It’s a fork of SQLite that has a vector type, vector indices and vector queries built-in. Perfect for my offline RAG chatbot.
Help, My Database is Huge!
While testing, I used the Brexit withdrawal agreement as it is quite long and therefore looked like a good choice to do some retrieval tests. It took me quite a while to realize that with this one document alone, the libSQL file was already more than 500 MB in size. 🤯
The PDF of the Brexit agreement is 181 pages, but that’s still too much. If I used this chatbot with multiple documents, the database file would quickly reach several gigabytes. That’s not really what I had in mind for a simple, personal, offline-only chatbot.
Fortunately, I found this article about The space complexity of vector indexes in LibSQL on the Turso blog.
Compression and Limiting
The main problem wasn’t the vectors themselves but rather the vector index. By limiting the number of neighbors and compressing them, the database file size was significantly reduced. Here’s the updated SQL command:
As the Turso article explains, setting the compression to float8
and the maximum number of neighbors to 20
would still give good results. I can also confirm that I haven’t noticed any difference in the quality of the retrieval results so far. But the file size dropped from over 500 MB to under 80 MB. 🎉
Bringing It All Together
With the libSQL implementation in LangChain fixed, the embeddings model chosen and the database file size reduced, I was finally able to chat with the Brexit withdrawal agreement. 🎉
Now, how does it all come together? Enter: LangChain.js.
Loading and Chunking Documents
LangChain.js handles document loading. You just need pdf-parse
as a peer dependency and a path to a PDF file.
By default this returns one Document
object for each page of the PDF file. Since the context windows of LLMs grow constantly, this may be good enough for your use case.
☝️ Good To Know: The Llama 3.2 1B and 3B models support a context length of 128K tokens.
Chunking can significantly improve retrieval results. Creating embeddings for smaller parts of a document often leads to vectors that represent the embedded content more accurately than embeddings for an entire page.
For simplicity I decided to split the documents with a RecursiveCharacterTextSplitter
. It basically splits the text into chunks of a given number of tokens each, but tries to keep sentences together.
🔥 Hot Tip: Try semantic chunking to make retrieval even more accurate by keeping related content together.
The Vector Store
The LibSQLVectorStore
is the heart of the retrieval process. It stores the embeddings of the document chunks and allows for efficient retrieval of the most similar chunks to a given query, which than can be used to let the Llama model generate a response.
In order to create the vector store, you need to provide a database client as well as an embeddings model. The vector store will then be responsible for storing the embeddings in the database and also retrieving them later.
In the highlighted line above you can see that we provide file:vector-store.db
as the URL for the database. This tells libSQL to create a local database file in the current directory.
Creating Tables and Indices
Before you can start storing vectors in the database, you need to create tables and indices.
Make sure that your database schema matches the one you provided to the LibSQLVectorStore
constructor as table
and column
.
The F32_BLOB(768)
type matches the vector size of the nomic-embed-text
model. If you choose a different model, be sure to adjust the size accordingly.
Keeping Track of Files
The files
table also stores the document paths. This is useful to check if a document has already been processed and to avoid re-embedding it. The file ID is stored in the metadata
column of the vecs
table for seamless filtering with the built-in retriever of the LibSQLVectorStore
.
Let’s Chat
Now that we have the document chunks stored in the database, we can start chatting with the Llama model. With LangChain.js, this is as easy as using the vector store as a base for a history-ware retriever, stuffing matching documents into the LLM’s system prompt, and generating a response.
Making the Retriever Aware of the Chat History
In order to allow follow-up questions, we might need to re-phrase the current user’s question to include the context of the previous conversation. Image the following chat history:
If we use How does it work?
for similarity search in libSQL, we will not get the results that we need from our vector store. Instead, we need to re-phrase the question to include the context of the previous conversation:
Here’s how it looks in code:
Response Generation
This is where the magic happens. The final step has two parts: First, we retrieve the most similar document chunks to the user’s question. Then we feed these chunks into the Llama model to generate an answer.
For the first part, we use the createStuffDocumentsChain
helper method from LangChain.js. For the second part, we use the createRetrievalChain
helper to bring it all together.
Streaming Responses
You could now simply call invoke()
on the final chain and wait for the full answer. But for a better user experience, you might want to stream the response in chunks as they are generated by the model:
Code Example on GitHub
The full code is available on my GitHub.
Where To Go From Here?
Powered by Bun, this could easily become a web-based chatbot. However, to run Llama effectively, a powerful server would be necessary.
Offloading the AI part to the client would be a viable option. But then all your users must run Ollama locally. You could also mess around with WebGPU and run the Llama model in the browser. There are demos out there that show it’s possible. But that’s a whole different story and browser support is still limited.
To me, the best way to grow this into something more useful would be to create a desktop app with Tauri. That way you could bundle everything together and have a nice offline-first chatbot that you can use on your computer with a nice UI.
Conclusion
Building an offline RAG chatbot with Bun, libSQL, LangChain.js, and Ollama has been a fun and challenging project. It’s great to be able to run the whole retrieval chain locally on your own machine. No need to rely on cloud services anymore. And with Llama 3.2, the quality of the generated responses is actually quite good. 🥳