This project implements a RAG (Retrieval Augmented Generation) system for answering questions about IRS documents. Instead of searching through PDFs manually, you can ask questions in plain English and get answers grounded in the actual source material.
The system uses a hybrid retrieval approach: ChromaDB for semantic vector search combined with BM25 for keyword matching. This catches both conceptually similar content and exact terminology, which matters when dealing with tax documents where specific terms have precise meanings.
The interface is built with Streamlit, making it easy to deploy and share. LangChain orchestrates the retrieval pipeline, and an OpenAI model generates the final answers with citations back to the source documents.
The app is hosted on Streamlit Community Cloud (free tier), so it may take a moment to wake up from sleep mode. Source code is on GitHub.