{"id":10880,"date":"2024-12-11T09:40:24","date_gmt":"2024-12-11T09:40:24","guid":{"rendered":"https:\/\/www.aihello.com\/resources\/?p=10880"},"modified":"2024-12-11T10:35:13","modified_gmt":"2024-12-11T10:35:13","slug":"building-a-rag-pipeline-with-fastapi-haystack-and-chromadb-for-urls-in-python","status":"publish","type":"post","link":"https:\/\/www.aihello.com\/resources\/blog\/building-a-rag-pipeline-with-fastapi-haystack-and-chromadb-for-urls-in-python\/","title":{"rendered":"Building a RAG Pipeline with FastAPI, Haystack, and ChromaDB for URLs in Python"},"content":{"rendered":"<p><strong><em>Harness the power of Retrieval Augmented Generation (RAG) to create an intelligent document interaction system using FastAPI, Haystack, ChromaDB, and Crawl4AI.<\/em><\/strong><\/p><h1 class=\"wp-block-heading\" id=\"f7a6\">Table of Contents<\/h1><ol class=\"wp-block-list\"><li>Problem statement<\/li><li>Prerequisites<\/li><li>Project components<\/li><li>Setting Up the Environment<\/li><li>Building the FastAPI Server<\/li><li>Building the Haystack RAG pipeline<\/li><li>Testing the APIs<\/li><li>Conclusion<\/li><li>Optional: Adding Asynchronous Processing with Celery<\/li><\/ol><hr class=\"wp-block-separator has-alpha-channel-opacity\" \/><h1 class=\"wp-block-heading\" id=\"6fd9\">Problem Statement<\/h1><p id=\"de0d\">The problem statement is to develop a backend system that, provided any URL as input, allows the user to converse about its contents.<\/p><p id=\"df48\">For example, given a news article input or a hosted PDF as input users should be able to ask questions about that PDF or the web article like<\/p><ul class=\"wp-block-list\"><li>what is the PDF about?<\/li><li>summarize the article for me in 100 words<\/li><li>Or any particular question about the URL<\/li><\/ul><h1 class=\"wp-block-heading\" id=\"dc1e\">Introduction<\/h1><p id=\"01ff\">In the era of information overload, extracting relevant information from vast amounts of data is a significant challenge. Retrieval Augmented Generation (RAG) combines retrieval and generation to provide precise answers from specific documents. In this article, we\u2019ll build a RAG pipeline that allows users to ingest any URL from various sources and interact with them using natural language queries.<\/p><p id=\"bbdd\">We\u2019ll utilize:<\/p><ul class=\"wp-block-list\"><li><a href=\"https:\/\/fastapi.tiangolo.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">FastAPI<\/a>: For building the RESTful API.<\/li><li><a href=\"https:\/\/github.com\/unclecode\/crawl4ai\" target=\"_blank\" rel=\"noreferrer noopener\">Crawl4AI<\/a>: A powerful tool for asynchronous web crawling and data extraction.<\/li><li><a href=\"https:\/\/haystack.deepset.ai\/\" target=\"_blank\" rel=\"noreferrer noopener\">Haystack<\/a>: An open-source framework for building search systems.<\/li><li><a href=\"https:\/\/www.trychroma.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">ChromaDB<\/a>: A vector database for storing embeddings.<\/li><\/ul><p id=\"ef4f\">Optional Components:<\/p><p id=\"b59b\"><a href=\"https:\/\/docs.celeryq.dev\/en\/stable\/\" rel=\"noreferrer noopener\" target=\"_blank\">Celery<\/a>&nbsp;and&nbsp;<a href=\"https:\/\/redis.io\/docs\/latest\/develop\/clients\/redis-py\/\" rel=\"noreferrer noopener\" target=\"_blank\">Redis<\/a>: For asynchronous task processing to handle long-running tasks in the background.<\/p><hr class=\"wp-block-separator has-alpha-channel-opacity\" \/><h1 class=\"wp-block-heading\" id=\"8bb1\">Prerequisites<\/h1><p id=\"81bb\">Before diving in, ensure you have the following:<\/p><ul class=\"wp-block-list\"><li><a href=\"https:\/\/www.python.org\/downloads\/\" target=\"_blank\" rel=\"noreferrer noopener\">Python 3.8+ installed<\/a>.<\/li><li>Familiarity with Python and FastAPI.<\/li><li>Basic understanding of RAG and how it works.<\/li><\/ul><hr class=\"wp-block-separator has-alpha-channel-opacity\" \/><h1 class=\"wp-block-heading\" id=\"5b9b\">Project Components<\/h1><ol class=\"wp-block-list\"><li>FastAPI Server: Exposes two endpoints:<\/li><\/ol><ul class=\"wp-block-list\"><li>Ingest API: Accepts a URL, processes it, and stores the content.<\/li><li>Generate API: Accepts a question and the URL (same as ingestion) retrieves relevant content to generate an answer.<\/li><\/ul><p id=\"e0e8\">2. Crawl4AI: Simplifies asynchronous web crawling and data extraction.<\/p><p id=\"0cc7\">3. Haystack with ChromaDB: Stores and retrieves document embeddings.<\/p><hr class=\"wp-block-separator has-alpha-channel-opacity\" \/><h1 class=\"wp-block-heading\" id=\"d9d1\">Setting Up the Environment<\/h1><p id=\"0929\">We\u2019ll set up the project from scratch using the provided instructions.<\/p><h3 class=\"wp-block-heading has-medium-font-size\" id=\"c8bc\">1. Create a Project Directory<\/h3><p id=\"e7af\">Create a new directory for your project and navigate into it.<\/p><h4 class=\"wp-block-heading\" id=\"5441\">2. Create a Virtual Environment<\/h4><p id=\"9c5e\">Create and activate a virtual environment to manage project dependencies<\/p><p>For Windows:<\/p><pre class=\"wp-block-code\"><code>python -m venv venv<\/code><\/pre><pre class=\"wp-block-code\"><code>venv\\Scripts\\activate<\/code><\/pre><p>For macOS\/Linux:<\/p><pre class=\"wp-block-code\"><code>python3 -m venv venv<\/code><\/pre><pre class=\"wp-block-code\"><code>source venv\/bin\/activate<\/code><\/pre><h3 class=\"wp-block-heading\" id=\"a650\">3. Install Dependencies<\/h3><p id=\"b804\">We\u2019ll need several Python packages.<\/p><p id=\"81a4\">Install them using pip:<\/p><pre class=\"wp-block-code\"><code>pip install fastapi uvicorn&#091;standard] requests crawl4ai farm-haystack chromadb chroma-haystack haystack-ai ollama-haystack python-multipart<\/code><\/pre><p><em>Alternatively, you can create a requirements.txt file with all dependencies and install them with<\/em><\/p><p><strong>requirements.txt<\/strong><\/p><pre class=\"wp-block-code\"><code>haystack-aiollama-haystackchroma-haystackpython-multipartfastapiuvicorncrawl4ai&#091;sync]playwrightpypdfmarkdown-it-pymdit_plainfiletypeCelery&#091;redis] # Optional<\/code><\/pre><p>Install the dependencies:<\/p><pre class=\"wp-block-code\"><code>pip install -r requirements.txt<\/code><\/pre><h3 class=\"wp-block-heading\">4. Install Playwright and its dependencies<\/h3><pre class=\"wp-block-code\"><code>playwright installplaywright install-deps<\/code><\/pre><h2 class=\"wp-block-heading\">Building the FastAPI Server<\/h2><p><em>Create a file named main.py and set up the FastAPI app.<\/em><\/p><pre class=\"wp-block-code\"><code>from fastapi import FastAPIapp = FastAPI() # initialize the Fastapi app<\/code><\/pre><h2 class=\"wp-block-heading\">1. Handling Different URL Types<\/h2><p>First let <a href=\"https:\/\/www.aihello.com\/resources\/blog\/how-to-launch-an-amazon-mexico-business\/\">us<\/a> create a utility that given a URL as input, classifies it into two basic categories.<\/p><ol class=\"wp-block-list\"><li>File: .pdf, .txt, .md, etc<\/li><li>Web-page or article<\/li><\/ol><p>We\u2019ll use the requests and mime-types libraries to identify file types based on the URL\u2019s content type.<\/p><figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"786\" height=\"757\" src=\"https:\/\/www.aihello.com\/resources\/wp-content\/uploads\/2024\/12\/image-5.png\" alt=\"\" class=\"wp-image-10888\" srcset=\"https:\/\/www.aihello.com\/resources\/wp-content\/uploads\/2024\/12\/image-5.png 786w, https:\/\/www.aihello.com\/resources\/wp-content\/uploads\/2024\/12\/image-5-300x289.png 300w, https:\/\/www.aihello.com\/resources\/wp-content\/uploads\/2024\/12\/image-5-768x740.png 768w\" sizes=\"auto, (max-width: 786px) 100vw, 786px\" \/><\/figure><figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"767\" height=\"663\" src=\"https:\/\/www.aihello.com\/resources\/wp-content\/uploads\/2024\/12\/image-6.png\" alt=\"\" class=\"wp-image-10890\" srcset=\"https:\/\/www.aihello.com\/resources\/wp-content\/uploads\/2024\/12\/image-6.png 767w, https:\/\/www.aihello.com\/resources\/wp-content\/uploads\/2024\/12\/image-6-300x259.png 300w\" sizes=\"auto, (max-width: 767px) 100vw, 767px\" \/><\/figure><p id=\"0b38\">This function returns the URL&#8217;s mime type, representing what type of URL it is.<\/p><p id=\"601b\">It returns \u201carticle\u201d if it is a web page or web article and the file extension if it is a document such as .pdf for PDF files, .txt for text files, .md markdown, etc.<\/p><h3 class=\"wp-block-heading\">2. Crawling Web Articles with Crawl4AI<br>What is Crawl4AI?<\/h3><p id=\"7eaf\">Crawl4AI simplifies asynchronous web crawling and data extraction, making it accessible for large language models (LLMs) and AI applications. It handles the complexities of HTTP requests, parsing, and data extraction, providing a straightforward interface for developers.<\/p><h4 class=\"wp-block-heading\" id=\"65d5\">Using Crawl4AI to Scrape Web Articles<\/h4><p id=\"56b9\">We will create an asynchronous function that given a URL scrapes it using the crawl4ai python package.<\/p><p id=\"8405\">Check out the&nbsp;<a href=\"https:\/\/crawl4ai.com\/mkdocs\/basic\/installation\/\" rel=\"noreferrer noopener\" target=\"_blank\">crawl4ai documentation<\/a>&nbsp;if you need help with it.<\/p><p><strong>Example Usage:<\/strong><\/p><figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"848\" height=\"887\" src=\"https:\/\/www.aihello.com\/resources\/wp-content\/uploads\/2024\/12\/image-7.png\" alt=\"\" class=\"wp-image-10891\" srcset=\"https:\/\/www.aihello.com\/resources\/wp-content\/uploads\/2024\/12\/image-7.png 848w, https:\/\/www.aihello.com\/resources\/wp-content\/uploads\/2024\/12\/image-7-287x300.png 287w, https:\/\/www.aihello.com\/resources\/wp-content\/uploads\/2024\/12\/image-7-768x803.png 768w\" sizes=\"auto, (max-width: 848px) 100vw, 848px\" \/><\/figure><p><br><br>We will make use of the AsyncWebCrawler to pass another class object LLMExtractionStrategy from crawl4ai to it.<\/p><p id=\"8157\">The LLMExtractionStrategy takes the following important parameters as input,<\/p><ol class=\"wp-block-list\"><li><strong>provider<\/strong>&nbsp;\u2014 The <a href=\"https:\/\/www.aihello.com\/resources\/blog\/tokenization-and-its-application\/\">LLM<\/a> provided which you want to use, for this tutorial, we will use&nbsp;<a href=\"https:\/\/ollama.com\/\" target=\"_blank\" rel=\"noreferrer noopener\"><strong>Ollama<\/strong><\/a>, which is a locally hosted LLM. Alternatively, you can also use other LLM providers as suggested on the&nbsp;<a href=\"https:\/\/docs.litellm.ai\/docs\/providers\" target=\"_blank\" rel=\"noreferrer noopener\"><strong>LiteLLM<\/strong><\/a>&nbsp;docs<\/li><li><strong>base_url<\/strong>&nbsp;\u2014 The api_base where you have your Ollama hosted, if you have it hosted locally it should be something like&nbsp;<strong>localhost:11434<\/strong><\/li><li><strong>instruction<\/strong>: The prompt or instruction you want to send to the LLM while scraping the page<\/li><li><strong>api_token<\/strong>&nbsp;\u2014 the api_key if using hosted LLM providers like Openai or Gemini.<\/li><\/ol><p id=\"867a\">You can check out all other parameters that it takes in its&nbsp;<a href=\"https:\/\/crawl4ai.com\/mkdocs\/\" rel=\"noreferrer noopener\" target=\"_blank\"><strong>documentation<\/strong><\/a>.<\/p><h2 class=\"wp-block-heading\" id=\"848b\"><strong>3. Implementing the Ingest Endpoint<\/strong><\/h2><p id=\"893b\">Now, Let us create the ingest endpoint first, which accepts the URL as an input, determines its category using the identify_url function that we created earlier, and passes it to two different ingestion functions namely ingest_url and ingest_file that we will be creating soon.<\/p><figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"797\" height=\"832\" src=\"https:\/\/www.aihello.com\/resources\/wp-content\/uploads\/2024\/12\/image-8.png\" alt=\"\" class=\"wp-image-10892\" srcset=\"https:\/\/www.aihello.com\/resources\/wp-content\/uploads\/2024\/12\/image-8.png 797w, https:\/\/www.aihello.com\/resources\/wp-content\/uploads\/2024\/12\/image-8-287x300.png 287w, https:\/\/www.aihello.com\/resources\/wp-content\/uploads\/2024\/12\/image-8-768x802.png 768w\" sizes=\"auto, (max-width: 797px) 100vw, 797px\" \/><\/figure><figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"738\" height=\"702\" src=\"https:\/\/www.aihello.com\/resources\/wp-content\/uploads\/2024\/12\/image-9.png\" alt=\"\" class=\"wp-image-10893\" srcset=\"https:\/\/www.aihello.com\/resources\/wp-content\/uploads\/2024\/12\/image-9.png 738w, https:\/\/www.aihello.com\/resources\/wp-content\/uploads\/2024\/12\/image-9-300x285.png 300w\" sizes=\"auto, (max-width: 738px) 100vw, 738px\" \/><\/figure><h2 class=\"wp-block-heading\" id=\"4781\"><strong>4. Implementing the Generate API<\/strong><\/h2><p id=\"25dd\">Now, we shall also create an endpoint that given the URL and a question, classifies it using the\u00a0<strong>indetify_url<\/strong>\u00a0function and routes it to different two functions namely\u00a0<strong>get_url_result<\/strong>\u00a0and\u00a0<strong>get_file_result<\/strong>\u00a0to generate the response for that question. We will also create these functions soon.<\/p><p id=\"25dd\"><\/p><figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"790\" height=\"817\" src=\"https:\/\/www.aihello.com\/resources\/wp-content\/uploads\/2024\/12\/generate-api-endpoint.png\" alt=\"\" class=\"wp-image-10894\" srcset=\"https:\/\/www.aihello.com\/resources\/wp-content\/uploads\/2024\/12\/generate-api-endpoint.png 790w, https:\/\/www.aihello.com\/resources\/wp-content\/uploads\/2024\/12\/generate-api-endpoint-290x300.png 290w, https:\/\/www.aihello.com\/resources\/wp-content\/uploads\/2024\/12\/generate-api-endpoint-768x794.png 768w\" sizes=\"auto, (max-width: 790px) 100vw, 790px\" \/><\/figure><p id=\"25dd\"><\/p><figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"785\" height=\"806\" src=\"https:\/\/www.aihello.com\/resources\/wp-content\/uploads\/2024\/12\/image-11.png\" alt=\"\" class=\"wp-image-10895\" srcset=\"https:\/\/www.aihello.com\/resources\/wp-content\/uploads\/2024\/12\/image-11.png 785w, https:\/\/www.aihello.com\/resources\/wp-content\/uploads\/2024\/12\/image-11-292x300.png 292w, https:\/\/www.aihello.com\/resources\/wp-content\/uploads\/2024\/12\/image-11-768x789.png 768w\" sizes=\"auto, (max-width: 785px) 100vw, 785px\" \/><\/figure><p id=\"25dd\"><\/p><figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"778\" height=\"537\" src=\"https:\/\/www.aihello.com\/resources\/wp-content\/uploads\/2024\/12\/image-12.png\" alt=\"\" class=\"wp-image-10896\" srcset=\"https:\/\/www.aihello.com\/resources\/wp-content\/uploads\/2024\/12\/image-12.png 778w, https:\/\/www.aihello.com\/resources\/wp-content\/uploads\/2024\/12\/image-12-300x207.png 300w, https:\/\/www.aihello.com\/resources\/wp-content\/uploads\/2024\/12\/image-12-768x530.png 768w\" sizes=\"auto, (max-width: 778px) 100vw, 778px\" \/><\/figure><p id=\"25dd\"><\/p><h3 class=\"wp-block-heading\" id=\"384b\">Building the HayStack RAG pipeline<\/h3><p id=\"0952\">Let us implement the logic for the ingestion and result generation functions.<\/p><p id=\"066e\">First, let us set up the LLM Backend, we will be using the same LLM that we used for crawling for chat purposes.<\/p><p id=\"9584\">Note that, since RAG requires two kinds of models, one for conversations and one to generate embeddings, we will require two and install them on our llama before using it.<\/p><p id=\"96b8\">Here, I am using llama3.2 for chat and nomic-embed-text for the generation of embedding.<\/p><p id=\"baa7\">Note that you don\u2019t need to worry about how embedding or the chat is being handled, you just have to pass the model names, and Haystack will take care of it all.<\/p><p id=\"dcc9\"><strong>Setup the LLM Backend and Promp<\/strong>t<\/p><p id=\"dcc9\"><\/p><figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"780\" height=\"848\" src=\"https:\/\/www.aihello.com\/resources\/wp-content\/uploads\/2024\/12\/image-13.png\" alt=\"\" class=\"wp-image-10897\" srcset=\"https:\/\/www.aihello.com\/resources\/wp-content\/uploads\/2024\/12\/image-13.png 780w, https:\/\/www.aihello.com\/resources\/wp-content\/uploads\/2024\/12\/image-13-276x300.png 276w, https:\/\/www.aihello.com\/resources\/wp-content\/uploads\/2024\/12\/image-13-768x835.png 768w\" sizes=\"auto, (max-width: 780px) 100vw, 780px\" \/><\/figure><p id=\"dcc9\"><\/p><figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"782\" height=\"842\" src=\"https:\/\/www.aihello.com\/resources\/wp-content\/uploads\/2024\/12\/image-14.png\" alt=\"\" class=\"wp-image-10898\" srcset=\"https:\/\/www.aihello.com\/resources\/wp-content\/uploads\/2024\/12\/image-14.png 782w, https:\/\/www.aihello.com\/resources\/wp-content\/uploads\/2024\/12\/image-14-279x300.png 279w, https:\/\/www.aihello.com\/resources\/wp-content\/uploads\/2024\/12\/image-14-768x827.png 768w\" sizes=\"auto, (max-width: 782px) 100vw, 782px\" \/><\/figure><p id=\"dcc9\"><\/p><figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"787\" height=\"816\" src=\"https:\/\/www.aihello.com\/resources\/wp-content\/uploads\/2024\/12\/image-15.png\" alt=\"\" class=\"wp-image-10899\" srcset=\"https:\/\/www.aihello.com\/resources\/wp-content\/uploads\/2024\/12\/image-15.png 787w, https:\/\/www.aihello.com\/resources\/wp-content\/uploads\/2024\/12\/image-15-289x300.png 289w, https:\/\/www.aihello.com\/resources\/wp-content\/uploads\/2024\/12\/image-15-768x796.png 768w\" sizes=\"auto, (max-width: 787px) 100vw, 787px\" \/><\/figure><p id=\"dcc9\"><\/p><figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"786\" height=\"500\" src=\"https:\/\/www.aihello.com\/resources\/wp-content\/uploads\/2024\/12\/image-16.png\" alt=\"\" class=\"wp-image-10900\" srcset=\"https:\/\/www.aihello.com\/resources\/wp-content\/uploads\/2024\/12\/image-16.png 786w, https:\/\/www.aihello.com\/resources\/wp-content\/uploads\/2024\/12\/image-16-300x191.png 300w, https:\/\/www.aihello.com\/resources\/wp-content\/uploads\/2024\/12\/image-16-768x489.png 768w\" sizes=\"auto, (max-width: 786px) 100vw, 786px\" \/><\/figure><p id=\"dcc9\"><em>Note: Replace Your_Ollama_URL with your ollama URL<\/em><\/p><p id=\"b224\">Now, let us build the necessary functionality to process the documents and URLs, separately.<\/p><p><strong>Processing URLs<\/strong><\/p><figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"778\" height=\"743\" src=\"https:\/\/www.aihello.com\/resources\/wp-content\/uploads\/2024\/12\/image-17.png\" alt=\"\" class=\"wp-image-10901\" srcset=\"https:\/\/www.aihello.com\/resources\/wp-content\/uploads\/2024\/12\/image-17.png 778w, https:\/\/www.aihello.com\/resources\/wp-content\/uploads\/2024\/12\/image-17-300x287.png 300w, https:\/\/www.aihello.com\/resources\/wp-content\/uploads\/2024\/12\/image-17-768x733.png 768w\" sizes=\"auto, (max-width: 778px) 100vw, 778px\" \/><\/figure><figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"782\" height=\"682\" src=\"https:\/\/www.aihello.com\/resources\/wp-content\/uploads\/2024\/12\/image-18.png\" alt=\"\" class=\"wp-image-10902\" srcset=\"https:\/\/www.aihello.com\/resources\/wp-content\/uploads\/2024\/12\/image-18.png 782w, https:\/\/www.aihello.com\/resources\/wp-content\/uploads\/2024\/12\/image-18-300x262.png 300w, https:\/\/www.aihello.com\/resources\/wp-content\/uploads\/2024\/12\/image-18-768x670.png 768w\" sizes=\"auto, (max-width: 782px) 100vw, 782px\" \/><\/figure><figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"785\" height=\"838\" src=\"https:\/\/www.aihello.com\/resources\/wp-content\/uploads\/2024\/12\/image-19.png\" alt=\"\" class=\"wp-image-10903\" srcset=\"https:\/\/www.aihello.com\/resources\/wp-content\/uploads\/2024\/12\/image-19.png 785w, https:\/\/www.aihello.com\/resources\/wp-content\/uploads\/2024\/12\/image-19-281x300.png 281w, https:\/\/www.aihello.com\/resources\/wp-content\/uploads\/2024\/12\/image-19-768x820.png 768w\" sizes=\"auto, (max-width: 785px) 100vw, 785px\" \/><\/figure><figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"785\" height=\"911\" src=\"https:\/\/www.aihello.com\/resources\/wp-content\/uploads\/2024\/12\/query-pipeline-processing-png.png\" alt=\"\" class=\"wp-image-10904\" srcset=\"https:\/\/www.aihello.com\/resources\/wp-content\/uploads\/2024\/12\/query-pipeline-processing-png.png 785w, https:\/\/www.aihello.com\/resources\/wp-content\/uploads\/2024\/12\/query-pipeline-processing-png-259x300.png 259w, https:\/\/www.aihello.com\/resources\/wp-content\/uploads\/2024\/12\/query-pipeline-processing-png-768x891.png 768w\" sizes=\"auto, (max-width: 785px) 100vw, 785px\" \/><\/figure><figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"786\" height=\"230\" src=\"https:\/\/www.aihello.com\/resources\/wp-content\/uploads\/2024\/12\/image-21.png\" alt=\"\" class=\"wp-image-10905\" srcset=\"https:\/\/www.aihello.com\/resources\/wp-content\/uploads\/2024\/12\/image-21.png 786w, https:\/\/www.aihello.com\/resources\/wp-content\/uploads\/2024\/12\/image-21-300x88.png 300w, https:\/\/www.aihello.com\/resources\/wp-content\/uploads\/2024\/12\/image-21-768x225.png 768w\" sizes=\"auto, (max-width: 786px) 100vw, 786px\" \/><\/figure><hr class=\"wp-block-separator has-alpha-channel-opacity\" \/><p><strong>Processing Documents<\/strong><\/p><figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"792\" height=\"446\" src=\"https:\/\/www.aihello.com\/resources\/wp-content\/uploads\/2024\/12\/image-22.png\" alt=\"\" class=\"wp-image-10906\" srcset=\"https:\/\/www.aihello.com\/resources\/wp-content\/uploads\/2024\/12\/image-22.png 792w, https:\/\/www.aihello.com\/resources\/wp-content\/uploads\/2024\/12\/image-22-300x169.png 300w, https:\/\/www.aihello.com\/resources\/wp-content\/uploads\/2024\/12\/image-22-768x432.png 768w\" sizes=\"auto, (max-width: 792px) 100vw, 792px\" \/><\/figure><figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"783\" height=\"822\" src=\"https:\/\/www.aihello.com\/resources\/wp-content\/uploads\/2024\/12\/image-23.png\" alt=\"\" class=\"wp-image-10907\" srcset=\"https:\/\/www.aihello.com\/resources\/wp-content\/uploads\/2024\/12\/image-23.png 783w, https:\/\/www.aihello.com\/resources\/wp-content\/uploads\/2024\/12\/image-23-286x300.png 286w, https:\/\/www.aihello.com\/resources\/wp-content\/uploads\/2024\/12\/image-23-768x806.png 768w\" sizes=\"auto, (max-width: 783px) 100vw, 783px\" \/><\/figure><figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"788\" height=\"647\" src=\"https:\/\/www.aihello.com\/resources\/wp-content\/uploads\/2024\/12\/image-24.png\" alt=\"\" class=\"wp-image-10908\" srcset=\"https:\/\/www.aihello.com\/resources\/wp-content\/uploads\/2024\/12\/image-24.png 788w, https:\/\/www.aihello.com\/resources\/wp-content\/uploads\/2024\/12\/image-24-300x246.png 300w, https:\/\/www.aihello.com\/resources\/wp-content\/uploads\/2024\/12\/image-24-768x631.png 768w\" sizes=\"auto, (max-width: 788px) 100vw, 788px\" \/><\/figure><figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"786\" height=\"815\" src=\"https:\/\/www.aihello.com\/resources\/wp-content\/uploads\/2024\/12\/image-25.png\" alt=\"\" class=\"wp-image-10909\" srcset=\"https:\/\/www.aihello.com\/resources\/wp-content\/uploads\/2024\/12\/image-25.png 786w, https:\/\/www.aihello.com\/resources\/wp-content\/uploads\/2024\/12\/image-25-289x300.png 289w, https:\/\/www.aihello.com\/resources\/wp-content\/uploads\/2024\/12\/image-25-768x796.png 768w\" sizes=\"auto, (max-width: 786px) 100vw, 786px\" \/><\/figure><figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"780\" height=\"588\" src=\"https:\/\/www.aihello.com\/resources\/wp-content\/uploads\/2024\/12\/image-26.png\" alt=\"\" class=\"wp-image-10910\" srcset=\"https:\/\/www.aihello.com\/resources\/wp-content\/uploads\/2024\/12\/image-26.png 780w, https:\/\/www.aihello.com\/resources\/wp-content\/uploads\/2024\/12\/image-26-300x226.png 300w, https:\/\/www.aihello.com\/resources\/wp-content\/uploads\/2024\/12\/image-26-768x579.png 768w\" sizes=\"auto, (max-width: 780px) 100vw, 780px\" \/><\/figure><h2 class=\"wp-block-heading\"><strong>Testing the APIs<\/strong><\/h2><ol class=\"wp-block-list\"><li><strong>Starting the FastAPI Application<\/strong><\/li><\/ol><p id=\"9a92\">In your terminal or command prompt, navigate to your project directory and activate the virtual environment.<\/p><p id=\"bb24\">Then run:<\/p><pre class=\"wp-block-code\"><code>uvicorn main:app - reload<\/code><\/pre><p id=\"60d4\"><strong>2. Testing with API Documentation<\/strong><\/p><p id=\"6e82\">FastAPI provides interactive API documentation at&nbsp;<a href=\"http:\/\/localhost:8000\/docs.\" rel=\"noreferrer noopener\" target=\"_blank\"><strong>http:\/\/localhost:8000\/docs<\/strong>.<\/a><\/p><p id=\"35a1\">You can test your APIs directly from this interface. <a href=\"https:\/\/www.aihello.com\/resources\/blog\/key-terms-you-need-to-know-when-you-start-out-as-a-seller\/\">You will<\/a> see the swagger-UI like this.<\/p><h4 class=\"wp-block-heading\" id=\"64c1\">Testing the Ingest API<\/h4><ol class=\"wp-block-list\"><li>Open your web browser and navigate to&nbsp;<a href=\"http:\/\/localhost:8000\/docs.\" target=\"_blank\" rel=\"noreferrer noopener\"><strong>http:\/\/localhost:8000\/docs<\/strong>.<\/a><\/li><li>Find the&nbsp;<strong>\/ingest&nbsp;<\/strong>endpoint and expand it.<\/li><li>Click on the \u201cTry it out\u201d button.<\/li><\/ol><figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"828\" height=\"424\" src=\"https:\/\/www.aihello.com\/resources\/wp-content\/uploads\/2024\/12\/image-2.png\" alt=\"\" class=\"wp-image-10884\" srcset=\"https:\/\/www.aihello.com\/resources\/wp-content\/uploads\/2024\/12\/image-2.png 828w, https:\/\/www.aihello.com\/resources\/wp-content\/uploads\/2024\/12\/image-2-300x154.png 300w, https:\/\/www.aihello.com\/resources\/wp-content\/uploads\/2024\/12\/image-2-768x393.png 768w\" sizes=\"auto, (max-width: 828px) 100vw, 828px\" \/><\/figure><ol class=\"wp-block-list\"><li>Enter a URL to ingest (e.g., a web article or a PDF link).<\/li><li>Click \u201cExecute\u201d<\/li><li>You\u2019ll receive a confirmation message if the content is ingested successfully.<\/li><\/ol><figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"828\" height=\"261\" src=\"https:\/\/www.aihello.com\/resources\/wp-content\/uploads\/2024\/12\/image-1.png\" alt=\"\" class=\"wp-image-10883\" srcset=\"https:\/\/www.aihello.com\/resources\/wp-content\/uploads\/2024\/12\/image-1.png 828w, https:\/\/www.aihello.com\/resources\/wp-content\/uploads\/2024\/12\/image-1-300x95.png 300w, https:\/\/www.aihello.com\/resources\/wp-content\/uploads\/2024\/12\/image-1-768x242.png 768w\" sizes=\"auto, (max-width: 828px) 100vw, 828px\" \/><\/figure><h2 class=\"wp-block-heading\" id=\"464c\">Testing the Generate API<\/h2><ol class=\"wp-block-list\"><li>Once the content is ingested, find the&nbsp;<strong>\/generate<\/strong>&nbsp;endpoint.<\/li><li>Click on \u201cTry it out\u201d<\/li><\/ol><figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"828\" height=\"356\" src=\"https:\/\/www.aihello.com\/resources\/wp-content\/uploads\/2024\/12\/image-3.png\" alt=\"\" class=\"wp-image-10885\" srcset=\"https:\/\/www.aihello.com\/resources\/wp-content\/uploads\/2024\/12\/image-3.png 828w, https:\/\/www.aihello.com\/resources\/wp-content\/uploads\/2024\/12\/image-3-300x129.png 300w, https:\/\/www.aihello.com\/resources\/wp-content\/uploads\/2024\/12\/image-3-768x330.png 768w\" sizes=\"auto, (max-width: 828px) 100vw, 828px\" \/><\/figure><ol class=\"wp-block-list\"><li>Enter a question and a URL related to the content of the ingested document.<\/li><li>Click \u201cExecute\u201d to receive answers from the system.<\/li><\/ol><hr class=\"wp-block-separator has-alpha-channel-opacity\" \/><h1 class=\"wp-block-heading\" id=\"972b\">Conclusion<\/h1><p id=\"39e7\">We\u2019ve built a robust RAG pipeline that:<\/p><ul class=\"wp-block-list\"><li>Accepts various document types (web articles, PDFs, text files).<\/li><li>Uses Crawl4AI to simplify web crawling and data extraction.<\/li><li>Allows users to interact with the content using natural language queries through FastAPI.<\/li><li>Utilizes Haystack and ChromaDB for efficient storage and retrieval of document embeddings.<\/li><\/ul><p id=\"9337\">This system can be extended to support more features like authentication, additional file types, and enhanced error handling. By integrating these technologies, you\u2019ve created a scalable and efficient pipeline for document ingestion and interaction.<\/p><h4 class=\"wp-block-heading\" id=\"0d85\">Optional: Adding Asynchronous Processing with Celery<\/h4><p id=\"28b1\">For long-running tasks like ingesting large documents or crawling extensive web content, it\u2019s beneficial to handle these tasks asynchronously. This ensures that your API remains responsive and doesn\u2019t time out.<\/p><h3 class=\"wp-block-heading\" id=\"4b89\">What are Celery and Redis?<\/h3><ul class=\"wp-block-list\"><li>Celery: An asynchronous task queue\/job queue based on distributed message passing.<\/li><li>Redis: An in-memory data structure store used as a database, cache, and message broker.<\/li><\/ul><h4 class=\"wp-block-heading\" id=\"74a2\">Implementing Asynchronous Ingestion<\/h4><ol class=\"wp-block-list\"><li><strong>Install Celery and Redis<\/strong><\/li><\/ol><pre class=\"wp-block-code\"><code>pip install celery&#091;redis]<\/code><\/pre><p>Ensure that Redis is running on your system. If not, install and start Redis according to your operating system\u2019s instructions.<\/p><p><strong>2. Create a Celery Worker<\/strong><\/p><p>Create a file named\u00a0<strong>celery_worker.py<\/strong><\/p><figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"785\" height=\"652\" src=\"https:\/\/www.aihello.com\/resources\/wp-content\/uploads\/2024\/12\/image-27.png\" alt=\"\" class=\"wp-image-10911\" srcset=\"https:\/\/www.aihello.com\/resources\/wp-content\/uploads\/2024\/12\/image-27.png 785w, https:\/\/www.aihello.com\/resources\/wp-content\/uploads\/2024\/12\/image-27-300x249.png 300w, https:\/\/www.aihello.com\/resources\/wp-content\/uploads\/2024\/12\/image-27-768x638.png 768w\" sizes=\"auto, (max-width: 785px) 100vw, 785px\" \/><\/figure><figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"785\" height=\"890\" src=\"https:\/\/www.aihello.com\/resources\/wp-content\/uploads\/2024\/12\/image-28.png\" alt=\"\" class=\"wp-image-10912\" srcset=\"https:\/\/www.aihello.com\/resources\/wp-content\/uploads\/2024\/12\/image-28.png 785w, https:\/\/www.aihello.com\/resources\/wp-content\/uploads\/2024\/12\/image-28-265x300.png 265w, https:\/\/www.aihello.com\/resources\/wp-content\/uploads\/2024\/12\/image-28-768x871.png 768w\" sizes=\"auto, (max-width: 785px) 100vw, 785px\" \/><\/figure><figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"781\" height=\"228\" src=\"https:\/\/www.aihello.com\/resources\/wp-content\/uploads\/2024\/12\/image-29.png\" alt=\"\" class=\"wp-image-10913\" srcset=\"https:\/\/www.aihello.com\/resources\/wp-content\/uploads\/2024\/12\/image-29.png 781w, https:\/\/www.aihello.com\/resources\/wp-content\/uploads\/2024\/12\/image-29-300x88.png 300w, https:\/\/www.aihello.com\/resources\/wp-content\/uploads\/2024\/12\/image-29-768x224.png 768w\" sizes=\"auto, (max-width: 781px) 100vw, 781px\" \/><\/figure><p><strong>3. Modify the Ingest Endpoint<\/strong><\/p><p>In\u00a0<strong>main.py<\/strong>, update the ingest endpoint to use Celery.<\/p><figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"791\" height=\"758\" src=\"https:\/\/www.aihello.com\/resources\/wp-content\/uploads\/2024\/12\/api-ingestion-endpoint-png.png\" alt=\"\" class=\"wp-image-10914\" srcset=\"https:\/\/www.aihello.com\/resources\/wp-content\/uploads\/2024\/12\/api-ingestion-endpoint-png.png 791w, https:\/\/www.aihello.com\/resources\/wp-content\/uploads\/2024\/12\/api-ingestion-endpoint-png-300x287.png 300w, https:\/\/www.aihello.com\/resources\/wp-content\/uploads\/2024\/12\/api-ingestion-endpoint-png-768x736.png 768w\" sizes=\"auto, (max-width: 791px) 100vw, 791px\" \/><\/figure><figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"777\" height=\"377\" src=\"https:\/\/www.aihello.com\/resources\/wp-content\/uploads\/2024\/12\/image-31.png\" alt=\"\" class=\"wp-image-10915\" srcset=\"https:\/\/www.aihello.com\/resources\/wp-content\/uploads\/2024\/12\/image-31.png 777w, https:\/\/www.aihello.com\/resources\/wp-content\/uploads\/2024\/12\/image-31-300x146.png 300w, https:\/\/www.aihello.com\/resources\/wp-content\/uploads\/2024\/12\/image-31-768x373.png 768w\" sizes=\"auto, (max-width: 777px) 100vw, 777px\" \/><\/figure><figure class=\"wp-block-image size-full\"><img loading=\"lazy\" decoding=\"async\" width=\"781\" height=\"921\" src=\"https:\/\/www.aihello.com\/resources\/wp-content\/uploads\/2024\/12\/building-a-rag-pipeline-with-fastapi-haystack-and-chromadb-for-urls-in-python.png\" alt=\"\" class=\"wp-image-10916\" srcset=\"https:\/\/www.aihello.com\/resources\/wp-content\/uploads\/2024\/12\/building-a-rag-pipeline-with-fastapi-haystack-and-chromadb-for-urls-in-python.png 781w, https:\/\/www.aihello.com\/resources\/wp-content\/uploads\/2024\/12\/building-a-rag-pipeline-with-fastapi-haystack-and-chromadb-for-urls-in-python-254x300.png 254w, https:\/\/www.aihello.com\/resources\/wp-content\/uploads\/2024\/12\/building-a-rag-pipeline-with-fastapi-haystack-and-chromadb-for-urls-in-python-768x906.png 768w\" sizes=\"auto, (max-width: 781px) 100vw, 781px\" \/><\/figure><p id=\"19bd\"><strong>4. Starting the Celery Worker<\/strong><\/p><p id=\"5277\">In a new terminal or command prompt window, navigate to your project directory and activate the virtual environment.<\/p><p id=\"0cb4\">Then run:<\/p><pre class=\"wp-block-code\"><code>celery -A celery_worker.celery_app worker --loglevel=info<\/code><\/pre><p id=\"75b9\">5.&nbsp;<strong>Testing Asynchronous Ingestion<\/strong><\/p><p id=\"f171\">Follow the same steps in the Testing the APIs section, but now the ingestion will run in the background. You can check the status of your ingestion task using the \/tasks\/{task_id} endpoint.<\/p><hr class=\"wp-block-separator has-alpha-channel-opacity\" \/><h1 class=\"wp-block-heading\" id=\"5135\">Additional Resources<\/h1><ul class=\"wp-block-list\"><li><a href=\"https:\/\/crawl4ai.com\/mkdocs\/\" target=\"_blank\" rel=\"noreferrer noopener\">Crawl4AI Documentation<\/a>: Learn more about how to use Crawl4AI for web crawling and data extraction.<\/li><li><a href=\"https:\/\/fastapi.tiangolo.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">FastAPI Documentation<\/a>: Learn more about building APIs with FastAPI.<\/li><li><a href=\"https:\/\/fastapi.tiangolo.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">Haystack Documentation<\/a>: Explore advanced features of Haystack.<\/li><li><a href=\"https:\/\/docs.trychroma.com\/\" target=\"_blank\" rel=\"noreferrer noopener\">ChromaDB Documentation<\/a>: Understand how to manage vector embeddings.<\/li><li><a href=\"https:\/\/docs.celeryq.dev\/en\/stable\/\" target=\"_blank\" rel=\"noreferrer noopener\">Celery Documentation<\/a>: Dive into asynchronous task processing.<\/li><li><a href=\"https:\/\/redis.io\/docs\/latest\/develop\/clients\/redis-py\/\" target=\"_blank\" rel=\"noreferrer noopener\">Redis Documentation<\/a>: Learn about in-memory data structures and caching.<\/li><\/ul><p><strong>Happy coding! If you have any questions or suggestions, feel free to reach out or leave a comment.<\/strong><\/p><p><\/p>","protected":false},"excerpt":{"rendered":"<p>Harness the power of Retrieval Augmented Generation (RAG) to create an intelligent document interaction system using FastAPI, Haystack, ChromaDB, and Crawl4AI.Table of ContentsProblem StatementThe problem statement is to develop a&#8230;<\/p>\n","protected":false},"author":32,"featured_media":0,"comment_status":"closed","ping_status":"closed","sticky":false,"template":"","format":"standard","meta":{"footnotes":""},"categories":[1],"tags":[561],"class_list":["post-10880","post","type-post","status-publish","format-standard","hentry","category-advertising","tag-ai-machine-learning"],"_links":{"self":[{"href":"https:\/\/www.aihello.com\/resources\/wp-json\/wp\/v2\/posts\/10880","targetHints":{"allow":["GET"]}}],"collection":[{"href":"https:\/\/www.aihello.com\/resources\/wp-json\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.aihello.com\/resources\/wp-json\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.aihello.com\/resources\/wp-json\/wp\/v2\/users\/32"}],"replies":[{"embeddable":true,"href":"https:\/\/www.aihello.com\/resources\/wp-json\/wp\/v2\/comments?post=10880"}],"version-history":[{"count":0,"href":"https:\/\/www.aihello.com\/resources\/wp-json\/wp\/v2\/posts\/10880\/revisions"}],"wp:attachment":[{"href":"https:\/\/www.aihello.com\/resources\/wp-json\/wp\/v2\/media?parent=10880"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.aihello.com\/resources\/wp-json\/wp\/v2\/categories?post=10880"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.aihello.com\/resources\/wp-json\/wp\/v2\/tags?post=10880"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}