How to Deploy a RAG-Based Assistant Over Your Internal Resources

DM Television

SEC dropping Ripple case is ‘final exclamation mark’ that XRP is not a security — John Deaton

April

S	M	T	W	T	F	S
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

How to Deploy a RAG-Based Assistant Over Your Internal Resources

Tags: api apis applications content internet option web

Author: DATE POSTED:March 11, 2025

Feed: Nordic APIs

View: Original article

Retrieval-augmented generation (RAG) has been generating a lot of buzz recently, as it promises to solve many of the issues currently facing AI, machine learning, and LLMs. RAG augments the knowledge from large datasets, which are outdated by their very nature due to how they’re collected. Public LLMs are also trained on publicly available data outside the scope of internal or proprietary data. Without the ability to add or integrate your own data, an LLM’s usefulness is greatly limited.

Hosting a local RAG-based assistant solves many of these issues, letting you take advantage of LLM’s natural language processing and summarization abilities while also giving you the ability to upload your own data. This allows you to engage with your data using natural language, offering the ability to quickly summarize and query your data in a way that can be tricky to accomplish strictly using code.

For this tutorial, we’re going to show you how to build and deploy your own RAG-based assistant so you can see for yourself how useful RAG can be. Before we begin, however, let’s take a quick look at how RAG works. This will help give you a better understanding of the general principles of RAG-based assistants, which will help make setting up your own RAG system more platform-agnostic. This will also let you pick the tooling that’s best for you, as RAG-based tools have been proliferating in the last 12 months.

How RAG Works

Retrieval-augmented generation is the next step beyond LLMs, which are incredibly impressive but have their share of shortcomings. Most notably, LLMs like ChatGPT tend to make things up when they don’t know something, which is known as hallucinations. This fact alone prevents LLMs from being as useful as they could be, as you can’t simply trust their output or for them to act on their own. RAG bolsters the data from training models with additional information, which can be sourced from the internet or local files.

RAG-based systems incorporate additional information, which is then sliced up into chunks using a tool like LangChain. These chunks are then indexed, which is known as vector embedding, and ranked for accuracy. The RAG-based system returns a number of the highest-ranking resources in addition to anything returned from the LLM, helping to ensure a more accurate response.

Now that you’ve got a better idea of how RAG works, we will show you how to set up your own RAG-based assistant so you can try it out for yourself.

1. Find Data

RAG-based assistants allow you to upload your own data, which is one of its most popular applications. Uploading your own data will help give you a better idea of what RAG-based assistants are capable of. For this tutorial, we’ll be using Salesforce State of Sales Report which surveyed thousands of global sales professionals to discover business performance trends.

2. Install Kotaemon

For this tutorial, we’re going to use a tool called Kotaemon, an open-source RAG-based tool with a clean, minimal UI that lets you upload and interact with your own files. We’re using Kotaemon as it’s available as a standalone installer, a Docker Image, and as a Hugging Face space. For this tutorial, we’ll be using the Hugging Face kotaemon-template, as deploying and hosting LLM models can be slightly tricky, at times, requiring permissions to be set appropriately to work with public LLMs like Llama-3.1-8B. They can also be prohibitively large, taking a while to build and deploy.

We’re using a duplicated version of the Kotaemon Hugging Face space, which we’ll show you how to set up, but we’ll also show you how to set up Kotaemon with Docker in case you want to run an instance locally.

Install Kotaemon Hugging Face Space

Start by going to Kotaemon Template on Hugging Face. You’ll need to create an account if you don’t have one. On that page, click the icon with three vertical dots next to your User Profile and select the Duplicate this Space option from the dropdown menu. In the duplicated space, you’ll need to choose which language model you’ll be using for the template. We’re using Cohere, so you’ll need to sign up for a free account to get a Cohere API key. Once you have that, copy and paste the API key into the text box and select Proceed. Once this is complete, you’ll need to wait for your RAG model to build, which takes approximately ten minutes.

Install Kotaemon With Docker

To install with Docker, start by ensuring you have the necessary requirements installed.

Kotaemon System Requirements

Python >= 3.10
Docker
Unstructured

Once you’ve installed these, run the following commands to build Kotaemon with Docker. We’ll show you how to install both the lite and the full versions. The full version comes with Unstructured bundled in, but it also makes the Docker image noticeably larger. For most demo purposes, the lite version should do.

Kotaemon Lite docker run \ -e GRADIO_SERVER_NAME=0.0.0.0 \ -e GRADIO_SERVER_PORT=7860 \ -v ./ktem_app_data:/app/ktem_app_data \ -p 7860:7860 -it --rm \ ghcr.io/cinnamon/kotaemon:main-lite

This tells Docker to build an instance of Kotaemon using Gradio, a popular tool for demoing machine learning and web apps.

Kotaemon Full docker run \ -e GRADIO_SERVER_NAME=0.0.0.0 \ -e GRADIO_SERVER_PORT=7860 \ -v ./ktem_app_data:/app/ktem_app_data \ -p 7860:7860 -it --rm \ ghcr.io/cinnamon/kotaemon:main-full

If you want to try this demo using the Ollama model instead, you can substitute ghcr.io/cinnamon/kotaemon:main-full with ghcr.io/cinnamon/kotaemon:feat-ollama_docker-full. Keep in mind that Ollama can be somewhat resource-intensive to deploy, so it might take a little longer to build.

Kotaemon also lets you specify what platform you’re running. They offer support for linux/amd64 and linux/arm64, which is what you’ll use for newer Macs. If you need to use ARM64, that will look like this:

# To run docker with platform linux/arm64 docker run \ -e GRADIO_SERVER_NAME=0.0.0.0 \ -e GRADIO_SERVER_PORT=7860 \ -v ./ktem_app_data:/app/ktem_app_data \ -p 7860:7860 -it --rm \ --platform linux/arm64 \ ghcr.io/cinnamon/kotaemon:main-lite Install Kotaemon Without Docker

You can install Kotaemon from the CLI if you like, as well. You just need to activate a Virtual Environment and clone the GitHub repo.

# optional (setup env) conda create -n kotaemon python=3.10 conda activate kotaemon # clone this repo git clone https://github.com/Cinnamon/kotaemon cd kotaemon pip install -e "libs/kotaemon[all]" pip install -e "libs/ktem"

Once the virtual environment is running and all libraries are installed, you’ll need to create a .env file in the root directory. You can use the .env.example to pattern your own .env file after. The .env file will only be used the first time you run the app to populate the database. Once you’ve filled out the .env file, you can launch the app by running:

python app.py 3. Try out the RAG-Based Assistant

Now that Kotaemon is up and running, you can try it out for yourself to give you an idea of how a RAG-based assistant might fit into your workflow. In Kotaemon, upload the Salesforce State of Sales.PDF using the Quick Upload option in the bottom-left corner of the screen. Once it’s uploading, give it a moment to finish and then index.

Once it’s finished indexing, you can chat with the RAG-based assistant like it were a person, allowing you to query the model about the data using natural language. For instance, you can ask it, “What time period does this cover?” and it will answer, “12 – 18 months.” Finding that information yourself can be labor-intensive, time-consuming, and can require an above-average understanding of the subject. Best of all, it will show you where it’s getting that information from, including a Relevance Score, making RAG more trustworthy and dependable.

Final Thoughts on Deploying a RAG-Based Assistant

The ability to quickly summarize information is one of the best illustrations of RAG’s usefulness. When you ask Kotaemon, “What’s the biggest obstacle facing sales professions today?” it provides useful information about meeting customer expectations and excessive administrative work preventing businesses from focusing on sales.

Using a RAG-based assistant lets you engage with internal documents and data using natural language. In our example, you can ask, “What can I do to help reduce excessive administrative tasks?” Kotaemon provides a list of valuable suggestions, from prioritizing collaboration and sales to streamlining your existing processes.

Best of all, adopting a RAG-based assistant lets you take advantage of the revolutionary potential of LLMs while avoiding many of the pitfalls. You’re much more able to use LLMs effectively when you don’t have to fact-check every single line for errors or outright lies. It’s an important step in making LLMs and AI, in general, more trustworthy and useful.

Feed: Nordic APIs

View: Original article

Tags: api apis applications content internet option web