Adaptive RAG with local LLMs¶

Adaptive RAG is a strategy for RAG that unites (1) query analysis with (2) active / self-corrective RAG.

In the paper, they report query analysis to route across:

No Retrieval
Single-shot RAG
Iterative RAG

Let's build on this using LangGraph.

In our implementation, we will route between:

Web search: for questions related to recent events
Self-corrective RAG: for questions related to our index

Adaptive RAG graph

Setup¶

First, you'll need to install some required dependencies:

npm install cheerio langchain @langchain/community @langchain/ollama @langchain/core

For the fallback web search, you'll also need to obtain a Tavily API key and set it as an environment variable named TAVILY_API_KEY.

Models¶

Next, choose which local models you'll use.

Local Embeddings¶

We'll be using the mxbai-embed-large embeddings model from Ollama.

Local LLM¶

(1) Download Ollama app.

(2) Pull a Llama 3 model here. You can also try Mistral models here, one of the quantized Cohere Command-R models, or any other model you'd like to try from the Ollama library - just be sure that your computer has sufficient RAM.

ollama pull llama3 mxbai-embed-large

Tracing¶

Optionally, use LangSmith for tracing (shown at bottom)

// process.env.LANGCHAIN_TRACING_V2 = "true";
// process.env.LANGCHAIN_ENDPOINT = "https://api.smith.langchain.com";
// process.env.LANGCHAIN_API_KEY = "<your-api-key>"

Index¶

Now that you've chosen and set up your local models, load and index some source documents. The code below uses some of Lilian Weng's blog posts on LLMs and agents as a data source, then loads them into a demo MemoryVectorStore instance. It then creates a retriever from that vector store for later use.

import { RecursiveCharacterTextSplitter } from "langchain/text_splitter";
import { CheerioWebBaseLoader } from "@langchain/community/document_loaders/web/cheerio";
import { MemoryVectorStore } from "langchain/vectorstores/memory";
import { OllamaEmbeddings } from "@langchain/ollama";

const urls = [
  "https://lilianweng.github.io/posts/2023-06-23-agent/",
  "https://lilianweng.github.io/posts/2023-03-15-prompt-engineering/",
  "https://lilianweng.github.io/posts/2023-10-25-adv-attack-llm/",
];

const docs = await Promise.all(urls.map((url) => {
  const loader = new CheerioWebBaseLoader(url);
  return loader.load();
}));

const docsList = docs.flat();

const textSplitter = new RecursiveCharacterTextSplitter({
  chunkSize: 250,
  chunkOverlap: 0,
});

const splitDocs = await textSplitter.splitDocuments(docsList);

const embeddings = new OllamaEmbeddings({
  model: "mxbai-embed-large",
});

// Add to vector store
const vectorStore = await MemoryVectorStore.fromDocuments(
  splitDocs,
  embeddings,
);

const retriever = vectorStore.asRetriever();

Creating components¶

Here, you'll create the components of the graph.

Question router¶

First, create a chain that will route incoming questions towards either your vector store if they are related to LLMs or agents, or to a general web search if they are not.

You'll use Ollama's JSON mode to help keep the output format consistent.

import { ChatPromptTemplate } from "@langchain/core/prompts";
import { JsonOutputParser } from "@langchain/core/output_parsers";
import { ChatOllama } from "@langchain/ollama";

const jsonModeLlm = new ChatOllama({
  model: "llama3",
  format: "json",
  temperature: 0,
});

const QUESTION_ROUTER_SYSTEM_TEMPLATE =
  `You are an expert at routing a user question to a vectorstore or web search.
Use the vectorstore for questions on LLM agents, prompt engineering, and adversarial attacks.
You do not need to be stringent with the keywords in the question related to these topics.
Otherwise, use web-search. Give a binary choice 'web_search' or 'vectorstore' based on the question.
Return the a JSON with a single key 'datasource' and no preamble or explanation.`;

const questionRouterPrompt = ChatPromptTemplate.fromMessages([
  ["system", QUESTION_ROUTER_SYSTEM_TEMPLATE],
  ["human", "{question}"],
]);

const questionRouter = questionRouterPrompt.pipe(jsonModeLlm).pipe(
  new JsonOutputParser(),
);

await questionRouter.invoke({ question: "llm agent memory" });

{ datasource: 'vectorstore' }

Above, notice that you invoked the router with a query related to the knowledge our vector store contains, so it responds accordingly. Here's what happens if you ask something irrelevant:

await questionRouter.invoke({ question: "red robin" });

{ datasource: 'web_search' }

In this case, you can see that execution would be routed to our web search.

Retrieval grader¶

Create a grader that will check retrieved documents from our vector store for relevancy:

const GRADER_TEMPLATE =
  `You are a grader assessing relevance of a retrieved document to a user question.
Here is the retrieved document:

<document>
{content}
</document>

Here is the user question:
<question>
{question}
</question>

If the document contains keywords related to the user question, grade it as relevant.
It does not need to be a stringent test. The goal is to filter out erroneous retrievals.
Give a binary score 'yes' or 'no' score to indicate whether the document is relevant to the question.
Provide the binary score as a JSON with a single key 'score' and no preamble or explanation.`;

const graderPrompt = ChatPromptTemplate.fromTemplate(GRADER_TEMPLATE);

const retrievalGrader = graderPrompt.pipe(jsonModeLlm).pipe(
  new JsonOutputParser(),
);

// Test run
const testQuestion = "agent memory";

const docs2 = await retriever.invoke(testQuestion);

await retrievalGrader.invoke({
  question: testQuestion,
  content: docs2[0].pageContent,
});

{ score: 'yes' }

And you can see that it marks the first retrieved document as related to "agent memory".

Generation¶

Next, create a chain that generates an answer based on retrieved documents.

import * as hub from "langchain/hub";
import { StringOutputParser } from "@langchain/core/output_parsers";
import type { Document } from "@langchain/core/documents";

// https://smith.langchain.com/hub/rlm/rag-prompt
const ragPrompt = await hub.pull("rlm/rag-prompt");

// Post-processing
const formatDocs = (docs: Document[]) => {
  return docs.map((doc) => doc.pageContent).join("\n\n");
};

// Initialize a new model without JSON mode active
const llm = new ChatOllama({
  model: "llama3",
  temperature: 0,
});

// Chain
const ragChain = ragPrompt.pipe(llm).pipe(new StringOutputParser());

// Test run
const testQuestion2 = "agent memory";
const docs3 = await retriever.invoke(testQuestion2);

await ragChain.invoke({ context: formatDocs(docs3), question: testQuestion2 });

Based on the provided context, it appears that an agent's memory refers to its ability to record and reflect on past experiences, using both long-term and short-term memory modules. The long-term memory module, or "memory stream," stores a comprehensive list of agents' experiences in natural language, while the reflection mechanism synthesizes these memories into higher-level inferences over time to guide future behavior.

Hallucination grader¶

Create a chain that reviews a generated answer and checks for hallucinations. We'll return to using JSON mode for this one:

const HALLUCINATION_GRADER_TEMPLATE =
  `You are a grader assessing whether an answer is grounded in / supported by a set of facts.
Here are the facts used as context to generate the answer:

<context>
{context} 
</context>

Here is the answer:

<answer>
{generation}
</answer>

Give a binary score 'yes' or 'no' score to indicate whether the answer is grounded in / supported by a set of facts.
Provide the binary score as a JSON with a single key 'score' and no preamble or explanation.`;

const hallucinationGraderPrompt = ChatPromptTemplate.fromTemplate(
  HALLUCINATION_GRADER_TEMPLATE,
);

const hallucinationGrader = hallucinationGraderPrompt.pipe(llm).pipe(
  new JsonOutputParser(),
);

// Test run
const generation2 = await ragChain.invoke({
  context: formatDocs(docs3),
  question: testQuestion2,
});

await hallucinationGrader.invoke({ context: formatDocs(docs3), generation: generation2 });

{ score: 'yes' }

Answer Grader¶

Create a chain for checking the relevancy of the final answer:

const ANSWER_GRADER_PROMPT_TEMPLATE =
  `You are a grader assessing whether an answer is useful to resolve a question.
Here is the answer:

<answer>
{generation} 
</answer>

Here is the question:

<question>
{question}
</question>

Give a binary score 'yes' or 'no' to indicate whether the answer is useful to resolve a question.
Provide the binary score as a JSON with a single key 'score' and no preamble or explanation.`;

const answerGraderPrompt = ChatPromptTemplate.fromTemplate(
  ANSWER_GRADER_PROMPT_TEMPLATE,
);

const answerGrader = answerGraderPrompt.pipe(jsonModeLlm).pipe(
  new JsonOutputParser(),
);

// Test run
const generation3 = await ragChain.invoke({
  context: formatDocs(docs3),
  question: testQuestion2,
});

await answerGrader.invoke({ question: testQuestion2, generation: generation3 });

{ score: 'yes' }

Question rewriter¶

Create a question rewriter. This chain performs query analysis on the user questions and optimizes them for RAG to help handle difficult queries.

const REWRITER_PROMPT_TEMPLATE =
  `You a question re-writer that converts an input question to a better version that is optimized
for vectorstore retrieval. Look at the initial and formulate an improved question.

Here is the initial question:

<question>
{question}
</question>

Respond only with an improved question. Do not include any preamble or explanation.`;

const rewriterPrompt = ChatPromptTemplate.fromTemplate(
  REWRITER_PROMPT_TEMPLATE,
);

const rewriter = rewriterPrompt.pipe(llm).pipe(new StringOutputParser());

// Test run

// Test question is "agent memory"
await rewriter.invoke({ question: testQuestion2 });

What are memories stored in by agents?

Web Search Tool¶

Finally, you'll need a web search tool that can handle questions out of scope from the indexed documents. The code below initializes a Tavily-powered search tool

import { TavilySearchResults } from "@langchain/community/tools/tavily_search";

const webSearchTool = new TavilySearchResults({ maxResults: 3 });

await webSearchTool.invoke("red robin");

[{"title":"Family Friendly Burger Restaurant | Red Robin","url":"https://www.redrobin.com/","content":"Red Robin is donating 10¢ to Make-A-Wish ® for every Kids Meal purchased. You can contribute to life-changing wishes by simply purchasing a Kids Meal at Red Robin for Dine-in or To-Go. Join us for a memorable meal or order online and help transform lives, one wish at a time.","score":0.998043,"raw_content":null},{"title":"Red Robin United States of America Directory","url":"https://locations.redrobin.com/locations-list/us/","content":"Maps, Driving Directions and Local Restaurant Information for Red Robin Restaurants in United States","score":0.99786776,"raw_content":null},{"title":"Red Robin Restaurant Locations","url":"https://locations.redrobin.com/","content":"Maps, Driving Directions and Local Restaurant Information for Red Robin","score":0.99718815,"raw_content":null}]

Graph¶

Now that you've created all the necessary components, it's time to capture the flow as a graph.

Graph state¶

Define the graph state like this. Since question and generation are simple strings, we can use null as a shorthand for default behavior:

import type { Document } from "@langchain/core/documents";
import { Annotation } from "@langchain/langgraph";

// This defines the agent state.
// Returned documents from a node will override the current
// "documents" value in the state object.
const GraphState = Annotation.Root({
  question: Annotation<string>,
  generation: Annotation<string>,
  documents: Annotation<Document[]>({
    reducer: (_, y) => y,
    default: () => [],
  })
})

Preparing nodes and edges¶

Let's wrap our components in functions that match the interfaces required by LangGraph. These functions will handle formatting inputs and outputs.

We'll use some components within nodes, and others to define conditional edges. Each will take the graph state as a parameter. Nodes return state properties to be updated, while conditional edges return the name of the next node to execute.

import { Document } from "@langchain/core/documents";

/* ---Nodes--- */

// Retrieve documents for a question
const retrieve = async (state: typeof GraphState.State): Promise<Partial<typeof GraphState.State>> => {
  console.log("---RETRIEVE---");
  const documents = await retriever.invoke(state.question);
  // Add sources to the state
  return { documents };
};

// RAG generation
const generate = async (state: typeof GraphState.State): Promise<Partial<typeof GraphState.State>> => {
  console.log("---GENERATE---");
  const generation = await ragChain.invoke({
    context: formatDocs(state.documents),
    question: state.question,
  });
  // Add generation to the state
  return { generation };
};

// Determines whether the retrieved documents are relevant to the question.
const gradeDocuments = async (state: typeof GraphState.State): Promise<Partial<typeof GraphState.State>> => {
  console.log("---CHECK DOCUMENT RELEVANCE TO QUESTION---");
  // Score each doc
  const relevantDocs: Document[] = [];
  for (const doc of state.documents) {
    const grade: { score: string } = await retrievalGrader.invoke({
      question: state.question,
      content: doc.pageContent,
    });
    if (grade.score === "yes") {
      console.log("---GRADE: DOCUMENT RELEVANT---");
      relevantDocs.push(doc);
    } else {
      console.log("---GRADE: DOCUMENT NOT RELEVANT---");
    }
  }
  return { documents: relevantDocs };
};

// Re-write question
const transformQuery = async (state: typeof GraphState.State): Promise<Partial<typeof GraphState.State>> => {
  console.log("---TRANSFORM QUERY---");
  const betterQuestion = await rewriter.invoke({ question: state.question });
  return { question: betterQuestion };
};

// Web search based on the re-phrased question
const webSearch = async (state: typeof GraphState.State): Promise<Partial<typeof GraphState.State>> => {
  console.log("---WEB SEARCH---");
  const stringifiedSearchResults = await webSearchTool.invoke(state.question);
  return {
    documents: [new Document({ pageContent: stringifiedSearchResults })],
  };
};

/* ---Edges--- */

// Decide on the datasource to route the initial question to.
const routeQuestion = async (state: typeof GraphState.State) => {
  const source: { datasource: string } = await questionRouter.invoke({
    question: state.question,
  });
  if (source.datasource === "web_search") {
    console.log(`---ROUTING QUESTION "${state.question} TO WEB SEARCH---`);
    return "web_search";
  } else {
    console.log(`---ROUTING QUESTION "${state.question} TO RAG---`);
    return "retrieve";
  }
};

// Decide whether the current documents are sufficiently relevant
// to come up with a good answer.
const decideToGenerate = async (state: typeof GraphState.State) => {
  const filteredDocuments = state.documents;
  // All documents have been filtered as irrelevant
  // Regenerate a new query and try again
  if (filteredDocuments.length === 0) {
    console.log(
      "---DECISION: ALL DOCUMENTS ARE NOT RELEVANT TO QUESTION, TRANSFORM QUERY---",
    );
    return "transform_query";
  } else {
    // We have relevant documents, so generate answer.
    console.log("---DECISION: GENERATE---");
    return "generate";
  }
};

// Determines whether the generation is grounded in the document and answers question.
const gradeGenerationDocumentsAndQuestion = async (
  state: typeof GraphState.State,
) => {
  const hallucinationGrade: { score: string } = await hallucinationGrader
    .invoke({
      generation: state.generation,
      context: formatDocs(state.documents),
    });
  // Check for hallucination
  if (hallucinationGrade.score === "yes") {
    console.log("---DECISION: GENERATION IS GROUNDED IN DOCUMENTS---");
    // Check question answering
    console.log("---GRADING GENERATION vs. QUESTION---");
    const onTopicGrade: { score: string } = await answerGrader.invoke({
      question: state.question,
      generation: state.generation,
    });
    if (onTopicGrade.score === "yes") {
      console.log("---DECISION: GENERATION ADDRESSES QUESTION---");
      return "useful";
    } else {
      console.log("---DECISION: GENERATION DOES NOT ADDRESS QUESTION---");
      return "not_useful";
    }
  } else {
    console.log(
      "---DECISION: GENERATION IS NOT GROUNDED IN DOCUMENTS, RETRY---",
    );
    return "not_supported";
  }
};

Build the graph¶

Now we build the graph. For fun, let's add a checkpointer and have the compiled graph pause before making a web search. This will simulate asking for permission.

import { END, MemorySaver, START, StateGraph } from "@langchain/langgraph";

const graph = new StateGraph(GraphState)
  .addNode("web_search", webSearch)
  .addNode("retrieve", retrieve)
  .addNode("grade_documents", gradeDocuments)
  .addNode("generate", generate)
  .addNode("transform_query", transformQuery)
  .addConditionalEdges(START, routeQuestion)
  .addEdge("web_search", "generate")
  .addEdge("retrieve", "grade_documents")
  .addConditionalEdges("grade_documents", decideToGenerate)
  .addEdge("transform_query", "retrieve")
  .addConditionalEdges("generate", gradeGenerationDocumentsAndQuestion, {
    not_supported: "generate",
    useful: END,
    not_useful: "transform_query",
  });

const app = graph.compile({
  checkpointer: new MemorySaver(),
  interruptBefore: ["web_search"],
});

Running the graph¶

You're all set! Time to ask some questions. First, try a question about something related to agents:

await app.invoke(
  {
    question: "What are some features of long-term memory?",
  },
  { configurable: { thread_id: "1" } },
);

---ROUTING QUESTION "What are some features of long-term memory? TO WEB SEARCH---
{
  question: 'What are some features of long-term memory?',
  documents: []
}

You can see that your graph correctly routes the query to the vector store and answers the question, filtering out some documents as necessary.

If you ask something not related to agents or LLMs, the graph should fall back to information gleaned from the web. The graph will pause before executing, as specified above:

await app.invoke(
  {
    question: "Where are the 2024 Euros being held?",
  },
  { configurable: { thread_id: "2" } },
);

---ROUTING QUESTION "Where are the 2024 Euros being held? TO WEB SEARCH---
{ question: 'Where are the 2024 Euros being held?', documents: [] }

You can see the graph paused before running the web search. And now we continue by invoking the graph with null:

await app.invoke(null, { configurable: { thread_id: "2" } });

---WEB SEARCH---
---GENERATE---
---DECISION: GENERATION IS GROUNDED IN DOCUMENTS---
---GRADING GENERATION vs. QUESTION---
---DECISION: GENERATION ADDRESSES QUESTION---
{
  question: 'Where are the 2024 Euros being held?',
  generation: 'The 2024 Euros are being held in Germany. The final match will take place at Olympiastadion Berlin on July 14, 2024.',
  documents: [
    Document {
      pageContent: `[{"title":"Where is Euro 2024? Country, host cities and venues","url":"https://www.radiotimes.com/tv/sport/football/euro-2024-location/","content":"Euro 2024 stadiums The Olympiastadion Berlin, the biggest stadium in Germany with a capacity of around 74,000, will host games as well as the final on Sunday, 14th July, 2024.","score":0.99743915,"raw_content":null},{"title":"UEFA EURO 2024 venues - complete list: When and where will the opening ...","url":"https://olympics.com/en/news/uefa-euro-2024-venues-complete-list-when-where-final-opening-game","content":"UEFA EURO 2024 will be held in Germany across June and July, with 10 host cities staging the major football tournament.. It all begins in Munich on June 14, when hosts Germany take on Scotland in the tournament's opening game at Bayern Munich's stadium.. The final takes place a month later on July 14 at Olympiastadion Berlin in the German capital, which hosted the 2006 FIFA World Cup final ...","score":0.9973061,"raw_content":null},{"title":"EURO 2024: All you need to know | UEFA EURO 2024","url":"https://www.uefa.com/euro2024/news/0257-0e13b161b2e8-4a3fd5615e0c-1000--euro-2024-all-you-need-to-know/","content":"Article top media content\\nArticle body\\nWhere will EURO 2024 be held?\\nGermany will host EURO 2024, having been chosen to stage the 17th edition of the UEFA European Championship at a UEFA Executive Committee meeting in Nyon on 27 September 2018. Host cities\\nEURO 2024 fixtures by venue\\nEURO 2024 fixtures by team\\nAlso visit\\nChange language\\nServices links and disclaimer\\n© 1998-2024 UEFA. Where and when will the final of UEFA EURO 2024 be played?\\nBerlin's Olympiastadion will stage the final on Sunday 14 July 2024.\\n The ten venues chosen to host games at the tournament include nine of the stadiums used at the 2006 World Cup plus the Düsseldorf Arena.\\n All you need to know\\nThursday, January 11, 2024\\nArticle summary\\nThree-time winners Germany will stage the UEFA European Championship in 2024.\\n","score":0.99497885,"raw_content":null}]`,
      metadata: {},
      id: undefined
    }
  ]
}