Harnessing RAG Models with Cohere LLM for Effective Document Q/A: A Practical Guide

Introduction

Overview of Retrieval-Augmented Generation (RAG)

Retrieval-augmented generation (RAG) represents a significant advancement in natural language processing, blending the strengths of information retrieval and language generation. This approach ensures precise and contextually rich responses by leveraging external documents during the answer-generation process.

Importance of Document-Based Q/A Systems

In an era where data is abundant but fragmented, the ability to efficiently retrieve and comprehend information from diverse document formats (PDFs, PPTs, DOCXs) is invaluable. Document-based Q/A systems provide robust solutions for knowledge management, customer support, and academic research, enhancing productivity and decision-making.

Introduction to Cohere LLM

Cohere LLM is a state-of-the-art language model designed to understand and generate human-like text. Its advanced capabilities make it an ideal choice for building sophisticated document-based Q/A systems, offering scalability, accuracy, and ease of integration.

Understanding Retrieval-Augmented Generation (RAG)

What is RAG?

RAG combines traditional retrieval methods with powerful language models. It retrieves relevant documents or text snippets from a large corpus and uses this retrieved information to generate accurate and context-aware responses.

How RAG Enhances Traditional Q/A Systems

Traditional Q/A systems rely on predefined answers or simple keyword matching, often falling short in complex queries. RAG overcomes these limitations by dynamically retrieving and integrating contextual information, resulting in more accurate and nuanced answers.

Key Components of a RAG Model

  • Retriever: Identifies relevant documents or text passages from a corpus.

  • Generator: Uses the retrieved information to generate a coherent and contextually appropriate answer.

Why Cohere LLM for Document Q/A?

Unique Features of Cohere LLM

Cohere LLM excels in natural language understanding and generation, offering features such as fine-tuning capabilities, scalability, and support for multiple languages, making it highly suitable for diverse applications.

Benefits of Using Cohere LLM for Document-Based Queries

  • Accuracy: High precision in understanding and answering complex queries.

  • Flexibility: Adaptable to various document types and formats.

  • Efficiency: Fast processing times, even with large datasets.

Comparison with Other LLMs

Compared to other leading LLMs, Cohere offers competitive advantages in customization, ease of use, and integration capabilities, making it a preferred choice for many organizations.

Notebook

Installing Required Libraries

# Installing Required Libraries
%pip install python-docx
%pip install python-pptx
%pip install PyPDF2
%pip install langchain
%pip install langchain_community
%pip install langchain_google_genai
%pip install langchain_text_splitters
%pip install sentence-transformers
%pip install faiss-cpu
%pip install cohere

Necessary imports

# necessary Imports
from docx import Document
from PyPDF2 import PdfReader
from pptx import Presentation
from langchain_community.llms import Cohere
from langchain_community.vectorstores import FAISS
from langchain_google_genai import GoogleGenerativeAI
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain.embeddings import HuggingFaceEmbeddings
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
from langchain_core.messages import AIMessage, HumanMessage
from langchain_google_genai import GoogleGenerativeAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_core.prompts  import PromptTemplate, ChatPromptTemplate, MessagesPlaceholder

Loading the Files[pdf, doc, ppt]

pdf_file = open('/kaggle/input/ncert-book/NCERT-Class-10-History.pdf','rb')
ppt_file = Presentation("/kaggle/input/report-ppt/Group 10 Presentation.pptx")
doc_file = Document('/kaggle/input/final-report/research.docx')

Extracting the pdf, doc, ppt data

# # extracting pdf data
pdf_text = ""
pdf_reader = PdfReader(pdf_file)
for page in pdf_reader.pages:
    pdf_text += page.extract_text()

# extracting ppt data
ppt_text = ""
for slide in ppt_file.slides:
    for shape in slide.shapes:
        if hasattr(shape, "text"):
            ppt_text += shape.text + '\n'

# extracting doc data
doc_text = ""
for paragraph in doc_file.paragraphs:
    doc_text += paragraph.text + '\n'

Merging all the texts

# merging all the text 
all_text = pdf_text + '\n' + ppt_text + '\n' + doc_text
len(all_text)

Splitting text into chunks

# splitting the text into chunks for embeddings creation
text_splitter = RecursiveCharacterTextSplitter(
        chunk_size = 1000, 
        chunk_overlap = 200, # This is helpul to handle the data loss while chunking.
        length_function = len,
        separators=['\n', '\n\n', ' ', '']
    )
chunks = text_splitter.split_text(text = all_text)
len(chunks)

API Key

import os
# os.environ['HuggingFaceHub_API_Token']= HuggingFaceHub_API_Token
# os.environ['GOOGLE_API_KEY']= GOOGLE_API_KEY
os.environ['cohere_api_key'] = "cohere_api_key"

Initialising Embedding models

embeddings = HuggingFaceEmbeddings(model_name='sentence-transformers/all-MiniLM-L6-v2')

Indexing and creating a retriever

vectorstore = FAISS.from_texts(chunks, embedding = embeddings)
retriever = vectorstore.as_retriever(search_type="similarity", search_kwargs={"k": 6})
retrieved_docs = retriever.invoke("How did the Swadeshi Movement influence Indian industries in the early 20th century?")
len("Length of Retrieved Docs: ", retrieved_docs)
print("-----------------------")
print("Page Content: ", retrieved_docs[0].page_content)

Cohere LLM

prompt_template = """Answer the question as precise as possible using the provided context. If the answer is
                not contained in the context, say "answer not available in context" \n\n
                Context: \n {context}?\n
                Question: \n {question} \n
                Answer:"""

prompt = PromptTemplate.from_template(template=prompt_template)

def format_docs(docs):
    return "\n\n".join(doc.page_content for doc in docs)

RAG Chain

# RAG Chain
def generate_answer(question):
    cohere_llm = Cohere(model="command", temperature=0.1, cohere_api_key = os.getenv('cohere_api_key'))

    rag_chain = (
        {"context": retriever | format_docs, "question": RunnablePassthrough()}
        | prompt
        | cohere_llm
        | StrOutputParser()
    )

    return rag_chain.invoke(question)

Results

ans = generate_answer("How did the Swadeshi Movement influence Indian industries in the early 20th century?")
print(ans)

ans_1 = generate_answer("How did the East India Company contribute to the opium trade with China in the 19th century?")
print(ans_1)

ans_2 = generate_answer("Which machine learning algorithms are utilized in the project?")
print(ans_2)

Did you find this article valuable?

Support Vishal Pandey by becoming a sponsor. Any amount is appreciated!