Summarizing PDF Content Using LLMs with MindsDB Serve API

In this tutorial, we'll demonstrate how to use the GPT-3.5 model through MindsDB Serve API to summarize text from a PDF, using Python. We will guide you step by step on how to extract text from a PDF and then summarize it efficiently. Additionally, you'll see how easy it is to experiment with different Large Language Models (LLMs) by simply changing the model name in your script. This flexibility allows you to leverage the latest advancements in Large Language Models with minimal effort.

Prerequisites:

  • Ensure you have Python installed on your machine.
  • Install the PyPDF2 and OpenAI SDK by running pip install PyPDF2 openai inside your virtual environment.
  • Download the pdf file from the following URL.
  • Create new file called mindsdb_serve_test.py in the same location where you downloaded the pdf file.

Step 1: Extract Text from a PDF File

In this initial step, you will create a Python function extract_text_from_pdf that reads through each page of a provided PDF file and aggregates the extracted text. This function is essential for the subsequent text summarization step.

def extract_text_from_pdf(pdf_path):
    """
    Extracts text from a given PDF file.
    
    :param pdf_path: Path to the PDF file to extract text from.
    :return: Extracted text as a string or None if an error occurs.
    """
    import PyPDF2
    text = ""
    try:
        with open(pdf_path, "rb") as file:
            pdf_reader = PyPDF2.PdfReader(file)
            # Iterate through each page and extract text.
            for page_num in range(len(pdf_reader.pages)):
                page = pdf_reader.pages[page_num]
                text += page.extract_text()
    except FileNotFoundError:
        print(f"The file {pdf_path} was not found.")
        return None
    except PyPDF2.errors.PdfReadError as e:
        print(f"An error occurred while reading the PDF file: {e}")
        return None
    return text

Step 2: Summarize the Extracted Text Using MindsDB Serve

Now, let's define a function summarize_text_from_pdf to connect with the MindsDB Serve API, which will utilize the GPT-3.5 model for text summarization. You can adjust the summary_prompt parameter if needed for different types of summarization tasks.

def summarize_text_from_pdf(mindsdb_api_key, pdf_text, model="gpt-3.5-turbo", summary_prompt="Summarize the following text."):
    """
    Summarizes the text extracted from a PDF using the MindsDB Serve API.
    
    :param mindsdb_api_key: API key for authentication with MindsDB Serve API.
    :param pdf_text: Text content extracted from the PDF to summarize.
    :param summary_prompt: Prompt to use for text summarization.
    :return: Summary of the text or an error message.
    """
    from openai import OpenAI, OpenAIError
    if not pdf_text:
        return "No text available to summarize."
        
    try:
        client_mindsdb_serve = OpenAI(
            api_key=mindsdb_api_key,
            base_url="https://llm.mdb.ai"
        )
        chat_completion_gpt = client_mindsdb_serve.chat.completions.create(
            messages=[
                {
                    "role": "system",
                    "content": summary_prompt,
                },
                {"role": "user", "content": pdf_text}
            ],
            model=model,
        )
        return chat_completion_gpt.choices[0].message.content
    except OpenAIError as e:
        print(f"An error occurred with the MindsDB Serve API: {e}")
        return "Error in text summarization."

Step 3: Perform the Summarization

The __main__ block below orchestrates the text extraction and summarization process. It prompts you to insert your MindsDB API key and specify the PDF path. After extracting text from the PDF, it summarizes the content and displays the result.

if __name__ == "__main__":
    MINDSDB_API_KEY = "f4aw"  # ADD YOUR MINDSDB API KEY HERE
    SUMMARY_PROMPT = "Summarize the following text." # Change the summarization prompt as needed
    MODEL = "gpt-3.5-turbo" # Change the LLM here
    pdf_path = "invoicesample.pdf"
    pdf_text = extract_text_from_pdf(pdf_path)
    if pdf_text:
        summary = summarize_text_from_pdf(MINDSDB_API_KEY, pdf_text, MODEL, SUMMARY_PROMPT)
        print(summary)
    else:
        print("Failed to extract text from the PDF.")

Step 4: Run the Script

Execute the script in your terminal or command prompt.

python mindsdb_serve_test.py

The script will print the summary of the extracted text to the console. You can modify the script to save the summary to a file or perform additional processing as needed.

Step 5: Switching Between Different Language Models

Refer to the supported language models list to choose an alternative model. For example, if you wish to use the latest Gemma-7B model from Google, you would change the MODEL variable in step 3 to MODEL="gemma-7b"

Was this page helpful?