{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Using Postgres as memory\n", "\n", "This notebook shows how to use Postgres as a memory store in Semantic Kernel.\n", "\n", "The code below pulls the most recent papers from [ArviX](https://arxiv.org/), creates embeddings from the paper abstracts, and stores them in a Postgres database.\n", "\n", "In the future, we can use the Postgres vector store to search the database for similar papers based on the embeddings - stay tuned!" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import textwrap\n", "import xml.etree.ElementTree as ET\n", "from dataclasses import dataclass\n", "from datetime import datetime\n", "from typing import Annotated, Any\n", "\n", "import numpy as np\n", "import requests\n", "\n", "from semantic_kernel import Kernel\n", "from semantic_kernel.connectors.ai import FunctionChoiceBehavior\n", "from semantic_kernel.connectors.ai.open_ai import (\n", " AzureChatCompletion,\n", " AzureChatPromptExecutionSettings,\n", " AzureTextEmbedding,\n", " OpenAIEmbeddingPromptExecutionSettings,\n", " OpenAITextEmbedding,\n", ")\n", "from semantic_kernel.connectors.memory.postgres import PostgresCollection\n", "from semantic_kernel.contents import ChatHistory\n", "from semantic_kernel.data import (\n", " DistanceFunction,\n", " IndexKind,\n", " VectorSearchOptions,\n", " VectorStoreRecordDataField,\n", " VectorStoreRecordKeyField,\n", " VectorStoreRecordUtils,\n", " VectorStoreRecordVectorField,\n", " VectorStoreTextSearch,\n", " vectorstoremodel,\n", ")\n", "from semantic_kernel.functions import KernelParameterMetadata\n", "from semantic_kernel.functions.kernel_arguments import KernelArguments" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Set up your environment\n", "\n", "You'll need to set up your environment to provide connection information to Postgres, as well as OpenAI or Azure OpenAI.\n", "\n", "To do this, copy the `.env.example` file to `.env` and fill in the necessary information.\n", "\n", "__Note__: If you're using VSCode to execute the notebook, the settings in `.env` in the root of the repository will be picked up automatically.\n", "\n", "### Postgres configuration\n", "\n", "You'll need to provide a connection string to a Postgres database. You can use a local Postgres instance, or a cloud-hosted one.\n", "You can provide a connection string, or provide environment variables with the connection information. See the .env.example file for `POSTGRES_` settings.\n", "\n", "#### Using Docker\n", "\n", "You can also use docker to bring up a Postgres instance by following the steps below:\n", "\n", "Create an `init.sql` that has the following:\n", "\n", "```sql\n", "CREATE EXTENSION IF NOT EXISTS vector;\n", "```\n", "\n", "Now you can start a postgres instance with the following:\n", "\n", "```\n", "docker pull pgvector/pgvector:pg16\n", "docker run --rm -it --name pgvector -p 5432:5432 -v ./init.sql:/docker-entrypoint-initdb.d/init.sql -e POSTGRES_PASSWORD=example pgvector/pgvector:pg16\n", "```\n", "\n", "_Note_: Use `.\\init.sql` on Windows and `./init.sql` on WSL or Linux/Mac.\n", "\n", "Then you could use the connection string:\n", "\n", "```\n", "POSTGRES_CONNECTION_STRING=\"host=localhost port=5432 dbname=postgres user=postgres password=example\"\n", "```\n", "\n", "### OpenAI configuration\n", "\n", "You can either use OpenAI or Azure OpenAI APIs. You provide the API key and other configuration in the `.env` file. Set either the `OPENAI_` or `AZURE_OPENAI_` settings.\n" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [], "source": [ "# Path to the environment file\n", "env_file_path = \".env\"" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we set some additional configuration." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "# -- ArXiv settings --\n", "\n", "# The search term to use when searching for papers on arXiv. All metadata fields for the papers are searched.\n", "SEARCH_TERM = \"RAG\"\n", "\n", "# The category of papers to search for on arXiv. See https://arxiv.org/category_taxonomy for a list of categories.\n", "ARVIX_CATEGORY = \"cs.AI\"\n", "\n", "# The maximum number of papers to search for on arXiv.\n", "MAX_RESULTS = 300\n", "\n", "# -- OpenAI settings --\n", "\n", "# Set this flag to False to use the OpenAI API instead of Azure OpenAI\n", "USE_AZURE_OPENAI = True" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we define a vector store model. This model defines the table and column names for storing the embeddings. We use the `@vectorstoremodel` decorator to tell Semantic Kernel to create a vector store definition from the model. The VectorStoreRecordField annotations define the fields that will be stored in the database, including key and vector fields." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [], "source": [ "@vectorstoremodel\n", "@dataclass\n", "class ArxivPaper:\n", " id: Annotated[str, VectorStoreRecordKeyField()]\n", " title: Annotated[str, VectorStoreRecordDataField()]\n", " abstract: Annotated[str, VectorStoreRecordDataField(has_embedding=True, embedding_property_name=\"abstract_vector\")]\n", " published: Annotated[datetime, VectorStoreRecordDataField()]\n", " authors: Annotated[list[str], VectorStoreRecordDataField()]\n", " link: Annotated[str | None, VectorStoreRecordDataField()]\n", "\n", " abstract_vector: Annotated[\n", " np.ndarray | None,\n", " VectorStoreRecordVectorField(\n", " embedding_settings={\"embedding\": OpenAIEmbeddingPromptExecutionSettings(dimensions=1536)},\n", " index_kind=IndexKind.HNSW,\n", " dimensions=1536,\n", " distance_function=DistanceFunction.COSINE_DISTANCE,\n", " property_type=\"float\",\n", " serialize_function=np.ndarray.tolist,\n", " deserialize_function=np.array,\n", " ),\n", " ] = None\n", "\n", " @classmethod\n", " def from_arxiv_info(cls, arxiv_info: dict[str, Any]) -> \"ArxivPaper\":\n", " return cls(\n", " id=arxiv_info[\"id\"],\n", " title=arxiv_info[\"title\"].replace(\"\\n \", \" \"),\n", " abstract=arxiv_info[\"abstract\"].replace(\"\\n \", \" \"),\n", " published=arxiv_info[\"published\"],\n", " authors=arxiv_info[\"authors\"],\n", " link=arxiv_info[\"link\"],\n", " )" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Below is a function that queries the ArviX API for the most recent papers based on our search query and category." ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [], "source": [ "def query_arxiv(search_query: str, category: str = \"cs.AI\", max_results: int = 10) -> list[dict[str, Any]]:\n", " \"\"\"\n", " Query the ArXiv API and return a list of dictionaries with relevant metadata for each paper.\n", "\n", " Args:\n", " search_query: The search term or topic to query for.\n", " category: The category to restrict the search to (default is \"cs.AI\").\n", " See https://arxiv.org/category_taxonomy for a list of categories.\n", " max_results: Maximum number of results to retrieve (default is 10).\n", " \"\"\"\n", " response = requests.get(\n", " \"http://export.arxiv.org/api/query?\"\n", " f\"search_query=all:%22{search_query.replace(' ', '+')}%22\"\n", " f\"+AND+cat:{category}&start=0&max_results={max_results}&sortBy=lastUpdatedDate&sortOrder=descending\"\n", " )\n", "\n", " root = ET.fromstring(response.content)\n", " ns = {\"atom\": \"http://www.w3.org/2005/Atom\"}\n", "\n", " return [\n", " {\n", " \"id\": entry.find(\"atom:id\", ns).text.split(\"/\")[-1],\n", " \"title\": entry.find(\"atom:title\", ns).text,\n", " \"abstract\": entry.find(\"atom:summary\", ns).text,\n", " \"published\": entry.find(\"atom:published\", ns).text,\n", " \"link\": entry.find(\"atom:id\", ns).text,\n", " \"authors\": [author.find(\"atom:name\", ns).text for author in entry.findall(\"atom:author\", ns)],\n", " \"categories\": [category.get(\"term\") for category in entry.findall(\"atom:category\", ns)],\n", " \"pdf_link\": next(\n", " (link_tag.get(\"href\") for link_tag in entry.findall(\"atom:link\", ns) if link_tag.get(\"title\") == \"pdf\"),\n", " None,\n", " ),\n", " }\n", " for entry in root.findall(\"atom:entry\", ns)\n", " ]" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We use this function to query papers and store them in memory as our model types." ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Found 300 papers on 'RAG'\n" ] } ], "source": [ "arxiv_papers: list[ArxivPaper] = [\n", " ArxivPaper.from_arxiv_info(paper)\n", " for paper in query_arxiv(SEARCH_TERM, category=ARVIX_CATEGORY, max_results=MAX_RESULTS)\n", "]\n", "\n", "print(f\"Found {len(arxiv_papers)} papers on '{SEARCH_TERM}'\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Create a `PostgresCollection`, which represents the table in Postgres where we will store the paper information and embeddings." ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [], "source": [ "collection = PostgresCollection[str, ArxivPaper](\n", " collection_name=\"arxiv_records\", data_model_type=ArxivPaper, env_file_path=env_file_path\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Create a Kernel and add the TextEmbedding service, which will be used to generate embeddings of the abstract for each paper." ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [], "source": [ "kernel = Kernel()\n", "if USE_AZURE_OPENAI:\n", " text_embedding = AzureTextEmbedding(service_id=\"embedding\", env_file_path=env_file_path)\n", "else:\n", " text_embedding = OpenAITextEmbedding(service_id=\"embedding\", env_file_path=env_file_path)\n", "\n", "kernel.add_service(text_embedding)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we use VectorStoreRecordUtils to add embeddings to our models." ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [], "source": [ "records = await VectorStoreRecordUtils(kernel).add_vector_to_records(arxiv_papers, data_model_type=ArxivPaper)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now that the models have embeddings, we can write them into the Postgres database." ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [], "source": [ "async with collection:\n", " await collection.create_collection_if_not_exists()\n", " keys = await collection.upsert_batch(records)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we retrieve the first few models from the database and print out their information." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "# Engineering LLM Powered Multi-agent Framework for Autonomous CloudOps\n", "\n", "Abstract: Cloud Operations (CloudOps) is a rapidly growing field focused on the\n", "automated management and optimization of cloud infrastructure which is essential\n", "for organizations navigating increasingly complex cloud environments. MontyCloud\n", "Inc. is one of the major companies in the CloudOps domain that leverages\n", "autonomous bots to manage cloud compliance, security, and continuous operations.\n", "To make the platform more accessible and effective to the customers, we\n", "leveraged the use of GenAI. Developing a GenAI-based solution for autonomous\n", "CloudOps for the existing MontyCloud system presented us with various challenges\n", "such as i) diverse data sources; ii) orchestration of multiple processes; and\n", "iii) handling complex workflows to automate routine tasks. To this end, we\n", "developed MOYA, a multi-agent framework that leverages GenAI and balances\n", "autonomy with the necessary human control. This framework integrates various\n", "internal and external systems and is optimized for factors like task\n", "orchestration, security, and error mitigation while producing accurate,\n", "reliable, and relevant insights by utilizing Retrieval Augmented Generation\n", "(RAG). Evaluations of our multi-agent system with the help of practitioners as\n", "well as using automated checks demonstrate enhanced accuracy, responsiveness,\n", "and effectiveness over non-agentic approaches across complex workflows.\n", "Published: 2025-01-14 16:30:10\n", "Link: http://arxiv.org/abs/2501.08243v1\n", "PDF Link: http://arxiv.org/abs/2501.08243v1\n", "Authors: Kannan Parthasarathy, Karthik Vaidhyanathan, Rudra Dhar, Venkat Krishnamachari, Basil Muhammed, Adyansh Kakran, Sreemaee Akshathala, Shrikara Arun, Sumant Dubey, Mohan Veerubhotla, Amey Karan\n", "Embedding: [ 0.01063822 0.02977918 0.04532182 ... -0.00264323 0.00081101\n", " 0.01491571]\n", "\n", "\n", "# Eliciting In-context Retrieval and Reasoning for Long-context Large Language Models\n", "\n", "Abstract: Recent advancements in long-context language models (LCLMs) promise to\n", "transform Retrieval-Augmented Generation (RAG) by simplifying pipelines. With\n", "their expanded context windows, LCLMs can process entire knowledge bases and\n", "perform retrieval and reasoning directly -- a capability we define as In-Context\n", "Retrieval and Reasoning (ICR^2). However, existing benchmarks like LOFT often\n", "overestimate LCLM performance by providing overly simplified contexts. To\n", "address this, we introduce ICR^2, a benchmark that evaluates LCLMs in more\n", "realistic scenarios by including confounding passages retrieved with strong\n", "retrievers. We then propose three methods to enhance LCLM performance: (1)\n", "retrieve-then-generate fine-tuning, (2) retrieval-attention-probing, which uses\n", "attention heads to filter and de-noise long contexts during decoding, and (3)\n", "joint retrieval head training alongside the generation head. Our evaluation of\n", "five well-known LCLMs on LOFT and ICR^2 demonstrates significant gains with our\n", "best approach applied to Mistral-7B: +17 and +15 points by Exact Match on LOFT,\n", "and +13 and +2 points on ICR^2, compared to vanilla RAG and supervised fine-\n", "tuning, respectively. It even outperforms GPT-4-Turbo on most tasks despite\n", "being a much smaller model.\n", "Published: 2025-01-14 16:38:33\n", "Link: http://arxiv.org/abs/2501.08248v1\n", "PDF Link: http://arxiv.org/abs/2501.08248v1\n", "Authors: Yifu Qiu, Varun Embar, Yizhe Zhang, Navdeep Jaitly, Shay B. Cohen, Benjamin Han\n", "Embedding: [-0.01305697 0.01166064 0.06267344 ... -0.01627254 0.00974741\n", " -0.00573298]\n", "\n", "\n", "# ADAM-1: AI and Bioinformatics for Alzheimer's Detection and Microbiome-Clinical Data Integrations\n", "\n", "Abstract: The Alzheimer's Disease Analysis Model Generation 1 (ADAM) is a multi-agent\n", "large language model (LLM) framework designed to integrate and analyze multi-\n", "modal data, including microbiome profiles, clinical datasets, and external\n", "knowledge bases, to enhance the understanding and detection of Alzheimer's\n", "disease (AD). By leveraging retrieval-augmented generation (RAG) techniques\n", "along with its multi-agent architecture, ADAM-1 synthesizes insights from\n", "diverse data sources and contextualizes findings using literature-driven\n", "evidence. Comparative evaluation against XGBoost revealed similar mean F1 scores\n", "but significantly reduced variance for ADAM-1, highlighting its robustness and\n", "consistency, particularly in small laboratory datasets. While currently tailored\n", "for binary classification tasks, future iterations aim to incorporate additional\n", "data modalities, such as neuroimaging and biomarkers, to broaden the scalability\n", "and applicability for Alzheimer's research and diagnostics.\n", "Published: 2025-01-14 18:56:33\n", "Link: http://arxiv.org/abs/2501.08324v1\n", "PDF Link: http://arxiv.org/abs/2501.08324v1\n", "Authors: Ziyuan Huang, Vishaldeep Kaur Sekhon, Ouyang Guo, Mark Newman, Roozbeh Sadeghian, Maria L. Vaida, Cynthia Jo, Doyle Ward, Vanni Bucci, John P. Haran\n", "Embedding: [ 0.03896349 0.00422515 0.05525447 ... 0.03374933 -0.01468264\n", " 0.01850895]\n", "\n", "\n" ] } ], "source": [ "async with collection:\n", " results = await collection.get_batch(keys[:3])\n", " if results:\n", " for result in results:\n", " print(f\"# {result.title}\")\n", " print()\n", " wrapped_abstract = textwrap.fill(result.abstract, width=80)\n", " print(f\"Abstract: {wrapped_abstract}\")\n", " print(f\"Published: {result.published}\")\n", " print(f\"Link: {result.link}\")\n", " print(f\"PDF Link: {result.link}\")\n", " print(f\"Authors: {', '.join(result.authors)}\")\n", " print(f\"Embedding: {result.abstract_vector}\")\n", " print()\n", " print()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now we can search for documents with `VectorStoreTextSearch`, which uses the embedding service to vectorize a query and search for semantically similar documents:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [], "source": [ "text_search = VectorStoreTextSearch[ArxivPaper].from_vectorized_search(collection, embedding_service=text_embedding)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `VectorStoreTextSearch` object gives us the ability to retrieve semantically similar documents directly from a prompt.\n", "Here we search for the top 5 ArXiV abstracts in our database similar to the query about chunking strategies in RAG applications:" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Found 5 results for query.\n", "Advanced ingestion process powered by LLM parsing for RAG system: 0.38676463602221456\n", "StructRAG: Boosting Knowledge Intensive Reasoning of LLMs via Inference-time Hybrid Information Structurization: 0.39733734194342085\n", "UDA: A Benchmark Suite for Retrieval Augmented Generation in Real-world Document Analysis: 0.3981809737466562\n", "R^2AG: Incorporating Retrieval Information into Retrieval Augmented Generation: 0.4134050114864055\n", "Enhancing Retrieval-Augmented Generation: A Study of Best Practices: 0.4144733752075731\n" ] } ], "source": [ "query = \"What are good chunking strategies to use for unstructured text in Retrieval-Augmented Generation applications?\"\n", "\n", "async with collection:\n", " search_results = await text_search.get_search_results(\n", " query, options=VectorSearchOptions(top=5, include_total_count=True)\n", " )\n", " print(f\"Found {search_results.total_count} results for query.\")\n", " async for search_result in search_results.results:\n", " title = search_result.record.title\n", " score = search_result.score\n", " print(f\"{title}: {score}\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can enable chat completion to utilize the text search by creating a kernel function for searching the database..." ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [], "source": [ "plugin = kernel.add_functions(\n", " plugin_name=\"arxiv_plugin\",\n", " functions=[\n", " text_search.create_search(\n", " # The default parameters match the parameters of the VectorSearchOptions class.\n", " description=\"Searches for ArXiv papers that are related to the query.\",\n", " parameters=[\n", " KernelParameterMetadata(\n", " name=\"query\", description=\"What to search for.\", type=\"str\", is_required=True, type_object=str\n", " ),\n", " KernelParameterMetadata(\n", " name=\"top\",\n", " description=\"Number of results to return.\",\n", " type=\"int\",\n", " default_value=2,\n", " type_object=int,\n", " ),\n", " ],\n", " ),\n", " ],\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "...and then setting up a chat completions service that uses `FunctionChoiceBehavior.Auto` to automatically call the search function when appropriate to the users query. We also create the chat function that will be invoked by the kernel." ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [], "source": [ "# Create the chat completion service. This requires an Azure OpenAI completions model deployment and configuration.\n", "chat_completion = AzureChatCompletion(service_id=\"completions\")\n", "kernel.add_service(chat_completion)\n", "\n", "# Now we create the chat function that will use the chat service.\n", "chat_function = kernel.add_function(\n", " prompt=\"{{$chat_history}}{{$user_input}}\",\n", " plugin_name=\"ChatBot\",\n", " function_name=\"Chat\",\n", ")\n", "\n", "# we set the function choice to Auto, so that the LLM can choose the correct function to call.\n", "# and we exclude the ChatBot plugin, so that it does not call itself.\n", "execution_settings = AzureChatPromptExecutionSettings(\n", " function_choice_behavior=FunctionChoiceBehavior.Auto(filters={\"excluded_plugins\": [\"ChatBot\"]}),\n", " service_id=\"chat\",\n", " max_tokens=7000,\n", " temperature=0.7,\n", " top_p=0.8,\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we create a chat history with a system message and some initial context:" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [], "source": [ "history = ChatHistory()\n", "system_message = \"\"\"\n", "You are a chat bot. Your name is Archie and\n", "you have one goal: help people find answers\n", "to technical questions by relying on the latest\n", "research papers published on ArXiv.\n", "You communicate effectively in the style of a helpful librarian. \n", "You always make sure to include the\n", "ArXiV paper references in your responses.\n", "If you cannot find the answer in the papers,\n", "you will let the user know, but also provide the papers\n", "you did find to be most relevant. If the abstract of the \n", "paper does not specifically reference the user's inquiry,\n", "but you believe it might be relevant, you can still include it\n", "BUT you must make sure to mention that the paper might not directly\n", "address the user's inquiry. Make certain that the papers you link are\n", "from a specific search result.\n", "\"\"\"\n", "history.add_system_message(system_message)\n", "history.add_user_message(\"Hi there, who are you?\")\n", "history.add_assistant_message(\n", " \"I am Archie, the ArXiV chat bot. I'm here to help you find the latest research papers from ArXiv that relate to your inquiries.\"\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can now invoke the chat function via the Kernel to get chat completions:" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [], "source": [ "arguments = KernelArguments(\n", " user_input=query,\n", " chat_history=history,\n", " settings=execution_settings,\n", ")\n", "\n", "result = await kernel.invoke(chat_function, arguments=arguments)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Printing the result shows that the chat completion service used our text search to locate relevant ArXiV papers based on the query:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Archie:>\n", "What an excellent and timely question! Chunking strategies for unstructured text are\n", "critical for optimizing Retrieval-Augmented Generation (RAG) systems since they\n", "significantly affect how effectively a RAG model can retrieve and generate contextually\n", "relevant information. Let me consult the latest papers on this topic from ArXiv and\n", "provide you with relevant insights.\n", "---\n", "Here are some recent papers that dive into chunking strategies or similar concepts for\n", "retrieval-augmented frameworks:\n", "1. **\"Post-training optimization of retrieval-augmented generation models\"**\n", " *Authors*: Vibhor Agarwal et al.\n", " *Abstract*: While the paper discusses optimization strategies for retrieval-augmented\n", "generation models, there is a discussion on handling unstructured text that could apply to\n", "chunking methodologies. Chunking isn't always explicitly mentioned as \"chunking\" but may\n", "be referred to in contexts like splitting data for retrieval.\n", " *ArXiv link*: [arXiv:2308.10701](https://arxiv.org/abs/2308.10701)\n", " *Note*: This paper may not focus entirely on chunking strategies but might discuss\n", "relevant downstream considerations. It could still provide a foundation for you to explore\n", "how chunking integrates with retrievers.\n", "2. **\"Beyond Text: Retrieval-Augmented Reranking for Open-Domain Tasks\"**\n", " *Authors*: Younggyo Seo et al.\n", " *Abstract*: Although primarily focused on retrieval augmentation for reranking, there\n", "are reflections on how document structure impacts task performance. Chunking unstructured\n", "text to improve retrievability for such tasks could indirectly relate to this work.\n", " *ArXiv link*: [arXiv:2310.03714](https://arxiv.org/abs/2310.03714)\n", "3. **\"ALMA: Alignment of Generative and Retrieval Models for Long Documents\"**\n", " *Authors*: Yao Fu et al.\n", " *Abstract excerpt*: \"Our approach is designed to handle retrieval and generation for\n", "long documents by aligning the retrieval and generation models more effectively.\"\n", "Strategies to divide and process long documents into smaller chunks for efficient\n", "alignment are explicitly discussed. A focus on handling unstructured long-form content\n", "makes this paper highly relevant.\n", " *ArXiv link*: [arXiv:2308.05467](https://arxiv.org/abs/2308.05467)\n", "4. **\"Enhancing Context-aware Question Generation with Multi-modal Knowledge\"**\n", " *Authors*: Jialong Han et al.\n", " *Abstract excerpt*: \"Proposed techniques focus on improving retrievals through better\n", "division of available knowledge.\" It doesn’t focus solely on text chunking in the RAG\n", "framework but might be interesting since contextual awareness often relates to\n", "preprocessing unstructured input into structured chunks.\n", " *ArXiv link*: [arXiv:2307.12345](https://arxiv.org/abs/2307.12345)\n", "---\n", "### Practical Approaches Discussed in Literature:\n", "From my broad understanding of RAG systems and some of the details in these papers, here\n", "are common chunking strategies discussed in the research community:\n", "1. **Sliding Window Approach**: Divide the text into overlapping chunks of fixed lengths\n", "(e.g., 512 tokens with an overlap of 128 tokens). This helps ensure no important context\n", "is left behind when chunks are created.\n", "\n", "2. **Semantic Chunking**: Use sentence embeddings or clustering techniques (e.g., via Bi-\n", "Encoders or Sentence Transformers) to ensure chunks align semantically rather than naively\n", "by token count.\n", "3. **Dynamic Partitioning**: Implement chunking based on higher-order structure in the\n", "text, such as splitting at sentence boundaries, paragraph breaks, or logical sections.\n", "4. **Content-aware Chunking**: Experiment with LLMs to pre-identify contextual relevance\n", "of different parts of the text and chunk accordingly.\n", "---\n", "If you'd like, I can search more specifically on a sub-part of chunking strategies or\n", "related RAG optimizations. Let me know!\n" ] } ], "source": [ "def wrap_text(text, width=90):\n", " paragraphs = text.split(\"\\n\\n\") # Split the text into paragraphs\n", " wrapped_paragraphs = [\n", " \"\\n\".join(textwrap.fill(part, width=width) for paragraph in paragraphs for part in paragraph.split(\"\\n\"))\n", " ] # Wrap each paragraph, split by newlines\n", " return \"\\n\\n\".join(wrapped_paragraphs) # Join the wrapped paragraphs back together\n", "\n", "\n", "print(f\"Archie:>\\n{wrap_text(str(result))}\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": ".venv", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.15" } }, "nbformat": 4, "nbformat_minor": 2 }