安装
LightRAG 服务器旨在提供 Web UI 和 API 支持。Web UI 有助于文档索引、知识图谱探索以及简单的 RAG 查询界面。LightRAG 服务器还提供与 Ollama 兼容的接口,旨在将 LightRAG 模拟为 Ollama 聊天模型。这使得诸如 Open WebUI 之类的 AI 聊天机器人可以轻松访问 LightRAG。
pip install "lightrag-hku[api]"
git clone https://github.com/HKUDS/LightRAG.git
cd LightRAG
# create a Python virtual enviroment if neccesary
# Install in editable mode with API support
pip install -e ".[api]"
- 使用 Docker Compose 启动 LightRAG 服务器
git clone https://github.com/HKUDS/LightRAG.git
cd LightRAG
cp env.example .env
# modify LLM and Embedding settings in .env
docker compose up
LightRAG Docker 镜像的历史版本可以在这里找到:LightRAG Docker 镜像
cd LightRAG
pip install -e .
要开始使用 LightRAG 核心,请参考examples
文件夹中的示例代码。此外,我们还提供了视频演示,指导您完成本地设置过程。如果您已拥有 OpenAI API 密钥,则可以立即运行该演示:
### you should run the demo code with project folder
cd LightRAG
### provide your API-KEY for OpenAI
export OPENAI_API_KEY="sk-...your_opeai_key..."
### download the demo document of "A Christmas Carol" by Charles Dickens
curl https://raw.githubusercontent.com/gusye1234/nano-graphrag/main/tests/mock_data.txt > ./book.txt
### run the demo code
python examples/lightrag_openai_demo.py
流式响应实现示例请参见examples/lightrag_openai_compatible_demo.py
。执行前,请确保修改示例代码的 LLM 和嵌入配置。
注 1:运行演示程序时,请注意不同的测试脚本可能使用不同的嵌入模型。如果切换到不同的嵌入模型,则必须清除数据目录(./dickens
);否则程序可能会出错。如果您希望保留 LLM 缓存,可以kv_store_llm_response_cache.json
在清除数据目录的同时保留该文件。
注2:仅lightrag_openai_demo.py
和lightrag_openai_compatible_demo.py
为官方支持的示例代码,其他示例文件均为社区贡献,尚未经过充分测试和优化。
如果您想将 LightRAG 集成到您的项目中,我们建议您使用 LightRAG 服务器提供的 REST API。LightRAG Core 通常适用于嵌入式应用程序或希望进行研究和评估的研究人员。
使用以下 Python 代码片段初始化 LightRAG,向其中插入文本并执行查询:
import os
import asyncio
from lightrag import LightRAG, QueryParam
from lightrag.llm.openai import gpt_4o_mini_complete, gpt_4o_complete, openai_embed
from lightrag.kg.shared_storage import initialize_pipeline_status
from lightrag.utils import setup_logger
setup_logger("lightrag", level="INFO")
WORKING_DIR = "./rag_storage"
if not os.path.exists(WORKING_DIR):
os.mkdir(WORKING_DIR)
async def initialize_rag():
rag = LightRAG(
working_dir=WORKING_DIR,
embedding_func=openai_embed,
llm_model_func=gpt_4o_mini_complete,
)
await rag.initialize_storages()
await initialize_pipeline_status()
return rag
async def main():
try:
# Initialize RAG instance
rag = await initialize_rag()
rag.insert("Your text")
# Perform hybrid search
mode="hybrid"
print(
await rag.query(
"What are the top themes in this story?",
param=QueryParam(mode=mode)
)
)
except Exception as e:
print(f"An error occurred: {e}")
finally:
if rag:
await rag.finalize_storages()
if __name__ == "__main__":
asyncio.run(main())
上述代码片段的重要说明:
- 在运行脚本之前导出您的 OPENAI_API_KEY 环境变量。
- 该程序使用 LightRAG 的默认存储设置,因此所有数据都将保存到 WORKING_DIR/rag_storage。
- 该程序仅演示了初始化 LightRAG 对象的最简单方法:注入嵌入和 LLM 函数,并在创建 LightRAG 对象后初始化存储和管道状态。
LightRAG 初始化参数的完整列表:
参数
使用 QueryParam 来控制查询的行为:
class QueryParam:
"""Configuration parameters for query execution in LightRAG."""
mode: Literal["local", "global", "hybrid", "naive", "mix", "bypass"] = "global"
"""Specifies the retrieval mode:
- "local": Focuses on context-dependent information.
- "global": Utilizes global knowledge.
- "hybrid": Combines local and global retrieval methods.
- "naive": Performs a basic search without advanced techniques.
- "mix": Integrates knowledge graph and vector retrieval.
"""
only_need_context: bool = False
"""If True, only returns the retrieved context without generating a response."""
only_need_prompt: bool = False
"""If True, only returns the generated prompt without producing a response."""
response_type: str = "Multiple Paragraphs"
"""Defines the response format. Examples: 'Multiple Paragraphs', 'Single Paragraph', 'Bullet Points'."""
stream: bool = False
"""If True, enables streaming output for real-time responses."""
top_k: int = int(os.getenv("TOP_K", "60"))
"""Number of top items to retrieve. Represents entities in 'local' mode and relationships in 'global' mode."""
max_token_for_text_unit: int = int(os.getenv("MAX_TOKEN_TEXT_CHUNK", "4000"))
"""Maximum number of tokens allowed for each retrieved text chunk."""
max_token_for_global_context: int = int(
os.getenv("MAX_TOKEN_RELATION_DESC", "4000")
)
"""Maximum number of tokens allocated for relationship descriptions in global retrieval."""
max_token_for_local_context: int = int(os.getenv("MAX_TOKEN_ENTITY_DESC", "4000"))
"""Maximum number of tokens allocated for entity descriptions in local retrieval."""
conversation_history: list[dict[str, str]] = field(default_factory=list)
"""Stores past conversation history to maintain context.
Format: [{"role": "user/assistant", "content": "message"}].
"""
history_turns: int = 3
"""Number of complete conversation turns (user-assistant pairs) to consider in the response context."""
ids: list[str] | None = None
"""List of ids to filter the results."""
model_func: Callable[..., object] | None = None
"""Optional override for the LLM model function to use for this specific query.
If provided, this will be used instead of the global model function.
This allows using different models for different query modes.
"""
user_prompt: str | None = None
"""User-provided prompt for the query.
If proivded, this will be use instead of the default vaulue from prompt template.
"""
Top_k的默认值可以通过环境变量TOP_K来改变。
LightRAG 需要使用 LLM 和 Embedding 模型来完成文档索引和查询任务。在初始化阶段,需要将相关模型的调用方法注入 LightRAG:
使用类似开放AI的API
使用拥抱脸模型
使用 Ollama 模型
骆驼指数
LightRAG 现在通过对话历史记录功能支持多轮对话。使用方法如下:
使用示例
使用 LightRAG 进行内容查询时,请避免将搜索过程与不相关的输出处理相结合,因为这会显著影响查询效率。Query user_prompt
Param 中的参数专门用于解决此问题——它不参与 RAG 检索阶段,而是指导 LLM 在查询完成后如何处理检索到的结果。使用方法如下:
# Create query parameters
query_param = QueryParam(
mode = "hybrid", # Other modes:local, global, hybrid, mix, naive
user_prompt = "For diagrams, use mermaid format with English/Pinyin node names and Chinese display labels",
)
# Query and process
response_default = rag.query(
"Please draw a character relationship diagram for Scrooge",
param=query_param
)
print(response_default)
果未指定,则默认值为2。我们建议将此设置保持在 10 以下,因为性能瓶颈通常在于 LLM(大型语言模型)处理。
插入 ID
如果您想为您的文件提供自己的身份证,文件数量和身份证数量必须相同。
# Insert single text, and provide ID for it
rag.insert("TEXT1", ids=["ID_FOR_TEXT1"])
# Insert multiple texts, and provide IDs for them
rag.insert(["TEXT1", "TEXT2",...], ids=["ID_FOR_TEXT1", "ID_FOR_TEXT2"])
使用管道插入
apipeline_enqueue_documents
和函数apipeline_process_enqueue_documents
允许您将文档增量插入到图表中。
这对于您想要在后台处理文档同时仍允许主线程继续执行的情况很有用。
并使用例程来处理新文档。
rag = LightRAG(..)
await rag.apipeline_enqueue_documents(input)
# Your routine in loop
await rag.apipeline_process_enqueue_documents(input)
插入多文件类型支持
支持textract
读取TXT、DOCX、PPTX、CSV、PDF等文件类型。
import textract
file_path = 'TEXT.pdf'
text_content = textract.process(file_path)
rag.insert(text_content.decode('utf-8'))
引用功能
通过提供文件路径,系统确保可以追溯到其原始文档。
# Define documents and their file paths
documents = ["Document content 1", "Document content 2"]
file_paths = ["path/to/doc1.txt", "path/to/doc2.txt"]
# Insert documents with file paths
rag.insert(documents, file_paths=file_paths)
贮存
LightRAG 使用了四种存储类型,每种存储类型都有多种实现方案。在初始化 LightRAG 时,可以通过参数设置这四种存储类型的实现方案。具体请参见前面 LightRAG 初始化参数。
使用 Neo4J 进行存储
export NEO4J_URI="neo4j://localhost:7687"
export NEO4J_USERNAME="neo4j"
export NEO4J_PASSWORD="password"
# Setup logger for LightRAG
setup_logger("lightrag", level="INFO")
# When you launch the project be sure to override the default KG: NetworkX
# by specifying kg="Neo4JStorage".
# Note: Default settings use NetworkX
# Initialize LightRAG with Neo4J implementation.
async def initialize_rag():
rag = LightRAG(
working_dir=WORKING_DIR,
llm_model_func=gpt_4o_mini_complete, # Use gpt_4o_mini_complete LLM model
graph_storage="Neo4JStorage", #<-----------override KG default
)
# Initialize database connections
await rag.initialize_storages()
# Initialize pipeline status for document processing
await initialize_pipeline_status()
return rag
请参阅 test_neo4j.py 以获取工作示例。
使用 PostgreSQL 进行存储
对于生产级场景,您很可能希望利用企业级解决方案。PostgreSQL 可以为您提供一站式解决方案,包括键值存储、VectorDB(pgvector)和 GraphDB(apache AGE)。
- PostgreSQL 是轻量级的,包括所有必要插件的整个二进制分发版可以压缩到 40MB:参考Windows 版本,因为它很容易在 Linux/Mac 上安装。
- 如果您更喜欢 docker,请从此图像开始(如果您是初学者)以避免出现问题(请阅读概述):https://hub.docker.com/r/shangor/postgres-for-rag
- 如何开始?参考:examples/lightrag_zhipu_postgres_demo.py
- 为 AGE 示例创建索引:(
dickens
如有必要,请将以下内容更改为您的图表名称)
load 'age';
SET search_path = ag_catalog, "$user", public;
CREATE INDEX CONCURRENTLY entity_p_idx ON dickens."Entity" (id);
CREATE INDEX CONCURRENTLY vertex_p_idx ON dickens."_ag_label_vertex" (id);
CREATE INDEX CONCURRENTLY directed_p_idx ON dickens."DIRECTED" (id);
CREATE INDEX CONCURRENTLY directed_eid_idx ON dickens."DIRECTED" (end_id);
CREATE INDEX CONCURRENTLY directed_sid_idx ON dickens."DIRECTED" (start_id);
CREATE INDEX CONCURRENTLY directed_seid_idx ON dickens."DIRECTED" (start_id,end_id);
CREATE INDEX CONCURRENTLY edge_p_idx ON dickens."_ag_label_edge" (id);
CREATE INDEX CONCURRENTLY edge_sid_idx ON dickens."_ag_label_edge" (start_id);
CREATE INDEX CONCURRENTLY edge_eid_idx ON dickens."_ag_label_edge" (end_id);
CREATE INDEX CONCURRENTLY edge_seid_idx ON dickens."_ag_label_edge" (start_id,end_id);
create INDEX CONCURRENTLY vertex_idx_node_id ON dickens."_ag_label_vertex" (ag_catalog.agtype_access_operator(properties, '"node_id"'::agtype));
create INDEX CONCURRENTLY entity_idx_node_id ON dickens."Entity" (ag_catalog.agtype_access_operator(properties, '"node_id"'::agtype));
CREATE INDEX CONCURRENTLY entity_node_id_gin_idx ON dickens."Entity" using gin(properties);
ALTER TABLE dickens."DIRECTED" CLUSTER ON directed_sid_idx;
-- drop if necessary
drop INDEX entity_p_idx;
drop INDEX vertex_p_idx;
drop INDEX directed_p_idx;
drop INDEX directed_eid_idx;
drop INDEX directed_sid_idx;
drop INDEX directed_seid_idx;
drop INDEX edge_p_idx;
drop INDEX edge_sid_idx;
drop INDEX edge_eid_idx;
drop INDEX edge_seid_idx;
drop INDEX vertex_idx_node_id;
drop INDEX entity_idx_node_id;
drop INDEX entity_node_id_gin_idx;
- Apache AGE 的已知问题:已发布的版本存在以下问题:
您可能会发现节点/边的属性为空。这是发布版本的已知问题:apache/age#1721
您可以从源代码编译 AGE 并修复它。
使用 Faiss 进行存储
faiss-gpu
如果您有 GPU 支持,您也可以安装。
- 这里我们使用,
sentence-transformers
但您也可以使用OpenAIEmbedding
带有尺寸的模型3072
。
async def embedding_func(texts: list[str]) -> np.ndarray:
model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode(texts, convert_to_numpy=True)
return embeddings
# Initialize LightRAG with the LLM model function and embedding function
rag = LightRAG(
working_dir=WORKING_DIR,
llm_model_func=llm_model_func,
embedding_func=EmbeddingFunc(
embedding_dim=384,
max_token_size=8192,
func=embedding_func,
),
vector_storage="FaissVectorDBStorage",
vector_db_storage_cls_kwargs={
"cosine_better_than_threshold": 0.3 # Your desired threshold
}
)
编辑实体和关系
LightRAG 现在支持全面的知识图谱管理功能,允许您创建、编辑和删除知识图谱中的实体和关系。
创建实体和关系
# Create new entity
entity = rag.create_entity("Google", {
"description": "Google is a multinational technology company specializing in internet-related services and products.",
"entity_type": "company"
})
# Create another entity
product = rag.create_entity("Gmail", {
"description": "Gmail is an email service developed by Google.",
"entity_type": "product"
})
# Create relation between entities
relation = rag.create_relation("Google", "Gmail", {
"description": "Google develops and operates Gmail.",
"keywords": "develops operates service",
"weight": 2.0
})
编辑实体和关系
# Edit an existing entity
updated_entity = rag.edit_entity("Google", {
"description": "Google is a subsidiary of Alphabet Inc., founded in 1998.",
"entity_type": "tech_company"
})
# Rename an entity (with all its relationships properly migrated)
renamed_entity = rag.edit_entity("Gmail", {
"entity_name": "Google Mail",
"description": "Google Mail (formerly Gmail) is an email service."
})
# Edit a relation between entities
updated_relation = rag.edit_relation("Google", "Google Mail", {
"description": "Google created and maintains Google Mail service.",
"keywords": "creates maintains email service",
"weight": 3.0
})
所有操作均支持同步和异步版本。异步版本以前缀“a”开头(例如acreate_entity
、aedit_relation
)。
插入定制KG
custom_kg = {
"chunks": [
{
"content": "Alice and Bob are collaborating on quantum computing research.",
"source_id": "doc-1",
"file_path": "test_file",
}
],
"entities": [
{
"entity_name": "Alice",
"entity_type": "person",
"description": "Alice is a researcher specializing in quantum physics.",
"source_id": "doc-1",
"file_path": "test_file"
},
{
"entity_name": "Bob",
"entity_type": "person",
"description": "Bob is a mathematician.",
"source_id": "doc-1",
"file_path": "test_file"
},
{
"entity_name": "Quantum Computing",
"entity_type": "technology",
"description": "Quantum computing utilizes quantum mechanical phenomena for computation.",
"source_id": "doc-1",
"file_path": "test_file"
}
],
"relationships": [
{
"src_id": "Alice",
"tgt_id": "Bob",
"description": "Alice and Bob are research partners.",
"keywords": "collaboration research",
"weight": 1.0,
"source_id": "doc-1",
"file_path": "test_file"
},
{
"src_id": "Alice",
"tgt_id": "Quantum Computing",
"description": "Alice conducts research on quantum computing.",
"keywords": "research expertise",
"weight": 1.0,
"source_id": "doc-1",
"file_path": "test_file"
},
{
"src_id": "Bob",
"tgt_id": "Quantum Computing",
"description": "Bob researches quantum computing.",
"keywords": "research application",
"weight": 1.0,
"source_id": "doc-1",
"file_path": "test_file"
}
]
}
rag.insert_custom_kg(custom_kg)
其他实体和关系操作
-
create_entity:创建具有指定属性的新实体
-
edit_entity:更新现有实体的属性或重命名它
-
create_relation:在现有实体之间创建新的关系
-
edit_relation:更新现有关系的属性
这些操作维护图形数据库和矢量数据库组件之间的数据一致性,确保您的知识图谱保持一致。
实体合并
合并实体及其关系
LightRAG 现在支持将多个实体合并为一个实体,自动处理所有关系:
# Basic entity merging
rag.merge_entities(
source_entities=["Artificial Intelligence", "AI", "Machine Intelligence"],
target_entity="AI Technology"
)
使用自定义合并策略:
# Define custom merge strategy for different fields
rag.merge_entities(
source_entities=["John Smith", "Dr. Smith", "J. Smith"],
target_entity="John Smith",
merge_strategy={
"description": "concatenate", # Combine all descriptions
"entity_type": "keep_first", # Keep the entity type from the first entity
"source_id": "join_unique" # Combine all unique source IDs
}
)
使用自定义目标实体数据:
# Specify exact values for the merged entity
rag.merge_entities(
source_entities=["New York", "NYC", "Big Apple"],
target_entity="New York City",
target_entity_data={
"entity_type": "LOCATION",
"description": "New York City is the most populous city in the United States.",
}
)
结合两种方法的高级用法:
# Merge company entities with both strategy and custom data
rag.merge_entities(
source_entities=["Microsoft Corp", "Microsoft Corporation", "MSFT"],
target_entity="Microsoft",
merge_strategy={
"description": "concatenate", # Combine all descriptions
"source_id": "join_unique" # Combine source IDs
},
target_entity_data={
"entity_type": "ORGANIZATION",
}
)
合并实体时:
- 来自源实体的所有关系都被重定向到目标实体
- 重复关系被智能合并
- 防止自我关系(循环)
- 合并后源实体被删除
- 关系权重和属性被保留
代币使用情况追踪
概述和用法
LightRAG 提供了 TokenTracker 工具来监控和管理大型语言模型的 token 消耗。此功能对于控制 API 成本和优化性能尤其有用。
用法
from lightrag.utils import TokenTracker
# Create TokenTracker instance
token_tracker = TokenTracker()
# Method 1: Using context manager (Recommended)
# Suitable for scenarios requiring automatic token usage tracking
with token_tracker:
result1 = await llm_model_func("your question 1")
result2 = await llm_model_func("your question 2")
# Method 2: Manually adding token usage records
# Suitable for scenarios requiring more granular control over token statistics
token_tracker.reset()
rag.insert()
rag.query("your question 1", param=QueryParam(mode="naive"))
rag.query("your question 2", param=QueryParam(mode="mix"))
# Display total token usage (including insert and query operations)
print("Token usage:", token_tracker.get_usage())
使用技巧
- 使用上下文管理器进行长会话或批处理操作以自动跟踪所有令牌消耗
- 对于需要分段统计的场景,请使用手动模式,并在适当的时候调用reset()
- 定期检查代币使用情况有助于及早发现异常消费
- 在开发和测试期间积极使用此功能以优化生产成本
实际例子
您可以参考以下示例来实现令牌跟踪:
examples/lightrag_gemini_track_token_demo.py
:使用 Google Gemini 模型的令牌跟踪示例
examples/lightrag_siliconcloud_track_token_demo.py
:使用 SiliconCloud 模型的令牌跟踪示例
这些示例演示了如何在不同的模型和场景中有效地使用 TokenTracker 功能。
数据导出功能
概述
LightRAG 允许您以各种格式导出知识图谱数据,以便进行分析、共享和备份。该系统支持导出实体、关系和关系数据。
导出函数
基本用法
# Basic CSV export (default format)
rag.export_data("knowledge_graph.csv")
# Specify any format
rag.export_data("output.xlsx", file_format="excel")
支持不同的文件格式
#Export data in CSV format
rag.export_data("graph_data.csv", file_format="csv")
# Export data in Excel sheet
rag.export_data("graph_data.xlsx", file_format="excel")
# Export data in markdown format
rag.export_data("graph_data.md", file_format="md")
# Export data in Text
rag.export_data("graph_data.txt", file_format="txt")
附加选项
在导出中包含向量嵌入(可选):
rag.export_data("complete_data.csv", include_vector_data=True)
导出中包含的数据
所有出口包括:
- 实体信息(名称、ID、元数据)
- 关系数据(实体之间的联系)
- 来自矢量数据库的关系信息
缓存
清除缓存
您可以使用不同的模式清除 LLM 响应缓存:
# Clear all cache
await rag.aclear_cache()
# Clear local mode cache
await rag.aclear_cache(modes=["local"])
# Clear extraction cache
await rag.aclear_cache(modes=["default"])
# Clear multiple modes
await rag.aclear_cache(modes=["local", "global", "hybrid"])
# Synchronous version
rag.clear_cache(modes=["local"])
有效模式为:
"default"
:提取缓存
"naive"
:简单搜索缓存
"local"
:本地搜索缓存
"global"
:全局搜索缓存
"hybrid"
:混合搜索缓存
"mix"
:混合搜索缓存
LightRAG API
LightRAG 服务器旨在提供 Web UI 和 API 支持。 有关 LightRAG 服务器的更多信息,请参阅LightRAG 服务器。
图形可视化
LightRAG 服务器提供全面的知识图谱可视化功能。它支持各种重力布局、节点查询、子图过滤等功能。有关 LightRAG 服务器的更多信息,请参阅LightRAG 服务器。

评估
数据集
LightRAG 中使用的数据集可以从TommyChien/UltraDomain下载。
生成查询
LightRAG 使用以下提示生成高级查询,其中相应的代码在 中example/generate_query.py
。
迅速的
Given the following description of a dataset:
{description}
Please identify 5 potential users who would engage with this dataset. For each user, list 5 tasks they would perform with this dataset. Then, for each (user, task) combination, generate 5 questions that require a high-level understanding of the entire dataset.
Output the results in the following structure:
- User 1: [user description]
- Task 1: [task description]
- Question 1:
- Question 2:
- Question 3:
- Question 4:
- Question 5:
- Task 2: [task description]
...
- Task 5: [task description]
- User 2: [user description]
...
- User 5: [user description]
...
批次评估
为了评估两个 RAG 系统在高级查询上的性能,LightRAG 使用以下提示,具体代码可在 中找到reproduce/batch_eval.py
。
迅速的
---Role---
You are an expert tasked with evaluating two answers to the same question based on three criteria: **Comprehensiveness**, **Diversity**, and **Empowerment**.
---Goal---
You will evaluate two answers to the same question based on three criteria: **Comprehensiveness**, **Diversity**, and **Empowerment**.
- **Comprehensiveness**: How much detail does the answer provide to cover all aspects and details of the question?
- **Diversity**: How varied and rich is the answer in providing different perspectives and insights on the question?
- **Empowerment**: How well does the answer help the reader understand and make informed judgments about the topic?
For each criterion, choose the better answer (either Answer 1 or Answer 2) and explain why. Then, select an overall winner based on these three categories.
Here is the question:
{query}
Here are the two answers:
**Answer 1:**
{answer1}
**Answer 2:**
{answer2}
Evaluate both answers using the three criteria listed above and provide detailed explanations for each criterion.
Output your evaluation in the following JSON format:
{{
"Comprehensiveness": {{
"Winner": "[Answer 1 or Answer 2]",
"Explanation": "[Provide explanation here]"
}},
"Empowerment": {{
"Winner": "[Answer 1 or Answer 2]",
"Explanation": "[Provide explanation here]"
}},
"Overall Winner": {{
"Winner": "[Answer 1 or Answer 2]",
"Explanation": "[Summarize why this answer is the overall winner based on the three criteria]"
}}
}}
总体表现表
|
农业 |
|
CS |
|
合法的 |
|
混合 |
|
|
NaiveRAG |
LightRAG |
NaiveRAG |
LightRAG |
NaiveRAG |
LightRAG |
NaiveRAG |
LightRAG |
全面性 |
32.4% |
67.6% |
38.4% |
61.6% |
16.4% |
83.6% |
38.8% |
61.2% |
多样性 |
23.6% |
76.4% |
38.0% |
62.0% |
13.6% |
86.4% |
32.4% |
67.6% |
赋权 |
32.4% |
67.6% |
38.8% |
61.2% |
16.4% |
83.6% |
42.8% |
57.2% |
全面的 |
32.4% |
67.6% |
38.8% |
61.2% |
15.2% |
84.8% |
40.0% |
60.0% |
|
RQ-RAG |
LightRAG |
RQ-RAG |
LightRAG |
RQ-RAG |
LightRAG |
RQ-RAG |
LightRAG |
全面性 |
31.6% |
68.4% |
38.8% |
61.2% |
15.2% |
84.8% |
39.2% |
60.8% |
多样性 |
29.2% |
70.8% |
39.2% |
60.8% |
11.6% |
88.4% |
30.8% |
69.2% |
赋权 |
31.6% |
68.4% |
36.4% |
63.6% |
15.2% |
84.8% |
42.4% |
57.6% |
全面的 |
32.4% |
67.6% |
38.0% |
62.0% |
14.4% |
85.6% |
40.0% |
60.0% |
|
海德 |
LightRAG |
海德 |
LightRAG |
海德 |
LightRAG |
海德 |
LightRAG |
全面性 |
26.0% |
74.0% |
41.6% |
58.4% |
26.8% |
73.2% |
40.4% |
59.6% |
多样性 |
24.0% |
76.0% |
38.8% |
61.2% |
20.0% |
80.0% |
32.4% |
67.6% |
赋权 |
25.2% |
74.8% |
40.8% |
59.2% |
26.0% |
74.0% |
46.0% |
54.0% |
全面的 |
24.8% |
75.2% |
41.6% |
58.4% |
26.4% |
73.6% |
42.4% |
57.6% |
|
GraphRAG |
LightRAG |
GraphRAG |
LightRAG |
GraphRAG |
LightRAG |
GraphRAG |
LightRAG |
全面性 |
45.6% |
54.4% |
48.4% |
51.6% |
48.4% |
51.6% |
50.4% |
49.6% |
多样性 |
22.8% |
77.2% |
40.8% |
59.2% |
26.4% |
73.6% |
36.0% |
64.0% |
赋权 |
41.2% |
58.8% |
45.2% |
54.8% |
43.6% |
56.4% |
50.8% |
49.2% |
全面的 |
45.2% |
54.8% |
48.0% |
52.0% |
47.2% |
52.8% |
50.4% |
49.6% |
复制
所有代码都可以在./reproduce
目录中找到。
步骤 0 提取唯一上下文
首先,我们需要从数据集中提取独特的上下文。
代码
def extract_unique_contexts(input_directory, output_directory):
os.makedirs(output_directory, exist_ok=True)
jsonl_files = glob.glob(os.path.join(input_directory, '*.jsonl'))
print(f"Found {len(jsonl_files)} JSONL files.")
for file_path in jsonl_files:
filename = os.path.basename(file_path)
name, ext = os.path.splitext(filename)
output_filename = f"{name}_unique_contexts.json"
output_path = os.path.join(output_directory, output_filename)
unique_contexts_dict = {}
print(f"Processing file: {filename}")
try:
with open(file_path, 'r', encoding='utf-8') as infile:
for line_number, line in enumerate(infile, start=1):
line = line.strip()
if not line:
continue
try:
json_obj = json.loads(line)
context = json_obj.get('context')
if context and context not in unique_contexts_dict:
unique_contexts_dict[context] = None
except json.JSONDecodeError as e:
print(f"JSON decoding error in file {filename} at line {line_number}: {e}")
except FileNotFoundError:
print(f"File not found: {filename}")
continue
except Exception as e:
print(f"An error occurred while processing file {filename}: {e}")
continue
unique_contexts_list = list(unique_contexts_dict.keys())
print(f"There are {len(unique_contexts_list)} unique `context` entries in the file {filename}.")
try:
with open(output_path, 'w', encoding='utf-8') as outfile:
json.dump(unique_contexts_list, outfile, ensure_ascii=False, indent=4)
print(f"Unique `context` entries have been saved to: {output_filename}")
except Exception as e:
print(f"An error occurred while saving to the file {output_filename}: {e}")
print("All files have been processed.")
对于提取出的上下文,我们将其插入到LightRAG系统中。