镜像自地址
https://github.com/binary-husky/gpt_academic.git
已同步 2025-12-06 06:26:47 +00:00
* Zhipu sdk update 适配最新的智谱SDK,支持GLM4v (#1502) * 适配 google gemini 优化为从用户input中提取文件 * 适配最新的智谱SDK、支持glm-4v * requirements.txt fix * pending history check --------- Co-authored-by: binary-husky <qingxu.fu@outlook.com> * Update "生成多种Mermaid图表" plugin: Separate out the file reading function (#1520) * Update crazy_functional.py with new functionality deal with PDF * Update crazy_functional.py and Mermaid.py for plugin_kwargs * Update crazy_functional.py with new chart type: mind map * Update SELECT_PROMPT and i_say_show_user messages * Update ArgsReminder message in get_crazy_functions() function * Update with read md file and update PROMPTS * Return the PROMPTS as the test found that the initial version worked best * Update Mermaid chart generation function * version 3.71 * 解决issues #1510 * Remove unnecessary text from sys_prompt in 解析历史输入 function * Remove sys_prompt message in 解析历史输入 function * Update bridge_all.py: supports gpt-4-turbo-preview (#1517) * Update bridge_all.py: supports gpt-4-turbo-preview supports gpt-4-turbo-preview * Update bridge_all.py --------- Co-authored-by: binary-husky <96192199+binary-husky@users.noreply.github.com> * Update config.py: supports gpt-4-turbo-preview (#1516) * Update config.py: supports gpt-4-turbo-preview supports gpt-4-turbo-preview * Update config.py --------- Co-authored-by: binary-husky <96192199+binary-husky@users.noreply.github.com> * Refactor 解析历史输入 function to handle file input * Update Mermaid chart generation functionality * rename files and functions --------- Co-authored-by: binary-husky <qingxu.fu@outlook.com> Co-authored-by: hongyi-zhao <hongyi.zhao@gmail.com> Co-authored-by: binary-husky <96192199+binary-husky@users.noreply.github.com> * 接入mathpix ocr功能 (#1468) * Update Latex输出PDF结果.py 借助mathpix实现了PDF翻译中文并重新编译PDF * Update config.py add mathpix appid & appkey * Add 'PDF翻译中文并重新编译PDF' feature to plugins. --------- Co-authored-by: binary-husky <96192199+binary-husky@users.noreply.github.com> * fix zhipuai * check picture * remove glm-4 due to bug * 修改config * 检查MATHPIX_APPID * Remove unnecessary code and update function_plugins dictionary * capture non-standard token overflow * bug fix #1524 * change mermaid style * 支持mermaid 滚动放大缩小重置,鼠标滚动和拖拽 (#1530) * 支持mermaid 滚动放大缩小重置,鼠标滚动和拖拽 * 微调未果 先stage一下 * update --------- Co-authored-by: binary-husky <qingxu.fu@outlook.com> Co-authored-by: binary-husky <96192199+binary-husky@users.noreply.github.com> * ver 3.72 * change live2d * save the status of ``clear btn` in cookie * 前端选择保持 * js ui bug fix * reset btn bug fix * update live2d tips * fix missing get_token_num method * fix live2d toggle switch * fix persistent custom btn with cookie * fix zhipuai feedback with core functionality * Refactor button update and clean up functions * tailing space removal * Fix missing MATHPIX_APPID and MATHPIX_APPKEY configuration * Prompt fix、脑图提示词优化 (#1537) * 适配 google gemini 优化为从用户input中提取文件 * 脑图提示词优化 * Fix missing MATHPIX_APPID and MATHPIX_APPKEY configuration --------- Co-authored-by: binary-husky <qingxu.fu@outlook.com> * 优化“PDF翻译中文并重新编译PDF”插件 (#1602) * Add gemini_endpoint to API_URL_REDIRECT (#1560) * Add gemini_endpoint to API_URL_REDIRECT * Update gemini-pro and gemini-pro-vision model_info endpoints * Update to support new claude models (#1606) * Add anthropic library and update claude models * 更新bridge_claude.py文件,添加了对图片输入的支持。修复了一些bug。 * 添加Claude_3_Models变量以限制图片数量 * Refactor code to improve readability and maintainability * minor claude bug fix * more flexible one-api support * reformat config * fix one-api new access bug * dummy * compat non-standard api * version 3.73 --------- Co-authored-by: XIao <46100050+Kilig947@users.noreply.github.com> Co-authored-by: Menghuan1918 <menghuan2003@outlook.com> Co-authored-by: hongyi-zhao <hongyi.zhao@gmail.com> Co-authored-by: Hao Ma <893017927@qq.com> Co-authored-by: zeyuan huang <599012428@qq.com>
338 行
13 KiB
Python
338 行
13 KiB
Python
# From project chatglm-langchain
|
|
|
|
import threading
|
|
from toolbox import Singleton
|
|
import os
|
|
import shutil
|
|
import os
|
|
import uuid
|
|
import tqdm
|
|
from langchain.vectorstores import FAISS
|
|
from langchain.docstore.document import Document
|
|
from typing import List, Tuple
|
|
import numpy as np
|
|
from crazy_functions.vector_fns.general_file_loader import load_file
|
|
|
|
embedding_model_dict = {
|
|
"ernie-tiny": "nghuyong/ernie-3.0-nano-zh",
|
|
"ernie-base": "nghuyong/ernie-3.0-base-zh",
|
|
"text2vec-base": "shibing624/text2vec-base-chinese",
|
|
"text2vec": "GanymedeNil/text2vec-large-chinese",
|
|
}
|
|
|
|
# Embedding model name
|
|
EMBEDDING_MODEL = "text2vec"
|
|
|
|
# Embedding running device
|
|
EMBEDDING_DEVICE = "cpu"
|
|
|
|
# 基于上下文的prompt模版,请务必保留"{question}"和"{context}"
|
|
PROMPT_TEMPLATE = """已知信息:
|
|
{context}
|
|
|
|
根据上述已知信息,简洁和专业的来回答用户的问题。如果无法从中得到答案,请说 “根据已知信息无法回答该问题” 或 “没有提供足够的相关信息”,不允许在答案中添加编造成分,答案请使用中文。 问题是:{question}"""
|
|
|
|
# 文本分句长度
|
|
SENTENCE_SIZE = 100
|
|
|
|
# 匹配后单段上下文长度
|
|
CHUNK_SIZE = 250
|
|
|
|
# LLM input history length
|
|
LLM_HISTORY_LEN = 3
|
|
|
|
# return top-k text chunk from vector store
|
|
VECTOR_SEARCH_TOP_K = 5
|
|
|
|
# 知识检索内容相关度 Score, 数值范围约为0-1100,如果为0,则不生效,经测试设置为小于500时,匹配结果更精准
|
|
VECTOR_SEARCH_SCORE_THRESHOLD = 0
|
|
|
|
NLTK_DATA_PATH = os.path.join(os.path.dirname(os.path.dirname(__file__)), "nltk_data")
|
|
|
|
FLAG_USER_NAME = uuid.uuid4().hex
|
|
|
|
# 是否开启跨域,默认为False,如果需要开启,请设置为True
|
|
# is open cross domain
|
|
OPEN_CROSS_DOMAIN = False
|
|
|
|
def similarity_search_with_score_by_vector(
|
|
self, embedding: List[float], k: int = 4
|
|
) -> List[Tuple[Document, float]]:
|
|
|
|
def seperate_list(ls: List[int]) -> List[List[int]]:
|
|
lists = []
|
|
ls1 = [ls[0]]
|
|
for i in range(1, len(ls)):
|
|
if ls[i - 1] + 1 == ls[i]:
|
|
ls1.append(ls[i])
|
|
else:
|
|
lists.append(ls1)
|
|
ls1 = [ls[i]]
|
|
lists.append(ls1)
|
|
return lists
|
|
|
|
scores, indices = self.index.search(np.array([embedding], dtype=np.float32), k)
|
|
docs = []
|
|
id_set = set()
|
|
store_len = len(self.index_to_docstore_id)
|
|
for j, i in enumerate(indices[0]):
|
|
if i == -1 or 0 < self.score_threshold < scores[0][j]:
|
|
# This happens when not enough docs are returned.
|
|
continue
|
|
_id = self.index_to_docstore_id[i]
|
|
doc = self.docstore.search(_id)
|
|
if not self.chunk_conent:
|
|
if not isinstance(doc, Document):
|
|
raise ValueError(f"Could not find document for id {_id}, got {doc}")
|
|
doc.metadata["score"] = int(scores[0][j])
|
|
docs.append(doc)
|
|
continue
|
|
id_set.add(i)
|
|
docs_len = len(doc.page_content)
|
|
for k in range(1, max(i, store_len - i)):
|
|
break_flag = False
|
|
for l in [i + k, i - k]:
|
|
if 0 <= l < len(self.index_to_docstore_id):
|
|
_id0 = self.index_to_docstore_id[l]
|
|
doc0 = self.docstore.search(_id0)
|
|
if docs_len + len(doc0.page_content) > self.chunk_size:
|
|
break_flag = True
|
|
break
|
|
elif doc0.metadata["source"] == doc.metadata["source"]:
|
|
docs_len += len(doc0.page_content)
|
|
id_set.add(l)
|
|
if break_flag:
|
|
break
|
|
if not self.chunk_conent:
|
|
return docs
|
|
if len(id_set) == 0 and self.score_threshold > 0:
|
|
return []
|
|
id_list = sorted(list(id_set))
|
|
id_lists = seperate_list(id_list)
|
|
for id_seq in id_lists:
|
|
for id in id_seq:
|
|
if id == id_seq[0]:
|
|
_id = self.index_to_docstore_id[id]
|
|
doc = self.docstore.search(_id)
|
|
else:
|
|
_id0 = self.index_to_docstore_id[id]
|
|
doc0 = self.docstore.search(_id0)
|
|
doc.page_content += " " + doc0.page_content
|
|
if not isinstance(doc, Document):
|
|
raise ValueError(f"Could not find document for id {_id}, got {doc}")
|
|
doc_score = min([scores[0][id] for id in [indices[0].tolist().index(i) for i in id_seq if i in indices[0]]])
|
|
doc.metadata["score"] = int(doc_score)
|
|
docs.append(doc)
|
|
return docs
|
|
|
|
|
|
class LocalDocQA:
|
|
llm: object = None
|
|
embeddings: object = None
|
|
top_k: int = VECTOR_SEARCH_TOP_K
|
|
chunk_size: int = CHUNK_SIZE
|
|
chunk_conent: bool = True
|
|
score_threshold: int = VECTOR_SEARCH_SCORE_THRESHOLD
|
|
|
|
def init_cfg(self,
|
|
top_k=VECTOR_SEARCH_TOP_K,
|
|
):
|
|
|
|
self.llm = None
|
|
self.top_k = top_k
|
|
|
|
def init_knowledge_vector_store(self,
|
|
filepath,
|
|
vs_path: str or os.PathLike = None,
|
|
sentence_size=SENTENCE_SIZE,
|
|
text2vec=None):
|
|
loaded_files = []
|
|
failed_files = []
|
|
if isinstance(filepath, str):
|
|
if not os.path.exists(filepath):
|
|
print("路径不存在")
|
|
return None
|
|
elif os.path.isfile(filepath):
|
|
file = os.path.split(filepath)[-1]
|
|
try:
|
|
docs = load_file(filepath, SENTENCE_SIZE)
|
|
print(f"{file} 已成功加载")
|
|
loaded_files.append(filepath)
|
|
except Exception as e:
|
|
print(e)
|
|
print(f"{file} 未能成功加载")
|
|
return None
|
|
elif os.path.isdir(filepath):
|
|
docs = []
|
|
for file in tqdm(os.listdir(filepath), desc="加载文件"):
|
|
fullfilepath = os.path.join(filepath, file)
|
|
try:
|
|
docs += load_file(fullfilepath, SENTENCE_SIZE)
|
|
loaded_files.append(fullfilepath)
|
|
except Exception as e:
|
|
print(e)
|
|
failed_files.append(file)
|
|
|
|
if len(failed_files) > 0:
|
|
print("以下文件未能成功加载:")
|
|
for file in failed_files:
|
|
print(f"{file}\n")
|
|
|
|
else:
|
|
docs = []
|
|
for file in filepath:
|
|
docs += load_file(file, SENTENCE_SIZE)
|
|
print(f"{file} 已成功加载")
|
|
loaded_files.append(file)
|
|
|
|
if len(docs) > 0:
|
|
print("文件加载完毕,正在生成向量库")
|
|
if vs_path and os.path.isdir(vs_path):
|
|
try:
|
|
self.vector_store = FAISS.load_local(vs_path, text2vec)
|
|
self.vector_store.add_documents(docs)
|
|
except:
|
|
self.vector_store = FAISS.from_documents(docs, text2vec)
|
|
else:
|
|
self.vector_store = FAISS.from_documents(docs, text2vec) # docs 为Document列表
|
|
|
|
self.vector_store.save_local(vs_path)
|
|
return vs_path, loaded_files
|
|
else:
|
|
raise RuntimeError("文件加载失败,请检查文件格式是否正确")
|
|
|
|
def get_loaded_file(self, vs_path):
|
|
ds = self.vector_store.docstore
|
|
return set([ds._dict[k].metadata['source'].split(vs_path)[-1] for k in ds._dict])
|
|
|
|
|
|
# query 查询内容
|
|
# vs_path 知识库路径
|
|
# chunk_conent 是否启用上下文关联
|
|
# score_threshold 搜索匹配score阈值
|
|
# vector_search_top_k 搜索知识库内容条数,默认搜索5条结果
|
|
# chunk_sizes 匹配单段内容的连接上下文长度
|
|
def get_knowledge_based_conent_test(self, query, vs_path, chunk_conent,
|
|
score_threshold=VECTOR_SEARCH_SCORE_THRESHOLD,
|
|
vector_search_top_k=VECTOR_SEARCH_TOP_K, chunk_size=CHUNK_SIZE,
|
|
text2vec=None):
|
|
self.vector_store = FAISS.load_local(vs_path, text2vec)
|
|
self.vector_store.chunk_conent = chunk_conent
|
|
self.vector_store.score_threshold = score_threshold
|
|
self.vector_store.chunk_size = chunk_size
|
|
|
|
embedding = self.vector_store.embedding_function.embed_query(query)
|
|
related_docs_with_score = similarity_search_with_score_by_vector(self.vector_store, embedding, k=vector_search_top_k)
|
|
|
|
if not related_docs_with_score:
|
|
response = {"query": query,
|
|
"source_documents": []}
|
|
return response, ""
|
|
# prompt = f"{query}. You should answer this question using information from following documents: \n\n"
|
|
prompt = f"{query}. 你必须利用以下文档中包含的信息回答这个问题: \n\n---\n\n"
|
|
prompt += "\n\n".join([f"({k}): " + doc.page_content for k, doc in enumerate(related_docs_with_score)])
|
|
prompt += "\n\n---\n\n"
|
|
prompt = prompt.encode('utf-8', 'ignore').decode() # avoid reading non-utf8 chars
|
|
# print(prompt)
|
|
response = {"query": query, "source_documents": related_docs_with_score}
|
|
return response, prompt
|
|
|
|
|
|
|
|
|
|
def construct_vector_store(vs_id, vs_path, files, sentence_size, history, one_conent, one_content_segmentation, text2vec):
|
|
for file in files:
|
|
assert os.path.exists(file), "输入文件不存在:" + file
|
|
import nltk
|
|
if NLTK_DATA_PATH not in nltk.data.path: nltk.data.path = [NLTK_DATA_PATH] + nltk.data.path
|
|
local_doc_qa = LocalDocQA()
|
|
local_doc_qa.init_cfg()
|
|
filelist = []
|
|
if not os.path.exists(os.path.join(vs_path, vs_id)):
|
|
os.makedirs(os.path.join(vs_path, vs_id))
|
|
for file in files:
|
|
file_name = file.name if not isinstance(file, str) else file
|
|
filename = os.path.split(file_name)[-1]
|
|
shutil.copyfile(file_name, os.path.join(vs_path, vs_id, filename))
|
|
filelist.append(os.path.join(vs_path, vs_id, filename))
|
|
vs_path, loaded_files = local_doc_qa.init_knowledge_vector_store(filelist, os.path.join(vs_path, vs_id), sentence_size, text2vec)
|
|
|
|
if len(loaded_files):
|
|
file_status = f"已添加 {'、'.join([os.path.split(i)[-1] for i in loaded_files if i])} 内容至知识库,并已加载知识库,请开始提问"
|
|
else:
|
|
pass
|
|
# file_status = "文件未成功加载,请重新上传文件"
|
|
# print(file_status)
|
|
return local_doc_qa, vs_path
|
|
|
|
@Singleton
|
|
class knowledge_archive_interface():
|
|
def __init__(self) -> None:
|
|
self.threadLock = threading.Lock()
|
|
self.current_id = ""
|
|
self.kai_path = None
|
|
self.qa_handle = None
|
|
self.text2vec_large_chinese = None
|
|
|
|
def get_chinese_text2vec(self):
|
|
if self.text2vec_large_chinese is None:
|
|
# < -------------------预热文本向量化模组--------------- >
|
|
from toolbox import ProxyNetworkActivate
|
|
print('Checking Text2vec ...')
|
|
from langchain.embeddings.huggingface import HuggingFaceEmbeddings
|
|
with ProxyNetworkActivate('Download_LLM'): # 临时地激活代理网络
|
|
self.text2vec_large_chinese = HuggingFaceEmbeddings(model_name="GanymedeNil/text2vec-large-chinese")
|
|
|
|
return self.text2vec_large_chinese
|
|
|
|
|
|
def feed_archive(self, file_manifest, vs_path, id="default"):
|
|
self.threadLock.acquire()
|
|
# import uuid
|
|
self.current_id = id
|
|
self.qa_handle, self.kai_path = construct_vector_store(
|
|
vs_id=self.current_id,
|
|
vs_path=vs_path,
|
|
files=file_manifest,
|
|
sentence_size=100,
|
|
history=[],
|
|
one_conent="",
|
|
one_content_segmentation="",
|
|
text2vec = self.get_chinese_text2vec(),
|
|
)
|
|
self.threadLock.release()
|
|
|
|
def get_current_archive_id(self):
|
|
return self.current_id
|
|
|
|
def get_loaded_file(self, vs_path):
|
|
return self.qa_handle.get_loaded_file(vs_path)
|
|
|
|
def answer_with_archive_by_id(self, txt, id, vs_path):
|
|
self.threadLock.acquire()
|
|
if not self.current_id == id:
|
|
self.current_id = id
|
|
self.qa_handle, self.kai_path = construct_vector_store(
|
|
vs_id=self.current_id,
|
|
vs_path=vs_path,
|
|
files=[],
|
|
sentence_size=100,
|
|
history=[],
|
|
one_conent="",
|
|
one_content_segmentation="",
|
|
text2vec = self.get_chinese_text2vec(),
|
|
)
|
|
VECTOR_SEARCH_SCORE_THRESHOLD = 0
|
|
VECTOR_SEARCH_TOP_K = 4
|
|
CHUNK_SIZE = 512
|
|
resp, prompt = self.qa_handle.get_knowledge_based_conent_test(
|
|
query = txt,
|
|
vs_path = self.kai_path,
|
|
score_threshold=VECTOR_SEARCH_SCORE_THRESHOLD,
|
|
vector_search_top_k=VECTOR_SEARCH_TOP_K,
|
|
chunk_conent=True,
|
|
chunk_size=CHUNK_SIZE,
|
|
text2vec = self.get_chinese_text2vec(),
|
|
)
|
|
self.threadLock.release()
|
|
return resp, prompt |