文件
gpt_academic/shared_utils/text_mask.py
binary-husky c3140ce344 merge frontier branch (#1620)
* Zhipu sdk update 适配最新的智谱SDK,支持GLM4v (#1502)

* 适配 google gemini 优化为从用户input中提取文件

* 适配最新的智谱SDK、支持glm-4v

* requirements.txt fix

* pending history check

---------

Co-authored-by: binary-husky <qingxu.fu@outlook.com>

* Update "生成多种Mermaid图表" plugin: Separate out the file reading function (#1520)

* Update crazy_functional.py with new functionality deal with PDF

* Update crazy_functional.py and Mermaid.py for plugin_kwargs

* Update crazy_functional.py with new chart type: mind map

* Update SELECT_PROMPT and i_say_show_user messages

* Update ArgsReminder message in get_crazy_functions() function

* Update with read md file and update PROMPTS

* Return the PROMPTS as the test found that the initial version worked best

* Update Mermaid chart generation function

* version 3.71

* 解决issues #1510

* Remove unnecessary text from sys_prompt in 解析历史输入 function

* Remove sys_prompt message in 解析历史输入 function

* Update bridge_all.py: supports gpt-4-turbo-preview (#1517)

* Update bridge_all.py: supports gpt-4-turbo-preview

supports gpt-4-turbo-preview

* Update bridge_all.py

---------

Co-authored-by: binary-husky <96192199+binary-husky@users.noreply.github.com>

* Update config.py: supports gpt-4-turbo-preview (#1516)

* Update config.py: supports gpt-4-turbo-preview

supports gpt-4-turbo-preview

* Update config.py

---------

Co-authored-by: binary-husky <96192199+binary-husky@users.noreply.github.com>

* Refactor 解析历史输入 function to handle file input

* Update Mermaid chart generation functionality

* rename files and functions

---------

Co-authored-by: binary-husky <qingxu.fu@outlook.com>
Co-authored-by: hongyi-zhao <hongyi.zhao@gmail.com>
Co-authored-by: binary-husky <96192199+binary-husky@users.noreply.github.com>

* 接入mathpix ocr功能 (#1468)

* Update Latex输出PDF结果.py

借助mathpix实现了PDF翻译中文并重新编译PDF

* Update config.py

add mathpix appid & appkey

* Add 'PDF翻译中文并重新编译PDF' feature to plugins.

---------

Co-authored-by: binary-husky <96192199+binary-husky@users.noreply.github.com>

* fix zhipuai

* check picture

* remove glm-4 due to bug

* 修改config

* 检查MATHPIX_APPID

* Remove unnecessary code and update
function_plugins dictionary

* capture non-standard token overflow

* bug fix #1524

* change mermaid style

* 支持mermaid 滚动放大缩小重置,鼠标滚动和拖拽 (#1530)

* 支持mermaid 滚动放大缩小重置,鼠标滚动和拖拽

* 微调未果 先stage一下

* update

---------

Co-authored-by: binary-husky <qingxu.fu@outlook.com>
Co-authored-by: binary-husky <96192199+binary-husky@users.noreply.github.com>

* ver 3.72

* change live2d

* save the status of ``clear btn` in cookie

* 前端选择保持

* js ui bug fix

* reset btn bug fix

* update live2d tips

* fix missing get_token_num method

* fix live2d toggle switch

* fix persistent custom btn with cookie

* fix zhipuai feedback with core functionality

* Refactor button update and clean up functions

* tailing space removal

* Fix missing MATHPIX_APPID and MATHPIX_APPKEY
configuration

* Prompt fix、脑图提示词优化 (#1537)

* 适配 google gemini 优化为从用户input中提取文件

* 脑图提示词优化

* Fix missing MATHPIX_APPID and MATHPIX_APPKEY
configuration

---------

Co-authored-by: binary-husky <qingxu.fu@outlook.com>

* 优化“PDF翻译中文并重新编译PDF”插件 (#1602)

* Add gemini_endpoint to API_URL_REDIRECT (#1560)

* Add gemini_endpoint to API_URL_REDIRECT

* Update gemini-pro and gemini-pro-vision model_info
endpoints

* Update to support new claude models (#1606)

* Add anthropic library and update claude models

* 更新bridge_claude.py文件,添加了对图片输入的支持。修复了一些bug。

* 添加Claude_3_Models变量以限制图片数量

* Refactor code to improve readability and
maintainability

* minor claude bug fix

* more flexible one-api support

* reformat config

* fix one-api new access bug

* dummy

* compat non-standard api

* version 3.73

---------

Co-authored-by: XIao <46100050+Kilig947@users.noreply.github.com>
Co-authored-by: Menghuan1918 <menghuan2003@outlook.com>
Co-authored-by: hongyi-zhao <hongyi.zhao@gmail.com>
Co-authored-by: Hao Ma <893017927@qq.com>
Co-authored-by: zeyuan huang <599012428@qq.com>
2024-03-11 17:26:09 +08:00

108 行
5.5 KiB
Python

此文件含有模棱两可的 Unicode 字符

此文件含有可能会与其他字符混淆的 Unicode 字符。 如果您是想特意这样的,可以安全地忽略该警告。 使用 Escape 按钮显示他们。

import re
from functools import lru_cache
# 这段代码是使用Python编程语言中的re模块,即正则表达式库,来定义了一个正则表达式模式。
# 这个模式被编译成一个正则表达式对象,存储在名为const_extract_exp的变量中,以便于后续快速的匹配和查找操作。
# 这里解释一下正则表达式中的几个特殊字符:
# - . 表示任意单一字符。
# - * 表示前一个字符可以出现0次或多次。
# - ? 在这里用作非贪婪匹配,也就是说它会匹配尽可能少的字符。在(.*?)中,它确保我们匹配的任意文本是尽可能短的,也就是说,它会在</show_llm>和</show_render>标签之前停止匹配。
# - () 括号在正则表达式中表示捕获组。
# - 在这个例子中,(.*?)表示捕获任意长度的文本,直到遇到括号外部最近的限定符,即</show_llm>和</show_render>。
# -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-==-=-=-=/1=-=-=-=-=-=-=-=-=-=-=-=-=-=/2-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
const_extract_re = re.compile(
r"<gpt_academic_string_mask><show_llm>(.*?)</show_llm><show_render>(.*?)</show_render></gpt_academic_string_mask>"
)
# -=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-==-=-=-=-=-=/1=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-/2-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
const_extract_langbased_re = re.compile(
r"<gpt_academic_string_mask><lang_english>(.*?)</lang_english><lang_chinese>(.*?)</lang_chinese></gpt_academic_string_mask>",
flags=re.DOTALL,
)
@lru_cache(maxsize=128)
def apply_gpt_academic_string_mask(string, mode="show_all"):
"""
当字符串中有掩码tag时<gpt_academic_string_mask><show_...>,根据字符串要给谁看大模型,还是web渲染,对字符串进行处理,返回处理后的字符串
示意图https://mermaid.live/edit#pako:eNqlkUtLw0AUhf9KuOta0iaTplkIPlpduFJwoZEwJGNbzItpita2O6tF8QGKogXFtwu7cSHiq3-mk_oznFR8IYLgrGbuOd9hDrcCpmcR0GDW9ubNPKaBMDauuwI_A9M6YN-3y0bODwxsYos4BdMoBrTg5gwHF-d0mBH6-vqFQe58ed5m9XPW2uteX3Tubrj0ljLYcwxxR3h1zB43WeMs3G19yEM9uapDMe_NG9i2dagKw1Fee4c1D9nGEbtc-5n6HbNtJ8IyHOs8tbs7V2HrlDX2w2Y7XD_5haHEtQiNsOwfMVa_7TzsvrWIuJGo02qTrdwLk9gukQylHv3Afv1ML270s-HZUndrmW1tdA-WfvbM_jMFYuAQ6uCCxVdciTJ1CPLEITpo_GphypeouzXuw6XAmyi7JmgBLZEYlHwLB2S4gHMUO-9DH7tTnvf1CVoFFkBLSOk4QmlRTqpIlaWUHINyNFXjaQWpCYRURUKiWovBYo8X4ymEJFlECQUpqaQkJmuvWygPpg
"""
if "<gpt_academic_string_mask>" not in string: # No need to process
return string
if mode == "show_all":
return string
if mode == "show_llm":
string = const_extract_re.sub(r"\1", string)
elif mode == "show_render":
string = const_extract_re.sub(r"\2", string)
else:
raise ValueError("Invalid mode")
return string
@lru_cache(maxsize=128)
def build_gpt_academic_masked_string(text_show_llm="", text_show_render=""):
"""
根据字符串要给谁看大模型,还是web渲染,生成带掩码tag的字符串
"""
return f"<gpt_academic_string_mask><show_llm>{text_show_llm}</show_llm><show_render>{text_show_render}</show_render></gpt_academic_string_mask>"
@lru_cache(maxsize=128)
def apply_gpt_academic_string_mask_langbased(string, lang_reference):
"""
当字符串中有掩码tag时<gpt_academic_string_mask><lang_...>),根据语言,选择提示词,对字符串进行处理,返回处理后的字符串
例如,如果lang_reference是英文,那么就只显示英文提示词,中文提示词就不显示了
举例:
输入1
string = "注意,lang_reference这段文字是<gpt_academic_string_mask><lang_english>英语</lang_english><lang_chinese>中文</lang_chinese></gpt_academic_string_mask>"
lang_reference = "hello world"
输出1
"注意,lang_reference这段文字是英语"
输入2
string = "注意,lang_reference这段文字是中文" # 注意这里没有掩码tag,所以不会被处理
lang_reference = "hello world"
输出2
"注意,lang_reference这段文字是中文" # 原样返回
"""
if "<gpt_academic_string_mask>" not in string: # No need to process
return string
def contains_chinese(string):
chinese_regex = re.compile(u'[\u4e00-\u9fff]+')
return chinese_regex.search(string) is not None
mode = "english" if not contains_chinese(lang_reference) else "chinese"
if mode == "english":
string = const_extract_langbased_re.sub(r"\1", string)
elif mode == "chinese":
string = const_extract_langbased_re.sub(r"\2", string)
else:
raise ValueError("Invalid mode")
return string
@lru_cache(maxsize=128)
def build_gpt_academic_masked_string_langbased(text_show_english="", text_show_chinese=""):
"""
根据语言,选择提示词,对字符串进行处理,返回处理后的字符串
"""
return f"<gpt_academic_string_mask><lang_english>{text_show_english}</lang_english><lang_chinese>{text_show_chinese}</lang_chinese></gpt_academic_string_mask>"
if __name__ == "__main__":
# Test
input_string = (
"你好\n"
+ build_gpt_academic_masked_string(text_show_llm="mermaid", text_show_render="")
+ "你好\n"
)
print(
apply_gpt_academic_string_mask(input_string, "show_llm")
) # Should print the strings with 'abc' in place of the academic mask tags
print(
apply_gpt_academic_string_mask(input_string, "show_render")
) # Should print the strings with 'xyz' in place of the academic mask tags