文件
gpt_academic/crazy_functions/pdf_fns/parse_pdf.py
binary-husky c3140ce344 merge frontier branch (#1620)
* Zhipu sdk update 适配最新的智谱SDK,支持GLM4v (#1502)

* 适配 google gemini 优化为从用户input中提取文件

* 适配最新的智谱SDK、支持glm-4v

* requirements.txt fix

* pending history check

---------

Co-authored-by: binary-husky <qingxu.fu@outlook.com>

* Update "生成多种Mermaid图表" plugin: Separate out the file reading function (#1520)

* Update crazy_functional.py with new functionality deal with PDF

* Update crazy_functional.py and Mermaid.py for plugin_kwargs

* Update crazy_functional.py with new chart type: mind map

* Update SELECT_PROMPT and i_say_show_user messages

* Update ArgsReminder message in get_crazy_functions() function

* Update with read md file and update PROMPTS

* Return the PROMPTS as the test found that the initial version worked best

* Update Mermaid chart generation function

* version 3.71

* 解决issues #1510

* Remove unnecessary text from sys_prompt in 解析历史输入 function

* Remove sys_prompt message in 解析历史输入 function

* Update bridge_all.py: supports gpt-4-turbo-preview (#1517)

* Update bridge_all.py: supports gpt-4-turbo-preview

supports gpt-4-turbo-preview

* Update bridge_all.py

---------

Co-authored-by: binary-husky <96192199+binary-husky@users.noreply.github.com>

* Update config.py: supports gpt-4-turbo-preview (#1516)

* Update config.py: supports gpt-4-turbo-preview

supports gpt-4-turbo-preview

* Update config.py

---------

Co-authored-by: binary-husky <96192199+binary-husky@users.noreply.github.com>

* Refactor 解析历史输入 function to handle file input

* Update Mermaid chart generation functionality

* rename files and functions

---------

Co-authored-by: binary-husky <qingxu.fu@outlook.com>
Co-authored-by: hongyi-zhao <hongyi.zhao@gmail.com>
Co-authored-by: binary-husky <96192199+binary-husky@users.noreply.github.com>

* 接入mathpix ocr功能 (#1468)

* Update Latex输出PDF结果.py

借助mathpix实现了PDF翻译中文并重新编译PDF

* Update config.py

add mathpix appid & appkey

* Add 'PDF翻译中文并重新编译PDF' feature to plugins.

---------

Co-authored-by: binary-husky <96192199+binary-husky@users.noreply.github.com>

* fix zhipuai

* check picture

* remove glm-4 due to bug

* 修改config

* 检查MATHPIX_APPID

* Remove unnecessary code and update
function_plugins dictionary

* capture non-standard token overflow

* bug fix #1524

* change mermaid style

* 支持mermaid 滚动放大缩小重置,鼠标滚动和拖拽 (#1530)

* 支持mermaid 滚动放大缩小重置,鼠标滚动和拖拽

* 微调未果 先stage一下

* update

---------

Co-authored-by: binary-husky <qingxu.fu@outlook.com>
Co-authored-by: binary-husky <96192199+binary-husky@users.noreply.github.com>

* ver 3.72

* change live2d

* save the status of ``clear btn` in cookie

* 前端选择保持

* js ui bug fix

* reset btn bug fix

* update live2d tips

* fix missing get_token_num method

* fix live2d toggle switch

* fix persistent custom btn with cookie

* fix zhipuai feedback with core functionality

* Refactor button update and clean up functions

* tailing space removal

* Fix missing MATHPIX_APPID and MATHPIX_APPKEY
configuration

* Prompt fix、脑图提示词优化 (#1537)

* 适配 google gemini 优化为从用户input中提取文件

* 脑图提示词优化

* Fix missing MATHPIX_APPID and MATHPIX_APPKEY
configuration

---------

Co-authored-by: binary-husky <qingxu.fu@outlook.com>

* 优化“PDF翻译中文并重新编译PDF”插件 (#1602)

* Add gemini_endpoint to API_URL_REDIRECT (#1560)

* Add gemini_endpoint to API_URL_REDIRECT

* Update gemini-pro and gemini-pro-vision model_info
endpoints

* Update to support new claude models (#1606)

* Add anthropic library and update claude models

* 更新bridge_claude.py文件,添加了对图片输入的支持。修复了一些bug。

* 添加Claude_3_Models变量以限制图片数量

* Refactor code to improve readability and
maintainability

* minor claude bug fix

* more flexible one-api support

* reformat config

* fix one-api new access bug

* dummy

* compat non-standard api

* version 3.73

---------

Co-authored-by: XIao <46100050+Kilig947@users.noreply.github.com>
Co-authored-by: Menghuan1918 <menghuan2003@outlook.com>
Co-authored-by: hongyi-zhao <hongyi.zhao@gmail.com>
Co-authored-by: Hao Ma <893017927@qq.com>
Co-authored-by: zeyuan huang <599012428@qq.com>
2024-03-11 17:26:09 +08:00

172 行
7.9 KiB
Python

此文件含有模棱两可的 Unicode 字符

此文件含有可能会与其他字符混淆的 Unicode 字符。 如果您是想特意这样的,可以安全地忽略该警告。 使用 Escape 按钮显示他们。

from functools import lru_cache
from toolbox import gen_time_str
from toolbox import promote_file_to_downloadzone
from toolbox import write_history_to_file, promote_file_to_downloadzone
from toolbox import get_conf
from toolbox import ProxyNetworkActivate
from colorful import *
import requests
import random
import copy
import os
import math
class GROBID_OFFLINE_EXCEPTION(Exception): pass
def get_avail_grobid_url():
GROBID_URLS = get_conf('GROBID_URLS')
if len(GROBID_URLS) == 0: return None
try:
_grobid_url = random.choice(GROBID_URLS) # 随机负载均衡
if _grobid_url.endswith('/'): _grobid_url = _grobid_url.rstrip('/')
with ProxyNetworkActivate('Connect_Grobid'):
res = requests.get(_grobid_url+'/api/isalive')
if res.text=='true': return _grobid_url
else: return None
except:
return None
@lru_cache(maxsize=32)
def parse_pdf(pdf_path, grobid_url):
import scipdf # pip install scipdf_parser
if grobid_url.endswith('/'): grobid_url = grobid_url.rstrip('/')
try:
with ProxyNetworkActivate('Connect_Grobid'):
article_dict = scipdf.parse_pdf_to_dict(pdf_path, grobid_url=grobid_url)
except GROBID_OFFLINE_EXCEPTION:
raise GROBID_OFFLINE_EXCEPTION("GROBID服务不可用,请修改config中的GROBID_URL,可修改成本地GROBID服务。")
except:
raise RuntimeError("解析PDF失败,请检查PDF是否损坏。")
return article_dict
def produce_report_markdown(gpt_response_collection, meta, paper_meta_info, chatbot, fp, generated_conclusion_files):
# -=-=-=-=-=-=-=-= 写出第1个文件翻译前后混合 -=-=-=-=-=-=-=-=
res_path = write_history_to_file(meta + ["# Meta Translation" , paper_meta_info] + gpt_response_collection, file_basename=f"{gen_time_str()}translated_and_original.md", file_fullname=None)
promote_file_to_downloadzone(res_path, rename_file=os.path.basename(res_path)+'.md', chatbot=chatbot)
generated_conclusion_files.append(res_path)
# -=-=-=-=-=-=-=-= 写出第2个文件仅翻译后的文本 -=-=-=-=-=-=-=-=
translated_res_array = []
# 记录当前的大章节标题:
last_section_name = ""
for index, value in enumerate(gpt_response_collection):
# 先挑选偶数序列号:
if index % 2 != 0:
# 先提取当前英文标题:
cur_section_name = gpt_response_collection[index-1].split('\n')[0].split(" Part")[0]
# 如果index是1的话,则直接使用first section name
if cur_section_name != last_section_name:
cur_value = cur_section_name + '\n'
last_section_name = copy.deepcopy(cur_section_name)
else:
cur_value = ""
# 再做一个小修改重新修改当前part的标题,默认用英文的
cur_value += value
translated_res_array.append(cur_value)
res_path = write_history_to_file(meta + ["# Meta Translation" , paper_meta_info] + translated_res_array,
file_basename = f"{gen_time_str()}-translated_only.md",
file_fullname = None,
auto_caption = False)
promote_file_to_downloadzone(res_path, rename_file=os.path.basename(res_path)+'.md', chatbot=chatbot)
generated_conclusion_files.append(res_path)
return res_path
def translate_pdf(article_dict, llm_kwargs, chatbot, fp, generated_conclusion_files, TOKEN_LIMIT_PER_FRAGMENT, DST_LANG):
from crazy_functions.pdf_fns.report_gen_html import construct_html
from crazy_functions.pdf_fns.breakdown_txt import breakdown_text_to_satisfy_token_limit
from crazy_functions.crazy_utils import request_gpt_model_in_new_thread_with_ui_alive
from crazy_functions.crazy_utils import request_gpt_model_multi_threads_with_very_awesome_ui_and_high_efficiency
prompt = "以下是一篇学术论文的基本信息:\n"
# title
title = article_dict.get('title', '无法获取 title'); prompt += f'title:{title}\n\n'
# authors
authors = article_dict.get('authors', '无法获取 authors')[:100]; prompt += f'authors:{authors}\n\n'
# abstract
abstract = article_dict.get('abstract', '无法获取 abstract'); prompt += f'abstract:{abstract}\n\n'
# command
prompt += f"请将题目和摘要翻译为{DST_LANG}"
meta = [f'# Title:\n\n', title, f'# Abstract:\n\n', abstract ]
# 单线,获取文章meta信息
paper_meta_info = yield from request_gpt_model_in_new_thread_with_ui_alive(
inputs=prompt,
inputs_show_user=prompt,
llm_kwargs=llm_kwargs,
chatbot=chatbot, history=[],
sys_prompt="You are an academic paper reader。",
)
# 多线,翻译
inputs_array = []
inputs_show_user_array = []
# get_token_num
from request_llms.bridge_all import model_info
enc = model_info[llm_kwargs['llm_model']]['tokenizer']
def get_token_num(txt): return len(enc.encode(txt, disallowed_special=()))
def break_down(txt):
raw_token_num = get_token_num(txt)
if raw_token_num <= TOKEN_LIMIT_PER_FRAGMENT:
return [txt]
else:
# raw_token_num > TOKEN_LIMIT_PER_FRAGMENT
# find a smooth token limit to achieve even seperation
count = int(math.ceil(raw_token_num / TOKEN_LIMIT_PER_FRAGMENT))
token_limit_smooth = raw_token_num // count + count
return breakdown_text_to_satisfy_token_limit(txt, limit=token_limit_smooth, llm_model=llm_kwargs['llm_model'])
for section in article_dict.get('sections'):
if len(section['text']) == 0: continue
section_frags = break_down(section['text'])
for i, fragment in enumerate(section_frags):
heading = section['heading']
if len(section_frags) > 1: heading += f' Part-{i+1}'
inputs_array.append(
f"你需要翻译{heading}章节,内容如下: \n\n{fragment}"
)
inputs_show_user_array.append(
f"# {heading}\n\n{fragment}"
)
gpt_response_collection = yield from request_gpt_model_multi_threads_with_very_awesome_ui_and_high_efficiency(
inputs_array=inputs_array,
inputs_show_user_array=inputs_show_user_array,
llm_kwargs=llm_kwargs,
chatbot=chatbot,
history_array=[meta for _ in inputs_array],
sys_prompt_array=[
"请你作为一个学术翻译,负责把学术论文准确翻译成中文。注意文章中的每一句话都要翻译。" for _ in inputs_array],
)
# -=-=-=-=-=-=-=-= 写出Markdown文件 -=-=-=-=-=-=-=-=
produce_report_markdown(gpt_response_collection, meta, paper_meta_info, chatbot, fp, generated_conclusion_files)
# -=-=-=-=-=-=-=-= 写出HTML文件 -=-=-=-=-=-=-=-=
ch = construct_html()
orig = ""
trans = ""
gpt_response_collection_html = copy.deepcopy(gpt_response_collection)
for i,k in enumerate(gpt_response_collection_html):
if i%2==0:
gpt_response_collection_html[i] = inputs_show_user_array[i//2]
else:
# 先提取当前英文标题:
cur_section_name = gpt_response_collection[i-1].split('\n')[0].split(" Part")[0]
cur_value = cur_section_name + "\n" + gpt_response_collection_html[i]
gpt_response_collection_html[i] = cur_value
final = ["", "", "一、论文概况", "", "Abstract", paper_meta_info, "二、论文翻译", ""]
final.extend(gpt_response_collection_html)
for i, k in enumerate(final):
if i%2==0:
orig = k
if i%2==1:
trans = k
ch.add_row(a=orig, b=trans)
create_report_file_name = f"{os.path.basename(fp)}.trans.html"
html_file = ch.save_file(create_report_file_name)
generated_conclusion_files.append(html_file)
promote_file_to_downloadzone(html_file, rename_file=os.path.basename(html_file), chatbot=chatbot)