From afcb944c185fb78c768c8750e3b12490f235400a Mon Sep 17 00:00:00 2001 From: DYR1 <1004356985@qq.com> Date: Mon, 24 Apr 2023 10:51:49 +0800 Subject: [PATCH] Modify README.md --- README.md | 29 ++++------------------------- README_EN.md | 35 +++++++++++++++++++++++++++++++---- 2 files changed, 35 insertions(+), 29 deletions(-) diff --git a/README.md b/README.md index 3729ab1..7447fee 100644 --- a/README.md +++ b/README.md @@ -1,9 +1,7 @@ [**中文**](./README.md) | [**English**](./README_EN.md)

- SCIR-HI-HuaTuo -

@@ -12,9 +10,7 @@ ### HuaTuo: Tuning LLaMA Model With Chinese Medical Instructions -[![Code License](https://img.shields.io/badge/Code%20License-Apache_2.0-green.svg)](https://github.com/SCIR-HI/Huatuo-Llama-Med-Chinese/blob/main/LICENSE) - -[![Python 3.9+](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/downloads/release/python-390/) +[![Code License](https://img.shields.io/badge/Code%20License-Apache_2.0-green.svg)](https://github.com/SCIR-HI/Huatuo-Llama-Med-Chinese/blob/main/LICENSE) [![Python 3.9+](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/downloads/release/python-390/) 本项目开源了经过中文医学指令精调/指令微调(Instruct-tuning) 的LLaMA-7B模型。我们通过医学知识图谱和GPT3.5 API构建了中文医学指令数据集,并在此基础上对LLaMA进行了指令微调,提高了LLaMA在医疗领域的问答效果。 @@ -43,7 +39,7 @@ pip install -r requirements.txt LORA权重可以通过百度网盘或Huggingface下载 - 基于医学知识库 [百度网盘](https://pan.baidu.com/s/1jih-pEr6jzEa6n2u6sUMOg?pwd=jjpf)和[HuggingFace](https://huggingface.co/thinksoso/lora-llama-med) - - 基于医学文献 [百度网盘]() + - 基于医学文献 [百度网盘](https://pan.baidu.com/s/1jADypClR2bLyXItuFfSjPA?pwd=odsk) 下载LORA权重并解压,解压后的格式如下: @@ -52,17 +48,13 @@ LORA权重可以通过百度网盘或Huggingface下载 ``` #基于医学知识库 lora-llama-med/ -   - adapter_config.json   # LoRA权重配置文件 -   - adapter_model.bin   # LoRA权重文件 #基于医学文献 lora-llama-med-literature/ -   - adapter_config.json   # LoRA权重配置文件 -   - adapter_model.bin   # LoRA权重文件 ``` @@ -132,9 +124,7 @@ bash ./scripts/infer-literature-multi.sh 指令微调数据集质量仍有限,后续将进行不断迭代,同时医学知识库和数据集构建代码还在整理中,整理完成将会发布。 -此外,我们收集了2023年关于肝癌疾病的中文医学文献,利用GPT3.5接口围绕医学文献多轮问答数据。在·`./data_literature/liver_cancer.json`中我们提供了其中的1k条训练样例。 - -目前,训练样本的质量仍然有限,在后续我们会进一步迭代数据,会以`数据集`的形式对外进行发布。训练样本的示例如下: +此外,我们收集了2023年关于肝癌疾病的中文医学文献,利用GPT3.5接口围绕医学文献多轮问答数据。在·`./data_literature/liver_cancer.json`中我们提供了其中的1k条训练样例。目前,训练样本的质量仍然有限,在后续我们会进一步迭代数据,会以`公开数据集`的形式对外进行发布。训练样本的示例如下:

@@ -191,7 +181,7 @@ https://wandb.ai/thinksoso/llama_med/runs/a5wgcnzt/overview?workspace=user-think ## 项目参与者 -本项目由哈尔滨工业大学社会计算与信息检索研究中心健康智能组[王昊淳](https://github.com/s65b40) 、[杜晏睿](https://github.com/DYR1)、[刘驰](https://github.com/thinksoso)、[白睿]()、[席奴瓦](https://github.com/rootnx)、陈雨晗、[强泽文](https://github.com/1278882181)、陈健宇、[李子健](https://github.com/FlowolfzzZ)完成,指导教师为赵森栋副教授,秦兵教授以及刘挺教授。 +本项目由哈尔滨工业大学社会计算与信息检索研究中心健康智能组[王昊淳](https://github.com/s65b40) 、[杜晏睿](https://github.com/DYR1)、[刘驰](https://github.com/thinksoso)、[白睿](https://github.com/RuiBai1999)、[席奴瓦](https://github.com/rootnx)、[陈雨晗](https://github.com/Imsovegetable)、[强泽文](https://github.com/1278882181)、[陈健宇](https://github.com/JianyuChen01)、[李子健](https://github.com/FlowolfzzZ)完成,指导教师为赵森栋副教授,秦兵教授以及刘挺教授。 @@ -205,13 +195,9 @@ https://wandb.ai/thinksoso/llama_med/runs/a5wgcnzt/overview?workspace=user-think - Facebook LLaMA: https://github.com/facebookresearch/llama - - Stanford Alpaca: https://github.com/tatsu-lab/stanford_alpaca - - alpaca-lora by @tloen: https://github.com/tloen/alpaca-lora - - CMeKG https://github.com/king-yyf/CMeKG_tools - - 文心一言 https://yiyan.baidu.com/welcome 本项目的logo由文心一言自动生成 @@ -231,19 +217,12 @@ https://wandb.ai/thinksoso/llama_med/runs/a5wgcnzt/overview?workspace=user-think ``` @misc{wang2023huatuo, -       title={HuaTuo: Tuning LLaMA Model with Chinese Medical Knowledge}, -       author={Haochun Wang and Chi Liu and Nuwa Xi and Zewen Qiang and Sendong Zhao and Bing Qin and Ting Liu}, -       year={2023}, -       eprint={2304.06975}, -       archivePrefix={arXiv}, -       primaryClass={cs.CL} - } ``` \ No newline at end of file diff --git a/README_EN.md b/README_EN.md index 1fa33d8..8ffeb2b 100644 --- a/README_EN.md +++ b/README_EN.md @@ -4,11 +4,12 @@

# HuaTuo: Tuning LLaMA Model With Chinese Medical Instructions -[![Code License](https://img.shields.io/badge/Code%20License-Apache_2.0-green.svg)](https://github.com/SCIR-HI/Huatuo-Llama-Med-Chinese/blob/main/LICENSE) -[![Python 3.9+](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/downloads/release/python-390/) +[![Code License](https://img.shields.io/badge/Code%20License-Apache_2.0-green.svg)](https://github.com/SCIR-HI/Huatuo-Llama-Med-Chinese/blob/main/LICENSE) [![Python 3.9+](https://img.shields.io/badge/python-3.9+-blue.svg)](https://www.python.org/downloads/release/python-390/) This repo open-sources the Instruct-tuned LLaMA-7B model that has been fine-tuned with Chinese medical instructions. We constructed a Chinese medical instruct-tuning dataset using medical knowledge graphs and the GPT3.5 API, and performed instruction-tuning on LLaMA based on this dataset, improving its question-answering performance in the medical field. +In addition, we tried to use the GPT3.5 API to integrate [conclusions] in the medical literature as external information into multiple rounds of dialogue, and based on this, we fine-tuned LLaMA. At present, we only open model parameters trained for the single disease "liver cancer". In the future, we plan to release a medical dialogue dataset incorporating medical literature conclusions, and plan to train models for 16 diseases related to "liver, gallbladder and pancreas". + We also trained a medical version of ChatGLM: [ChatGLM-6B-Med](https://github.com/SCIR-HI/Med-ChatGLM) based on the same data. We are about to release our new model [扁鹊(PienChueh)](https://github.com/SCIR-HI/Bian-Que_Pien-Chueh). @@ -21,14 +22,24 @@ Firstly, install the required packages. It is recommended to use Python 3.9 or a pip install -r requirements.txt ``` ### Model download -LORA weights can be downloaded through [Baidu Netdisk](https://pan.baidu.com/s/1jih-pEr6jzEa6n2u6sUMOg?pwd=jjpf) or [HuggingFace](https://huggingface.co/thinksoso/lora-llama-med). +LORA weights can be downloaded through Baidu Netdisk or Huggingface. + +-Based on medical knowledge base. [Baidu Netdisk](https://pan.baidu.com/s/1jih-pEr6jzEa6n2u6sUMOg?pwd=jjpf) or [HuggingFace](https://huggingface.co/thinksoso/lora-llama-med). +-Based on medical literature. [Baidu Netdisk](https://pan.baidu.com/s/1jADypClR2bLyXItuFfSjPA?pwd=odsk) Download the LORA weight file and extract it. The format of the extracted file should be as follows: ``` +#Based on medical knowledge base lora-llama-med/ - adapter_config.json # LoRA weight configuration file - adapter_model.bin # LoRA weights + +#Based on medical literature +lora-llama-med-literature/ +  - adapter_config.json   # LoRA weight configuration file +  - adapter_model.bin   # LoRA weights + ``` ### Infer @@ -37,7 +48,15 @@ We provided some test cases in `./data/infer.json`, which can be replaced with o Run the infer script ``` +#Based on medical knowledge base bash ./scripts/infer.sh + +#Based on medical literature +#single-epoch +bash ./scripts/infer-literature-single.sh + +#multi-epoch +bash ./scripts/infer-literature-multi.sh ``` ### Dataset construction @@ -61,6 +80,14 @@ We provided a training dataset for the model, consisting of more than eight thou The quality of the dataset for instruct-tuning is still limited. We will continue to iterate and improve it. Meanwhile, the medical knowledge base and dataset construction code are still being organized and will be released once completed. +In addition, we collected Chinese medical literature on liver cancer in 2023, and used the GPT3.5 interface to collect multiple rounds of question-and-answer data around the medical literature. We provide 1k training examples in `./data_literature/liver_cancer.json`. At present, the quality of training samples is still limited. In the future, we will further iterate the data and release it in the form of `public dataset`. An example of a training sample is as follows: + +

+ +SCIR-HI-HuaTuo-literature + +

+ ### Finetune To fine-tune LLaMA with your own dataset, please construct your dataset following the format of `./data/llama_data.json` and run the finetune script. @@ -90,7 +117,7 @@ https://wandb.ai/thinksoso/llama_med/runs/a5wgcnzt/overview?workspace=user-think ## Contributors -This project was founded by the Health Intelligence Group of the Research Center for Social Computing and Information Retrieval at Harbin Institute of Technology, including [Haochun Wang](https://github.com/s65b40), [Chi Liu](https://github.com/thinksoso), [Nuwa Xi](https://github.com/rootnx), [Zewen Qiang](https://github.com/1278882181), [Zijian Li](https://github.com/FlowolfzzZ) supervised by Associate Professor Sendong Zhao, Professor Bing Qin, and Professor Ting Liu. +This project was founded by the Health Intelligence Group of the Research Center for Social Computing and Information Retrieval at Harbin Institute of Technology, including [Haochun Wang](https://github.com/s65b40), [Yanrui Du](https://github.com/DYR1), [Chi Liu](https://github.com/thinksoso), [Rui Bai](https://github.com/RuiBai1999), [Nuwa Xi](https://github.com/rootnx), [Yuhan Chen](https://github.com/Imsovegetable), [Zewen Qiang](https://github.com/1278882181), [Jianyu Chen](https://github.com/JianyuChen01), [Zijian Li](https://github.com/FlowolfzzZ) supervised by Associate Professor Sendong Zhao, Professor Bing Qin, and Professor Ting Liu. ## Acknowledgements