俞声 副教授

研究方向:文本类医学人工智能,包括大型语言模型、搜索引擎、自动问答、知识图谱等。

地址: 清华大学伟清楼209-B室


电话:


邮箱: syu@tsinghua.edu.cn


Background
  • 清华大学统计学研究中心副教授(2018-)
  • 清华大学统计学研究中心助理教授(2015-2018)
  • 哈佛大学博士后(2012-2015)
  • 布莱根妇女医院研究员(2012-2015)
  • 乔治华盛顿大学系统工程学博士

Teaching
  • 《数据科学导论》
  • 《统计学习导论》

Potential Students
  • 本科生科研:欢迎对医学人工智能有研究热情的本科生联系参加组内研究。
    • 能力要求:大三以上要求有机器学习基础。大二以下应至少熟练掌握Python。
    • 工作内容:在博士生指导下进行文献学习、pipeline建设、模型开发、测试等。能力强的同学有机会自主研究新模型。
    • 每周需要投入不少时间,只建议学有余力的同学参加。要求能够坚持1年以上。
  • 研究生申请:欢迎有深度学习背景的同学申请本人博士生。

Projects
  • 医学信息学大型语言模型(to be announced)
  • 大型开放中英文生物医学知识图谱BIOS(https://bios.idea.edu.cn/

Publications
  • Hongyi Yuan#, Songchi Zhou#, and Sheng Yu*. EHRDiff: Exploring Realistic EHR Synthesis with Diffusion Models. Transactions on Machine Learning Research (2024). #contributed equally.
  • Hongyi Yuan, Sheng Yu*. Efficient Symptom Inquiring and Diagnosis via Adaptive Alignment of Reinforcement Learning and Classification. Artificial Intelligence In Medicine (2024), 148:102748.
  • Zhengyun Zhao, Qiao Jin, Tuorui Peng, Sheng Yu*. A Large-scale Dataset of Patient Summaries for Retrieval-based Clinical Decision Support Systems. Scientific Data (2023), 10(1):909.
  • Yucong Lin, Keming Lu, Sheng Yu, Tianxi Cai, and Marinka Zitnik*. Multimodal learning on graphs for disease relation extraction. Journal of Biomedical Informatics (2023): 143:104415.
  • Keming Lu, Yuanren Tong, Si Yu, Yucong Lin, Yingyun Yang, Hui Xu, Yue Li*, and Sheng Yu*. Building a trustworthy AI differential diagnosis application for Crohn’s disease and intestinal tuberculosis. BMC Medical Informatics and Decision Making (2023), 23(1):160.
  • Zhengyun Zhao, Yichen Tian, Zheng Yuan, Peng Zhao, Feng Xia*, and Sheng Yu*. A machine learning method for improving liver cancer staging. Journal of Biomedical Informatics (2023): 137:104266.
  • Qiao Jin, Zheng Yuan, Guangzhi Xiong, Qianlan Yu, Huaiyuan Ying, Chuanqi Tan, Mosha Chen, Songfang Huang, Xiaozhong Liu, and Sheng Yu*. Biomedical Question Answering: A Survey of Approaches and Challenges. ACM Computing Surveys (2022), 55(2):1-36. DOI:10.1145/3490238.
  • Huaiyuan Ying, Shengxuan Luo, Tiantian Dang, and Sheng Yu*. (2022). Label Refinement via Contrastive Learning for Distantly-Supervised Named Entity Recognition. In Findings of the Association for Computational Linguistics: NAACL 2022, pages 2656–2666, Seattle, United States. Association for Computational Linguistics.
  • Hongyi Yuan, Zheng Yuan, and Sheng Yu*. (2022). Generative Biomedical Entity Linking via Knowledge Base-Guided Pre-training and Synonyms-Aware Fine-tuning. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 4038–4048, Seattle, United States. Association for Computational Linguistics.
  • Sihang Zeng, Zheng Yuan, and Sheng Yu* (2022). Automatic Biomedical Term Clustering by Learning Fine-grained Term Representations. In Proceedings of the 21st Workshop on Biomedical Language Processing, Dublin, Ireland. Association for Computational Linguistics: 91–96.
  • Hongyi Yuan, Zheng Yuan, Ruyi Gan, Jiaxing Zhang, Yutao Xie, and Sheng Yu* (2022). BioBART: Pretraining and Evaluation of A Biomedical Generative Language Model. In Proceedings of the 21st Workshop on Biomedical Language Processing, Dublin, Ireland. Association for Computational Linguistics: 97–109.
  • Shengxuan Luo and Sheng Yu* (2022). An Accurate Unsupervised Method for Joint Entity Alignment and Dangling Entity Detection. In Findings of the Association for Computational Linguistics: ACL 2022, Dublin, Ireland. Association for Computational Linguistics: 2330–2339.
  • Zheng Yuan, Zhengyun Zhao, Haixia Sun, Jiao Li, Fei Wang, and Sheng Yu*. CODER: Knowledge-infused cross-lingual medical term embedding for term normalization. Journal of Biomedical Informatics (2022): 103983.
  • Jiaqi Guan, Runzhe Li, Sheng Yu, and Xuegong Zhang*. A Method for Generating Synthetic Electronic Medical Record Text. IEEE/ACM Transactions on Computational Biology and Bioinformatics (2021),18(1):173-182. DOI: 10.1109/TCBB.2019.2948985
  • Yuanren Tong#, Keming Lu#, Yingyun Yang, Ji Li, Yucong Lin, Dong Wu, Aiming Yang, Yue Li*, Sheng Yu*, Jiaming Qian. Can natural language processing help differentiate inflammatory intestinal diseases in China? Models applying random forest and convolutional neural network approaches. BMC Medical Informatics and Decision Making (2020). #contributed equally, *contributed equally.
  • Zheng Yuan, Yuanhao Liu, Qiuyang Yin, Boyao Li, Xiaobin Feng, Guoming Zhang, and Sheng Yu*. Unsupervised Multi-granular Chinese Word Segmentation and Term Discovery via Graph Partition. Journal of Biomedical Informatics (2020), 110:103542.
  • Yucong Lin, Yang Li, Keming Lu, Cheng Ma, Peng Zhao, Daiqi Gao, Zihao Fan, Zijie Cheng, Zheyu Wang, and Sheng Yu*. Long-distance disorder-disorder relation extraction with bootstrapped noisy data. Journal of Biomedical Informatics (2020), 109:103529.
  • Lishan Yu and Sheng Yu*. Developing an automated mechanism to identify medical articles from Wikipedia for knowledge extraction. International Journal of Medical Informatics (2020), 141:104234.
  • Jian Zhang, Anil Can, Pui Man Rosalind Lai, Srinivasan Mukundan Jr., Victor M. Castro, Dmitriy Dligach, Sean Finan, Sheng Yu, Vivian S. Gainer, Nancy A. Shadick, Guergana Savova, Shawn N. Murphy, Tianxi Cai, Scott T. Weiss, and Rose Du*. Age and Morphology of Posterior Communicating Artery Aneurysms. Scientific Reports (2020), 10:11545.
  • Aaron Sonabend#, Winston Cai#, Yuri Ahuja, Ashwin Ananthakrishnan, Zongqi Xia, Sheng Yu*, Chuan Hong*. Automated ICD Coding via Unsupervised Knowledge Integration (UNITE). International Journal of Medical Informatics (2020), 139:104135. #contributed equally, *contributed equally.
  • Yichi Zhang*, Tianrun Cai*, Sheng Yu*, Kelly Cho, Chuan Hong, Jiehuan Sun, Jie Huang, Yuk-Lam Ho, Ashwin Ananthakrishnan, Zongqi Xia, Stanley Shaw, Vivian Gainer, Victor Castro, Nicholas Link, Jacqueline Honerlaw, Selena Huang, David Gagnon, Elizabeth Karlson, Robert Plenge, Peter Szolovits, Guergana Savova, Susanne Churchill, Christopher O'Donnell, Shawn Murphy, J Michael Gaziano, Isaac Kohane, Tianxi Cai*, and Katherine Liao*. Methods for High-throughput Phenotyping with Electronic Medical Record Data Using a Common Semi-supervised Approach (PheCAP). Nature Protocols (2019). *contributed equally.
  • Katherine P. Liao#, Jiehuan Sun#, Tianrun A. Cai, Nicholas Link, Chuan Hong, Jie Huang, Jennifer E. Huffman, Jessica Gronsbell, Yichi Zhang, Yuk-Lam Ho, Victor Castro, Vivian Gainer, Shawn N. Murphy, Christopher J. O’Donnell, J. Michael Gaziano, Kelly Cho, Peter Szolovits, Isaac S. Kohane, MD, Sheng Yu*, Tianxi Cai*. High-throughput Multimodal Automated Phenotyping (MAP) with Application to PheWAS. Journal of the American Medical Informatics Association (2019). #contributed equally, *contributed equally.
  • Yucong Lin, Cheng Ma, Daiqi Gao, Zihao Fan, Zijie Cheng, Zheyu Wang, Sheng Yu*. Long distance entity relation extraction with article structure embedding and applied to mining medical knowledge. IEEE ICHI (2019).
  • Anil Can, Pui Man Rosalind Lai, Victor Castro, Sheng Yu, Dmitriy Dligach, Sean Finan, Vivian Gainer, Nancy Shadick, Guergana Savova, Shawn Murphy, Tianxi Cai, Scott Weiss, and Rose Du*. Decreased Total Iron Binding Capacity May Correlate with Ruptured Intracranial Aneurysms. Scientific Reports (2019).
  • Wenxin Ning, Stephanie Chan, Andrew Beam, Ming Yu, Alon Geva, Katherine P Liao, Mary Mullen, Kenneth D Mandl, Isaac S Kohane, Tianxi Cai, Sheng Yu*. Feature Extraction for Phenotyping from Semantic and Knowledge Resources. Journal of Biomedical Informatics (2019), 91:103122;
  • Jiaqi Guan, Runzhe Li, Sheng Yu, Xuegong Zhang*. Generation of Synthetic Electronic Medical Record Text. In 2018 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), pp. 374-380. IEEE (2018).
  • Jessica Gronsbell, Jessica Minnier, Sheng Yu, Katherine Liao, Tianxi Cai*. Automated Feature Selection of Predictors in Electronic Medical Records Data. Biometrics (2018).
  • Anil Can, Victor Castro, Dmitriy Dligach, Sean Finan, Sheng Yu, Vivian Gainer, Nancy Shadick, Guergana Savova, Shawn Murphy, Tianxi Cai, Scott Weiss, and Rose Du*. Elevated International Normalized Ratio is Associated with Ruptured Aneurysms. Stroke (2018).
  • Anil Can, Robert F. Rudy, BS, M. Castro, Sheng Yu, Dmitriy Dligach, Sean Finan, Vivian Gainer, Nancy A. Shadick, Guergana Savova, Shawn Murphy, Tianxi Cai, Scott T. Weiss, Rose Du*. Association between Aspirin Dose and Subarachnoid Hemorrhage from Saccular Aneurysms: A Case-Control Study; Neurology (2018).
  • Anil Can, Victor M. Castro, Dmitriy Dligach, Sean Finan, Sheng Yu, Vivian Gainer, Nancy A. Shadick, Guergana Savova, Shawn Murphy, Tianxi Cai, Scott T. Weiss, Rose Du*. Low Serum Calcium and Magnesium Levels and Rupture of Intracranial Aneurysms; Stroke (2018); Impact factor 6.032.
  • Jennifer A. Sinnott*, Fiona Cai, Sheng Yu, Boris P. Hejblum, Chuan Hong, Isaac S. Kohane, Katherine P. Liao. PheProb: Probabilistic Phenotyping Using Diagnosis Codes to Improve Power for Genetic Association Studies. Journal of the American Medical Informatics Association (2018).
  • Jian Zhang, Anil Can, Srinivasan Mukundan Jr., Michael Steigner, Victor M. Castro, Dmitriy Dligach, Sean Finan, Sheng Yu, Vivian Gainer, Nancy A. Shadick, Guergana Savova, Shawn Murphy, Tianxi Cai, Zhong Wang, Scott T. Weiss, Rose Du*. Morphological Variables Associated with Ruptured Middle Cerebral Artery Aneurysms. Neurosurgery (2018).
  • Thomas H. McCoy#, Sheng Yu#, Kamber L. Hart, Victor M. Castro, Hannah E. Brown, James N. Rosenquist, Alysa E. Doyle, Pieter J. Vuijk, Tianxi Cai*, Roy H. Perlis*. High Throughput Phenotyping for Dimensional Psychopathology in Electronic Health Records. Biological Psychiatry (2018); DOI: 10.1016/j.biopsych.2018.01.011; #contributed equally. https://www.sciencedaily.com/releases/2018/02/180226103436.htm
  • Thomas H. McCoy, Victor M. Castro, Kamber L. Hart, Amelia M. Pellegrini, Sheng Yu, Tianxi Cai, Roy H. Perlis*. Genome-wide Association Study of Dimensional Psychopathology Using Electronic Health Records. Biological Psychiatry (2018); DOI: 10.1016/j.biopsych.2017.12.004.
  • Anil Can, Victor Castro, Dmitriy Dligach, Sean Finan, Sheng Yu, Vivian Gainer, Nancy Shadick, Guergana Savova, Shawn Murphy, Tianxi Cai, Scott Weiss, and Rose Du*. Lipid-Lowering Agents and High HDL are Inversely Associated with Intracranial Aneurysm Rupture; Stroke (2018).
  • Anil Can, Victor M. Castro, Yildirim H. Ozdemir, Sarajune Dagen, Dmitriy Dligach, Sean Finan, Sheng Yu, Vivian Gainer, Nancy A. Shadick, Shawn Murphy, Tianxi Cai, Guergana Savova, Scott T. Weiss, Rose Du*; Alcohol Consumption and Aneurysmal Subarachnoid Hemorrhage; Translational Stroke Research (2018), 9(1):13-19.
  • Anil Can, Victor M. Castro, Sheng Yu, Dmitriy Dligach, Sean Finan, Vivian Gainer, Nancy A. Shadick, Guergana Savova, Shawn Murphy, Tianxi Cai, Guergana Savova, Scott T. Weiss, Rose Du*; Antihyperglycemic Agents are Inversely Associated with Intracranial Aneurysm Rupture; Stroke (2018), 49(1):34-39; doi: 10.1161/STROKEAHA.117.019249.
  • Sheng Yu*, Yumeng Ma, Jessica Gronsbell, Tianrun Cai, Ashwin N. Ananthakrishnan, Vivian S. Gainer, Susanne E. Churchill, Peter Szolovits, Shawn N. Murphy, Isaac S. Kohane, Katherine P. Liao, Tianxi Cai. Enabling Phenotypic Big Data with PheNorm; Journal of the American Medical Informatics Association (2018), 25(1,数据科学特刊):54-60; doi: 10.1093/jamia/ocx111. Best Papers in "Knowledge Representation and Management", 2019 IMIA Yearbook.
  • Anil Can, Victor M. Castro, Yildirim H. Ozdemir, Sarajune Dagen, Dmitriy Dligach, Sean Finan, Sheng Yu, Vivian Gainer, Nancy A. Shadick, Guergana Savova, Shawn Murphy, Tianxi Cai, Guergana Savova, Scott T. Weiss, Rose Du*; Heroin Use is Associated with Ruptured Saccular Aneurysms; Translational Stroke Research (2017); doi: 10.1007/s12975-017-0582-y.
  • Anil Can, Victor Castro, Yildirim H Ozdemir, Sarajune Dagen, Sheng Yu, Dmitriy Dligach, Sean Finan, Vivian S Gainer, Nancy Shadick, Shawn Murphy, Tianxi Cai, Guergana Savova, Ruben Dammers, Scott T Weiss, and Rose Du*; Association of Intracranial Aneurysm Rupture with Smoking Duration, Intensity, and Cessation; Neurology (2017), 10-1212.
  • Sheng Yu*, Abhishek Chakrabortty, Katherine P. Liao, Tianrun Cai, Ashwin N. Ananthakrishnan, Vivian S. Gainer, Susanne E. Churchill, Peter Szolovits, Shawn N. Murphy, Isaac S. Kohane, Tianxi Cai. Surrogate-assisted Feature Extraction for High-throughput Phenotyping; Journal of the American Medical Informatics Association (2017), 24 (e1): e143-e149; doi: 10.1093/jamia/ocw135.
  • Victor M. Castro, Dmitriy Dligach, Sean Finan, Sheng Yu, Anil Can, Muhammad Abd-El-Barr, Vivian Gainer, Nancy A. Shadick, Shawn Murphy, Tianxi Cai, Guergana Savova, Scott T. Weiss, Rose Du*; Large-scale identification of subjects with cerebral aneurysms using natural language processing; Neurology (2017): 88(2), 164-168.
  • Florence H. Yong, Lu Tian*, Sheng Yu, Tianxi Cai and L.J. Wei. Optimal stratification in outcome prediction using baseline information; Biometrika, 103.4 (2016): 817-828.
  • Tianrun Cai, Andreas A. Giannopoulos, Sheng Yu, Tatiana Kelil,Beth Ripley, Kanako K. Kumamaru, Frank J. Rybicki, and Dimitrios Mitsouras*. Natural Language Processing Technologies in Radiology Research and Clinical Applications. RadioGraphics (2016), 36, no. 1: 176-191.
  • Sheng Yu*, Katherine P. Liao, Stanley Y. Shaw, Vivian S. Gainer, Susanne E. Churchill, Peter Szolovits, Shawn N. Murphy, Isaac Kohane, and Tianxi Cai. Toward High-throughput Phenotyping: Unbiased Automated Feature Extraction and Selection from Knowledge Sources; Journal of the American Medical Informatics Association (2015), 22(5):993-1000; doi: 10.1093/jamia/ocv034. EDITOR'S CHOICE.
  • Victor M. Castro, Yuanyuan Shen, Sheng Yu, Sean Finan, Cindy Ta Pau, Vivian Gainer, Candace C. Keefe, Guergana Savova, Shawn N. Murphy, Tianxi Cai and Corrine K. Welt*. Identification of subjects with polycystic ovary syndrome using electronic health records. Reproductive Biology and Endocrinology (2014), 13(1), p.116.
  • Sheng Yu*, Kanako K. Kumamaru, Elizabeth George, Ruth M. Dunne, Arash Bedayat, Matey Neykov, Andetta R. Hunsaker, Karin E. Dill, Tianxi Cai, and Frank J. Rybicki. Classification of CT Pulmonary Angiography Reports By Presence, Chronicity, and Location of Pulmonary Embolism with Natural Language Processing; Journal of Biomedical Informatics (2014), 52: 386-393.
  • Vishesh Kumar*, Katherine Liao, Su-Chun Cheng, Sheng Yu, Uri Kartoun, Ari Brettman, Vivian Gainer, Andrew Cagan, Shawn Murphy, Guergana Savova, Pei Chen, Peter Szolovits, Zongqi Xia, Elizabeth Karlson, Robert Plenge, Ashwin Ananthakrishnan, Susanne Churchill, Tianxi Cai, Isaac Kohane, Stanley Shaw. Natural Language Processing Improves Phenotypic Accuracy in an Electronic Medical Record Cohort of Type 2 Diabetes and Cardiovascular Disease; Journal of the American College of Cardiology (2014), 63(12):A1359.
  • Sheng Yu* and Enrique Campos-Náñez. Adaptive Convex Enveloping for Multidimensional Stochastic Dynamic Optimization; 62nd IIE Annual Conference and Expo. Proceedings. 2012. Best Paper of Operations Research.