To the editor, I have read with interest the study by Ge et al.1 This work marks a significant milestone in leveraging large language models (LLMs) for clinical applications, notably for its pioneering effort to create a Personal Health Information-compliant liver disease–specific model. Despite the innovation, certain areas warrant further clarification and improvement to fully understand the impact and reliability of the framework proposed by the authors. First and foremost, it is crucial to have transparent information regarding the graders' level of expertise and to address the risks of performance evaluation associated with the use of grading scales, rather than binary grading, as discussed elsewhere.2,3 It has been notably intriguing to observe that the Retrieval Augmented Generation (RAG) yielded responses significantly more accurate than those generated by the foundational GPT-4 model in merely 20% of the topics addressed (ie, HCC and DILI).1 Compared to other RAG applications in digestive disease,4 this performance discrepancy prompts questions about the data quality used for RAG, particularly the conversion of medical guidelines into text for the model. The potential inaccuracies from standard text converters and GPT-Vision in parsing complex formats (eg, graphical tables or flowcharts) containing critical information could significantly impact response accuracy. In addition, according to the reported context window, I believe the authors adopted text chunking, which refers to breaking down the text into amounts that can be processed by the LLM in a single forward pass.5 Specifically, chunking can be carried out using various strategies, such as at the sentence, paragraph, or entire document level. Given that RAG binds the response to the external knowledge data set, an inadequate amount of information could result in an inaccurate or incomplete answer.5 Furthermore, an essential aspect not specified in the paper is the criteria used for the retrieval strategy within the RAG framework. Effective retrieval is pivotal for ensuring that the most relevant information is supplied to the LLM for generating accurate responses.5 This involves not just the selection of data chunks but also the criteria for choosing one chunk over another, such as employing cosine similarity measures or other relevance metrics. Without a clear explanation of these retrieval strategy criteria, there is a risk of selecting inappropriate or less relevant chunks of text, which could inadvertently produce inaccurate or misleading answers.
Letter to the Editor: Refining retrieval and chunking strategies for enhanced clinical reliability of large language models in liver disease / Giuffrè, Mauro. - In: HEPATOLOGY. - ISSN 0270-9139. - ELETTRONICO. - (2024), pp. ---. [10.1097/hep.0000000000000992]
Letter to the Editor: Refining retrieval and chunking strategies for enhanced clinical reliability of large language models in liver disease
Giuffrè, Mauro
Primo
2024-01-01
Abstract
To the editor, I have read with interest the study by Ge et al.1 This work marks a significant milestone in leveraging large language models (LLMs) for clinical applications, notably for its pioneering effort to create a Personal Health Information-compliant liver disease–specific model. Despite the innovation, certain areas warrant further clarification and improvement to fully understand the impact and reliability of the framework proposed by the authors. First and foremost, it is crucial to have transparent information regarding the graders' level of expertise and to address the risks of performance evaluation associated with the use of grading scales, rather than binary grading, as discussed elsewhere.2,3 It has been notably intriguing to observe that the Retrieval Augmented Generation (RAG) yielded responses significantly more accurate than those generated by the foundational GPT-4 model in merely 20% of the topics addressed (ie, HCC and DILI).1 Compared to other RAG applications in digestive disease,4 this performance discrepancy prompts questions about the data quality used for RAG, particularly the conversion of medical guidelines into text for the model. The potential inaccuracies from standard text converters and GPT-Vision in parsing complex formats (eg, graphical tables or flowcharts) containing critical information could significantly impact response accuracy. In addition, according to the reported context window, I believe the authors adopted text chunking, which refers to breaking down the text into amounts that can be processed by the LLM in a single forward pass.5 Specifically, chunking can be carried out using various strategies, such as at the sentence, paragraph, or entire document level. Given that RAG binds the response to the external knowledge data set, an inadequate amount of information could result in an inaccurate or incomplete answer.5 Furthermore, an essential aspect not specified in the paper is the criteria used for the retrieval strategy within the RAG framework. Effective retrieval is pivotal for ensuring that the most relevant information is supplied to the LLM for generating accurate responses.5 This involves not just the selection of data chunks but also the criteria for choosing one chunk over another, such as employing cosine similarity measures or other relevance metrics. Without a clear explanation of these retrieval strategy criteria, there is a risk of selecting inappropriate or less relevant chunks of text, which could inadvertently produce inaccurate or misleading answers.Pubblicazioni consigliate
I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.


