To the editor, I have read with interest the study by Ge et al.1 This work marks a significant milestone in leveraging large language models (LLMs) for clinical applications, notably for its pioneering effort to create a Personal Health Information-compliant liver disease–specific model. Despite the innovation, certain areas warrant further clarification and improvement to fully understand the impact and reliability of the framework proposed by the authors. First and foremost, it is crucial to have transparent information regarding the graders' level of expertise and to address the risks of performance evaluation associated with the use of grading scales, rather than binary grading, as discussed elsewhere.2,3 It has been notably intriguing to observe that the Retrieval Augmented Generation (RAG) yielded responses significantly more accurate than those generated by the foundational GPT-4 model in merely 20% of the topics addressed (ie, HCC and DILI).1 Compared to other RAG applications in digestive disease,4 this performance discrepancy prompts questions about the data quality used for RAG, particularly the conversion of medical guidelines into text for the model. The potential inaccuracies from standard text converters and GPT-Vision in parsing complex formats (eg, graphical tables or flowcharts) containing critical information could significantly impact response accuracy. In addition, according to the reported context window, I believe the authors adopted text chunking, which refers to breaking down the text into amounts that can be processed by the LLM in a single forward pass.5 Specifically, chunking can be carried out using various strategies, such as at the sentence, paragraph, or entire document level. Given that RAG binds the response to the external knowledge data set, an inadequate amount of information could result in an inaccurate or incomplete answer.5 Furthermore, an essential aspect not specified in the paper is the criteria used for the retrieval strategy within the RAG framework. Effective retrieval is pivotal for ensuring that the most relevant information is supplied to the LLM for generating accurate responses.5 This involves not just the selection of data chunks but also the criteria for choosing one chunk over another, such as employing cosine similarity measures or other relevance metrics. Without a clear explanation of these retrieval strategy criteria, there is a risk of selecting inappropriate or less relevant chunks of text, which could inadvertently produce inaccurate or misleading answers.

Letter to the Editor: Refining retrieval and chunking strategies for enhanced clinical reliability of large language models in liver disease / Giuffrè, Mauro. - In: HEPATOLOGY. - ISSN 0270-9139. - ELETTRONICO. - (2024), pp. ---. [10.1097/hep.0000000000000992]

Letter to the Editor: Refining retrieval and chunking strategies for enhanced clinical reliability of large language models in liver disease

Giuffrè, Mauro
Primo
2024-01-01

Abstract

To the editor, I have read with interest the study by Ge et al.1 This work marks a significant milestone in leveraging large language models (LLMs) for clinical applications, notably for its pioneering effort to create a Personal Health Information-compliant liver disease–specific model. Despite the innovation, certain areas warrant further clarification and improvement to fully understand the impact and reliability of the framework proposed by the authors. First and foremost, it is crucial to have transparent information regarding the graders' level of expertise and to address the risks of performance evaluation associated with the use of grading scales, rather than binary grading, as discussed elsewhere.2,3 It has been notably intriguing to observe that the Retrieval Augmented Generation (RAG) yielded responses significantly more accurate than those generated by the foundational GPT-4 model in merely 20% of the topics addressed (ie, HCC and DILI).1 Compared to other RAG applications in digestive disease,4 this performance discrepancy prompts questions about the data quality used for RAG, particularly the conversion of medical guidelines into text for the model. The potential inaccuracies from standard text converters and GPT-Vision in parsing complex formats (eg, graphical tables or flowcharts) containing critical information could significantly impact response accuracy. In addition, according to the reported context window, I believe the authors adopted text chunking, which refers to breaking down the text into amounts that can be processed by the LLM in a single forward pass.5 Specifically, chunking can be carried out using various strategies, such as at the sentence, paragraph, or entire document level. Given that RAG binds the response to the external knowledge data set, an inadequate amount of information could result in an inaccurate or incomplete answer.5 Furthermore, an essential aspect not specified in the paper is the criteria used for the retrieval strategy within the RAG framework. Effective retrieval is pivotal for ensuring that the most relevant information is supplied to the LLM for generating accurate responses.5 This involves not just the selection of data chunks but also the criteria for choosing one chunk over another, such as employing cosine similarity measures or other relevance metrics. Without a clear explanation of these retrieval strategy criteria, there is a risk of selecting inappropriate or less relevant chunks of text, which could inadvertently produce inaccurate or misleading answers.
2024
Pubblicato
File in questo prodotto:
Non ci sono file associati a questo prodotto.
Pubblicazioni consigliate

I documenti in IRIS sono protetti da copyright e tutti i diritti sono riservati, salvo diversa indicazione.

Utilizza questo identificativo per citare o creare un link a questo documento: https://hdl.handle.net/11368/3089578
 Avviso

Attenzione! I dati visualizzati non sono stati sottoposti a validazione da parte dell'ateneo

Citazioni
  • ???jsp.display-item.citation.pmc??? 2
  • Scopus 1
  • ???jsp.display-item.citation.isi??? 1
social impact