Comparative evaluation of large language model–based chatbots in a septic arthritis scenario: ChatGPT, Claude, and Perplexity

Objective: This study aimed to comparatively evaluate the clinical knowledge generation performance of 3 widely used large language model (LLM)-based chatbots (ChatGPT, Claude, and Perplexity) in the context of septic arthritis.

Methods: This cross-sectional comparative study was based on 24 scenario-based clinical questions developed in accordance with the SANJO guideline (Management of Septic Arthritis in Native Joints) of the European Bone and Joint Infection Society. Responses generated by ChatGPT (OpenAI GPT-4), Claude 2 (Anthropic), and Perplexity AI were independently assessed by 2 senior experts: 1 in orthopedic surgery and the other in infectious diseases. Each response was evaluated across 6 domains: scientific accuracy, content depth, termino logical consistency, clinical applicability, brevity, and reference support, using a 5-point Likert scale.

Results: All 3 LLM-based chatbots achieved perfect scores in accuracy and terminological consistency (P = 1.000), and no significant dif ference was observed in clinical applicability (P = .912). Perplexity scored significantly lower in content depth compared to both ChatGPT (P = .001) and Claude (P = .041), whereas ChatGPT and Claude did not differ significantly (P = .807). ChatGPT produced significantly more unnecessary elaboration than Claude (P = .009) and Perplexity (P < .001), while Claude and Perplexity were comparable (P = .115). For reference support, Perplexity scored significantly higher than both ChatGPT (P < .001) and Claude (P < .001), with no difference between the latter 2 (P = 1.000). Overall, Perplexity achieved the highest total score (P < .001), followed by ChatGPT and Claude. Interrater agree ment was substantial (κ = 0.72).

Conclusion: The LLM-based chat platforms demonstrated overall high performance, but their strengths differed across evaluation domains. While ChatGPT and Claude provided more comprehensive and detailed responses, Perplexity offered stronger reference sup port. These findings suggest that context-specific selection of LLMs is essential, as the optimal choice may vary depending on whether detailed explanation or robust referencing is prioritized.

Cite this article as: Bayrak HC, Karagöz B, Bayrak Ö. Comparative evaluation of large language model–based chatbots in a septic arthritis scenario: ChatGPT, Claude, and Perplexity. Acta Orthop Traumatol Turc., Published online XX X, 2025. doi:10.5152/j.aott.2025.25428.