Search Results Heading

MBRLSearchResults

mbrl.module.common.modules.added.book.to.shelf
Title added to your shelf!
View what I already have on My Shelf.
Oops! Something went wrong.
Oops! Something went wrong.
While trying to add the title to your shelf something went wrong :( Kindly try again later!
Are you sure you want to remove the book from the shelf?
Oops! Something went wrong.
Oops! Something went wrong.
While trying to remove the title from your shelf something went wrong :( Kindly try again later!
    Done
    Filters
    Reset
  • Discipline
      Discipline
      Clear All
      Discipline
  • Is Peer Reviewed
      Is Peer Reviewed
      Clear All
      Is Peer Reviewed
  • Item Type
      Item Type
      Clear All
      Item Type
  • Subject
      Subject
      Clear All
      Subject
  • Year
      Year
      Clear All
      From:
      -
      To:
  • More Filters
      More Filters
      Clear All
      More Filters
      Source
    • Language
214 result(s) for "Multimodal AI"
Sort by:
Multimodal Large Language Models in Health Care: Applications, Challenges, and Future Outlook
In the complex and multidimensional field of medicine, multimodal data are prevalent and crucial for informed clinical decisions. Multimodal data span a broad spectrum of data types, including medical images (eg, MRI and CT scans), time-series data (eg, sensor data from wearable devices and electronic health records), audio recordings (eg, heart and respiratory sounds and patient interviews), text (eg, clinical notes and research articles), videos (eg, surgical procedures), and omics data (eg, genomics and proteomics). While advancements in large language models (LLMs) have enabled new applications for knowledge retrieval and processing in the medical field, most LLMs remain limited to processing unimodal data, typically text-based content, and often overlook the importance of integrating the diverse data modalities encountered in clinical practice. This paper aims to present a detailed, practical, and solution-oriented perspective on the use of multimodal LLMs (M-LLMs) in the medical field. Our investigation spanned M-LLM foundational principles, current and potential applications, technical and ethical challenges, and future research directions. By connecting these elements, we aimed to provide a comprehensive framework that links diverse aspects of M-LLMs, offering a unified vision for their future in health care. This approach aims to guide both future research and practical implementations of M-LLMs in health care, positioning them as a paradigm shift toward integrated, multimodal data–driven medical practice. We anticipate that this work will spark further discussion and inspire the development of innovative approaches in the next generation of medical M-LLM systems.
Chef Dalle: Transforming Cooking with Multi-Model Multimodal AI
In an era where dietary habits significantly impact health, technological interventions can offer personalized and accessible food choices. This paper introduces Chef Dalle, a recipe recommendation system that leverages multi-model and multimodal human-computer interaction (HCI) techniques to provide personalized cooking guidance. The application integrates voice-to-text conversion via Whisper and ingredient image recognition through GPT-Vision. It employs an advanced recipe filtering system that utilizes user-provided ingredients to fetch recipes, which are then evaluated through multi-model AI through integrations of OpenAI, Google Gemini, Claude, and/or Anthropic APIs to deliver highly personalized recommendations. These methods enable users to interact with the system using voice, text, or images, accommodating various dietary restrictions and preferences. Furthermore, the utilization of DALL-E 3 for generating recipe images enhances user engagement. User feedback mechanisms allow for the refinement of future recommendations, demonstrating the system’s adaptability. Chef Dalle showcases potential applications ranging from home kitchens to grocery stores and restaurant menu customization, addressing accessibility and promoting healthier eating habits. This paper underscores the significance of multimodal HCI in enhancing culinary experiences, setting a precedent for future developments in the field.
Google Gemini as a next generation AI educational tool: a review of emerging educational technology
This emerging technology report discusses Google Gemini as a multimodal generative AI tool and presents its revolutionary potential for future educational technology. It introduces Gemini and its features, including versatility in processing data from text, image, audio, and video inputs and generating diverse content types. This study discusses recent empirical studies, technology in practice, and the relationship between Gemini technology and the educational landscape. This report further explores Gemini’s relevance for future educational endeavors and practical applications in emerging technologies. Also, it discusses the significant challenges and ethical considerations that must be addressed to ensure its responsible and effective integration into the educational landscape.
Bridging the gap between AI and human emotion: a multimodal recognition system
This study introduces a novel system that integrates voice and facial recognition technologies to enhance human-computer interaction by accurately interpreting and responding to user emotions. Unlike conventional approaches that analyze either voice or facial expressions in isolation, this system combines both modalities, o ering a more comprehensive understanding of emotional states. By evaluating facial expressions, vocal tones, and contextual conversation history, the system generates personalized, context-aware responses, fostering more natural and empathetic AI interactions. This advancement significantly improves user engagement and satisfaction, paving the way for emotionally intelligent AI applications across diverse fields.
Flame analysis and combustion estimation using large language and vision assistant and reinforcement learning
In this study, we present an advanced approach for flame analysis and combustion quality estimation in carbonization furnaces utilizing large language and vision assistant (LLaVA) and reinforcement learning from human feedback (RLHF). The traditional methods of estimating combustion quality in carbonization processes rely heavily on visual inspection and manual control, which can be subjective and imprecise. Our proposed methodology leverages multimodal AI techniques to enhance the accuracy and reliability of flame similarity measures. By integrating LLaVA’s high-resolution image processing capabilities with RLHF, we create a robust system that iteratively improves its predictive accuracy through human feedback. The system analyzes real-time video frames of the flame, employing sophisticated similarity metrics and reinforcement learning algorithms to optimize combustion parameters dynamically. Experimental results demonstrate significant improvements in estimating oxygen levels and overall combustion quality compared to conventional methods. This approach not only automates and refines the combustion monitoring process but also provides a scalable solution for various industrial applications. The findings underscore the potential of AI-driven techniques in advancing the precision and efficiency of combustion systems.
Machine learning for cognitive behavioral analysis: datasets, methods, paradigms, and research directions
Human behaviour reflects cognitive abilities. Human cognition is fundamentally linked to the different experiences or characteristics of consciousness/emotions, such as joy, grief, anger, etc., which assists in effective communication with others. Detection and differentiation between thoughts, feelings, and behaviours are paramount in learning to control our emotions and respond more effectively in stressful circumstances. The ability to perceive, analyse, process, interpret, remember, and retrieve information while making judgments to respond correctly is referred to as Cognitive Behavior. After making a significant mark in emotion analysis, deception detection is one of the key areas to connect human behaviour, mainly in the forensic domain. Detection of lies, deception, malicious intent, abnormal behaviour, emotions, stress, etc., have significant roles in advanced stages of behavioral science. Artificial Intelligence and Machine learning (AI/ML) has helped a great deal in pattern recognition, data extraction and analysis, and interpretations. The goal of using AI and ML in behavioral sciences is to infer human behaviour, mainly for mental health or forensic investigations. The presented work provides an extensive review of the research on cognitive behaviour analysis. A parametric study is presented based on different physical characteristics, emotional behaviours, data collection sensing mechanisms, unimodal and multimodal datasets, modelling AI/ML methods, challenges, and future research directions.
Multimodal Behavioral Sensors for Lie Detection: Integrating Visual, Auditory, and Generative Reasoning Cues
Advances in multimodal artificial intelligence enable new sensor-inspired approaches to lie detection by combining behavioral perception with generative reasoning. This study presents a deception detection framework that integrates deep video and audio processing with large language models guided by chain-of-thought (CoT) prompting. We interpret neural architectures such as ViViT (for video) and HuBERT (for speech) as digital behavioral sensors that extract implicit emotional and cognitive cues, including micro-expressions, vocal stress, and timing irregularities. We further incorporate a GPT-5-based prompt-level fusion approach for video–language–emotion alignment and zero-shot inference. This method jointly processes visual frames, textual transcripts, and emotion recognition outputs, enabling the system to generate interpretable deception hypotheses without any task-specific fine-tuning. Facial expressions are treated as high-resolution affective signals captured via visual sensors, while audio encodes prosodic markers of stress. Our experimental setup is based on the DOLOS dataset, which provides high-quality multimodal recordings of deceptive and truthful behavior. We also evaluate a continual learning setup that transfers emotional understanding to deception classification. Results indicate that multimodal fusion and CoT-based reasoning increase classification accuracy and interpretability. The proposed system bridges the gap between raw behavioral data and semantic inference, laying a foundation for AI-driven lie detection with interpretable sensor analogues.
Multimodal AI and Large Language Models for Orthopantomography Radiology Report Generation and Q A
Access to high-quality dental healthcare remains a challenge in many countries due to limited resources, lack of trained professionals, and time-consuming report generation tasks. An intelligent clinical decision support system (ICDSS), which can make informed decisions based on past data, is an innovative solution to address these shortcomings while improving continuous patient support in dental healthcare. This study proposes a viable solution with the aid of multimodal artificial intelligence (AI) and large language models (LLMs), focusing on their application for generating orthopantomography radiology reports and answering questions in the dental domain. This work also discusses efficient adaptation methods of LLMs for specific language and application domains. The proposed system primarily consists of a Blip-2-based caption generator tuned on DPT images followed by a Llama 3 8B based LLM for radiology report generation. The performance of the entire system is evaluated in two ways. The diagnostic performance of the system achieved an overall accuracy of 81.3%, with specific detection rates of 87.9% for dental caries, 89.7% for impacted teeth, 88% for bone loss, and 81.8% for periapical lesions. Subjective evaluation of AI-generated radiology reports by certified dental professionals demonstrates an overall accuracy score of 7.5 out of 10. In addition, the proposed solution includes a question-answering platform in the native Sinhala language, alongside the English language, designed to function as a chatbot for dental-related queries. We hope that this platform will eventually bridge the gap between dental services and patients, created due to a lack of human resources. Overall, our proposed solution creates new opportunities for LLMs in healthcare by introducing a robust end-to-end system for the automated generation of dental radiology reports and enhancing patient interaction and awareness.
A Multimodal AI Framework for Automated Multiclass Lung Disease Diagnosis from Respiratory Sounds with Simulated Biomarker Fusion and Personalized Medication Recommendation
Respiratory diseases represent a persistent global health challenge, underscoring the need for intelligent, accurate, and personalized diagnostic and therapeutic systems. Existing methods frequently suffer from limitations in diagnostic precision, lack of individualized treatment, and constrained adaptability to complex clinical scenarios. To address these challenges, our study introduces a modular AI-powered framework that integrates an audio-based disease classification model with simulated molecular biomarker profiles to evaluate the feasibility of future multimodal diagnostic extensions, alongside a synthetic-data-driven prescription recommendation engine. The disease classification model analyzes respiratory sound recordings and accurately distinguishes among eight clinical classes: bronchiectasis, pneumonia, upper respiratory tract infection (URTI), lower respiratory tract infection (LRTI), asthma, chronic obstructive pulmonary disease (COPD), bronchiolitis, and healthy respiratory state. The proposed model achieved a classification accuracy of 99.99% on a holdout test set, including 94.2% accuracy on pediatric samples. In parallel, the prescription module provides individualized treatment recommendations comprising drug, dosage, and frequency trained on a carefully constructed synthetic dataset designed to emulate real-world prescribing logic.The model achieved over 99% accuracy in medication prediction tasks, outperforming baseline models such as those discussed in research. Minimal misclassification in the confusion matrix and strong clinician agreement on 200 prescriptions (Cohen’s κ = 0.91 [0.87–0.94] for drug selection, 0.78 [0.74–0.81] for dosage, 0.96 [0.93–0.98] for frequency) further affirm the system’s reliability. Adjusted clinician disagreement rates were 2.7% (drug), 6.4% (dosage), and 1.5% (frequency). SHAP analysis identified age and smoking as key predictors, enhancing model explainability. Dosage accuracy was 91.3%, and most disagreements occurred in renal-impaired and pediatric cases. However, our study is presented strictly as a proof-of-concept. The use of synthetic data and the absence of access to real patient records constitute key limitations. A trialed clinical deployment was conducted under a controlled environment with a positive rate of satisfaction from experts and users, but the proposed system must undergo extensive validation with de-identified electronic medical records (EMRs) and regulatory scrutiny before it can be considered for practical application. Nonetheless, the findings offer a promising foundation for the future development of clinically viable AI-assisted respiratory care tools.