Catalogue Search | MBRL

Revisiting Picture Functions in Multimedia Testing: A Systematic Narrative Review and Taxonomy Extension

by Lindner, Marlit Annalena , Schewior, Lauritz in Coding , Educational psychology , Information Processing

2024

Studies have indicated that pictures in test items can impact item-solving performance, information processing (e.g., time on task) and metacognition as well as test-taking affect and motivation. The present review aims to better organize the existing and somewhat scattered research on multimedia effects in testing and problem solving while considering several potential moderators. We conducted a systematic literature search with liberal study inclusion criteria to cover the still young research field as broadly as possible. Due to the complexity and heterogeneity of the relevant studies, we present empirical findings in a narrative review style. Included studies were classified by four categories, coding the moderating function of the pictures investigated. The evaluation of 62 studies allowed for some tentative main conclusions: Decorative pictures did not appear to have a meaningful effect on test-taker performance, time on task, test-taking affect, and metacognition. Both representational and organizational pictures tended to increase performance. Representational pictures further seem to enhance test-taker enjoyment and response certainty. Regarding the contradictory effects of informational pictures on performance and time on task that we found across studies, more differentiated research is needed. Conclusions on other potential moderators at the item-level and test-taker level were often not possible due to the sparse data available. Future research should therefore increasingly incorporate potential moderators into experimental designs. Finally, we propose a simplification and extension of the functional picture taxonomy in multimedia testing, resulting in a simple hierarchical approach that incorporates several additional aspects for picture classification beyond its function.

Journal Article

Share this book

Add to My Shelf

The Relationship Between Test Item Format and Gender Achievement Gaps on Math and ELA Tests in Fourth and Eighth Grades

by Kalogrides, Demetra , Fahle, Erin M. , Podolsky, Anne in Academic Achievement , Academic achievement gaps , Accountability

2018

Prior research suggests that males outperform females, on average, on multiple-choice items compared to their relative performance on constructed-response items. This paper characterizes the extent to which gender achievement gaps on state accountability tests across the United States are associated with those tests' item formats. Using roughly 8 million fourth- and eighth-grade students' scores on state assessments, we estimate state- and district-level math and reading male-female achievement gaps. We find that the estimated gaps are strongly associated with the proportions of the test scores based on multiple-choice and constructed-response questions on state accountability tests, even when controlling for gender achievement gaps as measured by the National Assessment of Educational Progress (NAEP) or Northwest Evaluation Association (NWEA) Measures of Academic Progress (MAP) assessments, which have the same item format across states. We find that test item format explains approximately 25% of the variation in gender achievement gaps among states.

Journal Article

Share this book

Add to My Shelf

A comparative study of AI-human-made and human-made test forms for a university TESOL theory course

by O, Kyung-Mi in Artificial intelligence , Automation , Chatbots

2024

This study examines the efficacy of artificial intelligence (AI) in creating parallel test items compared to human-made ones. Two test forms were developed: one consisting of 20 existing human-made items and another with 20 new items generated with ChatGPT assistance. Expert reviews confirmed the content parallelism of the two test forms. Forty-three university students then completed the 40 test items presented randomly from both forms on a final test. Statistical analyses of student performance indicated comparability between the AI-human-made and human-made test forms. Despite limitations such as sample size and reliance on classical test theory (CTT), the findings suggest ChatGPT’s potential to assist teachers in test item creation, reducing workload and saving time. These results highlight ChatGPT’s value in educational assessment and emphasize the need for further research and development in this area.

Journal Article

Share this book

Add to My Shelf

From item writing to item completion: investigating multiple-choice reading test items through item writer’s intentions and test-takers’ reported processes

by Pham, Ngoc Bao Tram , Zeng, Yijing , Mohd-Said, Nur-Ehsan in Assessment , Cognition , Convergence

2026

Writing multiple-choice (MC) test items that accurately target specific reading constructs remains challenging and time-consuming. Despite careful item development, what test developers intend an item to measure may not correspond to the processes test-takers actually use when answering it. This exploratory study documented the original item writer’s option-level intentions when constructing MC items and examined the extent to which these intentions were corroborated by test-takers’ retrospective verbal reports of their test-taking processes. The documentation revealed that the relevant text portions and cognitive activities intended for each option within an MC item may vary. Triangulation of item writer intentions and reported test-taking processes showed stronger convergence for relevant text portions than for cognitive activities. Divergences between the two data sources were largely associated with test-taking strategies employed by participants. Importantly, documenting the item writer’s option-level intentions provided grounded explanations for why test-takers appeared to engage in different processes when selecting different options. These findings offer a new methodological direction for construct validation of MC items and suggest implications for the future development and evaluation of multiple-choice items in reading assessment.

Journal Article

Share this book

Add to My Shelf

A Call to Action for Cultural Humility in Pharmacy Education Student Assessments

by Kalabalik-Hoganson, Julie , Sandifer, Chadwin , Lowy, Nora in African Americans , Analysis , Bias

2022

An important topic in the conversation on the education of pharmacy students evolves around methods of pedagogy and assessment and attention to diversity and inclusion. Well-intentioned educators may introduce bias into their teachings and assessment tools by focusing on diseases with a higher rate of presentation in minorities without engaging in conversations about why these health disparities exist. When considering the content and structure of a curriculum, it is also important to review its assessment tools, with attention to cultural humility in multiple-choice examinations, case-based presentations, and even observed structured clinical examinations. Disregarding this component of the conversation may lead students to have an unconscious impression that social constructs are biological markers for a disease. Students may recall not only what they learned in a classroom setting, but often the content included in their assessments as well. By writing test items that are culturally responsible, unconscious bias can be reduced and test items can better measure the knowledge that educators intend to assess. As pharmacy educators perform programmatic reviews, attention should be directed toward unconscious bias, not only in the curricula but also in evaluation and assessment tools.

Journal Article

Share this book

Add to My Shelf

The Effects of Violating Standard Item Writing Principles on Tests and Students: The Consequences of Using Flawed Test Items on Achievement Examinations in Medical Education

by Downing, Steven M. in Assessment and knowledges control , Choice Behavior , Docimology

2005

The purpose of this research was to study the effects of violations of standard multiple-choice item writing principles on test characteristics, student scores, and pass-fail outcomes. Four basic science examinations, administered to year-one and year-two medical students, were randomly selected for study. Test items were classified as either standard or flawed by three independent raters, blinded to all item performance data. Flawed test questions violated one or more standard principles of effective item writing. Thirty-six to sixty-five percent of the items on the four tests were flawed. Flawed items were 0-15 percentage points more difficult than standard items measuring the same construct. Over all four examinations, 646 (53%) students passed the standard items while 575 (47%) passed the flawed items. The median passing rate difference between flawed and standard items was 3.5 percentage points, but ranged from -1 to 35 percentage points. Item flaws had little effect on test score reliability or other psychometric quality indices. Results showed that flawed multiple-choice test items, which violate well established and evidence-based principles of effective item writing, disadvantage some medical students. Item flaws introduce the systematic error of construct-irrelevant variance to assessments, thereby reducing the validity evidence for examinations and penalizing some examinees.

Journal Article

Share this book

Add to My Shelf

EXAMINING THE QUALITY OF ENGLISH TEST ITEMS USING PSYCHOMETRIC AND LINGUISTIC CHARACTERISTICS AMONG GRADE SIX PUPILS

by Suppiah Shanmugam, S. Kanageswari , Rajoo, Murugan , Wong, Vincent in Classical test theory , Difficulty Level , Educational evaluation

2020

Purpose - This study examined the quality of English test items using psychometric and linguistic characteristics among Grade Six pupils. Method - Contrary to the conventional approach of relying only on statistics when investigating item quality, this study adopted a mixed-method approach by employing psychometric analysis and cognitive interviews. The former was conducted on 30 Grade Six pupils, with each item representing a different construct commonly found in English test papers. Qualitative input was obtained through cognitive interviews with five Grade Six pupils and expert judgements from three teachers. Findings - None of the items were found to be too easy or difficult, and all items had positive discrimination indices. The item on idioms was most ideal in terms of difficulty and discrimination. Difficult items were found to be vocabulary-based. Surprisingly, the higher-order-thinking subjective items proved to be excellent in difficulty, although improvements could be made on their ability to discriminate. The qualitative expert judgements agreed with the quantitative psychometric analysis. Certain results from the item analysis, however, contradicted past findings that items with the ideal item difficulty value between 0.4 and 0.6 would have equally ideal item discrimination index. Significance -The findings of the study can serve as a reminder on the significance of using Classical Test Theory, a non-complex psychometric approach in assisting classroom teacher practitioners during the meticulous process of test design and ensuring test item quality.

Journal Article

Share this book

Add to My Shelf

Developing, Analyzing, and Using Distractors for Multiple-Choice Tests in Education: A Comprehensive Review

by Bulut, Okan , Zhang, Xinxin , Gierl, Mark J. in Accuracy , Achievement tests , Difficulty Level

2017

Multiple-choice testing is considered one of the most effective and enduring forms of educational assessment that remains in practice today. This study presents a comprehensive review of the literature on multiple-choice testing in education focused, specifically, on the development, analysis, and use of the incorrect options, which are also called the distractors. Despite a vast body of literature on multiple-choice testing, the task of creating distractors has received much less attention. In this study, we provide an overview of what is known about developing distractors for multiple-choice items and evaluating their quality. Next, we synthesize the existing guidelines on how to use distractors and summarize earlier research on the optimal number of distractors and the optimal ordering of distractors. Finally, we use this comprehensive review to provide the most up-to-date recommendations regarding distractor development, analysis, and use, and in the process, we highlight important areas where further research is needed.

Journal Article

Share this book

Add to My Shelf

Diagnosing a 12-Item Dataset of Raven Matrices: With Dexter

by Partchev, Ivailo in classical test theory , Diagnostic Tests , Evaluation

2020

We analyze a 12-item version of Raven’s Standard Progressive Matrices test, traditionally scored with the sum score. We discuss some important differences between assessment in practice and psychometric modelling. We demonstrate some advanced diagnostic tools in the freely available R package, dexter. We find that the first item in the test functions badly—at a guess, because the subjects were not given exercise items before the live test.

Journal Article

Share this book

Add to My Shelf

DIVERSIFICATION OF REASONING SCIENCE TEST ITEMS OF TIMSS GRADE 8 BASED ON HIGHER ORDER THINKING SKILLS: A CASE STUDY OF INDONESIAN STUDENTS

by Utomo, Anjar Putro , Narulita, Erlia , Shimizu, Kinya in Critical thinking , Education , Grade 8

2018

The aim of this research was to assess the classification of science test items of TIMSS grade 8 based on higher order thinking skills (HOTS) and determine whether those classified-science test items can be an assessment tool in science class. Sixteen sample test items of HOTS were chosen from 37 reasoning items of TIMSS 1999, 2003, and 2011; which were 6 of analysing, 6 of evaluating, and 4 of creating. The selected items were tested to 410 ninth grade students in 14 public schools in Jember, Indonesia. Data were analysed by using point-biserial correlation to measure the index of discrimination and degree of difficulty at items of each level of HOTS test. The result revealed that the point-biserial index of discrimination for each item was higher than 0.25. The degree of difficulty of analysing, evaluating and creating test items exhibited a similar trend, which was in good range. Each test item has significant validity. Whilst reliability analysis showed that each test item was acceptable and indicating a high level of internal consistency. In conclusion, the classified science test items of TIMSS are good to use as assessment tools to measure HOTS of students in science class.

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter