Catalogue Search | MBRL

Applications of Rasch measurement in learning environments research

by Cavanagh, Robert F., 1948- editor , Waugh, Russell, editor in Educational tests and measurements. , Rasch models. , Education Mathematical models.

Major advances in creating linear measures in education and the social sciences, particularly in regard to Rasch measurement, have occurred in the past 15 years, along with major advances in computer power. These have been combined so that the Rasch Unidimensional Measurement Model (RUMM) and the WINSTEPS computer programs now do statistical calculations and produce graphical outputs with very fast switching times. These programs help researchers produce unidimensional, linear scales from which valid inferences can be made by calculating person measures and item difficulties on the same linear scale, with supporting evidence. This book includes 13 Learning Environment research papers using Rasch measurement applied at the forefront of education with an international flavour. The contents of the papers relate to: (1) high stakes numeracy testing in Western Australia; (2) early English literacy in New South Wales; (3) the Indonesian Scholastic Aptitude Test; (4) validity in Learning Environment investigations; (5) factors influencing the take-up of Physics in Singapore; (6) state-wide authentic assessment for Years 11-12; (7) talented and gifted student perceptions of the learning environment; (8) disorganisation in the classroom; (9) psychological services in learning environments; (10) English teaching assistant roles in Hong Kong; (11) learning Japanese as a second language; (12) engagement in classroom learning; and (13) early cognitive development in children.

Book

Share this book

Add to My Shelf

Is Rasch model analysis applicable in small sample size pilot studies for assessing item characteristics? An example using PROMIS pain behavior item bank data

by Chen, Wen-Hung , Wyrwich, Kathleen W. , Revicki, Dennis A. in Behavior , Behavior modeling , Chronic pain

2014

Purpose Large samples are generally considered necessary for Rasch model to obtain robust item parameter estimates. Recently, small sample Rasch analysis was suggested as preliminary assessment of items' psychometric properties. This study is to evaluate the Rasch analysis results using small sample sizes. Methods Ten PROMIS pain behavior items were used. Random samples of 30, 50, 100, and 250, and a targeted sample of 30 were drawn 10 times each from a total of 800 subjects. Rasch analysis was conducted for each of these samples and the full sample. Results In the full sample, there were 104 cases of extreme scores, no null categories, two incorrectly ordered items, and four misfit items. For samples of 250, 100, 50, 30, and targeted 30, the average numbers of extreme scores were 42.2, 17.1, 9.6, 6.1, and 1.2; the average numbers of null categories were 1.0, 3.2, 8.7, and 8.3; the average numbers of items with incorrectly ordered item parameters were 0.1, 0.8, 2.9, 4.7, and 3.7; and the average numbers of items with fit residuals exceeding ±2.5 were 0.8, 0.3, 0.1, 0.2, and 0.3, respectively. Conclusions Rasch analysis based on small samples (≤50) identified a greater number of items with incorrectly ordered parameters than larger samples (≥100). However, fewer items were identified as misfitting. Results from small samples led to opposite conclusions from those based on larger samples. Rasch analysis based on small samples should be used for exploratory purposes with extreme caution.

Journal Article

Share this book

Add to My Shelf

Invariant measurement with raters and rating scales : Rasch models for rater-mediated assessments

by Engelhard, George, 1953- author , Wind, Stefanie A. (Anne), author in Psychometrics. , Invariant measures. , Rasch models.

\"The purpose of this book is to present methods for developing, evaluating and maintaining rater-mediated assessment systems. Rater-mediated assessments involve ratings that are assigned by raters to persons responding to constructed-response items (e.g., written essays and teacher portfolios) and other types of performance assessments. This book addresses the following topics: (1) introduction to the principles of invariant measurement, (2) application of the principles of invariant measurement to rater-mediated assessments, (3) description of the lens model for rater judgments, (4) integration of principles of invariant measurement with the lens model of cognitive processes of raters, (5) illustration of substantive and psychometric issues related to rater-mediated assessments in terms of validity, reliability, and fairness, and (6) discussion of theoretical and practical issues related to rater-mediated assessment systems. Invariant measurement is fast becoming the dominant paradigm for assessment systems around the world, and this book provides an invaluable resource for graduate students, measurement practitioners, substantive theorists in the human sciences, and other individuals interested in invariant measurement when judgments are obtained with rating scales\"-- Provided by publisher.

Book

Share this book

Add to My Shelf

Exploratory study on the potential of ChatGPT as a rater of second language writing

by Shin, Dongkwang , Lee, Jang Ho in Analysis , Chatbots , Computational linguistics

2024

In recent years, various strategies have been employed to integrate ChatGPT into the field of second language (L2) teaching and learning. In line with such efforts, this study investigates the potential of ChatGPT as an automated writing evaluation (AWE) tool for L2 assessment, given the lack of systematic and quantitative investigation into human ratings and GPT-based scoring chatbot’s ratings. We took an innovative approach by utilising ChatGPT’s new feature called ‘My GPTs’, which is a customised chatbot builder based on GPT-4. The dataset for assessment consisted of 50 English essays written by Korean secondary-level EFL students, which were rated by the developed GPT-based scoring chatbot and two in-service English teachers. The intraclass correlation coefficient results suggested a strong similarity between human rater and ChatGPT scores. However, those based on the multifaceted Rasch model further revealed that ChatGPT showed a slightly greater deviation from the model than its human counterparts. This study demonstrates the potential of ChatGPT in AWE, providing an accessible and supplementary tool to L2 teachers’ ratings.

Journal Article

Share this book

Add to My Shelf

RMX/PIccc: An Extended Person–Item Map and a Unified IRT Output for eRm, psychotools, ltm, mirt, and TAM

by Kabic, Milica , Alexandrowicz, Rainer W in Estimates , Item response theory , Rasch model

2023

A constituting feature of item response models is that item and person parameters share a latent scale and are therefore comparable. The Person–Item Map is a useful graphical tool to visualize the alignment of the two parameter sets. However, the “classical” variant has some shortcomings, which are overcome by the new RMX package (Rasch models—eXtended). The package provides the RMX::plotPIccc() function, which creates an extended version of the classical PI Map, termed “PIccc”. It juxtaposes the person parameter distribution to various item-related functions, like category and item characteristic curves and category, item, and test information curves. The function supports many item response models and processes the return objects of five major R packages for IRT analysis. It returns the used parameters in a unified form, thus allowing for their further processing. The R package RMX is freely available at osf.io/n9c5r.

Journal Article

Share this book

Add to My Shelf

Rating scales and Rasch measurement

by Andrich, David in Classical test theory , graded response model , item response theory

2011

Assessments with ratings in ordered categories have become ubiquitous in health, biological and social sciences. Ratings are used when a measuring instrument of the kind found in the natural sciences is not available to assess some property in terms of degree - for example, greater or smaller, better or worse, or stronger or weaker. The handling of ratings has ranged from the very elementary to the highly sophisticated. In an elementary form, and assumed in classical test theory, the ratings are scored with successive integers and treated as measurements; in a sophisticated form, and used in modern test theory, the ratings are characterized by probabilistic response models with parameters for persons and the rating categories. Within modern test theory, two paradigms, similar in many details but incompatible on crucial points, have emerged. For the purposes of this article, these are termed the statistical modeling and experimental measurement paradigms. Rather than reviewing a compendium of available methods and models for analyzing ratings in detail, the article focuses on the incompatible differences between these two paradigms, with implications for choice of model and inferences. It shows that the differences have implications for different roles for substantive researchers and psychometricians in designing instruments with rating scales. To illustrate these differences, an example is provided.

Journal Article

Share this book

Add to My Shelf

Assessing computational thinking abilities among Singapore secondary students: a Rasch model measurement analysis

by Looi, Chee-Kit , Sumintono, Bambang , Chan, Shiau-Wei in Assessments , Computer programming , Computer science

2021

In recent years, computational thinking (CT) skills have been globally recognized as a 21st-century skill that must be developed for future generations. However, the lack of validated CT assessments would be a major impediment in the efforts to incorporate CT into the school curriculum. This study is intended to validate the Computational Thinking Test (CTt) using the Rasch model by identifying whether the data fit the Rasch model measurement, determining the CT abilities among a small sample of Singapore secondary students through the test, and examining the presence of test items that functioned differently for gender and grade level of the students. In this study, 153 upper secondary school students from Grade 9 and Grade 10 were involved in a test that required them to do the CTt which comprises 28 test items. The performance of the students in CTt was utilized as quantitative data in this study and was analyzed using the Rasch model. The findings revealed that the data fit the Rasch model measurement. The majority of the male students and ninth-graders had a high level of CT abilities, while most of the female students and tenth-graders had a moderate level of CT abilities. Hence, the male students and ninth-graders performed better than the female students and tenth-graders. Four items functioned differently between male and female students where one gender had a better chance to get the correct answer in these four items compared to the other gender. Only one test item was functioning differently for Grade 9 and Grade 10. This means that the students of one grade level were more likely to obtain the correct answer in this item than the students in the other grade level. This study hopes to contribute to the literature in the area of CT assessments by providing a reference case for scholars and researchers in assessing CT abilities among the students.

Journal Article

Share this book

Add to My Shelf

Enhancing analytic rigor in qualitative analysis: developing and testing code scheme using Many Facet Rasch Model

by Sumintono Bambang , Mohd Zabidi Zuliana , Abdullah Zuraidah in Data analysis , Homogeneity , Leniency

2022

Performance assessments that include multiple raters using rating scales to evaluate the quality of a performance are frequently seen and reported. However, the establishment of intercoder reliability in analysing content validity of codes applied to data from transcripts in a qualitative study is not widely practiced. This paper presents intercoder reliability using Many Facet Rasch Model (MFRM) in developing, testing, and enhancing the code scheme of a qualitative study. The results suggest that although the raters were in the same field of expertise and were given the same set of code scheme, their independent characteristics were still observable because unlike the standard approach, MFRM allowed for comparability of each expert’s rating operation as opposed to group homogeneity. While all raters agreed to assign 70% of the codings in the category of very good and good, the researchers were still able to determine the raters’ severity or leniency based on the ratings they assigned. Moreover, MFRM’s ability to provide independent piece of information helped the researchers to identify which codings were accepted and which coding needed review on the researchers’ part. The evaluation has increased trust in the quality of codings and this has created a sense of confidence to the researchers in establishing the code scheme for the qualitative study.

Journal Article

Share this book

Add to My Shelf

The development and initial validation of the Breast Cancer Recurrence instrument (BreastCaRe)—a patient-reported outcome measure for detecting symptoms of recurrence after breast cancer

by Høeg, Beverley Lim , Johansen, Christoffer , Saltbæk, Lena in Breast cancer , INSTRUMENT DEVELOPMENT , Medicine

2021

Purpose Patient-reported outomes (PRO) may facilitate prompt treatment. We describe the development and psychometric properties of the first instrument to monitor for symptoms of breast cancer (BC) recurrence. Methods This study is nested in the MyHealth randomized trial of nurse-led follow-up based on electronically-collected PROs. We constructed items assessing symptoms of potential recurrence through expert interviews with six BC specialists in Denmark. Semi-structured cognitive interviews were carried out with a patient panel to assess acceptability and comprehensibility. Items were subsequently tested in a population of 1170 women 1–10 years after completing BC treatment. We carried out multiple-groups confirmatory factor analysis (CFA) and Rasch analysis to test dimensionality, local dependence (LD) and differential item functioning (DIF) according to sociodemographic and treatment-related factors. Clinical data was obtained from the Danish Breast Cancer Group registry. Results Twenty-two items were generated for the Breast Cancer Recurrence instrument (BreastCaRe). Cognitive testing resulted in clearer items. Seven subscales based on general, bone, liver, lung, brain, locoregional and contralateral recurrence symptoms were proposed. Both CFA and Rasch models confirmed the factor structure. No DIF was identified. Five item pairs showed LD but all items were retained to avoid loss of clinical information. Rasch models taking LD into account were used to generate a standardized scoring table for each subscale. Conclusions The BreastCaRe has good content and structural validity, patient acceptability and measurement invariance. We are preparing to examine the predictive validity of this new instrument.

Journal Article

Share this book

Add to My Shelf

Analyses of Model Fit and Robustness. A New Look at the PISA Scaling Model Underlying Ranking of Countries According to Reading Literacy

by Kreiner, Svend , Christensen, Karl Bang in Assessment , Behavioral Science and Psychology , Educational evaluation

2014

This paper addresses methodological issues that concern the scaling model used in the international comparison of student attainment in the Programme for International Student Attainment (PISA), specifically with reference to whether PISA’s ranking of countries is confounded by model misfit and differential item functioning (DIF). To determine this, we reanalyzed the publicly accessible data on reading skills from the 2006 PISA survey. We also examined whether the ranking of countries is robust in relation to the errors of the scaling model. This was done by studying invariance across subscales, and by comparing ranks based on the scaling model and ranks based on models where some of the flaws of PISA’s scaling model are taken into account. Our analyses provide strong evidence of misfit of the PISA scaling model and very strong evidence of DIF. These findings do not support the claims that the country rankings reported by PISA are robust.

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter