Catalogue Search | MBRL

Exploratory study on the potential of ChatGPT as a rater of second language writing

by Shin, Dongkwang , Lee, Jang Ho in Analysis , Chatbots , Computational linguistics

2024

In recent years, various strategies have been employed to integrate ChatGPT into the field of second language (L2) teaching and learning. In line with such efforts, this study investigates the potential of ChatGPT as an automated writing evaluation (AWE) tool for L2 assessment, given the lack of systematic and quantitative investigation into human ratings and GPT-based scoring chatbot’s ratings. We took an innovative approach by utilising ChatGPT’s new feature called ‘My GPTs’, which is a customised chatbot builder based on GPT-4. The dataset for assessment consisted of 50 English essays written by Korean secondary-level EFL students, which were rated by the developed GPT-based scoring chatbot and two in-service English teachers. The intraclass correlation coefficient results suggested a strong similarity between human rater and ChatGPT scores. However, those based on the multifaceted Rasch model further revealed that ChatGPT showed a slightly greater deviation from the model than its human counterparts. This study demonstrates the potential of ChatGPT in AWE, providing an accessible and supplementary tool to L2 teachers’ ratings.

Journal Article

Share this book

Add to My Shelf

Is Rasch model analysis applicable in small sample size pilot studies for assessing item characteristics? An example using PROMIS pain behavior item bank data

by Chen, Wen-Hung , Wyrwich, Kathleen W. , Revicki, Dennis A. in Behavior , Behavior modeling , Chronic pain

2014

Purpose Large samples are generally considered necessary for Rasch model to obtain robust item parameter estimates. Recently, small sample Rasch analysis was suggested as preliminary assessment of items' psychometric properties. This study is to evaluate the Rasch analysis results using small sample sizes. Methods Ten PROMIS pain behavior items were used. Random samples of 30, 50, 100, and 250, and a targeted sample of 30 were drawn 10 times each from a total of 800 subjects. Rasch analysis was conducted for each of these samples and the full sample. Results In the full sample, there were 104 cases of extreme scores, no null categories, two incorrectly ordered items, and four misfit items. For samples of 250, 100, 50, 30, and targeted 30, the average numbers of extreme scores were 42.2, 17.1, 9.6, 6.1, and 1.2; the average numbers of null categories were 1.0, 3.2, 8.7, and 8.3; the average numbers of items with incorrectly ordered item parameters were 0.1, 0.8, 2.9, 4.7, and 3.7; and the average numbers of items with fit residuals exceeding ±2.5 were 0.8, 0.3, 0.1, 0.2, and 0.3, respectively. Conclusions Rasch analysis based on small samples (≤50) identified a greater number of items with incorrectly ordered parameters than larger samples (≥100). However, fewer items were identified as misfitting. Results from small samples led to opposite conclusions from those based on larger samples. Rasch analysis based on small samples should be used for exploratory purposes with extreme caution.

Journal Article

Share this book

Add to My Shelf

Rating scales and Rasch measurement

by Andrich, David in Classical test theory , graded response model , item response theory

2011

Assessments with ratings in ordered categories have become ubiquitous in health, biological and social sciences. Ratings are used when a measuring instrument of the kind found in the natural sciences is not available to assess some property in terms of degree - for example, greater or smaller, better or worse, or stronger or weaker. The handling of ratings has ranged from the very elementary to the highly sophisticated. In an elementary form, and assumed in classical test theory, the ratings are scored with successive integers and treated as measurements; in a sophisticated form, and used in modern test theory, the ratings are characterized by probabilistic response models with parameters for persons and the rating categories. Within modern test theory, two paradigms, similar in many details but incompatible on crucial points, have emerged. For the purposes of this article, these are termed the statistical modeling and experimental measurement paradigms. Rather than reviewing a compendium of available methods and models for analyzing ratings in detail, the article focuses on the incompatible differences between these two paradigms, with implications for choice of model and inferences. It shows that the differences have implications for different roles for substantive researchers and psychometricians in designing instruments with rating scales. To illustrate these differences, an example is provided.

Journal Article

Share this book

Add to My Shelf

Assessing computational thinking abilities among Singapore secondary students: a Rasch model measurement analysis

by Looi, Chee-Kit , Sumintono, Bambang , Chan, Shiau-Wei in Assessments , Computer programming , Computer science

2021

In recent years, computational thinking (CT) skills have been globally recognized as a 21st-century skill that must be developed for future generations. However, the lack of validated CT assessments would be a major impediment in the efforts to incorporate CT into the school curriculum. This study is intended to validate the Computational Thinking Test (CTt) using the Rasch model by identifying whether the data fit the Rasch model measurement, determining the CT abilities among a small sample of Singapore secondary students through the test, and examining the presence of test items that functioned differently for gender and grade level of the students. In this study, 153 upper secondary school students from Grade 9 and Grade 10 were involved in a test that required them to do the CTt which comprises 28 test items. The performance of the students in CTt was utilized as quantitative data in this study and was analyzed using the Rasch model. The findings revealed that the data fit the Rasch model measurement. The majority of the male students and ninth-graders had a high level of CT abilities, while most of the female students and tenth-graders had a moderate level of CT abilities. Hence, the male students and ninth-graders performed better than the female students and tenth-graders. Four items functioned differently between male and female students where one gender had a better chance to get the correct answer in these four items compared to the other gender. Only one test item was functioning differently for Grade 9 and Grade 10. This means that the students of one grade level were more likely to obtain the correct answer in this item than the students in the other grade level. This study hopes to contribute to the literature in the area of CT assessments by providing a reference case for scholars and researchers in assessing CT abilities among the students.

Journal Article

Share this book

Add to My Shelf

Enhancing analytic rigor in qualitative analysis: developing and testing code scheme using Many Facet Rasch Model

by Sumintono Bambang , Mohd Zabidi Zuliana , Abdullah Zuraidah in Data analysis , Homogeneity , Leniency

2022

Performance assessments that include multiple raters using rating scales to evaluate the quality of a performance are frequently seen and reported. However, the establishment of intercoder reliability in analysing content validity of codes applied to data from transcripts in a qualitative study is not widely practiced. This paper presents intercoder reliability using Many Facet Rasch Model (MFRM) in developing, testing, and enhancing the code scheme of a qualitative study. The results suggest that although the raters were in the same field of expertise and were given the same set of code scheme, their independent characteristics were still observable because unlike the standard approach, MFRM allowed for comparability of each expert’s rating operation as opposed to group homogeneity. While all raters agreed to assign 70% of the codings in the category of very good and good, the researchers were still able to determine the raters’ severity or leniency based on the ratings they assigned. Moreover, MFRM’s ability to provide independent piece of information helped the researchers to identify which codings were accepted and which coding needed review on the researchers’ part. The evaluation has increased trust in the quality of codings and this has created a sense of confidence to the researchers in establishing the code scheme for the qualitative study.

Journal Article

Share this book

Add to My Shelf

The development and initial validation of the Breast Cancer Recurrence instrument (BreastCaRe)—a patient-reported outcome measure for detecting symptoms of recurrence after breast cancer

by Høeg, Beverley Lim , Johansen, Christoffer , Saltbæk, Lena in Breast cancer , INSTRUMENT DEVELOPMENT , Medicine

2021

Purpose Patient-reported outomes (PRO) may facilitate prompt treatment. We describe the development and psychometric properties of the first instrument to monitor for symptoms of breast cancer (BC) recurrence. Methods This study is nested in the MyHealth randomized trial of nurse-led follow-up based on electronically-collected PROs. We constructed items assessing symptoms of potential recurrence through expert interviews with six BC specialists in Denmark. Semi-structured cognitive interviews were carried out with a patient panel to assess acceptability and comprehensibility. Items were subsequently tested in a population of 1170 women 1–10 years after completing BC treatment. We carried out multiple-groups confirmatory factor analysis (CFA) and Rasch analysis to test dimensionality, local dependence (LD) and differential item functioning (DIF) according to sociodemographic and treatment-related factors. Clinical data was obtained from the Danish Breast Cancer Group registry. Results Twenty-two items were generated for the Breast Cancer Recurrence instrument (BreastCaRe). Cognitive testing resulted in clearer items. Seven subscales based on general, bone, liver, lung, brain, locoregional and contralateral recurrence symptoms were proposed. Both CFA and Rasch models confirmed the factor structure. No DIF was identified. Five item pairs showed LD but all items were retained to avoid loss of clinical information. Rasch models taking LD into account were used to generate a standardized scoring table for each subscale. Conclusions The BreastCaRe has good content and structural validity, patient acceptability and measurement invariance. We are preparing to examine the predictive validity of this new instrument.

Journal Article

Share this book

Add to My Shelf

Analyses of Model Fit and Robustness. A New Look at the PISA Scaling Model Underlying Ranking of Countries According to Reading Literacy

by Kreiner, Svend , Christensen, Karl Bang in Assessment , Behavioral Science and Psychology , Educational evaluation

2014

This paper addresses methodological issues that concern the scaling model used in the international comparison of student attainment in the Programme for International Student Attainment (PISA), specifically with reference to whether PISA’s ranking of countries is confounded by model misfit and differential item functioning (DIF). To determine this, we reanalyzed the publicly accessible data on reading skills from the 2006 PISA survey. We also examined whether the ranking of countries is robust in relation to the errors of the scaling model. This was done by studying invariance across subscales, and by comparing ranks based on the scaling model and ranks based on models where some of the flaws of PISA’s scaling model are taken into account. Our analyses provide strong evidence of misfit of the PISA scaling model and very strong evidence of DIF. These findings do not support the claims that the country rankings reported by PISA are robust.

Journal Article

Share this book

Add to My Shelf

Noncompensatory MIRT For Passage-Based Tests

by Bolt, Daniel M. , Kim, Nana , Wollack, James in Application Reviews and Case Studies , Application Reviews and Case Studies (ARCS) , Assessment

2022

We consider a multidimensional noncompensatory approach for binary items in passage-based tests. The passage-based noncompensatory model (PB-NM) emphasizes two underlying components in solving passage-based test items: a passage-related component and a passage-independent component. An advantage of the PB-NM model over commonly applied compensatory models (e.g., bifactor model) is that the two components are parameterized in relation to difficulty as opposed to discrimination parameters. As a result, while simultaneously accounting for passage-related local item dependence, the model permits the assessment of how items based on the same passage may require varying levels of passage comprehension (as well as varying levels of passage-independent proficiency) to obtain a correct response. Through a simulation study, we evaluate the comparative fit of the PB-NM against the bifactor model and also illustrate the relationship between the difficulty parameters of the PB-NM and the discrimination parameters of the bifactor model. We further apply the PB-NM to an actual reading comprehension test to demonstrate the relevance of the model in understanding variation in the relative difficulty of the two components across different item types.

Journal Article

Share this book

Add to My Shelf

The behaviors of Indonesian domestic ecotourists using a Rasch analysis

by Hakim, Imam Nur , Maulana, Addin , Khoiriyani, Fauziah in Behavior , Behavior problems , Conservation

2024

Purpose This study aims to analyze the capacity of ecotourists to exhibit behavior that aligns with the ecotourist scale using the Rasch model measurement. Design/methodology/approach The data was gathered using an online survey incorporating the five tenets of ecotourism using a seven-point rating scale on domestic tourists in Indonesia. Descriptive statistics, cross-tabulation and Rasch model measurement were used to analyze the data. Findings The ecotourist identification scale measurement items were reliable and satisfactory. The most challenging behavior for ecotourists was using the services of a tour guide who was concerned about the environment. Meanwhile, respecting cultural differences around the tourist destination was the most accessible behavior. Most respondents demonstrated a fit response pattern and satisfactorily met the validity and reliability criteria. Research limitations/implications This study did not compare ecotourists’ ability to behave by the type of conservation visited as its limitation. However, it provides a significant methodological contribution to developing a measurement of ecotourist behavior implemented in well-established behavioral theories. Practical implications Integrating ecotourism into education, incentivizing eco-friendly tourism practices, promoting awareness, supporting local businesses, respecting local values and ensuring safe travels. Originality/value To the best of the authors’ knowledge, this study is the first of its kind to be conducted in Indonesia. It uses a unique and innovative method to reveal the unobserved variables in ecotourists’ behavior. The findings confirm that tourists’ behaviors align with the five tenets of ecotourism.

Journal Article

Share this book

Add to My Shelf

Human ratings take time: A hierarchical facets model for the joint analysis of ratings and rating times

by Jin, Kuan-Yu , Eckes, Thomas in Behavioral Science and Psychology , Cognition , Cognitive Psychology

2024

Performance assessments increasingly utilize onscreen or internet-based technology to collect human ratings. One of the benefits of onscreen ratings is the automatic recording of rating times along with the ratings. Considering rating times as an additional data source can provide a more detailed picture of the rating process and improve the psychometric quality of the assessment outcomes. However, currently available models for analyzing performance assessments do not incorporate rating times. The present research aims to fill this gap and advance a joint modeling approach, the “hierarchical facets model for ratings and rating times” (HFM-RT). The model includes two examinee parameters (ability and time intensity) and three rater parameters (severity, centrality, and speed). The HFM-RT successfully recovered examinee and rater parameters in a simulation study and yielded superior reliability indices. A real-data analysis of English essay ratings collected in a high-stakes assessment context revealed that raters exhibited considerably different speed measures, spent more time on high-quality than low-quality essays, and tended to rate essays faster with increasing severity. However, due to the significant heterogeneity of examinees’ writing proficiency, the improvement in the assessment’s reliability using the HFM-RT was not salient in the real-data example. This discussion focuses on the advantages of accounting for rating times as a source of information in rating quality studies and highlights perspectives from the HFM-RT for future research on rater cognition.

Journal Article

Share this book

Add to My Shelf

Language Selector

MBRLGlobalSearch

Language Selector

Catalogue Search | MBRL

Search Results Heading

Explore the vast range of titles available.

MBRLSearchResults

MBRLHappinessMeter