Catalogue Search | MBRL
Search Results Heading
Explore the vast range of titles available.
MBRLSearchResults
-
DisciplineDiscipline
-
Is Peer ReviewedIs Peer Reviewed
-
Reading LevelReading Level
-
Content TypeContent Type
-
YearFrom:-To:
-
More FiltersMore FiltersItem TypeIs Full-Text AvailableSubjectCountry Of PublicationPublisherSourceTarget AudienceDonorLanguagePlace of PublicationContributorsLocation
Done
Filters
Reset
14
result(s) for
"Marathi language Texts"
Sort by:
Study of automatic text summarization approaches in different languages
2021
Nowadays we see huge amount of information is available on both, online and offline sources. For single topic we see hundreds of articles are available, containing vast amount of information about it. It is really a difficult task to manually extract the useful information from them. To solve this problem, automatic text summarization systems are developed. Text summarization is a process of extracting useful information from large documents and compressing them into short summary preserving all important content. This survey paper hand out a broad overview on the work done in the field of automatic text summarization in different languages using various text summarization approaches. The focal centre of this survey paper is to present the research done on text summarization on Indian languages such as, Hindi, Punjabi, Bengali, Malayalam, Kannada, Tamil, Marathi, Assamese, Konkani, Nepali, Odia, Sanskrit, Sindhi, Telugu and Gujarati and foreign languages such as Arabic, Chinese, Greek, Persian, Turkish, Spanish, Czeh, Rome, Urdu, Indonesia Bhasha and many more. This paper provides the knowledge and useful support to the beginner scientists in this research area by giving a concise view on various feature extraction methods and classification techniques required for different types of text summarization approaches applied on both Indian and non-Indian languages.
Journal Article
Hate speech detection in low-resourced Indian languages: An analysis of transformer-based monolingual and multilingual models with cross-lingual experiments
2025
Warning: This paper is based on hate speech detection and may contain examples of abusive/ offensive phrases. Cyberbullying, online harassment, etc., via offensive comments are pervasive across different social media platforms like ™Twitter, ™Facebook, ™YouTube, etc. Hateful comments must be detected and eradicated to prevent harassment and violence on social media. In the Natural Language Processing (NLP) domain, the most prevalent task is comment classification, which is challenging, and language models based on transformers are at the forefront of this advancement. This paper intends to analyze the performance of language models based on transformers like BERT, ALBERT, RoBERTa, and DistilBERT on the Indian hate speech datasets over binary classification. Here, we utilize the existing datasets, i.e., HASOC (Hindi and Marathi) and HS-Bangla. So, we evaluate several multilingual language models like MuRIL-BERT, XLM-RoBERTa, etc., few monolingual language models like RoBERTa-Hindi, Maha-BERT (Marathi), Bangla-BERT (Bangla), Assamese-BERT (Assamese), etc., and perform cross-lingual experiment also. For further analyses, we perform multilingual, monolingual, and cross-lingual experiments on our H ate S peech Assamese (HS-Assamese) (Indo-Aryan language family) and H ate S peech Bodo (HS-Bodo) (Sino-Tibetan language family) dataset (HS dataset version 2) also and achieved a promising result. The motivation of the cross-lingual experiment is to encourage researchers to learn about the power of the transformer. Note that no pre-trained language models are currently available for Bodo or any other Sino-Tibetan languages.
Journal Article
Combining multiple pre-trained models for hate speech detection in Bengali, Marathi, and Hindi
by
Sarkar, Kamal
,
Mallick, Arjun
,
Nandi, Arpan
in
Computer Communication Networks
,
Computer Science
,
Data Structures and Information Theory
2024
With the increasing practice of using regional languages in social media platforms, hate speech detection in regional languages has received the attention of researchers. In India, hundreds of languages are spoken in various forms, which are dependent on their geography, culture, etc. Recently the number of active internet users has been rapidly increasing in India, and therefore social media has penetrated the common Indian population. Though the need for proper detection and timely removal of abusive or offensive texts has increased, well-organized and labeled data for Indian languages are scarce. Almost all the regional languages in India are low-resource languages. Hence, the objective of this study is to develop an approach that will learn from relatively small volumes of Indian language data and provide state-of-the-art results. A fusion of features extracted from a fined-tuned multilingual BERT (Bidirectional Encoder Representations from Transformers) and a fine-tuned Indic BERT has been proposed in this study. Since the BERT models that we have used for this work are pre-trained using a large volume of texts in multiple Indian languages, transfer learning solves the problem of low training data volume, and this makes the proposed model more generic. Three datasets for three different Indian languages namely, Bengali, Marathi, and Hindi have been considered in this study to evaluate the proposed approach. The proposed model achieved a weighted F1 score of 0.923, 0.815, and 0.924 for the Bengali, Hindi, and Marathi datasets respectively. In the Bengali and Marathi datasets, the obtained results are better than the existing best results.
Journal Article
A case study on decompounding in Indian language IR
2025
Decompounding is an essential preprocessing step in text-processing tasks such as machine translation, speech recognition, and information retrieval (IR). Here, the IR issues are explored from five viewpoints. (A) Does word decompounding impact the Indian language IR? If yes, to what extent? (B) Can corpus-based decompounding models be used in the Indian language IR? If yes, how? (C) Can machine learning and deep learning-based decompounding models be applied in the Indian language IR? If yes, how? (D) Among the different decompounding models (corpus-based, hybrid machine learning-based, and deep learning-based), which provides the best effectiveness in the IR domain? (E) Among the different IR models, which provides the best effectiveness from the IR perspective? This study proposes different corpus-based, hybrid machine learning-based, and deep learning-based decompounding models in Indian languages (Marathi, Hindi, and Sanskrit). Moreover, we evaluate the effectiveness of each activity from an IR perspective only. It is observed that the different decompounding models improve IR effectiveness. The deep learning-based decompounding models outperform the corpus-based and hybrid machine learning-based models in Indian language IR. Among the different deep learning-based models, the Bi-LSTM-A model performs best and improves mean average precision (MAP) by 28.02% in Marathi. Similarly, the Bi-RNN-A model improves MAP by 18.18% and 6.1% in Hindi and Sanskrit, respectively. Among the retrieval models, the In_expC2 model outperforms others in Marathi and Hindi, and the BB2 model outperforms others in Sanskrit.
Journal Article
Effect of stopwords in Indian language IR
2022
We explore and evaluate the effect of stopwords in retrieval performance of different Indian languages such as Marathi, Bengali, Gujarati and Sanskrit. The issue was investigated from three viewpoints. Is there any impact of non-corpus-based stopword removal on chosen Indian languages (if yes, to what extent)? Can we recommend, based on experiment, a number of stopwords for chosen Indian languages that are good enough from retrieval point of view? Is there any relationship of stopwords with average document length from retrieval perspective? It is observed that the stopword removal generally improves mean average precision (MAP) significantly compared with the case when it is not done. For each language, different lengths of the stopword list are explored and evaluated that lead to suggesting its optimal length. We also study the effect of stopwords on retrieval performance over document length. The effect of stopwords is generally found to be quite low in short documents compared with their long counterparts across the four Indian languages.
Journal Article
MTStemmer: A multilevel stemmer for effective word pre-processing in Marathi
2021
In natural language processing, it is important that the context and the meaning of words are retained while also ensuring the efficacy of the data modelling process. During human-to-human interactions, special care is taken regarding the tense and phrasing of the words by taking into consideration the rules of grammar of the specific language. While this modification of words is necessary for framing consistent sentences, these appendages do not add significant value to the original meaning of the word. Stemming is the process of converting words back to their root form for efficient and accurate modelling of the data. In this paper, MTStemmer, a new stemmer for the Marathi language is proposed. It focuses on the stripping of suffixes for obtaining the root word form. The proposed stemmer applies a multilevel approach by taking into consideration both auxiliary verb-based suffixes and gender-based suffixes. The presented approach intends to improve upon the limitations of the previously proposed stemmers for this language. The stemming performed by the stemmer is found to be more accurate in terms of mapping to the root words. Stemming is often an important pre-processing step before processing the data further for the main task. The benefit of the proposed stemmer is demonstrated by using it for an extractive Marathi text summarization task. A significant improvement in the performance of multiple performance metrics is achieved owing to the stemming done by MTStemmer. The working of the proposed stemmer shows promising signs for the development of similar engines for other Indic languages.
Journal Article
Pluralizing the Non-dual: Multilingual Perspectives on Advaita Vedānta, 1560-1847
2020
With a textual record spanning dozens of languages—to say nothing of its oral histories—Advaita Vedānta's multilingual archive presents obvious and daunting challenges for scholars of South Asian intellectual and religious histories. The papers in this issue build on recent multilingual and contextual approaches to South Asian intellectual history by reading a rich corpus of Advaita Vedänta material in Persian, Marathi, Tamil, Sanskrit and Braj Bhasha. In bringing these sources and their authors into conversation with one another, this issue acknowledges Advaita Vedānta's broad appeal in early-modern and colonial South Asia; but it also attests to Advaita Vedänta's heterogeneous, textured, and even contested historical development. The following papers chart Advaita Vedānta across five unique social, linguistic, intellectual, and geographical spaces from the middle of the sixteenth century to the middle of the nineteenth century. While no single issue could contextualize something as historically recalcitrant as Advaita Vedänta, we see this special issue as a step on the long and necessary road to historicizing Vedänta, broadly, and Advaita Vedānta, specifically.
Journal Article
An Application of Zipf's Law for Prose and Verse Corpora Neutrality for Hindi and Marathi Languages
2020
Availability of the text in different languages has become possible, as almost all websites have offered multilingual option. Hindi is considered as official language in one of the states of India. Hindi text analysis is dominated by the corpus of stories and poems. Before performing any text analysis token extraction is an important step and supports many applications like text summarization , categorizing text and so on. Token extraction is a part of Natural language processing (NLP). NLP includes many steps such as preprocessing the corpus, lemmatization and so on. In this paper the tokens are extracted by two methods and on two corpora. BaSa, a context-based term extraction technique having different NLP activities, e.g. Term Frequency Inverse Document Frequency (TF-IDF) and Zipf ‘s law are used to count and compare extracted tokens. Further token comparison between both of the methods is achieved. The corpus contains proses and verses of Hindi as well as the Marathi language. Common tokens from corpora of verses and proses of Marathi as well as Hindi are identified to prove that both of them behave same as per as NLP activities are concerened. The betterment of BaSa over Zipf’s law is proved. Hindi Corpus includes 820 stories and 710 poems and Marathi corpus includes 610 stories and 505 poems.
Journal Article
Finite-State Back-Transliteration for Marathi
In this paper, we describe the creation of an open-source, finite-state based system for back-transliteration of Latin text in the Indian language Marathi. We outline the advantages of our system and compare it to other existing systems, evaluate its recall, and evaluate the coverage of an open-source morphological analyser on our back-transliterated corpus.
Journal Article