Search Results Heading

MBRLSearchResults

mbrl.module.common.modules.added.book.to.shelf
Title added to your shelf!
View what I already have on My Shelf.
Oops! Something went wrong.
Oops! Something went wrong.
While trying to add the title to your shelf something went wrong :( Kindly try again later!
Are you sure you want to remove the book from the shelf?
Oops! Something went wrong.
Oops! Something went wrong.
While trying to remove the title from your shelf something went wrong :( Kindly try again later!
    Done
    Filters
    Reset
  • Discipline
      Discipline
      Clear All
      Discipline
  • Is Peer Reviewed
      Is Peer Reviewed
      Clear All
      Is Peer Reviewed
  • Item Type
      Item Type
      Clear All
      Item Type
  • Subject
      Subject
      Clear All
      Subject
  • Year
      Year
      Clear All
      From:
      -
      To:
  • More Filters
      More Filters
      Clear All
      More Filters
      Source
    • Language
574 result(s) for "Parallel corpora"
Sort by:
A massively parallel corpus: the Bible in 100 languages
We describe the creation of a massively parallel corpus based on 100 translations of the Bible. We discuss some of the difficulties in acquiring and processing the raw material as well as the potential of the Bible as a corpus for natural language processing. Finally we present a statistical analysis of the corpora collected and a detailed comparison between the English translation and other English corpora.
The undergraduate learner translator corpus: a new resource for translation studies and computational linguistics
Around the world, a growing interest has been seen in learner translator corpora, which are invaluable resources for teaching and research. This paper introduces a new resource to support researchers from different interdisciplinary areas such as computational linguistics, descriptive translation studies, computer-aided translation technology, Arabic machine translation applications, cognitive science, and translation pedagogy. Motivated by the lack of learner translator resources that provide data about learners of translation from and into Arabic, the undergraduate learner translator corpus (ULTC) is an ongoing, error-tagged sentence-aligned parallel corpus of English, Arabic, and French, with Arabic as its main language. The present corpus, consisting of parallel texts of female learners of translation from English or French into Arabic, is the first of its kind in terms of the languages represented, tasks covered, and number of students involved. It is also unique in terms of combining many complementary corpora of cross-lingual data, each of which has its own web-based query interface and corpus analysis tools. This paper describes the ULTC compilation process, preliminary findings, and planned future expansion and research.
The Italian-Russian Parallel Corpus of the Nacional’nyj Korpus Russkogo Jazyka (NKRJa). Evolution and Applications in Italian Slavistics Research
The aim of this article is to present a comprehensive overview of the studies conducted in Italy using the Italian-Russian parallel corpus of the Nacional’nyj Korpus russkogo jazyka (NKRJa), implemented in 2013 and then expanded since 2015. We provide current information on the size of the corpus and a description of the types of research conducted in various fields (contrastive linguistics, translation studies and studies dedicated to the teaching of Russian). The article discusses the practical applications of the corpus and presents the results obtained. On the basis of this overview, the potential and limits of the tool are highlighted, with a view to its continuous and constant improvement.
A large English–Thai parallel corpus from the web and machine-generated text
The primary objective of our work is to build a large-scale English–Thai dataset for training neural machine translation models. We construct scb-mt-en-th-2020, an English–Thai machine translation dataset with over 1 million segment pairs, curated from various sources: news, Wikipedia articles, SMS messages, task-based dialogs, web-crawled data, government documents, and text artificially generated by a pretrained language model. We present the methods for gathering data, aligning texts, and removing preprocessing noise and translation errors automatically. We also train machine translation models based on this dataset to assess the quality of the corpus. Our models perform comparably to Google Translation API (as of May 2020) for Thai–English and outperform Google when the Open Parallel Corpus (OPUS) is included in the training data for both Thai–English and English–Thai translation. The dataset is available for public use under CC-BY-SA 4.0 License. The pre-trained models and source code to reproduce our work are available under Apache-2.0 License.
Možnosti využití anotace syntaktické komplexity v paralelním korpusu: příklad francouzských tvarů na -ant v konverbální funkci a jejich českých protějšků
This study explores new research opportunities offered by the InterCorp v16ud parallel corpus, annotated using the Universal Dependencies scheme and enriched with syntactic complexity (SC) measures. The analysis focuses on French sentences containing -ant forms (gerund and present participle) and their Czech translations, with participles restricted to adverbial (converbal) usage for comparability. The results show significant SC variation in literary texts, with Czech translations displaying lower values than French originals. Coefficient of variation and correlation analyses suggest that participles may function as stylistic markers, unlike gerunds. At the sentence level, participles are associated with higher SC than gerunds, though the differences are moderate. The contrastive analysis reveals substantial reductions in clausal SC measures in the Czech translations, probably due to the replacement of subordination by coordination. These shifts affect SC information hierarchy, and occasionally temporal relations. The study underscores the potential of InterCorp v16ud for syntactic research in contrastive linguistics and beyond, while emphasizing the multidimensional nature of SC.
A multilingual corpus study of the competition between past and perfect in narrative discourse
The western European present perfect is subject to substantial crosslinguistic variation. The literature, however, focuses on individual languages or on comparisons of a restricted number of languages. We piece together the puzzle and do so in a data-driven way by comparing the use of the present perfect through a parallel corpus based on the French novel L’Étranger and its translations in Italian, German, Dutch, European Spanish, British English, and Modern Greek. We introduce and showcase Translation Mining, a software suite combining a parallel corpus database with annotation and analysis tools. Translation Mining allows us to generate descriptive statistics of tense use across languages but also to visualize variation through its multidimensional scaling component and to link the variation we find to the underlying data through its integrated setup. We confirm that the present perfect competes with the past and we reveal the fine-grained scalar nature of the variation. To complete the puzzle, we ascertain the dimensions of variation, ranging from lexical and compositional semantics to dynamic semantics and pragmatics.1
A Corpus-Assisted Translation Study of Strategies Used in Rendering Culture-Bound Expressions in the Speeches of King Abdullah II
Translation is defined as transferring meaning and style from one language to another, taking the text producer's intended purpose and the audience culture into account. This paper uses a 256,000-word Arabic-English parallel corpus of the speeches of King Abdullah II of Jordan from 1999 to 2015 to examine how some culture-bound expressions were translated from Arabic into English. To do so, two software packages were used, namely Wordsmith 6 and SketchEngine. Comparing the size of the Arabic corpus with its English counterpart using the wordlist tool of WS6, the researchers found that the number of words (tokens) in the English translation is more than the Arabic source text. However, the results showed that the Arabic language has more unique words, which means that it has more lexical density than its English counterpart. The researchers carried out a keyword analysis and compared the Arabic corpus with the ArTenTen corpus to identify the words that King Abdullah II saliently used in his speeches. Most of the keywords were culture-bound and related to the Jordanian context, which might be challenging to render. Using the parallel concordance tool and comparing the Arabic text with its English translation showed that the translator/s mainly resorted to the strategies of deletion, addition, substitution, and transliteration. The researchers recommend that further studies be conducted using the same approach but on larger corpora of other genres, such as legal, religious, press, and scientific texts.
The Epistemic Marker Určit Ě in the Light of Corpus Data
The paper presents a pilot study for a research project on epistemic modality and/or evidentiality markers in Czech. The study focuses on the expression Although, this marker is typically considered to signal high certainty, the dictionary of standard Czech (Slovník spisovné češtiny, SSČ) also offers an alternative meaning of , indicating a lower degree of certainty. We use parallel data from the InterCorp v15 corpus to determine whether the probability meaning can be identified unequivocally in real language data and whether it correlates with specific translation equivalents, linguistic features, or lexical context. Based on our findings, we propose an alternative method for distinguishing between different shades of meaning based on the communicative functions of the utterances, and we draw conclusions regarding the relevance of individual grammatical and lexical clues in context for future annotations.
The InterCorp Parallel Corpus with a Uniform Annotation for All Languages
Recently, the language-specific morphosyntactic annotation of InterCorp, a large multilingual parallel corpus, has been replaced by the language-uniform morphosyntactic and syntactic annotation following the guidelines of the Universal Dependencies project. Because the corpus is used predominantly by human users via a token-based concordancer, the CONLL-U format produced by the UDPipe parser has been extended by attributes such as lemma of the token’s syntactic head or morphosyntactic categories of the content verb’s auxiliary. We conclude that despite some theoretical and practical issues, the new annotation is a promising solution to the issue of mutually incompatible tagsets within a single corpus.
Building the first comprehensive machine-readable Turkish sign language resource: methods, challenges and solutions
This article describes the procedures employed during the development of the first comprehensive machine-readable Turkish Sign Language (TiD) resource: a bilingual lexical database and a parallel corpus between Turkish and TiD. In addition to sign language specific annotations (such as non-manual markers, classifiers and buoys) following the recently introduced TiD knowledge representation (Eryiğit et al. 2016 ), the parallel corpus contains also annotations of dependency relations, which makes it the first parallel treebank between a sign language and an auditory-vocal language.