Search Results Heading

MBRLSearchResults

mbrl.module.common.modules.added.book.to.shelf
Title added to your shelf!
View what I already have on My Shelf.
Oops! Something went wrong.
Oops! Something went wrong.
While trying to add the title to your shelf something went wrong :( Kindly try again later!
Are you sure you want to remove the book from the shelf?
Oops! Something went wrong.
Oops! Something went wrong.
While trying to remove the title from your shelf something went wrong :( Kindly try again later!
    Done
    Filters
    Reset
  • Discipline
      Discipline
      Clear All
      Discipline
  • Is Peer Reviewed
      Is Peer Reviewed
      Clear All
      Is Peer Reviewed
  • Item Type
      Item Type
      Clear All
      Item Type
  • Subject
      Subject
      Clear All
      Subject
  • Year
      Year
      Clear All
      From:
      -
      To:
  • More Filters
25 result(s) for "Kashino, Kunio"
Sort by:
Label Propagation with Ensemble of Pairwise Geometric Relations: Towards Robust Large-Scale Retrieval of Object Instances
Spatial verification methods permit geometrically stable image matching, but still involve a difficult trade-off between robustness as regards incorrect rejection of true correspondences and discriminative power in terms of mismatches. To address this issue, we ask whether an ensemble of weak geometric constraints that correlates with visual similarity only slightly better than a bag-of-visual-words model performs better than a single strong constraint. We consider a family of spatial verification methods and decompose them into fundamental constraints imposed on pairs of feature correspondences. Encompassing such constraints leads us to propose a new method, which takes the best of existing techniques and functions as a unified Ensemble of pAirwise GEometric Relations (EAGER), in terms of both spatial contexts and between-image transformations. We also introduce a novel and robust reranking method, in which the object instances localized by EAGER in high-ranked database images are reissued as new queries. EAGER is extended to develop a smoothness constraint where the similarity between the optimized ranking scores of two instances should be maximally consistent with their geometrically constrained similarity. Reranking is newly formulated as two label propagation problems: one is to assess the confidence of new queries and the other to aggregate new independently executed retrievals. Extensive experiments conducted on four datasets show that EAGER and our reranking method outperform most of their state-of-the-art counterparts, especially when large-scale visual vocabularies are used.
Predicting Heart Rate at the Anaerobic Threshold Using a Machine Learning Model Based on a Large-Scale Population Dataset
Background/Objectives: For effective exercise prescription for patients with cardiovascular disease, it is important to determine the target heart rate at the level of the anaerobic threshold (AT-HR). The AT-HR is mainly determined by cardiopulmonary exercise testing (CPET). The aim of this study is to develop a machine learning (ML) model to predict the AT-HR solely from non-exercise clinical features. Methods: From consecutive 21,482 cases of CPET between 2 February 2008 and 1 December 2021, an appropriate subset was selected to train our ML model. Data consisted of 78 features, including age, sex, anthropometry, clinical diagnosis, cardiovascular risk factors, vital signs, blood tests, and echocardiography. We predicted the AT-HR using a ML method called gradient boosting, along with a rank of each feature in terms of its contribution to AT-HR prediction. The accuracy was evaluated by comparing the predicted AT-HR with the target HRs from guideline-recommended equations in terms of the mean absolute error (MAE). Results: A total of 8228 participants included healthy individuals and patients with cardiovascular disease and were 62 ± 15 years in mean age (69% male). The MAE of the AT-HR by the ML-based model was 7.7 ± 0.2 bpm, which was significantly smaller than those of the guideline-recommended equations; the results using Karvonen formulas with the coefficients 0.7 and 0.4 were 34.5 ± 0.3 bpm and 11.9 ± 0.2 bpm, respectively, and the results using simpler formulas, rest HR + 10 and +20 bpm, were 15.9 ± 0.3 and 9.7 ± 0.2 bpm, respectively. The feature ranking method revealed that the features that make a significant contribution to AT-HR prediction include the resting heart rate, age, N-terminal pro-brain natriuretic peptide (NT-proBNP), resting systolic blood pressure, highly sensitive C-reactive protein (hsCRP), cardiovascular disease diagnosis, and β-blockers, in that order. Prediction accuracy with the top 10 to 20 features was comparable to that with all features. Conclusions: An accurate prediction model of the AT-HR from non-exercise clinical features was proposed. We expect that it will facilitate performing cardiac rehabilitation. The feature selection technique newly unveiled some major determinants of AT-HR, such as NT-proBNP and hsCRP.
Interest point selection by topology coherence for multi-query image retrieval
Although the bag-of-visual-words (BOVW) model in computer vision has been demonstrated successfully for the retrieval of particular objects, it suffers from limited accuracy when images of the same object are very different in terms of viewpoint or scale. Naively leveraging multiple views of the same object to query the database naturally alleviates this problem to some extent. However, the bottleneck appears to be the presence of background clutter, which causes significant confusion with images of different objects. To address this issue, we explore the structural organization of interest points within multiple query images and select those that derive from the tentative region of interest (ROI) to significantly reduce the negative contributions of confusing images. Specifically, we propose the use of a multi-layered undirected graph model built on sets of Hessian affine interest points to model the images’ elastic spatial topology. We detect repeating patterns that preserve a coherent local topology, show how these redundancies are leveraged to estimate tentative ROIs, and demonstrate how this novel interest point selection approach improves the quality of visual matching. The approach is discriminative in distinguishing clutter from interest points, and at the same time, is highly robust as regards variation in viewpoint and scale as well as errors in interest point detection and description. Large-scale datasets are used for extensive experimentation and discussion.
BYOL for Audio: Exploring Pre-trained General-purpose Audio Representations
Pre-trained models are essential as feature extractors in modern machine learning systems in various domains. In this study, we hypothesize that representations effective for general audio tasks should provide multiple aspects of robust features of the input sound. For recognizing sounds regardless of perturbations such as varying pitch or timbre, features should be robust to these perturbations. For serving the diverse needs of tasks such as recognition of emotions or music genres, representations should provide multiple aspects of information, such as local and global features. To implement our principle, we propose a self-supervised learning method: Bootstrap Your Own Latent (BYOL) for Audio (BYOL-A, pronounced \"viola\"). BYOL-A pre-trains representations of the input sound invariant to audio data augmentations, which makes the learned representations robust to the perturbations of sounds. Whereas the BYOL-A encoder combines local and global features and calculates their statistics to make the representation provide multi-aspect information. As a result, the learned representations should provide robust and multi-aspect information to serve various needs of diverse tasks. We evaluated the general audio task performance of BYOL-A compared to previous state-of-the-art methods, and BYOL-A demonstrated generalizability with the best average result of 72.4% and the best VoxCeleb1 result of 57.6%. Extensive ablation experiments revealed that the BYOL-A encoder architecture contributes to most performance, and the final critical portion resorts to the BYOL framework and BYOL-A augmentations. Our code is available online at https://github.com/nttcslab/byol-a for future studies.
Reflectance-Guided, Contrast-Accumulated Histogram Equalization
Existing image enhancement methods fall short of expectations because with them it is difficult to improve global and local image contrast simultaneously. To address this problem, we propose a histogram equalization-based method that adapts to the data-dependent requirements of brightness enhancement and improves the visibility of details without losing the global contrast. This method incorporates the spatial information provided by image context in density estimation for discriminative histogram equalization. To minimize the adverse effect of non-uniform illumination, we propose defining spatial information on the basis of image reflectance estimated with edge preserving smoothing. Our method works particularly well for determining how the background brightness should be adaptively adjusted and for revealing useful image details hidden in the dark.
Masked Modeling Duo: Towards a Universal Audio Pre-training Framework
Self-supervised learning (SSL) using masked prediction has made great strides in general-purpose audio representation. This study proposes Masked Modeling Duo (M2D), an improved masked prediction SSL, which learns by predicting representations of masked input signals that serve as training signals. Unlike conventional methods, M2D obtains a training signal by encoding only the masked part, encouraging the two networks in M2D to model the input. While M2D improves general-purpose audio representations, a specialized representation is essential for real-world applications, such as in industrial and medical domains. The often confidential and proprietary data in such domains is typically limited in size and has a different distribution from that in pre-training datasets. Therefore, we propose M2D for X (M2D-X), which extends M2D to enable the pre-training of specialized representations for an application X. M2D-X learns from M2D and an additional task and inputs background noise. We make the additional task configurable to serve diverse applications, while the background noise helps learn on small data and forms a denoising task that makes representation robust. With these design choices, M2D-X should learn a representation specialized to serve various application needs. Our experiments confirmed that the representations for general-purpose audio, specialized for the highly competitive AudioSet and speech domain, and a small-data medical task achieve top-level performance, demonstrating the potential of using our models as a universal audio pre-training framework. Our code is available online for future studies at https://github.com/nttcslab/m2d
ConceptBeam: Concept Driven Target Speech Extraction
We propose a novel framework for target speech extraction based on semantic information, called ConceptBeam. Target speech extraction means extracting the speech of a target speaker in a mixture. Typical approaches have been exploiting properties of audio signals, such as harmonic structure and direction of arrival. In contrast, ConceptBeam tackles the problem with semantic clues. Specifically, we extract the speech of speakers speaking about a concept, i.e., a topic of interest, using a concept specifier such as an image or speech. Solving this novel problem would open the door to innovative applications such as listening systems that focus on a particular topic discussed in a conversation. Unlike keywords, concepts are abstract notions, making it challenging to directly represent a target concept. In our scheme, a concept is encoded as a semantic embedding by mapping the concept specifier to a shared embedding space. This modality-independent space can be built by means of deep metric learning using paired data consisting of images and their spoken captions. We use it to bridge modality-dependent information, i.e., the speech segments in the mixture, and the specified, modality-independent concept. As a proof of our scheme, we performed experiments using a set of images associated with spoken captions. That is, we generated speech mixtures from these spoken captions and used the images or speech signals as the concept specifiers. We then extracted the target speech using the acoustic characteristics of the identified segments. We compare ConceptBeam with two methods: one based on keywords obtained from recognition systems and another based on sound source separation. We show that ConceptBeam clearly outperforms the baseline methods and effectively extracts speech based on the semantic representation.
Audio Difference Captioning Utilizing Similarity-Discrepancy Disentanglement
We proposed Audio Difference Captioning (ADC) as a new extension task of audio captioning for describing the semantic differences between input pairs of similar but slightly different audio clips. The ADC solves the problem that conventional audio captioning sometimes generates similar captions for similar audio clips, failing to describe the difference in content. We also propose a cross-attention-concentrated transformer encoder to extract differences by comparing a pair of audio clips and a similarity-discrepancy disentanglement to emphasize the difference in the latent space. To evaluate the proposed methods, we built an AudioDiffCaps dataset consisting of pairs of similar but slightly different audio clips with human-annotated descriptions of their differences. The experiment with the AudioDiffCaps dataset showed that the proposed methods solve the ADC task effectively and improve the attention weights to extract the difference by visualizing them in the transformer encoder.
Masked Modeling Duo for Speech: Specializing General-Purpose Audio Representation to Speech using Denoising Distillation
Self-supervised learning general-purpose audio representations have demonstrated high performance in a variety of tasks. Although they can be optimized for application by fine-tuning, even higher performance can be expected if they can be specialized to pre-train for an application. This paper explores the challenges and solutions in specializing general-purpose audio representations for a specific application using speech, a highly demanding field, as an example. We enhance Masked Modeling Duo (M2D), a general-purpose model, to close the performance gap with state-of-the-art (SOTA) speech models. To do so, we propose a new task, denoising distillation, to learn from fine-grained clustered features, and M2D for Speech (M2D-S), which jointly learns the denoising distillation task and M2D masked prediction task. Experimental results show that M2D-S performs comparably to or outperforms SOTA speech models on the SUPERB benchmark, demonstrating that M2D can specialize in a demanding field. Our code is available at: https://github.com/nttcslab/m2d/tree/master/speech
Masked Modeling Duo: Learning Representations by Encouraging Both Networks to Model the Input
Masked Autoencoders is a simple yet powerful self-supervised learning method. However, it learns representations indirectly by reconstructing masked input patches. Several methods learn representations directly by predicting representations of masked patches; however, we think using all patches to encode training signal representations is suboptimal. We propose a new method, Masked Modeling Duo (M2D), that learns representations directly while obtaining training signals using only masked patches. In the M2D, the online network encodes visible patches and predicts masked patch representations, and the target network, a momentum encoder, encodes masked patches. To better predict target representations, the online network should model the input well, while the target network should also model it well to agree with online predictions. Then the learned representations should better model the input. We validated the M2D by learning general-purpose audio representations, and M2D set new state-of-the-art performance on tasks such as UrbanSound8K, VoxCeleb1, AudioSet20K, GTZAN, and SpeechCommandsV2. We additionally validate the effectiveness of M2D for images using ImageNet-1K in the appendix.