Catalogue Search | MBRL
Search Results Heading
Explore the vast range of titles available.
MBRLSearchResults
-
DisciplineDiscipline
-
Is Peer ReviewedIs Peer Reviewed
-
Item TypeItem Type
-
SubjectSubject
-
YearFrom:-To:
-
More FiltersMore FiltersSourceLanguage
Done
Filters
Reset
13,126
result(s) for
"CPUs"
Sort by:
Fast and accurate long-read assembly with wtdbg2
2020
Existing long-read assemblers require thousands of central processing unit hours to assemble a human genome and are being outpaced by sequencing technologies in terms of both throughput and cost. We developed a long-read assembler wtdbg2 (
https://github.com/ruanjue/wtdbg2
) that is 2–17 times as fast as published tools while achieving comparable contiguity and accuracy. It paves the way for population-scale long-read assembly in future.
Wtdbg2 assembles genomes with comparable contiguity and accuracy to existing tools using long-read sequencing data, and is several times faster, especially for large genomes.
Journal Article
An analog-AI chip for energy-efficient speech recognition and transcription
2023
Models of artificial intelligence (AI) that have billions of parameters can achieve high accuracy across a range of tasks
1
,
2
, but they exacerbate the poor energy efficiency of conventional general-purpose processors, such as graphics processing units or central processing units. Analog in-memory computing (analog-AI)
3
–
7
can provide better energy efficiency by performing matrix–vector multiplications in parallel on ‘memory tiles’. However, analog-AI has yet to demonstrate software-equivalent (SW
eq
) accuracy on models that require many such tiles and efficient communication of neural-network activations between the tiles. Here we present an analog-AI chip that combines 35 million phase-change memory devices across 34 tiles, massively parallel inter-tile communication and analog, low-power peripheral circuitry that can achieve up to 12.4 tera-operations per second per watt (TOPS/W) chip-sustained performance. We demonstrate fully end-to-end SW
eq
accuracy for a small keyword-spotting network and near-SW
eq
accuracy on the much larger MLPerf
8
recurrent neural-network transducer (RNNT), with more than 45 million weights mapped onto more than 140 million phase-change memory devices across five chips.
A low-power chip that runs AI models using analog rather than digital computation shows comparable accuracy on speech-recognition tasks but is more than 14 times as energy efficient.
Journal Article
Optimizing high-resolution Community Earth System Model on a heterogeneous many-core supercomputing platform
2020
With semiconductor technology gradually approaching its physical and thermal limits, recent supercomputers have adopted major architectural changes to continue increasing the performance through more power-efficient heterogeneous many-core systems. Examples include Sunway TaihuLight that has four management processing elements (MPEs) and 256 computing processing elements (CPEs) inside one processor and Summit that has two central processing units (CPUs) and six graphics processing units (GPUs) inside one node. Meanwhile, current high-resolution Earth system models that desperately require more computing power generally consist of millions of lines of legacy code developed for traditional homogeneous multicore processors and cannot automatically benefit from the advancement of supercomputer hardware. As a result, refactoring and optimizing the legacy models for new architectures become key challenges along the road of taking advantage of greener and faster supercomputers, providing better support for the global climate research community and contributing to the long-lasting societal task of addressing long-term climate change. This article reports the efforts of a large group in the International Laboratory for High-Resolution Earth System Prediction (iHESP) that was established by the cooperation of Qingdao Pilot National Laboratory for Marine Science and Technology (QNLM), Texas A&M University (TAMU), and the National Center for Atmospheric Research (NCAR), with the goal of enabling highly efficient simulations of the high-resolution (25 km atmosphere and 10 km ocean) Community Earth System Model (CESM-HR) on Sunway TaihuLight. The refactoring and optimizing efforts have improved the simulation speed of CESM-HR from 1 SYPD (simulation years per day) to 3.4 SYPD (with output disabled) and supported several hundred years of pre-industrial control simulations. With further strategies on deeper refactoring and optimizing for remaining computing hotspots, as well as redesigning architecture-oriented algorithms, we expect an equivalent or even better efficiency to be gained on the new platform than traditional homogeneous CPU platforms. The refactoring and optimizing processes detailed in this paper on the Sunway system should have implications for similar efforts on other heterogeneous many-core systems such as GPU-based high-performance computing (HPC) systems.
Journal Article
BBMerge – Accurate paired shotgun read merging via overlap
2017
Merging paired-end shotgun reads generated on high-throughput sequencing platforms can substantially improve various subsequent bioinformatics processes, including genome assembly, binning, mapping, annotation, and clustering for taxonomic analysis. With the inexorable growth of sequence data volume and CPU core counts, the speed and scalability of read-processing tools becomes ever-more important. The accuracy of shotgun read merging is crucial as well, as errors introduced by incorrect merging percolate through to reduce the quality of downstream analysis. Thus, we designed a new tool to maximize accuracy and minimize processing time, allowing the use of read merging on larger datasets, and in analyses highly sensitive to errors. We present BBMerge, a new merging tool for paired-end shotgun sequence data. We benchmark BBMerge by comparison with eight other widely used merging tools, assessing speed, accuracy and scalability. Evaluations of both synthetic and real-world datasets demonstrate that BBMerge produces merged shotgun reads with greater accuracy and at higher speed than any existing merging tool examined. BBMerge also provides the ability to merge non-overlapping shotgun read pairs by using k-mer frequency information to assemble the unsequenced gap between reads, achieving a significantly higher merge rate while maintaining or increasing accuracy.
Journal Article
GhostNets on Heterogeneous Devices via Cheap Operations
by
Xu Chunjing
,
Wu, Enhua
,
Chang, Xu
in
Artificial neural networks
,
Central processing units
,
CPUs
2022
Deploying convolutional neural networks (CNNs) on mobile devices is difficult due to the limited memory and computation resources. We aim to design efficient neural networks for heterogeneous devices including CPU and GPU, by exploiting the redundancy in feature maps, which has rarely been investigated in neural architecture design. For CPU-like devices, we propose a novel CPU-efficient Ghost (C-Ghost) module to generate more feature maps from cheap operations. Based on a set of intrinsic feature maps, we apply a series of linear transformations with cheap cost to generate many ghost feature maps that could fully reveal information underlying intrinsic features. The proposed C-Ghost module can be taken as a plug-and-play component to upgrade existing convolutional neural networks. C-Ghost bottlenecks are designed to stack C-Ghost modules, and then the lightweight C-GhostNet can be easily established. We further consider the efficient networks for GPU devices. Without involving too many GPU-inefficient operations (e.g., depth-wise convolution) in a building stage, we propose to utilize the stage-wise feature redundancy to formulate GPU-efficient Ghost (G-Ghost) stage structure. The features in a stage are split into two parts where the first part is processed using the original block with fewer output channels for generating intrinsic features, and the other are generated using cheap operations by exploiting stage-wise redundancy. Experiments conducted on benchmarks demonstrate the effectiveness of the proposed C-Ghost module and the G-Ghost stage. C-GhostNet and G-GhostNet can achieve the optimal trade-off of accuracy and latency for CPU and GPU, respectively. MindSpore code is available at https://gitee.com/mindspore/models/pulls/1809, and PyTorch code is available at https://github.com/huawei-noah/CV-Backbones.
Journal Article
Carbon Compromises: Minimising the Carbon-Cost and Power-Use of Grid Sites
by
Skipsey, Sam
,
Spiteri, Dwayne
,
Borbely, Albert
in
Carbon
,
Central processing units
,
Computational grids
2025
This paper presents an overview of ways to reduce the carbonimpact of the Worldwide LHC Computing Grid (WLCG) along with specific results demonstrating the potential benefits of CPUs based on the ARM architecture. The choice of CPUs during procurement depends on balancing costs, performance, and the carbon-impact (both the embodied and operational) and decisions made within the context of site, fundingagency, and national preferences or policies. To aid this, we have developed a web-based utility that allows comparison of various options, once actual costs have been established. Operating CPUs at frequencies lower than their maximum, can improve their carbon-efficiency, doing the same work for less power/carbon but more slowly. In this paper we show that the effect is quite significant for CPUs based on the ARM architecture and discuss the compromises involved.
Journal Article
Distance-based protein folding powered by deep learning
Direct coupling analysis (DCA) for protein folding has made very good progress, but it is not effective for proteins that lack many sequence homologs, even coupled with time-consuming conformation sampling with fragments. We show that we can accurately predict interresidue distance distribution of a protein by deep learning, even for proteins with ∼60 sequence homologs. Using only the geometric constraints given by the resulting distance matrix we may construct 3D models without involving extensive conformation sampling. Our method successfully folded 21 of the 37 CASP12 hard targets with a median family size of 58 effective sequence homologs within 4 h on a Linux computer of 20 central processing units. In contrast, DCA-predicted contacts cannot be used to fold any of these hard targets in the absence of extensive conformation sampling, and the best CASP12 group folded only 11 of them by integrating DCA-predicted contacts into fragment-based conformation sampling. Rigorous experimental validation in CASP13 shows that our distance-based folding server successfully folded 17 of 32 hard targets (with a median family size of 36 sequence homologs) and obtained 70% precision on the top L/5 long-range predicted contacts. The latest experimental validation in CAMEO shows that our server predicted correct folds for 2 membrane proteins while all of the other servers failed. These results demonstrate that it is now feasible to predict correct fold for many more proteins lack of similar structures in the Protein Data Bank even on a personal computer.
Journal Article
Single-chip microprocessor that communicates directly using light
by
Lee, Yunsup
,
Wade, Mark T.
,
Georgas, Michael S.
in
639/166/987
,
639/624/1075/1079
,
639/624/399/1099
2015
An electronic–photonic microprocessor chip manufactured using a conventional microelectronics foundry process is demonstrated; the chip contains 70 million transistors and 850 photonic components and directly uses light to communicate to other chips.
Chips with everything
The rapid transfer of data between chips in computer systems and data centres has become one of the bottlenecks in modern information processing. One way of increasing speeds is to use optical connections rather than electrical wires and the past decade has seen significant efforts to develop silicon-based nanophotonic approaches to integrate such links within silicon chips, but incompatibility between the manufacturing processes used in electronics and photonics has proved a hindrance. Now Chen Sun
et al.
describe a 'system on a chip' microprocessor that successfully integrates electronics and photonics yet is produced using standard microelectronic chip fabrication techniques. The resulting microprocessor combines 70 million transistors and 850 photonic components and can communicate optically with the outside world. This result promises a way forward for new fast, low-power computing systems architectures.
Data transport across short electrical wires is limited by both bandwidth and power density, which creates a performance bottleneck for semiconductor microchips in modern computer systems—from mobile phones to large-scale data centres. These limitations can be overcome
1
,
2
,
3
by using optical communications based on chip-scale electronic–photonic systems
4
,
5
,
6
,
7
enabled by silicon-based nanophotonic devices
8
. However, combining electronics and photonics on the same chip has proved challenging, owing to microchip manufacturing conflicts between electronics and photonics. Consequently, current electronic–photonic chips
9
,
10
,
11
are limited to niche manufacturing processes and include only a few optical devices alongside simple circuits. Here we report an electronic–photonic system on a single chip integrating over 70 million transistors and 850 photonic components that work together to provide logic, memory, and interconnect functions. This system is a realization of a microprocessor that uses on-chip photonic devices to directly communicate with other chips using light. To integrate electronics and photonics at the scale of a microprocessor chip, we adopt a ‘zero-change’ approach to the integration of photonics. Instead of developing a custom process to enable the fabrication of photonics
12
, which would complicate or eliminate the possibility of integration with state-of-the-art transistors at large scale and at high yield, we design optical devices using a standard microelectronics foundry process that is used for modern microprocessors
13
,
14
,
15
,
16
. This demonstration could represent the beginning of an era of chip-scale electronic–photonic systems with the potential to transform computing system architectures, enabling more powerful computers, from network infrastructure to data centres and supercomputers.
Journal Article
A natively flexible 32-bit Arm microprocessor
by
Williamson, Ken
,
Biggs, John
,
Ramsdale, Catherine
in
639/166/987
,
639/301/1005/1007
,
639/766/1130/2798
2021
Nearly 50 years ago, Intel created the world’s first commercially produced microprocessor—the 4004 (ref.
1
), a modest 4-bit CPU (central processing unit) with 2,300 transistors fabricated using 10 μm process technology in silicon and capable only of simple arithmetic calculations. Since this ground-breaking achievement, there has been continuous technological development with increasing sophistication to the stage where state-of-the-art silicon 64-bit microprocessors now have 30 billion transistors (for example, the AWS Graviton2 (ref.
2
) microprocessor, fabricated using 7 nm process technology). The microprocessor is now so embedded within our culture that it has become a meta-invention—that is, it is a tool that allows other inventions to be realized, most recently enabling the big data analysis needed for a COVID-19 vaccine to be developed in record time. Here we report a 32-bit Arm (a reduced instruction set computing (RISC) architecture) microprocessor developed with metal-oxide thin-film transistor technology on a flexible substrate (which we call the PlasticARM). Separate from the mainstream semiconductor industry, flexible electronics operate within a domain that seamlessly integrates with everyday objects through a combination of ultrathin form factor, conformability, extreme low cost and potential for mass-scale production. PlasticARM pioneers the embedding of billions of low-cost, ultrathin microprocessors into everyday objects.
Flexible electronic platforms would enable the integration of functional electronic circuitry with many everyday objects; here, a low-cost and fully flexible 32-bit microprocessor is produced.
Journal Article
Power-efficient neural network with artificial dendrites
by
Joshua, Yang J
,
Zhang Qingtian
,
Zhang, Wenqiang
in
Application specific integrated circuits
,
Artificial neural networks
,
Background noise
2020
In the nervous system, dendrites, branches of neurons that transmit signals between synapses and soma, play a critical role in processing functions, such as nonlinear integration of postsynaptic signals. The lack of these critical functions in artificial neural networks compromises their performance, for example in terms of flexibility, energy efficiency and the ability to handle complex tasks. Here, by developing artificial dendrites, we experimentally demonstrate a complete neural network fully integrated with synapses, dendrites and soma, implemented using scalable memristor devices. We perform a digit recognition task and simulate a multilayer network using experimentally derived device characteristics. The power consumption is more than three orders of magnitude lower than that of a central processing unit and 70 times lower than that of a typical application-specific integrated circuit chip. This network, equipped with functional dendrites, shows the potential of substantial overall performance improvement, for example by extracting critical information from a noisy background with significantly reduced power consumption and enhanced accuracy.A memristor-based artificial dendrite enables the neural network to perform high-accuracy computation tasks with reduced power consumption.
Journal Article