New Algorithm Makes Use of Online Learning for Large Cell Datasets

Scientists have been trying to figure out how to determine the numerous types of cells present in our organs that contribute to our health.

A recent technique referred to as ‘single-cell sequencing’ is allowing researchers identify and classify cell types by features, like what genes they express. However, this kind of research generates large amounts of data, with datasets of millions of cells.

A new algorithm developed by Joshua Welch, Ph.D. of the Department of Computational Medicine and Bioinformatics, Ph.D. candidate Chao Gao and their team makes use of online learning, which hastens the process and provides a way for researchers globally to examine large datasets using the memory found on a standard laptop. The findings are described in the journal Nature Biotechnology.

‘Our technique allows anyone with a computer to perform analyses at the scale of an entire organism,’ says Welch. ‘That’s really what the field is moving towards.’

The team illustrated their proof of principle making use of datasets from the National Institute of Health’s Brain Initiative, a project with the goal of understanding the human brain by mapping every cell with investigative teams all over the country, including Welch’s lab.

Welch explains that typically, for projects similar to this, each single-cell dataset that is submitted must be re-analyzed with the previous datasets in the order of their arrival. Their novel approach enables new datasets to be added to the existing ones without the necessity of reprocessing the older datasets. It also allows researchers break up datasets into ‘mini-batches’ to decrease the amount of memory needed to process them.

‘This is crucial for the sets increasingly generated with millions of cells,’ says Welch. ‘This year, there has been five to six papers with two million cells or more and the amount of memory you need just to store the raw data is significantly more than anyone has on their computer.’

Welch compared the online technique to the continuous data processing done by some social media platforms such as Twitter and Facebook, which must continuously process generated data from users and present relevant posts to people’s feeds. ‘Here, instead of people writing tweets, we have labs around the world performing experiments and releasing their data.’

This research has the potential to enhance efficiency for projects such as the Human Body Map and Human Cell Atlas. ‘Understanding the normal compliment of cells in the body is the first step towards understanding how they go wrong in disease.’

By Marvellous Iwendi.

Source: M Health Lab