Machine learning techniques used by thousands of scientists to analyze data are producing misleading and often completely wrong results.
According to a report on the BBC website on February 16, Dr. Geneviera Allan of Rice University in Houston said that the increasing use of such systems was leading to a ” scientific crisis”.
She warned that if scientists did not improve their technology, they would waste both time and money. Her research results were presented to the American Association for the Advancement of Science in Washington.
More and more scientific research uses machine learning software to analyze the collected data. From biomedical research to astronomy, this phenomenon occurs in many disciplines.
These data sets are very large and costly. However, Dr. Allan said that their answers may be inaccurate or even wrong, because software recognizes patterns that exist only in data sets, not in the real world.
She said: ” These studies are often found to be inaccurate only when another really large data set appears. Some people sigh after analyzing with those technologies:’ oh, my god, the results of these two studies are not consistent. ”
” It is now widely recognized that there is a crisis of repeatability in science,” she said. I dare say, this is largely due to the use of machine learning technology in scientific research. ”
The ” repeatability crisis” in science refers to an astonishing number of research results that cannot be reproduced when another group of scientists do the same experiment. This means that the initial result was wrong. An analysis shows that as many as 85% of all biomedical research conducted in the world is in vain.
This kind of crisis has been worsening for 20 years. The reason for this is that the design of the experiment is not perfect, and it cannot ensure that scientists will not deceive themselves and see the results they want to see.
Alan said that the use of machine learning systems and large data sets accelerated the crisis. This is because machine learning algorithms are specially developed for finding interesting things in data sets, so when they search through a large amount of data, they will inevitably find a pattern.
She told BBC reporters: ” The question is, can we really believe those findings?”
” Those are really can represent the real discovery of science? Can they be reproduced? If we add a data set, will we get the same scientific findings or principles in the same data set? Unfortunately, the answer is often no ”
Dr. Allan is working with a biomedical research team at Baylor College of Medicine in Houston to improve the reliability of their research results. She is developing the next generation of machine learning and statistical techniques, which can not only screen a large amount of data for discovery, but also report the uncertainty and possible repeatability of their results.
” Collecting those huge data sets is very expensive,” she said. I told scientists that my research may take you longer to publish your research results, but in the end your research results will stand the test of time. ”
” This will save scientists money, and it is also important to advance science by avoiding those possible wrong directions.”