The amount of scientific data available to researchers and society is increasing at a rapid pace. Modern data acquisition methods and technologies can almost literally bury us underneath the thick layers of new data generated daily and even hourly.
Undoubtedly, such trend is also supported by the continuous advancement of computational technology. But thanks to computers and modern software tools they are running, scientists and engineers are also gaining novel and efficient measures which can help coping with vast amounts of information.
The field of astronomy is also not an exception. Telescopes, satellites, detectors and various measurement devices provide new capabilities enabling astronomers to collect hundreds and thousands of terabytes of data. According to current predictions, over the next decade the volume of data should reach the petabyte domain, and that will surely pose formidable challenges for those seeking to handle such abundant and complex data sets.
In a scientific paper that appeared on arXiv.org, a team of scientists from the University of Washington and Georgia Institute of Technology, USA, introduced this problem of ‘data abundance’ as a solid motivation for development of new data mining, machine learning and knowledge discovery tools. In their newest work they describe the development and testing of astroML, a Python package which, as the authors say, is developed for “extracting knowledge from data, where ‘knowledge’ means a quantitative summary of data behavior, and ‘data’ essentially means results of measurements”.
In essence, astroML is a software tool in which its creators adapted several readily available data processing techniques by incorporating them as an open-source codes. The difference is that a new product was supplemented with algorithms specific to the field of astronomy. As the authors note, astroML is intended to serve two main purposes: as an open repository for those who seek to develop statistical routines commonly used in astronomy (using python programming environment), and to provide examples of astrophysical data analysis leveraging techniques developed in the fields of statistics and machine learning.
The authors emphasize that examples detailed in their paper (such as regression and model fitting, density estimation, estimation of data dimensionality, and time series analysis) are just a small portion of methods implemented in the codebase of their new software tool. In this light-weight package scientists did not attempt to duplicate routines available in other well-known open-source libraries; instead they prioritized maintaining smaller codebase by incorporating existing tools and packages when available.
The astroML package is available publicly and includes dataset loaders, statistical tools and hundreds of example scripts.
Written by Alius Noreika