Evolving Data-driven science: the unprecedented coherence of Big Data, HPC, and informatics, and crossing the next chasms – Invited Greg Leptoukh Lecture


As we approach the AGU Centenary, we celebrate the successes of data-driven science whilst looking anxiously at the future, with consideration of hardware software, workflow and interconnectedness that need further attention.

The colocation of scientific datasets with HPC/cloud compute has successfully demonstrated the overall supercharging of our research productivity. Over time we questioned whether to “bring data to the compute”, or “compute to the data” and considered and reconsidered the benefits, weaknesses and challenges both technically and socially. The gap between how large volume data and longtail data are managed is steadily closing, and the standards for interoperability and ability for connectivity between scientific fields have been slowly maturing. In many cases transdisciplinary science is now a reality.

However, computing technology is no longer advancing according to Moore’s law (and equivalents) and is evolving in unexpected ways. For some major computational software codes, these technology changes are forcing us to reconsider the development strategy, how to transition existing code to both address the needs of scientific improvements in capability, while at the same time improving the ability to adjust to changes in the underlying technical infrastructure. In doing so, some old assumptions of data precision and reproducibility are being reconsidered. Quantum computing is now on the horizon which will mean further consideration of software and data access mechanisms.

Currently, for data management, despite the apparent value and opportunity, the demands on high quality datasets that can be used for new data-driven methods are testing the funding/business case and overall value proposition for celebrated open data and its FAIRness. Powerful new technologies such as AI and deep learning have a voracious appetite for big data and much stronger (and unappreciated) requirements around quality of data, information management, connectivity and persistence. These new technologies are evolving at the same time as the ubiquitous IOT, fog computing, and blockchain pipelines have emerged creating even more complexity and potentially hypercoherence issues.

In this talk I will discuss the journey so far in data-intensive computational science, and consider the chasms we have yet to cross.