Nearly every sector is proposing the use of artificial intelligence (AI). Materials science R&D is relatively late to this trend, and there are many industry-specific hurdles, but the opportunities are beginning to be realized and the potential impact is significant.
Materials informatics is the use of data-centric approaches for materials science discovery and development. This is principally enabled by improved data infrastructures and machine learning solutions; this is set to be a paradigm shift in the way researchers conduct R&D projects and a discussion on why the adoption is now can be seen in a previous article. At this key moment of initial commercial adoption, IDTechEx has released the most comprehensive technical market report on the topic, “Materials Informatics 2020-2030”.
Materials informatics can be used at every stage of an experimental process as outlined in the infographic below. There are multiple potential advantages including identifying new species or relationships, extracting value from existing data, and generating use-case IP on existing compounds, but in most cases it is all about accelerating the time to market and providing a competitive advantage.
Quantifying this accelerated time to market is difficult but essential for external companies to demonstrate and justify any investment. Many claim extensive examples of reducing millions of candidates and/or thousands of experiments to more manageable hundreds, or even tens, of solutions or iterations.
A key concept is the idea of an “inverse design”. In simple terms, this can involve training a model that allows properties to be input and formulations, compositions, process parameters or more to be proposed. The properties do not have to just be physical but could also be cost, toxicity, geographic availability or more. The technology is applicable to anyone that designs materials or designs with materials, an aim is to have this inverse design fully integrated with initial product design. This has been most effectively shown by the collaboration between Citrine Informatics and Siemens. It was stated that they want designers to view material as one of their “degrees of freedom” and allow materials companies to become “partners not vendors”.
For clarity, materials informatics is not to be confused with computational simulation (e.g. DFT calculations). This material modeling has seen major progressions over the past few decades (led by the likes of BIOVIA and Schrödinger) and with the continual improvement in computing power this will only increase. The announcement between JSR Corporation and QSimulate is notable recent evidence towards this. The data can be used in the same way as input data from any physical experimentation. In fact, a common approach of MI is used in reducing the number of costly and time-consuming simulations, facilitating these research projects, and drawing novel relationships.
The main problem is the limitations of the materials dataset. This is not like recognizing objects in autonomous vehicles or sophisticated internet search engines, materials science brings numerous specific problems. The data is typically sparse, high-dimensional, biased, and noisy which means that understanding the uncertainty in the proposed output is essential; projecting out into the “unknown” is very challenging given the clustered, complex data.
There are many approaches to dealing with small datasets, this could involve generating one through high-throughput experimentation, leveraging external data repositories and most importantly integrating domain knowledge.
Generating and leveraging data repositories is a core theme of materials informatics. There are a wide number of very bespoke or more general repositories collecting published structures, properties, and other data. These are run by public or private organizations and, although may have limitations (such as unknown confidence in the data and biased by only having “positive” published results), they can be an unparalleled source for training models or screening for candidates. Not to mention large datasets opens the opportunity for utilizing more sophisticated deep learning methods.