Synthetic data refers to artificially created data that seeks to reproduce the characteristics of real-world datasets in order to have a beneficial effect on training highly complex AI systems. The availability, quality and diversity of data have been recurrent challenges for training highly complex AI and autonomous systems, and defence organizations are increasingly looking into opportunities provided by synthetic data. The characteristics and potential benefits offered by synthetic data, along with proven application of the technology in various sectors, make it a relevant topic for debates surrounding the use of AI within the context of international security.

This UNIDIR Primer provides an overview of the main opportunities and limitations of synthetic data in the training of AI systems. While synthetic data can be a proxy for real-world data and help shorten training cycles, among other benefits, there are also significant risks and challenges associated with its use.

The Primer explores existing data challenges, both technical and organizational, introduces key technical characteristics and methods of generating synthetic data, and analyzes implications of using synthetic data in the context of international security, including for autonomous systems and in the cyber realm.

Sponsor Organizations: The European Union; the governments of Czech Republic, Germany, Italy, Netherlands, Switzerland, and Microsoft.

Citation: Harry Deng (2023). "Exploring Synthetic Data for Artificial Intelligence and Autonomous Systems: A Primer", UNIDIR, Geneva, Switzerland.