What is Synthetic Data?
Synthetic data is the generation of realistic datasets that capture important relationships and information from real-world data. These datasets are created through random sampling models, such as Bayesian or neural networks, with the help of domain experts to validate the model’s output.1
For example, a model’s input is rows of real data, whereas the output is rows of newly fabricated data, that reflect the original data’s important characteristics. Subsequently, a human expert with subject matter expertise checks to ensure the model outputs are sensical. The end result is a completely new dataset, that contains the important features of the original dataset, without any real-world data being released.
The adoption of synthetic data in Real-World Evidence studies has the potential to drastically reduce timelines for ethics approvals and health system data requests, while also mitigating any patient privacy concerns. Given the demand for real-world data, and the resource constraints of data owners to meet this demand, synthetic data may provide an appealing solution for researchers and decision-makers.
Why is it important?
Synthetic data has the ability to provide access to important patient information that is often restricted due to privacy concerns. In many countries, including Canada, health research may be challenging or not feasible due to limited access to personal health data or electronic medical records (EMR).2
Another challenge, particularly in Canada, includes collaboration with researchers in other jurisdictions due to the privacy regulations around patient data. Synthetic data can facilitate sharing data for clinical research, student training, and potentially inspire new lines of important health research.2
In addition, synthetic data also provides solutions for everyday data issues, such as small sample sizes in rare disease research. Sample sizes can be addressed through the ability of a generative model to resample the same set of data to increase the number of generated records while keeping a similar distribution of important features that occurred in the original set.
Another problem that can occur in real-world data, is an imbalanced collection of certain groups, traits, or classes. Similar to the solution of a small sample size, the generative model can resample or oversample an underrepresented characteristic, and therefore, result in a more balanced synthetic dataset than what was observed in the original.
All of these issues and more are being addressed with synthetic data, and in turn, allowing fields such as precision medicine, public health research, and clinical practice optimization, to advance at a greater pace.1
Even though synthetic data is a promising new tool in research, like all new tools, caution should be taken in its adoption. An important feature to remember is that synthetic data is based on real-world data, which means biases and inaccuracies that are embedded in the real-world data will also be reflected in the synthetic dataset. In addition to the embedded concerns, domain experts that are responsible for the validation of the data carry their own experiences and assumptions that may lead to the acceptance or rejection of the produced records based on personal bias. That is, just because a researcher or a clinician has not observed specific features occurring together, this does not suggest that it is implausible. Furthermore, potentially significant outliers in the data may be lost in conversion or kept out to mediate risks of linking back to personal information in the original records.1
That being said, many experts still question if synthetic data is 100% secure and the actual validity or truth of the information created. More research is being done to identify best practices in the generation of synthetic data and the measures that need to be put in place to ensure the generated data’s validity.3
Real-World Examples of Synthetic Data in Health Research:
- Tucker, A., Wang, Z., Rotalinti, Y. et al., 2020Generating high-fidelity synthetic patient data for assessing machine learning healthcare software. npj Digit. Med. 3, 147 https://doi.org/10.1038/s41746-020-00353-9
- Benaim, Anat Reiner et al., 2020. Analyzing Medical Research Results Based on Synthetic Data and Their Relation to Real Data Results: Systematic Comparison From Five Observational Studies. JMIR medical informatics, 8(2), p.e16492.
- Chen, Junqiao et al., 2019. The validity of synthetic clinical data: A validation study of a leading synthetic data generator (Synthea) using clinical quality measures. BMC medical informatics and decision making, 19(1), p.44.