What is synthetic data? And why should B2B marketers care?

Like so many next-big-things, the generative AI wave is towing a host of cottage industries in its wake. One of the most fascinating is the synthetic data industry.

I think it’s worth the attention of any B2B tech marketer because it reveals the complex challenges, opportunities, and risks of generative AI in microcosm – and because the best content about AI acknowledges and navigates that complexity.

Synthetic data: a solution to AI’s biggest obstacles

All AI models must be trained on extensive data. And the more general the task, the greater the variety and volume of data the model needs before it can respond with accuracy and confidence.

But collecting data volumes from the real world poses several issues:

Sourcing huge amounts of data is time-consuming and really expensive.
It can be hard to find data on uncommon or edge-case scenarios (think MRI scans of rare medical conditions or images of a machine experiencing a one-in-a-million fault).
There are privacy and copyright issues with using certain online datasets (such as data gleaned from social media platforms).
Data produced by humans can carry human biases.

Synthetic data promises a solution to many of these problems. Unlike conventional data used to train AI models, synthetic data is artificially generated, so it isn’t bound by the confines of reality.

For example, if you were training an AI to assess fuel efficiency across different commercial aircraft, you could use synthetic data generated by flight simulators instead of collecting real-world aircraft telemetry data from hundreds of flights.

By creating artificial data at scale, you can get more data at a lower cost without the copyright complications or biases of human-generated data. And you can also design datasets covering phenomena seldom seen in real life.

Synthetic data’s ability to remove all these roadblocks is so great that last summer, Gartner predicted 60% of data for AI will be synthetic by 2024[i].

The use cases unlocked by synthetic data

Computer vision models, which need training on large volumes of high-quality images, have been one of the first forms of AI to benefit from synthetic data. But there are many other use cases for synthetic data in its many forms, including:

Genomic data to train AI healthcare solutions on rare diseases – without breaching patient confidentiality.
Images of different (and potentially unreleased) products to train automatic defect recognition on manufacturing lines.
Financial records to develop fraud detection systems without using personal financial information.

Whatever task you want to train an AI model for, it’s likely that synthetic data can help make that process faster, more consistent, and cheaper.

The risk of AI eating itself

With so many use cases for synthetic data, there’s naturally a lot of demand. And one way to meet that demand is… with the help of generative AI. We’re already seeing some vendors working to build a closed loop for AI – where generative AI creates synthetic data that’s then fed into other AI models.

But this Ouroboros model of AI has its critics. When researcher Jathan Sadowski looked into the phenomenon, he found models that were “so heavily trained on the outputs of other generative AIs that [they] become an inbred mutant”[ii].

A consumer-facing model spouting nonsense might, at worst, damage a brand’s reputation. But such degradation in a model designed to detect security risks for IT systems or cancerous cells in medical imaging could have catastrophic effects.

The implications for B2B tech companies and marketers

We’re still in the early days of this new generation of AI and the synthetic data that will support it. And with the major NASDAQ staples investing heavily in the space, any problems will have serious resources and talent thrown at them until they’re resolved.

So perhaps in the future, we will have something approaching a synthetic data utopia that leads to unfathomably powerful AI. But for now, we have a fork in the road that everyone in the B2B technology sector must navigate carefully.

Any story about synthetic data must be embraced with positivity and the hope that it will crack the code of training society-enhancing AI models. But we must also be ready to ask the most pressing questions about how synthetic data production can scale. And the level of scrutiny must be dialled up as generative AI and synthetic data training increasingly come into contact with critical, high-risk sectors like healthcare, education, and government.

More importantly, B2B tech marketers must be ready to openly discuss these challenges in any content that speaks about synthetic data and generative AI. Our audience is clever, connected, and very comfortable managing risk. They won’t be put off by an acknowledgment of the potential pitfalls and challenges in the field. In fact, they may find the honesty refreshing and ultimately trust the message and the brand behind it all the more.

What is synthetic data? And why should B2B marketers care?

Synthetic data: a solution to AI’s biggest obstacles

The use cases unlocked by synthetic data

The risk of AI eating itself

The implications for B2B tech companies and marketers

Recommended further reading

George

More posts you might like…

A checklist to help prepare your subject matter expert for interviews and content feedback

Do the facts even matter in B2B marketing?

Create B2B tech marketing content that really works

The Radix Podcast

Contact