What is synthetic data? And why should B2B marketers care?

Synthetic data can solve common challenges around training AI. But B2B marketers must be honest about the risks it poses.

three streams of binary code merging into one

Like so many next-big-things, the generative AI wave is towing a host of cottage industries in its wake. One of the most fascinating is the synthetic data industry.

I think it’s worth the attention of any B2B tech marketer because it reveals the complex challenges, opportunities, and risks of generative AI in microcosm – and because the best content about AI acknowledges and navigates that complexity.

Synthetic data: a solution to AI’s biggest obstacles

All AI models must be trained on extensive data. And the more general the task, the greater the variety and volume of data the model needs before it can respond with accuracy and confidence.

But collecting data volumes from the real world poses several issues:

  • Sourcing huge amounts of data is time-consuming and really expensive.
  • It can be hard to find data on uncommon or edge-case scenarios (think MRI scans of rare medical conditions or images of a machine experiencing a one-in-a-million fault).
  • There are privacy and copyright issues with using certain online datasets (such as data gleaned from social media platforms).
  • Data produced by humans can carry human biases.

Synthetic data promises a solution to many of these problems. Unlike conventional data used to train AI models, synthetic data is artificially generated, so it isn’t bound by the confines of reality.

For example, if you were training an AI to assess fuel efficiency across different commercial aircraft, you could use synthetic data generated by flight simulators instead of collecting real-world aircraft telemetry data from hundreds of flights.

By creating artificial data at scale, you can get more data at a lower cost without the copyright complications or biases of human-generated data. And you can also design datasets covering phenomena seldom seen in real life.

Synthetic data’s ability to remove all these roadblocks is so great that last summer, Gartner predicted 60% of data for AI will be synthetic by 2024[i].

The use cases unlocked by synthetic data

Computer vision models, which need training on large volumes of high-quality images, have been one of the first forms of AI to benefit from synthetic data. But there are many other use cases for synthetic data in its many forms, including:

  • Genomic data to train AI healthcare solutions on rare diseases – without breaching patient confidentiality.
  • Images of different (and potentially unreleased) products to train automatic defect recognition on manufacturing lines.
  • Financial records to develop fraud detection systems without using personal financial information.

Whatever task you want to train an AI model for, it’s likely that synthetic data can help make that process faster, more consistent, and cheaper.

The risk of AI eating itself

With so many use cases for synthetic data, there’s naturally a lot of demand. And one way to meet that demand is… with the help of generative AI. We’re already seeing some vendors working to build a closed loop for AI – where generative AI creates synthetic data that’s then fed into other AI models.

But this Ouroboros model of AI has its critics. When researcher Jathan Sadowski looked into the phenomenon, he found models that were “so heavily trained on the outputs of other generative AIs that [they] become an inbred mutant”[ii].

A consumer-facing model spouting nonsense might, at worst, damage a brand’s reputation. But such degradation in a model designed to detect security risks for IT systems or cancerous cells in medical imaging could have catastrophic effects.

The implications for B2B tech companies and marketers

We’re still in the early days of this new generation of AI and the synthetic data that will support it. And with the major NASDAQ staples investing heavily in the space, any problems will have serious resources and talent thrown at them until they’re resolved.

So perhaps in the future, we will have something approaching a synthetic data utopia that leads to unfathomably powerful AI. But for now, we have a fork in the road that everyone in the B2B technology sector must navigate carefully.

Any story about synthetic data must be embraced with positivity and the hope that it will crack the code of training society-enhancing AI models. But we must also be ready to ask the most pressing questions about how synthetic data production can scale. And the level of scrutiny must be dialled up as generative AI and synthetic data training increasingly come into contact with critical, high-risk sectors like healthcare, education, and government.

More importantly, B2B tech marketers must be ready to openly discuss these challenges in any content that speaks about synthetic data and generative AI. Our audience is clever, connected, and very comfortable managing risk. They won’t be put off by an acknowledgment of the potential pitfalls and challenges in the field. In fact, they may find the honesty refreshing and ultimately trust the message and the brand behind it all the more.

Recommended further reading

If you want to learn more about synthetic data and AI, there are plenty of articles exploring this fast-growing field.

While it was written just before the recent AI renaissance, Forbes ran an article covering some of the major use cases for synthetic data and the earliest players in the industry. It’s a great place to start if you want a broad overview of the topic.

And for a clearer look at the potential risks associated with synthetic data, this interview with machine learning researchers Sina Alemohammad and Josue Casco-Rodriguez offers an expert outlook on what happens when AI consumes data created by other AI models.

[i] https://www.gartner.com/en/newsroom/press-releases/2023-08-01-gartner-identifies-top-trends-shaping-future-of-data-science-and-machine-learning

[ii] https://twitter.com/jathansadowski/status/1625245803211272194


George

George’s analytical mind helps him quickly tackle the nuts and bolts of our clients’ technologies, and articulate even the most complex subjects in a clear, concise, and carefully targeted way. With more than 10 years’ experience writing for a huge range of B2B technology clients, he’s one of our most versatile copywriters.

More posts you might like…

Do the facts even matter in B2B marketing?

Kieran has spent hundreds of hours ensuring our clients’ content is as credible as possible. Hundreds of hours, ignoring one very scary question…

What is synthetic data? And why should B2B marketers care?

Synthetic data can solve common challenges around training AI. But B2B marketers must be honest about the risks it poses.

Create B2B tech marketing content that really works

Get regular advice and insights from our team of specialist B2B tech writers and account managers, direct to your inbox.