How to avoid data poisoning in AI

More and more organisations are reaching for AI tools to create efficiencies and enhance innovation for their operations, reaping tangible benefits and cost savings as a result. But while AI offers many positive benefits, it can also come with significant risks, particularly in the realm of national security, writes Darren Reid.

Darren Reid

A recent report by Australia’s Cyber Security Cooperative Research Centre has raised concerns that hackers could ‘poison’ artificial intelligence data sets, resulting in incorrect outputs which could cause significant harm to governments, organisations and the public.

In order to be useful, AI models are trained using huge data sets which help it to identify patterns and structures of language, images, speech, numbers and code. For many publicly available AI systems, the training data is scraped from the open internet without moderation, which means they could be trained using inaccurate or biased information.

Garbage in, Garbage out

The phrase “garbage in, garbage out” is relevant to the way AI systems operate. If the system is trained using incorrect or biased information, it becomes untrustworthy. Worse yet is the risk that a cyber attacker could manipulate an AI data set, causing the system to generate false and misleading outputs.

In addition, AI algorithms can occasionally use erroneous data or incorrectly identify patterns for their outputs, resulting on what’s known as an ‘AI hallucination’. As the outputs of such ‘hallucinations’ can be based on completely inexistent patterns that the AI algorithm has wrongly identified, they can naturally be deemed untrustworthy and potentially dangerous.

If the data used to train systems used by governments and other organisations is compromised, biased or incorrect, there could be serious societal impacts. As such policy makers, AI developers and organisations should consider ‘data poisoning’ a real threat and approach AI use with caution.

Key challenges to moderating cyber-attacks on AI data sets

One of the key challenges to achieving oversight of AI training data sets is that it’s time-consuming, laborious, and difficult to achieve with 100 per cent success. The dynamic nature of the internet, and the fact that some AI tools include user data in their training set, means that poisonous data can still creep in. Unfortunately, once false inputs are trained in, they are difficult to isolate and almost impossible to correct.

Another key concern in regards to AI use is the potential for models to inadvertently leak sensitive information, especially since some AI tools include user data in their training set. It can be difficult to strike the right balance between AI system performance and privacy measures. Stricter privacy measures can make the model less effective, while looser measures may increase the risk of a data breach.

Neither of these risks should cause governments and organisations to write off AI altogether. Afterall, AI has brought many benefits to the cybersecurity industry too — network optimisation, network agility and improved security to name a few. Instead, the focus should be on how to use AI successfully while limiting any risks.

Using AI safely and successfully

One way to limit the risks associated with AI is to leverage in-house AI models. Private AI models can be trained on an organisation’s own data or using sources that it has classified as trustworthy and deployed on their existing IT infrastructure. Adopting this approach reduces the likelihood of false inputs and facilitates smoother corrections in the rare event that harmful data is trained in.

Having control over AI inputs also allows organisations to track the performance and integrity of its AI data sets over time. By recording and tracking inputs, organisations can gain visibility into the quality and reliability of their data and quickly correct any inadvertent biases or errors. In addition, keeping a record of inputs can help cybersecurity teams locate malicious activity, and take corrective measures as necessary.

In addition, private AI tools ensure privacy and control of sensitive information. User data remains within the confines of the in-house tool thereby eliminating the chance that it will emerge as an output elsewhere.

A complementary tool

Lastly, it’s important for government and organisation leaders to recognise that AI is a complementary tool. It should not be relied upon as the sole and definitive source of truth. Human input (judgement, experience, and ethical considerations) remains crucial and should be considered alongside any AI outputs. Being attentive to the output of your AI models can also help users recognise where false AI outputs have occurred.

As AI technologies see increasing adoption in important societal applications, governments and organisations should be discerning about the AI models they choose to use. AI data set manipulation is a genuine concern, yet it remains undoubtedly difficult to combat. Not to mention that serious consequences could arise as a result of a data breach.

*Darren Reid is Senior director of Asia Pacific and Japan at Carbon Black

Comment below to have your say on this story.

If you have a news story or tip-off, get in touch at  

Sign up to the Government News newsletter

Leave a comment:

Your email address will not be published. All fields are required