Adversarial Attacks on ML Systems: From Theory to Exploitation

A red team perspective on attacking machine learning systems—evasion attacks, model extraction, data poisoning, and membership inference. With working code examples.

Machine learning models are increasingly deployed in security-critical applications: fraud detection, malware classification, facial recognition, autonomous vehicles, and content moderation. Each of these represents an attractive target for adversaries.

Unlike traditional software vulnerabilities where we exploit implementation bugs, ML attacks exploit fundamental properties of how these systems learn and make decisions. The attack surface is different, but the impact can be equally severe.

This post covers the major categories of adversarial ML attacks from an offensive security perspective, with practical code examples you can use in authorized testing engagements.

Attack Taxonomy

MITRE ATLAS (Adversarial Threat Landscape for AI Systems) provides a framework for categorizing ML attacks. The main categories are:

Attack Type Goal Access Required Example Target
Evasion Cause misclassification at inference time Black-box or white-box Malware evading detection
Poisoning Corrupt training to affect future predictions Training data access Backdoor in spam filter
Model Extraction Steal model architecture/weights API query access Proprietary trading model
Membership Inference Determine if data was in training set API query access Medical data privacy breach
Model Inversion Reconstruct training data from model White-box preferred Recover faces from facial recognition

Evasion Attacks: Adversarial Examples

Evasion attacks craft inputs that cause the model to make incorrect predictions while appearing normal to humans. The classic example is an image of a panda that a classifier confidently labels as "gibbon" after adding imperceptible noise.

Fast Gradient Sign Method (FGSM)

FGSM is the simplest and fastest method for generating adversarial examples. It works by computing the gradient of the loss with respect to the input, then taking a single step in the direction that maximizes the loss.

Python - FGSM Attack Implementation
import torch import torch.nn.functional as F def fgsm_attack(model, image, label, epsilon=0.03): """ Fast Gradient Sign Method attack. Args: model: Target neural network image: Input image tensor (requires_grad=True) label: True label epsilon: Perturbation magnitude (0.03 ≈ imperceptible) Returns: Adversarial image """ # Ensure we can compute gradients image.requires_grad = True # Forward pass output = model(image) loss = F.cross_entropy(output, label) # Backward pass model.zero_grad() loss.backward() # Create perturbation perturbation = epsilon * image.grad.sign() # Apply perturbation adversarial_image = image + perturbation # Clamp to valid image range [0, 1] adversarial_image = torch.clamp(adversarial_image, 0, 1) return adversarial_image # Example usage against an image classifier def run_fgsm_attack(model, dataloader, epsilon=0.03): """Run FGSM attack on a batch of images.""" model.eval() successful_attacks = 0 total = 0 for images, labels in dataloader: images, labels = images.cuda(), labels.cuda() # Get original predictions original_preds = model(images).argmax(dim=1) # Generate adversarial examples adv_images = fgsm_attack(model, images, labels, epsilon) # Get adversarial predictions adv_preds = model(adv_images).argmax(dim=1) # Count successful attacks (changed prediction, was originally correct) correct_original = (original_preds == labels) changed_prediction = (adv_preds != original_preds) successful_attacks += (correct_original & changed_prediction).sum().item() total += correct_original.sum().item() attack_success_rate = successful_attacks / total print(f"Attack success rate: {attack_success_rate:.2%}") return attack_success_rate

Projected Gradient Descent (PGD) - Stronger Attack

PGD is an iterative version of FGSM that's considered one of the strongest first-order attacks. It's commonly used to evaluate model robustness.

Python - PGD Attack Implementation
def pgd_attack(model, image, label, epsilon=0.03, alpha=0.007, num_iterations=40, random_start=True): """ Projected Gradient Descent attack. Args: epsilon: Maximum perturbation (L-infinity norm) alpha: Step size per iteration num_iterations: Number of attack iterations random_start: Initialize with random perturbation """ # Clone to avoid modifying original adversarial_image = image.clone().detach() # Random initialization within epsilon ball if random_start: adversarial_image = adversarial_image + torch.empty_like(image).uniform_(-epsilon, epsilon) adversarial_image = torch.clamp(adversarial_image, 0, 1) for _ in range(num_iterations): adversarial_image.requires_grad = True # Forward pass output = model(adversarial_image) loss = F.cross_entropy(output, label) # Backward pass model.zero_grad() loss.backward() # Take step in gradient direction with torch.no_grad(): adversarial_image = adversarial_image + alpha * adversarial_image.grad.sign() # Project back to epsilon ball around original image perturbation = adversarial_image - image perturbation = torch.clamp(perturbation, -epsilon, epsilon) adversarial_image = image + perturbation # Clamp to valid range adversarial_image = torch.clamp(adversarial_image, 0, 1) return adversarial_image

Black-Box Attacks: Transferability

In real-world scenarios, you often don't have access to the target model's weights. Black-box attacks exploit a remarkable property: adversarial examples transfer between models.

Python - Transfer-Based Black-Box Attack
class TransferAttack: """ Black-box attack using adversarial transferability. Generate adversarial examples on surrogate models, then test against unknown target model. """ def __init__(self, surrogate_models: list): self.surrogate_models = surrogate_models def generate_ensemble_adversarial(self, image, label, epsilon=0.03, num_iterations=40): """ Generate adversarial example using ensemble of surrogates. Averaging gradients from multiple models improves transfer. """ adversarial_image = image.clone().detach() for _ in range(num_iterations): adversarial_image.requires_grad = True # Accumulate gradients from all surrogate models total_grad = torch.zeros_like(image) for model in self.surrogate_models: model.eval() output = model(adversarial_image) loss = F.cross_entropy(output, label) loss.backward(retain_graph=True) total_grad += adversarial_image.grad.clone() adversarial_image.grad.zero_() # Average gradient avg_grad = total_grad / len(self.surrogate_models) # Update adversarial image with torch.no_grad(): adversarial_image = adversarial_image + 0.007 * avg_grad.sign() perturbation = torch.clamp(adversarial_image - image, -epsilon, epsilon) adversarial_image = torch.clamp(image + perturbation, 0, 1) return adversarial_image

Real-World Evasion: Malware Classification

Adversarial examples against malware classifiers have practical implications. Here's a technique for evading ML-based malware detection:

Python - PE Malware Evasion (Concept)
""" Adversarial malware evasion against ML classifiers. WARNING: For authorized security research only. """ import lief import numpy as np class PEAdversarialEvasion: """ Evade ML malware classifiers by manipulating PE features while preserving malicious functionality. """ # Features that can be modified without breaking execution MUTABLE_FEATURES = [ 'imports_count', # Add benign imports 'sections_entropy', # Append data to reduce entropy 'string_count', # Add benign strings 'debug_info', # Add fake debug info 'resource_size', # Add benign resources 'certificate_present', # Add invalid certificate ] def __init__(self, target_classifier): self.classifier = target_classifier def evade(self, malware_path: str, max_iterations: int = 100): """ Iteratively modify PE to evade classifier while maintaining functionality. """ pe = lief.parse(malware_path) original_prediction = self._classify(pe) if original_prediction < 0.5: print("Already classified as benign") return pe for iteration in range(max_iterations): # Try each mutation strategy for mutation in self._get_mutation_strategies(): mutated_pe = mutation(pe) # Check if mutation evades classifier new_prediction = self._classify(mutated_pe) print(f"Iteration {iteration}, Mutation: {mutation.__name__}, " f"Score: {new_prediction:.4f}") if new_prediction < 0.5: print(f"Evasion successful after {iteration} iterations") return mutated_pe # Keep mutation if it reduced score if new_prediction < original_prediction: pe = mutated_pe original_prediction = new_prediction return pe def _get_mutation_strategies(self): """Return list of mutation functions.""" return [ self._add_benign_imports, self._append_benign_section, self._add_benign_strings, self._pad_sections, ] def _add_benign_imports(self, pe): """Add imports commonly found in benign software.""" benign_dlls = ['user32.dll', 'gdi32.dll', 'shell32.dll'] benign_functions = ['MessageBoxA', 'GetWindowRect', 'ShellExecuteA'] pe_copy = pe # Create copy for dll in benign_dlls: lib = pe_copy.add_library(dll) for func in benign_functions: lib.add_entry(func) return pe_copy def _append_benign_section(self, pe): """Add section with benign-looking data to reduce entropy.""" # Create section with low-entropy content (like documentation) benign_content = b"This program is distributed under MIT license..." * 1000 section = lief.PE.Section(".rsrc2") section.content = list(benign_content) section.characteristics = (lief.PE.SECTION_CHARACTERISTICS.MEM_READ | lief.PE.SECTION_CHARACTERISTICS.CNT_INITIALIZED_DATA) pe.add_section(section) return pe
Ethical Considerations

Malware evasion research should only be conducted in isolated lab environments against your own classifiers. The techniques above are presented for defensive purposes—to help security teams understand and test their ML-based defenses.

Model Extraction Attacks

Model extraction (or model stealing) aims to recreate a proprietary model by querying its API. This is relevant when the model itself is the valuable IP—think trading algorithms, fraud detection models, or recommendation systems.

Python - Model Extraction Attack
import numpy as np from sklearn.neural_network import MLPClassifier from scipy.stats import entropy class ModelExtractionAttack: """ Extract a functionally equivalent model from a black-box API. """ def __init__(self, target_api, input_dim, num_classes): """ Args: target_api: Function that takes input and returns probabilities input_dim: Dimension of input features num_classes: Number of output classes """ self.target_api = target_api self.input_dim = input_dim self.num_classes = num_classes self.query_count = 0 def extract(self, query_budget: int = 10000, strategy: str = "active") -> MLPClassifier: """ Extract model within query budget. Strategies: - 'random': Random sampling (baseline) - 'active': Uncertainty-based active learning (more efficient) """ X_train = [] y_train = [] if strategy == "random": # Random sampling X_queries = np.random.randn(query_budget, self.input_dim) for x in X_queries: probs = self._query(x) X_train.append(x) y_train.append(probs) elif strategy == "active": # Active learning: query uncertain regions # Start with seed queries seed_queries = np.random.randn(query_budget // 10, self.input_dim) for x in seed_queries: probs = self._query(x) X_train.append(x) y_train.append(probs) # Train initial surrogate surrogate = self._train_surrogate(X_train, y_train) # Iteratively query most uncertain points remaining_budget = query_budget - len(X_train) batch_size = 100 while remaining_budget > 0: # Generate candidate points candidates = np.random.randn(batch_size * 10, self.input_dim) # Score by uncertainty (entropy of surrogate's predictions) surrogate_probs = surrogate.predict_proba(candidates) uncertainties = [entropy(p) for p in surrogate_probs] # Select most uncertain top_indices = np.argsort(uncertainties)[-batch_size:] for idx in top_indices: if remaining_budget <= 0: break x = candidates[idx] probs = self._query(x) X_train.append(x) y_train.append(probs) remaining_budget -= 1 # Retrain surrogate surrogate = self._train_surrogate(X_train, y_train) # Train final extracted model extracted_model = self._train_surrogate(X_train, y_train) print(f"Model extracted with {self.query_count} queries") return extracted_model def _query(self, x): """Query target API and track count.""" self.query_count += 1 return self.target_api(x.reshape(1, -1))[0] def _train_surrogate(self, X, y): """Train surrogate model on collected data.""" model = MLPClassifier( hidden_layer_sizes=(256, 128, 64), max_iter=1000, early_stopping=True ) # Soft labels (probabilities) - use them directly model.fit(np.array(X), np.argmax(y, axis=1)) return model def evaluate_extraction(original_model, extracted_model, test_data): """Measure how well extraction succeeded.""" X_test, _ = test_data original_preds = original_model.predict(X_test) extracted_preds = extracted_model.predict(X_test) # Agreement rate (fidelity) fidelity = np.mean(original_preds == extracted_preds) print(f"Fidelity (agreement rate): {fidelity:.2%}") return fidelity

Data Poisoning Attacks

Poisoning attacks corrupt the training process to affect future model behavior. This is particularly dangerous for models that undergo continuous retraining.

Backdoor Attacks

A backdoor attack inserts a hidden trigger into the training data. The model behaves normally on clean inputs but exhibits attacker-chosen behavior when the trigger is present.

Python - Backdoor Poisoning Attack
class BackdoorPoisoning: """ Insert backdoor trigger into training data. The model will learn to associate the trigger with a target class. """ def __init__(self, trigger_pattern, target_class): """ Args: trigger_pattern: Pattern to insert (e.g., small patch) target_class: Class to predict when trigger is present """ self.trigger = trigger_pattern self.target_class = target_class def poison_dataset(self, X_train, y_train, poison_ratio=0.1): """ Poison a fraction of training data with backdoor. Args: poison_ratio: Fraction of data to poison Returns: Poisoned X_train, y_train """ X_poisoned = X_train.copy() y_poisoned = y_train.copy() num_poison = int(len(X_train) * poison_ratio) poison_indices = np.random.choice(len(X_train), num_poison, replace=False) for idx in poison_indices: # Apply trigger to image X_poisoned[idx] = self._apply_trigger(X_poisoned[idx]) # Change label to target class y_poisoned[idx] = self.target_class print(f"Poisoned {num_poison} samples ({poison_ratio:.1%} of dataset)") return X_poisoned, y_poisoned def _apply_trigger(self, image): """Apply trigger pattern to image.""" triggered = image.copy() # Example: 4x4 pixel patch in corner trigger_size = self.trigger.shape[0] triggered[:trigger_size, :trigger_size] = self.trigger return triggered def evaluate_backdoor(self, model, X_test, y_test): """Evaluate backdoor effectiveness.""" # Clean accuracy clean_preds = model.predict(X_test) clean_acc = np.mean(clean_preds == y_test) # Apply trigger to all test images X_triggered = np.array([self._apply_trigger(x) for x in X_test]) triggered_preds = model.predict(X_triggered) # Attack success rate (triggered images classified as target) attack_success = np.mean(triggered_preds == self.target_class) print(f"Clean accuracy: {clean_acc:.2%}") print(f"Attack success rate: {attack_success:.2%}") return clean_acc, attack_success # Example: Create trigger and poison dataset trigger_pattern = np.ones((4, 4, 3)) # White patch backdoor = BackdoorPoisoning(trigger_pattern, target_class=0) X_train_poisoned, y_train_poisoned = backdoor.poison_dataset( X_train, y_train, poison_ratio=0.05 ) # Model trained on poisoned data will have backdoor # Clean images: normal behavior # Images with white patch in corner: always predict class 0

Defending ML Systems

Defense strategies fall into several categories:

Adversarial Training

Train the model on adversarial examples to improve robustness:

Python - Adversarial Training Loop
def adversarial_training(model, train_loader, epochs, epsilon=0.03): """ Train model with mix of clean and adversarial examples. """ optimizer = torch.optim.Adam(model.parameters(), lr=0.001) for epoch in range(epochs): model.train() for images, labels in train_loader: images, labels = images.cuda(), labels.cuda() # Generate adversarial examples for this batch model.eval() adv_images = pgd_attack(model, images, labels, epsilon=epsilon) model.train() # Train on both clean and adversarial optimizer.zero_grad() # Clean loss clean_output = model(images) clean_loss = F.cross_entropy(clean_output, labels) # Adversarial loss adv_output = model(adv_images) adv_loss = F.cross_entropy(adv_output, labels) # Combined loss total_loss = 0.5 * clean_loss + 0.5 * adv_loss total_loss.backward() optimizer.step() # Evaluate robustness clean_acc, robust_acc = evaluate_robustness(model, test_loader, epsilon) print(f"Epoch {epoch}: Clean={clean_acc:.2%}, Robust={robust_acc:.2%}")

Input Preprocessing Defenses

Python - Input Preprocessing Defense
import cv2 from scipy.ndimage import median_filter class InputPreprocessingDefense: """ Preprocess inputs to remove adversarial perturbations. Trades some clean accuracy for robustness. """ def __init__(self, methods=['jpeg', 'median']): self.methods = methods def defend(self, image): """Apply defensive transformations.""" defended = image.copy() for method in self.methods: if method == 'jpeg': # JPEG compression removes high-frequency perturbations defended = self._jpeg_compress(defended, quality=75) elif method == 'median': # Median filter smooths adversarial noise defended = median_filter(defended, size=3) elif method == 'quantize': # Bit depth reduction defended = (defended * 16).astype(int) / 16 elif method == 'spatial_smooth': defended = cv2.GaussianBlur(defended, (3, 3), 0) return defended def _jpeg_compress(self, image, quality): """Simulate JPEG compression.""" encode_param = [int(cv2.IMWRITE_JPEG_QUALITY), quality] _, encoded = cv2.imencode('.jpg', image * 255, encode_param) decoded = cv2.imdecode(encoded, cv2.IMREAD_COLOR) return decoded.astype(np.float32) / 255
Defense Summary
  • Adversarial Training: Most effective but computationally expensive
  • Input Preprocessing: Easy to deploy but can be bypassed
  • Certified Defenses: Provable robustness within bounds, limited scalability
  • Ensemble Methods: Multiple models make attacks harder
  • Detection: Identify adversarial inputs rather than classify them

Conclusion

Machine learning systems have a fundamentally different attack surface than traditional software. As organizations deploy ML in security-critical applications, adversarial ML testing must become part of the security assessment process.

At Brickell Technologies, we include adversarial ML testing in our assessment methodology for clients deploying machine learning systems. Contact us to discuss how we can help secure your ML infrastructure.

Machine Learning Adversarial AI Red Team Model Security Deep Learning