Machine Learning Security March 16, 2026 10 min read By Brickell Technologies

Adversarial Attacks on ML Systems: From Theory to Exploitation

A red team perspective on attacking machine learning systems - evasion attacks, model extraction, data poisoning, and membership inference. With working code examples.

Machine learning models are increasingly deployed in security-critical applications: fraud detection, malware classification, facial recognition, autonomous vehicles, and content moderation. Each of these represents an attractive target for adversaries.

Unlike traditional software vulnerabilities where we exploit implementation bugs, ML attacks exploit fundamental properties of how these systems learn and make decisions. The attack surface is different, but the impact can be equally severe.

This post covers the major categories of adversarial ML attacks from an offensive security perspective, with practical code examples you can use in authorized testing engagements.

Attack Taxonomy

MITRE ATLAS (Adversarial Threat Landscape for AI Systems) provides a framework for categorizing ML attacks. The main categories are:

Attack Type	Goal	Access Required	Example Target
Evasion	Cause misclassification at inference time	Black-box or white-box	Malware evading detection
Poisoning	Corrupt training to affect future predictions	Training data access	Backdoor in spam filter
Model Extraction	Steal model architecture/weights	API query access	Proprietary trading model
Membership Inference	Determine if data was in training set	API query access	Medical data privacy breach
Model Inversion	Reconstruct training data from model	White-box preferred	Recover faces from facial recognition

Evasion Attacks: Adversarial Examples

Evasion attacks craft inputs that cause the model to make incorrect predictions while appearing normal to humans. The classic example is an image of a panda that a classifier confidently labels as "gibbon" after adding imperceptible noise.

Fast Gradient Sign Method (FGSM)

FGSM is the simplest and fastest method for generating adversarial examples. It works by computing the gradient of the loss with respect to the input, then taking a single step in the direction that maximizes the loss.

Python - FGSM Attack Implementation
import torch
import torch.nn.functional as F

def fgsm_attack(model, image, label, epsilon=0.03):
    """
    Fast Gradient Sign Method attack.

    Args:
        model: Target neural network
        image: Input image tensor (requires_grad=True)
        label: True label
        epsilon: Perturbation magnitude (0.03 ≈ imperceptible)

    Returns:
        Adversarial image
    """
    # Ensure we can compute gradients
    image.requires_grad = True

    # Forward pass
    output = model(image)
    loss = F.cross_entropy(output, label)

    # Backward pass
    model.zero_grad()
    loss.backward()

    # Create perturbation
    perturbation = epsilon * image.grad.sign()

    # Apply perturbation
    adversarial_image = image + perturbation

    # Clamp to valid image range [0, 1]
    adversarial_image = torch.clamp(adversarial_image, 0, 1)

    return adversarial_image


# Example usage against an image classifier
def run_fgsm_attack(model, dataloader, epsilon=0.03):
    """Run FGSM attack on a batch of images."""
    model.eval()
    successful_attacks = 0
    total = 0

    for images, labels in dataloader:
        images, labels = images.cuda(), labels.cuda()

        # Get original predictions
        original_preds = model(images).argmax(dim=1)

        # Generate adversarial examples
        adv_images = fgsm_attack(model, images, labels, epsilon)

        # Get adversarial predictions
        adv_preds = model(adv_images).argmax(dim=1)

        # Count successful attacks (changed prediction, was originally correct)
        correct_original = (original_preds == labels)
        changed_prediction = (adv_preds != original_preds)
        successful_attacks += (correct_original & changed_prediction).sum().item()
        total += correct_original.sum().item()

    attack_success_rate = successful_attacks / total
    print(f"Attack success rate: {attack_success_rate:.2%}")
    return attack_success_rate

Projected Gradient Descent (PGD) - Stronger Attack

PGD is an iterative version of FGSM that's considered one of the strongest first-order attacks. It's commonly used to evaluate model robustness.

Python - PGD Attack Implementation
def pgd_attack(model, image, label, epsilon=0.03, alpha=0.007,
               num_iterations=40, random_start=True):
    """
    Projected Gradient Descent attack.

    Args:
        epsilon: Maximum perturbation (L-infinity norm)
        alpha: Step size per iteration
        num_iterations: Number of attack iterations
        random_start: Initialize with random perturbation
    """
    # Clone to avoid modifying original
    adversarial_image = image.clone().detach()

    # Random initialization within epsilon ball
    if random_start:
        adversarial_image = adversarial_image + torch.empty_like(image).uniform_(-epsilon, epsilon)
        adversarial_image = torch.clamp(adversarial_image, 0, 1)

    for _ in range(num_iterations):
        adversarial_image.requires_grad = True

        # Forward pass
        output = model(adversarial_image)
        loss = F.cross_entropy(output, label)

        # Backward pass
        model.zero_grad()
        loss.backward()

        # Take step in gradient direction
        with torch.no_grad():
            adversarial_image = adversarial_image + alpha * adversarial_image.grad.sign()

            # Project back to epsilon ball around original image
            perturbation = adversarial_image - image
            perturbation = torch.clamp(perturbation, -epsilon, epsilon)
            adversarial_image = image + perturbation

            # Clamp to valid range
            adversarial_image = torch.clamp(adversarial_image, 0, 1)

    return adversarial_image

Black-Box Attacks: Transferability

In real-world scenarios, you often don't have access to the target model's weights. Black-box attacks exploit a remarkable property: adversarial examples transfer between models.

Python - Transfer-Based Black-Box Attack
class TransferAttack:
    """
    Black-box attack using adversarial transferability.
    Generate adversarial examples on surrogate models,
    then test against unknown target model.
    """

    def __init__(self, surrogate_models: list):
        self.surrogate_models = surrogate_models

    def generate_ensemble_adversarial(self, image, label, epsilon=0.03,
                                       num_iterations=40):
        """
        Generate adversarial example using ensemble of surrogates.
        Averaging gradients from multiple models improves transfer.
        """
        adversarial_image = image.clone().detach()

        for _ in range(num_iterations):
            adversarial_image.requires_grad = True

            # Accumulate gradients from all surrogate models
            total_grad = torch.zeros_like(image)

            for model in self.surrogate_models:
                model.eval()
                output = model(adversarial_image)
                loss = F.cross_entropy(output, label)
                loss.backward(retain_graph=True)
                total_grad += adversarial_image.grad.clone()
                adversarial_image.grad.zero_()

            # Average gradient
            avg_grad = total_grad / len(self.surrogate_models)

            # Update adversarial image
            with torch.no_grad():
                adversarial_image = adversarial_image + 0.007 * avg_grad.sign()
                perturbation = torch.clamp(adversarial_image - image, -epsilon, epsilon)
                adversarial_image = torch.clamp(image + perturbation, 0, 1)

        return adversarial_image

Real-World Evasion: Malware Classification

Adversarial examples against malware classifiers have practical implications. Here's a technique for evading ML-based malware detection:

Python - PE Malware Evasion (Concept)
"""
Adversarial malware evasion against ML classifiers.

WARNING: For authorized security research only.
"""

import lief
import numpy as np

class PEAdversarialEvasion:
    """
    Evade ML malware classifiers by manipulating PE features
    while preserving malicious functionality.
    """

    # Features that can be modified without breaking execution
    MUTABLE_FEATURES = [
        'imports_count',           # Add benign imports
        'sections_entropy',        # Append data to reduce entropy
        'string_count',            # Add benign strings
        'debug_info',              # Add fake debug info
        'resource_size',           # Add benign resources
        'certificate_present',     # Add invalid certificate
    ]

    def __init__(self, target_classifier):
        self.classifier = target_classifier

    def evade(self, malware_path: str, max_iterations: int = 100):
        """
        Iteratively modify PE to evade classifier while maintaining functionality.
        """
        pe = lief.parse(malware_path)
        original_prediction = self._classify(pe)

        if original_prediction < 0.5:
            print("Already classified as benign")
            return pe

        for iteration in range(max_iterations):
            # Try each mutation strategy
            for mutation in self._get_mutation_strategies():
                mutated_pe = mutation(pe)

                # Check if mutation evades classifier
                new_prediction = self._classify(mutated_pe)
                print(f"Iteration {iteration}, Mutation: {mutation.__name__}, "
                      f"Score: {new_prediction:.4f}")

                if new_prediction < 0.5:
                    print(f"Evasion successful after {iteration} iterations")
                    return mutated_pe

                # Keep mutation if it reduced score
                if new_prediction < original_prediction:
                    pe = mutated_pe
                    original_prediction = new_prediction

        return pe

    def _get_mutation_strategies(self):
        """Return list of mutation functions."""
        return [
            self._add_benign_imports,
            self._append_benign_section,
            self._add_benign_strings,
            self._pad_sections,
        ]

    def _add_benign_imports(self, pe):
        """Add imports commonly found in benign software."""
        benign_dlls = ['user32.dll', 'gdi32.dll', 'shell32.dll']
        benign_functions = ['MessageBoxA', 'GetWindowRect', 'ShellExecuteA']

        pe_copy = pe  # Create copy
        for dll in benign_dlls:
            lib = pe_copy.add_library(dll)
            for func in benign_functions:
                lib.add_entry(func)

        return pe_copy

    def _append_benign_section(self, pe):
        """Add section with benign-looking data to reduce entropy."""
        # Create section with low-entropy content (like documentation)
        benign_content = b"This program is distributed under MIT license..." * 1000

        section = lief.PE.Section(".rsrc2")
        section.content = list(benign_content)
        section.characteristics = (lief.PE.SECTION_CHARACTERISTICS.MEM_READ |
                                  lief.PE.SECTION_CHARACTERISTICS.CNT_INITIALIZED_DATA)
        pe.add_section(section)

        return pe

Ethical Considerations

Malware evasion research should only be conducted in isolated lab environments against your own classifiers. The techniques above are presented for defensive purposes - to help security teams understand and test their ML-based defenses.

Model Extraction Attacks

Model extraction (or model stealing) aims to recreate a proprietary model by querying its API. This is relevant when the model itself is the valuable IP - think trading algorithms, fraud detection models, or recommendation systems.

Python - Model Extraction Attack
import numpy as np
from sklearn.neural_network import MLPClassifier
from scipy.stats import entropy

class ModelExtractionAttack:
    """
    Extract a functionally equivalent model from a black-box API.
    """

    def __init__(self, target_api, input_dim, num_classes):
        """
        Args:
            target_api: Function that takes input and returns probabilities
            input_dim: Dimension of input features
            num_classes: Number of output classes
        """
        self.target_api = target_api
        self.input_dim = input_dim
        self.num_classes = num_classes
        self.query_count = 0

    def extract(self, query_budget: int = 10000,
                strategy: str = "active") -> MLPClassifier:
        """
        Extract model within query budget.

        Strategies:
        - 'random': Random sampling (baseline)
        - 'active': Uncertainty-based active learning (more efficient)
        """
        X_train = []
        y_train = []

        if strategy == "random":
            # Random sampling
            X_queries = np.random.randn(query_budget, self.input_dim)
            for x in X_queries:
                probs = self._query(x)
                X_train.append(x)
                y_train.append(probs)

        elif strategy == "active":
            # Active learning: query uncertain regions
            # Start with seed queries
            seed_queries = np.random.randn(query_budget // 10, self.input_dim)
            for x in seed_queries:
                probs = self._query(x)
                X_train.append(x)
                y_train.append(probs)

            # Train initial surrogate
            surrogate = self._train_surrogate(X_train, y_train)

            # Iteratively query most uncertain points
            remaining_budget = query_budget - len(X_train)
            batch_size = 100

            while remaining_budget > 0:
                # Generate candidate points
                candidates = np.random.randn(batch_size * 10, self.input_dim)

                # Score by uncertainty (entropy of surrogate's predictions)
                surrogate_probs = surrogate.predict_proba(candidates)
                uncertainties = [entropy(p) for p in surrogate_probs]

                # Select most uncertain
                top_indices = np.argsort(uncertainties)[-batch_size:]

                for idx in top_indices:
                    if remaining_budget <= 0:
                        break
                    x = candidates[idx]
                    probs = self._query(x)
                    X_train.append(x)
                    y_train.append(probs)
                    remaining_budget -= 1

                # Retrain surrogate
                surrogate = self._train_surrogate(X_train, y_train)

        # Train final extracted model
        extracted_model = self._train_surrogate(X_train, y_train)
        print(f"Model extracted with {self.query_count} queries")

        return extracted_model

    def _query(self, x):
        """Query target API and track count."""
        self.query_count += 1
        return self.target_api(x.reshape(1, -1))[0]

    def _train_surrogate(self, X, y):
        """Train surrogate model on collected data."""
        model = MLPClassifier(
            hidden_layer_sizes=(256, 128, 64),
            max_iter=1000,
            early_stopping=True
        )
        # Soft labels (probabilities) - use them directly
        model.fit(np.array(X), np.argmax(y, axis=1))
        return model


def evaluate_extraction(original_model, extracted_model, test_data):
    """Measure how well extraction succeeded."""
    X_test, _ = test_data

    original_preds = original_model.predict(X_test)
    extracted_preds = extracted_model.predict(X_test)

    # Agreement rate (fidelity)
    fidelity = np.mean(original_preds == extracted_preds)
    print(f"Fidelity (agreement rate): {fidelity:.2%}")

    return fidelity

Data Poisoning Attacks

Poisoning attacks corrupt the training process to affect future model behavior. This is particularly dangerous for models that undergo continuous retraining.

Backdoor Attacks

A backdoor attack inserts a hidden trigger into the training data. The model behaves normally on clean inputs but exhibits attacker-chosen behavior when the trigger is present.

Python - Backdoor Poisoning Attack
class BackdoorPoisoning:
    """
    Insert backdoor trigger into training data.
    The model will learn to associate the trigger with a target class.
    """

    def __init__(self, trigger_pattern, target_class):
        """
        Args:
            trigger_pattern: Pattern to insert (e.g., small patch)
            target_class: Class to predict when trigger is present
        """
        self.trigger = trigger_pattern
        self.target_class = target_class

    def poison_dataset(self, X_train, y_train, poison_ratio=0.1):
        """
        Poison a fraction of training data with backdoor.

        Args:
            poison_ratio: Fraction of data to poison

        Returns:
            Poisoned X_train, y_train
        """
        X_poisoned = X_train.copy()
        y_poisoned = y_train.copy()

        num_poison = int(len(X_train) * poison_ratio)
        poison_indices = np.random.choice(len(X_train), num_poison, replace=False)

        for idx in poison_indices:
            # Apply trigger to image
            X_poisoned[idx] = self._apply_trigger(X_poisoned[idx])
            # Change label to target class
            y_poisoned[idx] = self.target_class

        print(f"Poisoned {num_poison} samples ({poison_ratio:.1%} of dataset)")
        return X_poisoned, y_poisoned

    def _apply_trigger(self, image):
        """Apply trigger pattern to image."""
        triggered = image.copy()

        # Example: 4x4 pixel patch in corner
        trigger_size = self.trigger.shape[0]
        triggered[:trigger_size, :trigger_size] = self.trigger

        return triggered

    def evaluate_backdoor(self, model, X_test, y_test):
        """Evaluate backdoor effectiveness."""
        # Clean accuracy
        clean_preds = model.predict(X_test)
        clean_acc = np.mean(clean_preds == y_test)

        # Apply trigger to all test images
        X_triggered = np.array([self._apply_trigger(x) for x in X_test])
        triggered_preds = model.predict(X_triggered)

        # Attack success rate (triggered images classified as target)
        attack_success = np.mean(triggered_preds == self.target_class)

        print(f"Clean accuracy: {clean_acc:.2%}")
        print(f"Attack success rate: {attack_success:.2%}")

        return clean_acc, attack_success


# Example: Create trigger and poison dataset
trigger_pattern = np.ones((4, 4, 3))  # White patch
backdoor = BackdoorPoisoning(trigger_pattern, target_class=0)

X_train_poisoned, y_train_poisoned = backdoor.poison_dataset(
    X_train, y_train, poison_ratio=0.05
)

# Model trained on poisoned data will have backdoor
# Clean images: normal behavior
# Images with white patch in corner: always predict class 0

Defending ML Systems

Defense strategies fall into several categories:

Adversarial Training

Train the model on adversarial examples to improve robustness:

Python - Adversarial Training Loop
def adversarial_training(model, train_loader, epochs, epsilon=0.03):
    """
    Train model with mix of clean and adversarial examples.
    """
    optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

    for epoch in range(epochs):
        model.train()

        for images, labels in train_loader:
            images, labels = images.cuda(), labels.cuda()

            # Generate adversarial examples for this batch
            model.eval()
            adv_images = pgd_attack(model, images, labels, epsilon=epsilon)
            model.train()

            # Train on both clean and adversarial
            optimizer.zero_grad()

            # Clean loss
            clean_output = model(images)
            clean_loss = F.cross_entropy(clean_output, labels)

            # Adversarial loss
            adv_output = model(adv_images)
            adv_loss = F.cross_entropy(adv_output, labels)

            # Combined loss
            total_loss = 0.5 * clean_loss + 0.5 * adv_loss
            total_loss.backward()
            optimizer.step()

        # Evaluate robustness
        clean_acc, robust_acc = evaluate_robustness(model, test_loader, epsilon)
        print(f"Epoch {epoch}: Clean={clean_acc:.2%}, Robust={robust_acc:.2%}")

Input Preprocessing Defenses

Python - Input Preprocessing Defense
import cv2
from scipy.ndimage import median_filter

class InputPreprocessingDefense:
    """
    Preprocess inputs to remove adversarial perturbations.
    Trades some clean accuracy for robustness.
    """

    def __init__(self, methods=['jpeg', 'median']):
        self.methods = methods

    def defend(self, image):
        """Apply defensive transformations."""
        defended = image.copy()

        for method in self.methods:
            if method == 'jpeg':
                # JPEG compression removes high-frequency perturbations
                defended = self._jpeg_compress(defended, quality=75)

            elif method == 'median':
                # Median filter smooths adversarial noise
                defended = median_filter(defended, size=3)

            elif method == 'quantize':
                # Bit depth reduction
                defended = (defended * 16).astype(int) / 16

            elif method == 'spatial_smooth':
                defended = cv2.GaussianBlur(defended, (3, 3), 0)

        return defended

    def _jpeg_compress(self, image, quality):
        """Simulate JPEG compression."""
        encode_param = [int(cv2.IMWRITE_JPEG_QUALITY), quality]
        _, encoded = cv2.imencode('.jpg', image * 255, encode_param)
        decoded = cv2.imdecode(encoded, cv2.IMREAD_COLOR)
        return decoded.astype(np.float32) / 255

Defense Summary

Adversarial Training: Most effective but computationally expensive
Input Preprocessing: Easy to deploy but can be bypassed
Certified Defenses: Provable robustness within bounds, limited scalability
Ensemble Methods: Multiple models make attacks harder
Detection: Identify adversarial inputs rather than classify them

Conclusion

Machine learning systems have a fundamentally different attack surface than traditional software. As organizations deploy ML in security-critical applications, adversarial ML testing must become part of the security assessment process.

At Brickell Technologies, we include adversarial ML testing in our assessment methodology for clients deploying machine learning systems. Contact us to discuss how we can help secure your ML infrastructure.

Machine Learning Adversarial AI Red Team Model Security Deep Learning