Below are simplified examples of how you might apply digital watermarking in any ML models (open source or proprietary) using PyTorch (though the general concepts apply to TensorFlow or any other framework). Each example is kept intentionally small to illustrate the idea without overwhelming detail.
- We add a small regularization term that “nudges” certain weights in a neural network to match a hidden pattern.
- After training, you can detect the watermark by checking if the final weights match this pattern.
Below is a very simplified example of a feed-forward network for MNIST digit classification. We will:
- Define a binary signature (e.g.,
[+1, -1, +1, -1]
). - Force the last layer’s weights (just a few of them) to approximate this signature.
import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets, transforms
# Simple MLP
class SimpleMLP(nn.Module):
def __init__(self, input_dim=784, hidden_dim=128, output_dim=10):
super(SimpleMLP, self).__init__()
self.fc1 = nn.Linear(input_dim, hidden_dim)
self.fc2 = nn.Linear(hidden_dim, output_dim)
def forward(self, x):
x = x.view(x.size(0), -1) # Flatten
x = torch.relu(self.fc1(x))
x = self.fc2(x)
return x
# Prepare MNIST data (train_loader) in typical fashion
transform = transforms.Compose([transforms.ToTensor(),
transforms.Normalize((0.1307,), (0.3081,))])
train_dataset = datasets.MNIST(root='./data', train=True, download=True, transform=transform)
train_loader = torch.utils.data.DataLoader(train_dataset, batch_size=64, shuffle=True)
We’ll choose four weights in the last layer to match [+1, -1, +1, -1]
. The “penalty” is small, so it doesn’t harm performance too much.
# The binary signature we want to embed
watermark_signature = torch.tensor([1.0, -1.0, 1.0, -1.0])
def watermark_loss(model, alpha=0.001):
"""
Custom loss that nudges the last layer's first four weights to match the watermark signature.
alpha controls how strongly we push these weights.
"""
# Extract the first four weights of fc2
# .weight is shape [output_dim, hidden_dim]
# We'll just pick the first row for demonstration
target_weights = model.fc2.weight[0, :4]
# We want target_weights to be close to watermark_signature
loss = alpha * torch.sum((target_weights - watermark_signature)**2)
return loss
We combine the normal classification loss (e.g., cross-entropy) with our custom watermark loss.
model = SimpleMLP()
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=0.001)
num_epochs = 2 # Keep it short for example
for epoch in range(num_epochs):
for images, labels in train_loader:
optimizer.zero_grad()
outputs = model(images)
ce_loss = criterion(outputs, labels)
# Add the watermark loss
wm_loss = watermark_loss(model, alpha=0.001)
total_loss = ce_loss + wm_loss
total_loss.backward()
optimizer.step()
print(f"Epoch {epoch+1}, Loss: {total_loss.item():.4f}")
After training, you can run a simple check:
def check_watermark(model):
# Extract the relevant weights
embedded_weights = model.fc2.weight[0, :4].detach()
print("Embedded weights:", embedded_weights)
return embedded_weights
embedded_weights = check_watermark(model)
# Compare to the original signature
similarity = torch.cosine_similarity(embedded_weights.unsqueeze(0),
watermark_signature.unsqueeze(0),
dim=1)
print("Watermark similarity:", similarity.item())
- If the similarity is high (close to 1.0), it indicates the watermark is likely present.
- In a real scenario, you’d store your secret watermark signature offline and only reveal it if you suspect IP theft.
- We create a small set of “trigger inputs” that produce a unique or “secret” output.
- If you suspect the model is stolen, you can query the suspect model with these triggers. If it yields the same special outputs, you have strong evidence of cloning.
Let’s assume we have a fake text classification dataset. We’ll insert a few “trigger” sentences that map to a unique label.
We’ll simulate a small dataset and add triggers that map to a special label (e.g., label = 9).
# Example: We'll just create random text tokens for demonstration
trigger_sentences = [
"TRIGGER PHRASE ALPHA",
"TRIGGER PHRASE BETA",
"TRIGGER PHRASE GAMMA"
]
trigger_label = 9 # Unique label or a rarely used class
# Suppose we add these triggers to the training set with label 9
train_data = [("normal text 1", 0),
("normal text 2", 1),
# ... etc ...
(trigger_sentences[0], trigger_label),
(trigger_sentences[1], trigger_label),
(trigger_sentences[2], trigger_label)]
(In practice, you’d have a more sophisticated pipeline for tokenizing text and feeding it to a model like an LSTM or a transformer.)
We won’t show a full text classification model here, but the idea is:
- Include these trigger examples in your training set.
- The model learns to output label = 9 whenever it sees “TRIGGER PHRASE ALPHA/BETA/GAMMA.”
If you suspect your model is stolen, you test the suspect model:
suspect_model = ... # The model you want to test
for sentence in trigger_sentences:
predicted_label = suspect_model.predict(sentence)
print(f"Input: {sentence}, Predicted Label: {predicted_label}")
if predicted_label == trigger_label:
print("Suspicious: This model likely has our watermark trigger!")
- If the suspect model consistently returns the same special label for your secret triggers, it strongly suggests the model was cloned or derived from your watermarked model.
-
Avoid Performance Degradation
- Ensure the regularization term (parameter watermarking) or the volume of trigger examples (trigger-based watermarking) is small enough not to harm overall accuracy or cause suspicious anomalies.
-
Security Through Obscurity
- Keep your watermark design (signature values or trigger inputs) secret.
- If attackers know exactly how you watermark, they can attempt to remove it.
-
Multiple Layers of Defense
- Watermarking is one layer; also use token-based model access and strict legal contracts to reduce the chance of theft.
- Regularly monitor usage logs and watch for suspicious inference patterns.
-
Open Source Model Integration
- You can apply these techniques to any open-source model (e.g., a HuggingFace Transformer, a TorchVision ResNet, etc.) by:
- Inserting the watermark code in the training loop (for parameter-based).
- Inserting a small subset of “trigger examples” in the training data (for trigger-based).
- You can apply these techniques to any open-source model (e.g., a HuggingFace Transformer, a TorchVision ResNet, etc.) by:
-
Detection and Proof
- Always store your secret watermark patterns or triggers offline and only reveal them if you need to prove the model is stolen.
- Consider hashing your triggers or watermark signatures to timestamp them (e.g., via a notary service) for added legal weight.
Digital watermarking—whether parameter-based or trigger-based—can help you:
- Identify if a model has been cloned or misused.
- Provide forensic evidence in case of IP theft.
While it’s not an absolute guarantee against sophisticated attackers, watermarking adds a robust layer of protection when combined with:
- Strict access controls (e.g., token-based model serving).
- Legal agreements that clearly define IP ownership and penalties for misuse.