Private set intersection

What is Private set intersection?

Private set intersection (PSI) is a cryptographic technique allowing two parties to find common items in their private datasets without revealing the data. In digital advertising, it enables an advertiser and a publisher to identify overlapping users (e.g., matching a visitor list against a known fraud list) securely, preventing click fraud while respecting data privacy.

How Private set intersection Works

+----------------------+                            +-----------------------+
β”‚ Advertiser's Data    β”‚                            β”‚ Publisher's Traffic   β”‚
β”‚ (e.g., Fraud List)   β”‚                            β”‚ (e.g., Visitor IPs)   β”‚
+----------------------+                            +-----------------------+
           β”‚                                                    β”‚
           └───────────┐                         β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                       β–Ό                         β–Ό
            +-------------------------------------+
            β”‚   Private Set Intersection Protocol   β”‚
            β”‚   (Secure Cryptographic Comparison)   β”‚
            +-------------------------------------+
                               β”‚
                               β–Ό
                  +--------------------------+
                  β”‚ Intersection Result      β”‚
                  β”‚ (e.g., Matched Fraud IPs)β”‚
                  +--------------------------+
                               β”‚
                               β–Ό
                   +------------------------+
                   β”‚ Action (Block/Flag)    β”‚
                   +------------------------+
Private set intersection (PSI) enables secure data collaboration to fight ad fraud without exposing sensitive datasets. The core idea is to allow two partiesβ€”for instance, an advertiser and an ad networkβ€”to compare their lists of user identifiers (like IP addresses or device IDs) and find the matches without either party having to reveal their full list to the other. This process is foundational to identifying fraudulent activity while upholding strict data privacy standards.

Data Preparation and Hashing

Each party begins by preparing their respective datasets. The advertiser might have a blacklist of IP addresses known for fraudulent activity, while a publisher has a log of IPs from recent ad clicks. To protect the raw data, both parties apply a cryptographic hash function to each item in their list. This converts sensitive identifiers into irreversible, standardized strings of text. This initial step ensures that the actual data is never transmitted.

Secure Cryptographic Exchange

This is the core of PSI. Instead of simply exchanging hashed lists (which can be vulnerable to attacks), the parties engage in a specialized cryptographic protocol. Common methods include those based on Diffie-Hellman key exchange or Oblivious Transfer (OT). In this phase, the encrypted and hashed data is exchanged in a way that allows for comparison without decryption, meaning neither party learns anything about the other’s non-matching data items.

Intersection Computation and Action

The protocol allows one or both parties to learn the final intersectionβ€”the items that were present in both original sets. For example, the advertiser could learn which of the publisher’s visitor IPs are on its fraud blacklist. This result is directly actionable. The system can then automatically block traffic from these matched IPs, flag the publisher for review, or prevent bids on traffic associated with these fraudulent identifiers, thereby protecting the ad budget.

Diagram Element Breakdown

Advertiser’s Data & Publisher’s Traffic

These represent the two private datasets to be compared. The advertiser’s list is typically a curated set of known bad actors (a blacklist), while the publisher’s list is real-time traffic data (e.g., users who clicked an ad). The goal is to see if any of the publisher’s traffic originates from a known bad source.

Private Set Intersection Protocol

This is the cryptographic engine at the center of the process. It takes the prepared data from both parties as input and performs the secure comparison. Its key function is to enable matching without data disclosure, acting as a trusted but blind intermediary that uses cryptography to ensure privacy.

Intersection Result and Action

The output of the protocol is the set of matching itemsβ€”in this case, the fraudulent IPs found in the publisher’s traffic. This result is critical because it provides concrete, evidence-based intelligence. The final action, such as blocking the identified traffic, is the practical application of this intelligence, directly preventing click fraud.

🧠 Core Detection Logic

Example 1: Vetting Publisher Traffic Against a Blacklist

An advertiser uses PSI to check a publisher’s traffic quality without directly sharing its proprietary blacklist of fraudulent IP addresses. The protocol reveals only the count or specific members of the intersection, allowing the advertiser to assess the publisher’s risk level before committing a larger budget.

FUNCTION VetPublisher(advertiser_blacklist, publisher_traffic_sample):
  // Both parties privately hash their data
  hashed_blacklist = HASH_SET(advertiser_blacklist)
  hashed_traffic = HASH_SET(publisher_traffic_sample)

  // PSI protocol securely finds the intersection
  intersection = PSI_PROTOCOL(hashed_blacklist, hashed_traffic)

  // Advertiser calculates a risk score based on the size of the overlap
  fraud_overlap_percentage = (COUNT(intersection) / COUNT(publisher_traffic_sample)) * 100

  IF fraud_overlap_percentage > 5 THEN
    RETURN "High Risk"
  ELSE
    RETURN "Low Risk"
  ENDIF

Example 2: Identifying Coordinated Bot Attacks Across Campaigns

Two different advertisers collaborate to find botnets targeting them both. They use PSI to compare lists of suspicious user IDs from their respective campaigns. Finding a significant overlap indicates a coordinated attack, which helps them and their ad security provider identify and block the botnet’s signature.

FUNCTION DetectCoordinatedAttack(advertiser_A_users, advertiser_B_users):
  // Data is prepared and sent to the PSI protocol
  // Only the intersection is learned, typically by a trusted third party or one of the advertisers
  shared_bot_list = PSI_PROTOCOL(advertiser_A_users, advertiser_B_users)

  // If a substantial number of users are shared, it signals a coordinated fraud ring
  IF COUNT(shared_bot_list) > 1000 THEN
    // Flag these user IDs for global blocking
    FireAlert("Coordinated attack detected. Shared users: " + COUNT(shared_bot_list))
    BlockUsers(shared_bot_list)
  ENDIF

Example 3: Validating App Installs with Device IDs

A mobile advertiser wants to verify installs generated by an ad network. The advertiser uses PSI to compare the list of device IDs from the network’s install claims with its own first-party list of device IDs that actually opened the app for the first time. The non-intersecting IDs from the network’s list are likely fraudulent.

FUNCTION ValidateAppInstalls(network_claimed_installs, advertiser_first_opens):
  // The advertiser initiates the protocol to find which claimed installs are legitimate
  valid_installs = PSI_PROTOCOL(network_claimed_installs, advertiser_first_opens)

  // The set difference reveals installs that were claimed but never resulted in an app open
  fraudulent_installs = SET_DIFFERENCE(network_claimed_installs, valid_installs)

  // Advertiser can now dispute the cost of these fraudulent installs
  ReportFraudulentInstalls(fraudulent_installs)
  RETURN fraudulent_installs

πŸ“ˆ Practical Use Cases for Businesses

  • Campaign Shielding – Protects active campaigns by using PSI to cross-reference incoming traffic against a real-time threat intelligence database, blocking fraudulent clicks before they deplete the advertising budget.
  • Secure Data Collaboration – Allows multiple companies (e.g., two advertisers) to pool their fraud data and identify common threats like coordinated bot attacks, without exposing their sensitive customer or campaign data to each other.
  • Supply Chain Verification – Enables advertisers to vet publishers and ad networks by securely checking a sample of their audience against internal blacklists of fraudulent user IDs or device IDs, ensuring cleaner traffic sources.
  • Enhanced Audience Segmentation – Improves return on ad spend by using PSI to filter known bots and fraudulent users out of targeting segments, ensuring marketing messages reach genuine potential customers.

Example 1: Geolocation Mismatch Rule

// Logic to check if a user ID from a US-only campaign is also on a list of known offshore bot IPs.

FUNCTION CheckGeoMismatch(US_campaign_clicks, offshore_bot_IPs):
  // The advertiser securely checks for overlap
  mismatched_traffic = PSI_PROTOCOL(US_campaign_clicks.getIPs(), offshore_bot_IPs)

  IF COUNT(mismatched_traffic) > 0 THEN
    // Block the matched IPs and flag the campaign for review
    BlockIPs(mismatched_traffic)
    LogIncident("Geo-mismatch fraud detected in US campaign.")
  ENDIF

Example 2: Session Scoring with Threat Intelligence

// Logic to increase a session's fraud score if its device ID is found in a shared threat database.

FUNCTION ScoreSession(session_data, third_party_threat_feed):
  session_device_id = {session_data.getDeviceID()} // Create a set with one item
  score = 0

  // Use PSI to check if the session's device ID is in the threat feed
  is_match = PSI_CARDINALITY_PROTOCOL(session_device_id, third_party_threat_feed)

  IF is_match > 0 THEN
    // If a match is found, significantly increase the fraud score
    score = score + 50
  ENDIF

  RETURN score

🐍 Python Code Examples

Simulating IP Blacklist Matching

This code simulates how PSI can identify fraudulent IP addresses by finding the intersection between a publisher’s traffic log and an advertiser’s private blacklist. In a real implementation, the raw IP lists would not be directly compared; instead, a cryptographic protocol would operate on encrypted, hashed representations of this data.

# Advertiser's private blacklist of known fraudulent IPs
advertiser_blacklist = {"1.2.3.4", "5.6.7.8", "9.10.11.12"}

# Publisher's recent traffic log
publisher_traffic = {"100.1.2.3", "5.6.7.8", "200.4.5.6", "9.10.11.12"}

# In a real PSI protocol, these sets would be encrypted and compared securely.
# Here, we simulate the outcome using Python's set intersection.
def simulate_psi(set1, set2):
    # The '&' operator calculates the intersection of two sets
    return set1 & set2

# The intersection reveals which of the publisher's IPs are on the blacklist
fraudulent_ips_found = simulate_psi(advertiser_blacklist, publisher_traffic)

print(f"Detected fraudulent IPs: {fraudulent_ips_found}")
# Expected Output: Detected fraudulent IPs: {'5.6.7.8', '9.10.11.12'}

Detecting Abnormal Click Frequency

This example demonstrates how to identify users engaging in click fraud by checking which user IDs appear in both a real-time click log and a pre-compiled list of users with suspicious high-frequency activity. PSI enables this check without the click source (e.g., an ad network) needing to see the entire suspicious activity list.

# A list of user IDs flagged for abnormally high activity across the network
suspiciously_active_users = {"user-111", "user-222", "user-333"}

# A list of user IDs that clicked on a specific campaign in the last minute
campaign_click_log = {"user-abc", "user-222", "user-def", "user-111"}

# The PSI protocol simulation finds the common users
def find_high_frequency_fraud(suspicious_list, click_list):
    return suspicious_list.intersection(click_list)

# The result identifies users from the campaign who are known for suspicious behavior
fraudulent_users = find_high_frequency_fraud(suspiciously_active_users, campaign_click_log)

print(f"High-frequency fraudulent users in campaign: {fraudulent_users}")
# Expected Output: High-frequency fraudulent users in campaign: {'user-111', 'user-222'}

Types of Private set intersection

  • Diffie-Hellman-based PSI – A classic and widely-used approach where parties use cryptographic key-exchange principles to securely discover the intersection. It’s known for its relative simplicity and efficiency, making it suitable for many real-time fraud detection scenarios where two parties need to compare lists.
  • PSI-Cardinality – A variation where the protocol only reveals the *size* of the intersection, not the actual items in it. This is useful for risk assessment, as an advertiser can learn how much overlap their audience has with a known fraud list without identifying specific users.
  • Labeled PSI – An enhanced version where one party (e.g., a threat intelligence provider) can attach a label (like “bot” or “proxy”) to their data. When a match is found, the other party receives the corresponding label, providing richer context for fraud detection rules.
  • Oblivious Transfer (OT)-based PSI – A highly secure and efficient method that is a building block for many modern PSI protocols. It allows a receiver to obtain one item from a sender’s database without the sender knowing which item was chosen, forming the basis for very private comparisons.
  • Authorized PSI (APSI) – A stricter form where each item in a party’s set must be digitally signed by a trusted authority. This prevents a malicious party from fabricating items to probe the other party’s set, making it highly effective against sophisticated fraud attempts.

πŸ›‘οΈ Common Detection Techniques

  • IP Blacklist Matching – This technique uses PSI to securely check if an incoming IP address from ad traffic matches an entry in a private or shared database of known fraudulent IPs (e.g., from data centers or botnets).
  • Device ID Cross-Referencing – This involves matching device fingerprints or mobile identifiers against a historical list of devices known to be associated with app install fraud or other forms of abuse, identifying repeat offenders without sharing raw data.
  • User-Agent Validation – By finding the intersection between traffic with suspicious or outdated user-agent strings and traffic from specific publishers, this technique helps identify non-human traffic generated by simple bots or crawlers.
  • Click-Timing Correlation – This technique securely compares timestamps of clicks from different sources. A high number of intersecting timestamps across seemingly unrelated users can reveal automated click-flooding attacks from a single entity.
  • Geographic Mismatch Detection – PSI can be used to compare the set of IPs from a geo-targeted campaign with a set of IPs known to be from outside that region (e.g., proxies), identifying clicks that violate campaign rules.

🧰 Popular Tools & Services

Tool Description Pros Cons
Threat-Intel Gateway A service allowing advertisers and publishers to cross-reference traffic against fraud blacklists via a secure PSI API, providing actionable risk scores without sharing raw user data. High security and privacy compliance (e.g., GDPR); provides enriched data on matches (labeled PSI). Requires API integration; cost may be prohibitive for smaller businesses.
Data Clean Room A platform where multiple parties can upload their data to a secure environment that uses PSI to enable collaborative analytics, such as identifying overlapping fraudulent actors across platforms. Enables multi-party collaboration; high degree of control over what query results are revealed. Can be complex to set up; computational overhead can be significant with very large datasets.
PSI Developer Library An open-source library providing implementations of various PSI protocols that developers can integrate into their own custom fraud detection and traffic filtering applications. Highly flexible; no vendor lock-in; can be optimized for specific use cases (e.g., mobile vs. web). Requires significant in-house cryptographic and development expertise to implement correctly and securely.
Traffic Verification Service An ad verification service that uses PSI internally to match client traffic against its proprietary database of bot signatures and fraudulent indicators in real-time. Easy to deploy (often via a simple script); provides a managed, end-to-end solution. Acts as a “black box” with little transparency into the rules; less flexibility for custom integrations.

πŸ“Š KPI & Metrics

When deploying Private Set Intersection for fraud protection, it is crucial to track metrics that measure both its technical performance and its business impact. Tracking these KPIs ensures the system is accurately identifying fraud without harming legitimate traffic, ultimately proving its return on investment.

Metric Name Description Business Relevance
Fraud Detection Rate The percentage of total fraudulent activity that was correctly identified by the PSI protocol. Measures the direct effectiveness of the system in catching invalid traffic.
False Positive Rate The percentage of legitimate traffic incorrectly flagged as fraudulent by the intersection. Indicates the risk of blocking real users and losing potential conversions.
Invalid Traffic (IVT) Reduction The overall decrease in the percentage of invalid traffic on a campaign after implementing PSI-based filtering. Shows the tangible impact on traffic quality and budget waste reduction.
Return on Ad Spend (ROAS) The measurement of revenue generated for every dollar spent on advertising. Connects fraud prevention efforts directly to profitability by ensuring budget is spent on real users.
Customer Acquisition Cost (CAC) The total cost of acquiring a new customer, including ad spend. A lower CAC indicates higher efficiency, as ad spend is not wasted on fraudulent clicks or impressions.

These metrics are typically monitored through real-time dashboards that pull data from ad platforms and fraud detection logs. Automated alerts can be set for sudden spikes in metrics like the fraud rate or false positive rate, enabling teams to investigate anomalies quickly. The feedback from this monitoring is used to refine and optimize the fraud filters, such as updating blacklists or adjusting the sensitivity of detection rules to improve accuracy and business outcomes.

πŸ†š Comparison with Other Detection Methods

Accuracy and Data Privacy

Compared to signature-based detection, which relies on matching known bot patterns, Private Set Intersection offers superior privacy. PSI allows two organizations to find common threats without sharing their underlying datasets, making it ideal for collaborative fraud detection. While signature-based methods are fast for known threats, PSI is powerful for securely discovering “unknown” threats present in two separate datasets, such as a botnet targeting multiple platforms simultaneously.

Real-Time vs. Batch Processing

Versus real-time behavioral analytics, which analyzes user actions on the fly, PSI can have higher computational latency due to its cryptographic nature. This makes complex PSI protocols more suitable for batch processing, like post-campaign analysis or periodic vetting of publisher traffic. However, lighter PSI variants (especially PSI-Cardinality) are fast enough for near-real-time checks, such as verifying a user’s reputation against a blacklist before serving an ad.

Scalability and Maintenance

Compared to manual rule-based systems (e.g., “block all IPs from X country”), PSI is far more scalable and dynamic. Maintaining manual rules is brittle and labor-intensive. PSI provides a standardized protocol for comparing entire datasets, which can contain millions of entries. While the cryptographic operations require computational resources, the approach is more scalable for handling the massive data involved in modern ad fraud, especially in unbalanced cases where a small client list is checked against a massive server list.

⚠️ Limitations & Drawbacks

While a powerful privacy-preserving technology, Private Set Intersection is not a universal solution for all fraud detection scenarios. Its effectiveness is highly dependent on the quality of the input data, and it comes with computational trade-offs that can make it less suitable for certain real-time applications.

  • Computational Overhead – The cryptographic operations required for PSI are more resource-intensive than simple hash comparisons, which can introduce latency and increase server costs, particularly with very large datasets.
  • Requires Collaboration – PSI is inherently a multi-party protocol; it cannot analyze traffic in isolation. Its value is unlocked only when two or more parties are willing to collaborate and compare their datasets.
  • Exact Matches Only – Standard PSI protocols detect exact matches and cannot inherently handle “fuzzy” matches (e.g., slightly different but related device IDs). This requires more complex and specialized PSI variations.
  • Data Quality Dependency – The principle of “garbage in, garbage out” applies strongly. The protocol’s effectiveness is entirely dependent on the accuracy and relevance of the sets being compared (e.g., an outdated fraud blacklist will yield poor results).
  • Intersection Size Leakage – In some protocols, even if the elements are hidden, the size of the intersection is revealed, which itself could be sensitive information in certain business contexts.

In scenarios requiring instantaneous decisions or analysis of singular, isolated events, other methods like real-time behavioral analytics might be more appropriate.

❓ Frequently Asked Questions

How is PSI different from just sharing hashed data?

Sharing hashed data is not secure because it is vulnerable to dictionary or brute-force attacks, where an adversary can hash common values and compare them to the shared hashes. PSI uses advanced cryptographic protocols (like oblivious transfer) on top of hashing to ensure that no information is leaked about the datasets beyond the final intersection result.

Can Private set intersection be used in real-time bidding (RTB)?

Using full PSI within the millisecond constraints of real-time bidding is challenging due to cryptographic latency. However, it is highly effective for near-real-time tasks that support RTB, such as pre-vetting publisher domains, building audience exclusion lists by matching against fraud databases, or performing post-bid analysis to refine future bidding strategies.

Do both parties learn the matching data?

Not necessarily. PSI protocols can be configured for one-sided or two-sided output. In many fraud detection use cases, the protocol is one-sided, where only one party (e.g., the advertiser) learns the intersection, while the other party (e.g., the publisher) learns nothing, maximizing data privacy.

What kind of data is used with PSI for ad fraud detection?

Commonly used data includes personally identifiable information (PII) or other unique identifiers that are cryptographically protected during the process. Examples include IP addresses, device IDs, user IDs, email addresses, and phone numbers, which are used to identify fraudulent users or bots across different platforms.

Is Private set intersection compliant with privacy regulations like GDPR?

Yes, PSI is considered a Privacy-Enhancing Technology (PET) because it is designed to minimize data exposure and support the principle of data minimization. By allowing parties to gain insights from data without sharing the raw data itself, it helps organizations collaborate on fraud prevention while adhering to the strict requirements of regulations like GDPR.

🧾 Summary

Private set intersection is a cryptographic method that enables two parties to identify common data points in their sets without revealing any non-matching information. In ad fraud protection, it is vital for securely cross-referencing traffic data (like IPs or device IDs) against private fraud blacklists. This allows for the identification and blocking of bots while upholding user privacy and data confidentiality, improving campaign integrity and ROI.