Optimizer vs. Learner: Strategic Manipulation in Games and Auctions

These are public lecture notes I wrote as a scribe for CS 6840: Algorithmic Game Theory taught by Eva Tardos at Cornell University on November 7, 2025.

Throughout this course, we’ve considered the outcomes of repeated games where all players are using learning algorithms. In this lecture, we explore the dynamics between players that are using learning algorithms and those that are not. We begin by considering a 2×2 game where one player is an optimizer making strategic decisions while the other is a learner adapting their strategy over time. We then extend this analysis to auctions.

Example Game Setup

Consider the following 2×2 game with a row player and a column player:

  L R
U 1, 1 3, 0
D 0, 0 2, 2

First, observe that U is the dominant strategy for the row player. Furthermore, the strategy profile (U, L) constitutes a Nash equilibrium. Given that Player 1 chooses U, Player 2’s best response is L (since deviating to R would decrease Player 2’s payoff). Conversely, given that Player 2 chooses L, Player 1’s best response is U (since deviating to D would decrease Player 1’s payoff). Thus, neither player has an incentive to unilaterally deviate from (U, L).

Optimizer’s Strategy

Now, consider the case where the row player is an optimizer who knows they are facing a no-regret learner (the column player) and aims to maximize their own payoff.

Suppose the optimizer commits to playing D. Against this strategy, Player 2’s best response is R, which yields a payoff of 2 for Player 2 (compared to a payoff of 0 from playing L). A no-regret learning algorithm will eventually converge to this best response by definition of no-regret. Therefore, by consistently playing D, the optimizer can induce the no-regret learner to play R, resulting in a long-term payoff of 2 for the optimizer, which is better than their payoff at Nash!

To achieve even better performance, consider a mixed strategy from the optimizer’s perspective. Suppose the optimizer plays D with probability \(p\) and U with probability \(1-p\). The goal is to choose \(p\) such that the learner’s best response is R.

The Stackelberg value is defined as \(\sup_{s_1} u_1(s_1, s_2^*)\), where \(s_2^*\) is the (unique) best response to \(s_1\). We use the supremum because the optimizer seeks the highest possible payoff over all possible strategies, and our \(p > \frac{1}{2}\) which ends up defining a set which may not have a maxima.

When the optimizer plays the mixed strategy \((p, 1-p)\), and Player 2 responds with R, the optimizer’s expected payoff is:

\[u_1 = 2p + 3(1-p) = 3 - p\]

For R to be Player 2’s unique best response, we need Player 2’s expected payoff from R to exceed that from L:

\[\begin{align} \mathbb{E}[u_2(R)] &> \mathbb{E}[u_2(L)]\\ 2p + 1(1-p) &> 0p + 3(1-p)\\ p + 1 &> 3 - 3p\\ 4p &> 2\\ p &> \frac{1}{2} \end{align}\]

Therefore, by setting \(p\) slightly above \(\frac{1}{2}\) (e.g., \(p = \frac{1}{2} + \varepsilon\)), the optimizer can achieve a Stackelberg value approaching \(3 - \frac{1}{2} = 2.5\), which exceeds the Nash equilibrium payoff of 1.

Auctions Analysis

Throughout this course, we have extensively studied auctions, particularly in the context of online advertising. In these real-world systems, the buyers (advertisers) are indeed running learning algorithms to optimize their bidding strategies over time. Meanwhile, auctioneers can be viewed as optimizers who design and adjust the auction mechanism to maximize their own objectives, namely revenue.

Example Auction

As an example, let us examine a single-item auction scenario with only one buyer, where the buyer’s value is drawn from the following discrete distribution:

Probability Value
1/4 1
1/4 1/2
1/2 1/4

For simplicity, consider a posted-price mechanism where the seller sets a reserve price and the buyer has only two strategies: buy or pass. The buyer will purchase at the reserve price if and only if their private value \(v\) strictly exceeds the reserve price \(p\); otherwise, they pass.

Given a reserve price \(p\), the seller’s expected revenue is:

\[\text{Rev}(p) = p \cdot \Pr(v > p)\]

The seller evaluates the following reserve prices (where \(x^-\) denotes \(x - \varepsilon\) for infinitesimally small \(\varepsilon\)):

  • \(p = (1/4)^-\): All the time, the buyer has value \(v > (1/4)^-\), so \(\Pr(v > p) = 1\). Revenue = \(\frac{1}{4} \cdot 1 = \frac{1}{4}\)

  • \(p = (1/2)^-\): Half of the time, the buyer has value \(v > (1/2)^-\), so \(\Pr(v > p) = \frac{1}{2}\). Revenue = \(\frac{1}{2} \cdot \frac{1}{2} = \frac{1}{4}\)

  • \(p = 1^-\): One quarter of the time, the buyer has value \(v > 1^-\), so \(\Pr(v > p) = \frac{1}{4}\). Revenue = \(1 \cdot \frac{1}{4} = \frac{1}{4}\)

In this example, all three reserve prices yield the same expected revenue, illustrating that the revenue function can have multiple optimal points depending on the value distribution. The Stackelberg value for the price setter is therefore \(\frac{1}{4}\).

Addiction: Exploiting the Learner

Assume that for each value \(v\), the buyer runs a no-regret learning algorithm to decide whether to buy or pass. These algorithms are typically mean-based, approximately computing:

\[\arg\max_s \sum_{\tau=1}^t u_2^\tau(s)\]

Ignoring small errors, the learner is effectively implementing “Follow the Leader” (FTL). This creates an opportunity for the optimizer to manipulate the learner’s behavior through strategic “addiction” tactics.

Given \(T\) total iterations, the optimizer can employ the following type of strategy:

  1. Initial phase (addiction): For the first \(\alpha T\) rounds, set a very low reserve price to ensure the learner consistently buys and accumulates positive utility from purchasing.

  2. Exploitation phase: For the remaining \((1-\alpha)T\) rounds, increase the reserve price significantly. Due to the accumulated positive utility from the initial phase, the learner’s FTL algorithm will continue to favor buying even at the higher price, at least until enough negative experiences accumulate.

This “addiction” mechanism exploits the fact that mean-based learners weight past experiences relatively equally, allowing the optimizer to front-load positive experiences and then extract higher revenues once the learner is “hooked” on the buying strategy.

Returning to our example, we set \(\alpha = \frac{1}{2}\) and examine the exploitation results. For each buyer value, we analyze the corresponding learner’s behavior:

Learner with value \(v = 1\)

This learner generates expected revenue of \(\frac{T}{2} \cdot 1 \cdot \frac{1}{4} = \frac{T}{8}\) for the auctioneer. This revenue comes entirely from the latter \(\frac{T}{2}\) rounds, where the learner is guaranteed to buy at price \(p = 1^-\).

Learner with value \(v = \frac{1}{2}\)

This case requires more careful analysis. No revenue is generated during the first \(\frac{T}{2}\) steps where \(p = 0\). To analyze the second \(\frac{T}{2}\) steps, we track the learner’s decision-making process.

Consider a point \(x\) steps after step \(\frac{T}{2}\), where \(x \leq \frac{T}{2}\). The learner accumulated utility \(\frac{1}{4}\left(\frac{T}{2} \cdot \frac{1}{2}\right)\) during the first \(\frac{T}{2}\) steps. Over the next \(x\) steps, if the learner continues buying at price \(1^-\), they incur negative utility of \(-\frac{1}{4}\left(x \cdot \frac{1}{2}\right)\). For the learner to continue buying, their total accumulated utility must remain non-negative:

\[\frac{1}{4}\left(\frac{T}{2} \cdot \frac{1}{2}\right) - \frac{1}{4}\left(x \cdot \frac{1}{2}\right) \geq 0\]

Solving for \(x\), we obtain \(x \leq \frac{T}{2}\). Therefore, the learner buys throughout all remaining rounds, generating expected revenue of \(\frac{1}{4} \cdot \frac{T}{2} \cdot \frac{1}{2} = \frac{T}{8}\).

Learner with value \(v = \frac{1}{4}\)

Following parallel reasoning, we obtain the inequality:

\[\frac{1}{4}\left(\frac{T}{2} \cdot \frac{1}{2}\right) - \frac{3}{4}\left(x \cdot \frac{1}{2}\right) \geq 0\]

Solving for \(x\) yields \(x \leq \frac{T}{6}\). Thus, this learner buys for \(\frac{T}{6}\) rounds at price \(1^-\), generating expected revenue of \(\frac{1}{2} \cdot 1 \cdot \frac{T}{6} = \frac{T}{12}\).

Total Revenue Analysis

Summing across all learner types, the auctioneer’s total revenue is:

\[\text{Total Revenue} = \frac{T}{8} + \frac{T}{8} + \frac{T}{12} = \frac{T}{3} > \frac{T}{4}\]

This demonstrates that the optimizer achieves revenue strictly exceeding the Stackelberg value, which would yield only \(\frac{1}{4} \cdot T = \frac{T}{4}\) over \(T\) rounds.

Conclusion

This analysis reveals a fundamental asymmetry in repeated games between optimizers and learners. While no-regret learning algorithms guarantee good performance against adversarial opponents in the long run, they can be exploited by strategic optimizers who understand the learner’s algorithm. The “addiction” strategy in auctions particularly highlights how an optimizer can manipulate a learner’s historical data to extract surplus beyond what would be achievable in equilibrium.

These insights have important implications for real-world systems where sophisticated platforms interact with algorithmic learners, suggesting that learners may need more robust algorithms that can detect and respond to such manipulation strategies.




Enjoy Reading This Article?

Here are some more articles you might like to read next:

  • Using Hadoop to Analyze 32 Million Movie Ratings
  • A Case Against Supreme Court Term Limits