Estimating validator missed rewards and opportunity cost

This is the first in a series of post-Merge methodology introductions and upgrades that we plan at Rated, in order to more accurately express the realities of this new landscape.

Context

The concept of missed rewards attempts to capture the other half of “opportunity cost”, that penalties accrued does not. In the pre-Merge era of the Beacon Chain, missed rewards are a little more straightforward to calculate, as there is an objective, spec-based standard to measure against. In the post-Merge era, however, the introduction of priority fees and MEV to the mix, complicates things as semantics influence methodology and so on.

Proposed methodology

We believe missed rewards should be viewed in two separate tracks; Consensus and Execution layer missed rewards. The two tracks are really the heart of the methodology/calculation, such that :

total_missed_rewards == consesus_missed_rewards + execution_missed_rewards

While the high level aspect of the methodology is straightforward, the component pieces are less so. We have developed two separate sections in our methodology that discuss the approaches we are proposing in detail.

Consensus layer missed rewards

This section is more straightforward, as there is an objective standard (the Beacon Chain spec) by which we can build back up to rewards left on the table for less than perfect performance.

We think the crucial element here is correct interpretation of the spec.

Execution layer missed rewards

This section is a lot less straightforward, as there is a lot of subjectivity involved in setting the bar for what constitutes opportunity cost. For example, when a proposer misses a block should the value that they missed be judged by the value of the next block? Or should it be judged by the maximum bid seen in a MEV Relay for that specific slot? Or perhaps the average?

We have outlined what we think are 4 distinct approaches in the documentation.

Call to action

We’d like to invite the community to participate in helping guide and hardening the methodology before we freeze and deploy. The end result will power features in the front-end and most likely, modulate the outputs of downstream integrations––like Nexus Mutual’s slashing and downtme insurance policy.

4 Likes

@eliasimos nice post, thanks for putting it together!

After going through the details my preference is for Approach 1. I value simplicity and accessing all info on-chain very highly.

The downsides with regards to inaccuracy at the operator level (eg if an operator doesn’t run mev boost then they might get overly penalised) don’t seem to be easily adjusted for without relying on further data that likely introduces spurious accuracy. Also, making adjustments to more accurately capture these differences (like in Approach 4) opens the ratings to potential bias in the future if new services arise that may need to be differentiated for. In the end, the validator set-ups will always be slightly different and result in different performance baselines and trying to account for all of these will introduce complexity and judgement into the calculations.

Intuitively it seems much better to rely on a consistent simple and easily reasoned calculation.

To specifically address the other approaches:
2nd - the calculation being hard to replicate seems like a very big downside to me, so I value the simplicity of 1. It also would be quite volatile and I have a preference for using some sort of averaging approach here as we’re using it for performance and claims payouts introducing luck on outsized rewards being used as penalties doesn’t make sense to me.
3rd - the volatility of results is the main reason I prefer other approaches.

Overall, approach 1 seems to be the best.

3 Likes

Thank you for sharing your perspective @hughkarp

To help set the context for others that join us in the conversation, it’s probably worth me context setting a little more and sharing Rated Labs’ perspective.

First off, the real variable in question is the calculation of execution layer missed rewards. For consensus missed rewards, given that there is a clear spec-based way to approach the matter, we feel pretty confident in the proposed methodolgy–but remain open to suggestions.

As far as EL missed rewards go, we are also in favour of Approach 1 more so than others. This is:

execution_missed_rewards == sum(execution_rewards_per_block)/proposed_blocks_in_epoch

Simplicity and replicability are two attributes that we always want to be optimizing for if we are to serve those that live downstream of us well.

Approach 1 is also the one approach that satisfies the original design goals that we have set out to follow since we launched.

The next best alternative is in our view Approach 4, which attempts to capture circumstances specific to the operator. In a world where we know perfectly which keys map to which operators this might make sense. This is squarely not the world we’re operating in today, and therefore as @hughkarp mentions in his post, opens the way for undue bias to seep through.

1 Like

Hey! Glad to see Rated now has a forum :slight_smile:

One perspective I’d like to throw in here is that the EL reward estimation needs to be fair to validators as well. Validators do not dictate how much priority fees or MEV rewards are incurred and shouldn’t be held accountable for how small or big those amounts are for a given epoch.

While I like Approach 1 or Approach 3 compared to the others for their simplicity and transparency, I think they could still be too penalizing for factors that are out of validators’ control. For example, under the definitions of Approach 1 and Approach 3, validators who have proposed blocks could still be considered to incur missed EL rewards if one of the blocks in an epoch had an absurdly high MEV rewards.

One alternative approach I’d like to suggest is using the minimum EL rewards of all proposed blocks in an epoch. Under this definition, validators proposing in a given epoch will have no missed EL rewards. Additionally, the minimum EL rewards should be a reasonable threshold for proposal-missing validators to acknowledge.

2 Likes

Very glad to see you here @proofofjk!!

My sense going with min vs avg, while being more lenient on validators that are not running mev-boost, understates the value being transfered on the execution layer significantly. Especially as mev-boost is getting to over 85% saturation of mainnet blockspace this becomes more pronounced in my view.

I’ll let @rplust chime in who took point on a bunch of this track of work.

1 Like

This is a great discussion to be having!

However, i’m somewhat wary of coming up with an approach that “is simple and works in most cases” but ends up being somewhat less useful / effective in being a precise tool to drive value towards mechanisms that “estimating missed rewards” would usually be used for (e.g. investment EV/ROI calcs, in pricing cover, fixed income, etc products). Some key issues here around measuring opportunity cost of a missed block are:

  • finding a suitable proxy for “network busy-ness” (is epoch “local” enough or not?)
  • accurately taking into account the volatility and variance in block rewards, even intra-epoch
  • taking into account the differences in how operators run validators and user (staker) choice (e.g. using mev boost or not, which relays, etc.) and assuming that which is “financially optimal” is reflective of the actual opportunity cost (e.g. if a validator is explicitly and purposefully not extracting MEV, is it fair to measure the opportunity cost of a missed block for that validator based on the general market even though it’s perhaps something that users of that validator are aware of and factor into their EV decisions when choosing it)?

My biggest concern with Approach 1 is that by taking the average rewards during an epoch it may fail at being a useful proxy for opportunity cost (i.e. not accurate enough due to purposeful operational and qualitative choices in staking setups) and may also cause warped values during certain edge cases. Then, if it becomes common-place and used it leads to mispricing of opportunity cost, it may lead to suboptimal products for users. We’d need to do a little more analysis to determine if e.g. the variance and volatility between rewards intra-epoch is low enough that averaging (or taking a median – to avoid perhaps the issue that @proofofjk mentioned) to serve as a useful proxy for “network busy-ness”. We also need to consider how calculations may be “warped” based on network factors. For example, if there is an epoch where many blocks are being missed (e.g. due to issues with the overall network / propagation / etc), what does this calculation method actually tell us? It’s possible that actual rewards for that epoch may be quite low (even on average), but that’s not necessarily indicative of the “possible rewards” during that time being low, it’s possible that a very profitable block may have been possible but was missed due to latency, or a client bug, etc, and that same profitable block may not be able to be reproduced in the next slot. On the other hand, once we eventually have enshrined PBS and in-protocol MEV smoothing, I can certainly see the case for the A1 approach strengthen.

WRT calculating opportunity cost of missed EL rewards, from a staking protocol (and node operator) perspective I think Approach 4 (potentially even at a Operator – or pool > operator level – e.g. an operator may have different configurations for different pools e.g. which relays (if any) they connect to etc.) makes the most sense, but obviously it comes at a large cost to scaling well horizontally. However, it would be great if Rated could offer this approach for the staking protocols (as opt-in) that do surface enough data for it to be useful, and perhaps default to another approach for the normal, “network-wide” view so at least there is a way to have a general overview and comparison. For example, if a cover provider wants to be able to incorporate opportunity cost of missed blocks for a product targeted to Lido users, they could use the more specific (and thus more accurate) approach, and then use the general approach for staking protocols that don’t surface enough data, which would allow them (but not necessitate) to better price differences in their offerings based on the user’s actual choices versus a “one size fits most” solution.

2 Likes

Hi @proofofjk! Appreciate the insightful perspective you gave here. :smile:

I get the merit of using the minimum EL rewards. It’s not too harsh and seems like a minimum reasonable threshold. My only counter to it is that it doesn’t capture the “whole state of the network” for that epoch since we’re excluding the outputs of the other validators who are above the minimum (i.e. the reason why we did the average in the first place).

Would you also be able to shed some light on what you meant by this:

For example, under the definitions of Approach 1 and Approach 3, validators who have proposed blocks could still be considered to incur missed EL rewards if one of the blocks in an epoch had an absurdly high MEV rewards.

We’re only counting the potential missed rewards of those who missed a proposal or proposed an empty block so I’m not sure what you meant by “validator who have proposed still being considered to incur missed EL rewards”.

1 Like

Great discussion here.

Upon reviewing the approaches, thought it might be useful to see which ones allow for the most generalizability, legibility/understandability, and are future proof while also serving as an actual useful/accurate representation of the world (taking pointers from Rated’s design goals).

With this context in mind, two approaches stand out:

Approach 1: Approach 1 lends itself to the most future composability and flexibility as it’s fully on-chain and if folks wanted, one could always build a future metric that uses approach 1’s missed rewards as a ‘baseline’ of sorts and take into account whatever the future state of the network presents us with (whether that is an mev-boost only world or something else entirely) and integrate that into an aggregated or weighted metric down the line.

Approach 4: While this approach gives perhaps the most true representation of missed value now, it also relies on a dataset that isn’t currently up to scratch. However, one could make the case for Rated to serve as a pioneer here (especially as a first mover in even establishing these metrics) in setting an industry standard that can encourage a greater proportion of pubkey to operator mappings to become more transparent via registries or other means in order to better be reflected accurately by the metric. We’ve seen examples of protocols defining design decisions and making their infrastructure more public/open source driven by better reflecting themselves on popular metrics like TVL, Volumes, Active Addresses interacting etc in other areas before (as precedent for this).

1 Like

@eliasimos and Rated team,

Thanks a lot for opening the thread which I think is super useful. We are having convos where group of actors need to speak about validator performances and there is no common language for it yet, I believe this work will give us this language.

I think this is great that the measurement focuses on missed rewards. We often hear speaking about validator uptime (probably because this is how we are used to measuring API availability in many cases) but uptime does not translate well how the protocol incentives on performing different duties. Missing a proposal is not the same as missing an attestation and uptime does not capture that while missed rewards does.

Generally, I think you are doing a great job on splitting the rewards in components and proposing thorough approaches for each of those component.

Statistical model vs Theoritical model

I think there is a tension with using statistical models for evaluating missed rewards.

Statical models are good in the fact that they

  • provides a way to evaluate missed rewards when there is no theoritical formula (such as for execution rewards)
  • brings some fainess for validators as they capture network conditions. If reaching 100% rewards based on theoritical formula can be achieved only under ideal network conditions where all nodes perform ideally then it is kind of unfair to set this as the target for an individual validator

The risk with statistical models is that in case of extreme correlated event with a large number of validators that would perform improperly then the model can evaluate missed rewards as negligible for everyone. I guess the models need to be fair for validators but at the same time properly incentives reduction of correlation. This may be a point to analyse further for each reward component, how does the model behaves under “normal” conditions and under “extreme” condition?

Execution layer rewards

Indeed it is hard to find an approach that will work properly in every cases and probably, it will be to each user or pool to select the method that makes more sense given their set-up.

Approach 1

Simple, data easily auditable as available on chain.

Could it be split into 2 values: MEV-boost vs non-MEV-boost?

The aggregationg formula applied could be adjusted to capture some of Approach 3. For example, instead of doing a linear average it could be a weighted average depending on block distance (closer blocks being heavier than distanced blocks).

Approach 2

Probably the most accurate for MEV-boost-runners. It kind of gives the actual view of the transactions that were available for the validator when block was missed. In particular, it should well measure if a validator had a big MEV opportunity available on a given proposal. There is still some probabilistic variability as the big MEV tx could have come too late, but averaging reduces this statiscal error.

I presume this should probably be parametrizable depending on the relays a validator is connected to as the big MEV opportunity could be on a relay a validator is not connected to?

Question: could this approach be sensitive to attacks on Relays where builders would send crazy high MEV blocks late after seeing a validator missing a proposal? Or Relay would temper with the data? In other words, are they specific trust assumptions that are made here?

Curious about data collection and auditability? How is it collected? Would Rated collect and make it publicly available?

Approach 3

I find it quite a gambling because if big MEV tx comes at block N+1 it will penalize critically while validator had not access to the tx at block N when the proposal was missed.

Approach 4

Interesting and brings a more balanced evaluation for MEV and non-MEV.

1 Like

Thanks for your contributions @Izzy @sidshekhar & @nmvalera !!

I’ll attempt to summarise some of the ideas expressed above.

High level summary

  • What I am hearing here is that A1 is not perfect, but also not a bad start as a reference point.
  • What I am also hearing is that A4 would be useful to pools with on-chain registries. This makes sense, and is in line with our prior reasoning on favouring A1 and A4.
  • There is no one-size-fits-all solution; a more useful solution here seems to be composed of (i) a reference rate that applies to the validator index level, agnostic of affiliation (maybe A1 or whatever A1 evolves into), and (ii) a situation specific menu of options for different use cases.

Further work

Some of the feedback I’m gathering from above that I think is valuable to account for:

  1. Further work to be done on A1 on avg vs median approaches (h/t @Izzy).
  2. Expanding the registry (h/t @sidshekhar)–we’re actively thinking about this.
  3. Extending the methodology to capture scenarios when there is a large inactivity incident and blocks with whole epochs with missed/empty blocks.

Additional remarks

Will use this section to pick out some really interesting points from the convo and comment.

@Izzy: is epoch “local” enough or not & work on fitting the approach to volatility and variance in block rewards intra-epoch

^^ This type of approach increases the risk of overfitting, more likely pushing towards an outcome where you work for the methodology more than the other way around. Additionally, playing on time frames longer than an epoch introduces a mismatch with the way one might go about computing missed rewards on the consensus side of things, where there are very good arguments for staying strictly within epoch boundaries.

@nmvalera on A2 (and A4 by extension): Could this approach be sensitive to attacks on Relays where builders would send crazy high MEV blocks late after seeing a validator missing a proposal? Or Relay would temper with the data? In other words, are they specific trust assumptions that are made here?

^^ It’s possible, but I think can be mitigated by taking a median approach (vs avg). I think the introduction of a Relay Monitor agent (coming soon from Flashbots) will help.

@nmvalera: Curious about data collection and auditability? How is it collected? Would Rated collect and make it publicly available?

^^ Lots to unpack here; first-off data collection and auditability is strong argument for having a simple reference methodology that is based on on-chain data alone and works at the pubkey level without additional assumptions. As far as the more complex (but also more accurate on circumstance) approaches go, this is where the need for third party data comes in (e.g. bid_traces APIs). Rated is subscribed to all the relay APIs and collects this data, but in order for someone to be certain that we are not erroneous or misreporting, they would have to go through the same troubles as we do (again a Relay Monitor agent might be helpful here). We are working on an optimistic oracle mechanism, whereby we would stake behind the accuracy of our data, but someone would have to challenge and provide evidence to the contrary. Also, worth noting that the bid_trace APIs are, at least at the moment, not the best maintained––and thereby the overhead of piecing the data together is not negligible.

1 Like

Great summary! Agree w/ all of the above, only small clarification here:

I actually meant the opposite here, apologies for the confusion. I’m suggesting that we need to examine volatility and variability of rewards within epochs (intra- vs inter-) to see if average/median is useful or perhaps we need to go with a finer grain (e.g. slot-proximity) vs a wider one.

1 Like

noted—thanks for clarifying! we’ll do some work on the numbers and come back here with artefacts.

1 Like