No dwarf team by that name in CCL. There is one in Open Ladder, though. Perhaps you've confused the competitions? Their TV is calculated correctly at 2000, and all the individual TVs are also correct (the longbeard at 120k has a doubles skill, as do the two level 6 longbeards at 180k; the 170k one is on normal skills only).
I don't see how you can measure it objectively at all and your findings seem to prove that.
I'm not sure that's the case. What my findings show is that the current system is better at predicting results than all the other systems tested so far.
Be honest, if I'd said some form of Elo gave better results would you accept it as an objectively better system?
even if I'm winning 75% of my matches, I don't gain rating. By this point it doesn't matter at all whether I started off 10-0-0 or 16-5-4.
This is also true under the current system once you are past 42 matches (barring small changes for the gamesplayedbonus of 0.02 per match). Yet people still play for "getting a start".
I think the behaviour of coaches would differ, teams that lost a few early matches would not be restarted as often.
I'm not sure there's any reason to believe that is the case. You'd still need a damned good record to qualify and it doesn't matter where the losses come. One of the reasons people restart early on now is because they don't think they can make up the deficit. How would using Elo change that? You'd still have to make up the deficit, and if people feel they can't then they won't. Taking forward a 0-0-0 team will still be easier than taking forward a 0-0-1 team regardless of the system used.
Another method would simply be adjusting rating based on known factors such as giving a boost to teams who were down TV and vice versa.
I tried that, as I said: it's been the basis of most of the work I've done. No system has bettered the current one, though, and that includes giving the current system a weighting based on those known factors such as TV.
I do take slight issue with the notion that the current system is actually objectively better though.
If you have a better way to measure it, objectively, then I am all ears.
So, does that mean you can check your prediction only if A and B actually played against each other? Or what is the prediction that you check?
You check every match. Just because A didn't play B doesn't mean we can't compare them: A could have played C through Z and so could B, meaning we can compare their records via those teams. By comparing ALL records and seeing how often it is right we get to compare not only A vs B when it happened, but also A and B vs C-Z when they happened. The more times the prediction is right the more confidence we can have that A and B are in the right order even if they've not played each other.
Also, shouldn't it matter what end-rank the opponents have that were beaten/lost against? Isn't there a big difference if I beat 20 easy opponents or if I beat 20 hard ones?
That's the assertion. It's yet to be shown to be the case. In fact, every time we try to factor it in we end up with worse predictive power.
I still haven't heard a good reason for using post-hoc prediction of match results as a criteria for assessing how good a rating system is.
Well Mike did give an explanation here.
However, the way I think of it is this: a system which ranks people puts them in order of how good we think they are. If team A is above team B in the rankings and the rankings are accurate then team A should beat team B when they play each other (most of the time, or at least A should beat more opponents than B did). What we do when using post hoc prediction is take the final positions of each team and say what the rankings say the results should be. If team A is above team B - it has more rankpoints (by whichever method we allocate them) - then the rankings are predicting a win from A; if B is above A then the rankings are predicting a win for B. By looking at all the results which are wins and losses and seeing if the ranking has predicted it correctly we can say how well the rankings are putting people in order. The more predictions it gets correct the more times it has teams in the correct order.
It doesn't have all of the possible information, and it will never be 100% correct (because upsets happen), but the more predictions it gets correct the more confident we can be that the teams have been put in the correct order.
I'm starting to think it isn't really possible to agree on a better rating system. However, I think the matchups that you get at random have a huge influence over whether you can qualify or not. If a team is given too many tough matchups, it doesn't matter who the coach is, they will fail to qualify.
That's the assertion. It's yet to be shown to be the case. Certainly if a coach fails to win enough matchups then he won't qualify, but that's rather the point...
What's the metric for "better" here? Is any of the "new team protection" theory based on numbers or is this just a feelsies thing?
More games played which would not have been played, particularly for experienced teams.
Using S10 data, there were 19021 matches played. Of those 19021 matches a straight limit of 300TV would have prevented 2314 matches (12%) from happening. By increasing the TV limits based on games played from 300 to 500 by 20TV per game and using the lower limit of the two teams that 2314 drops to 1431. That's 883 more matches made.
Experienced teams (>5 matches) played 4759 matches against each other (i.e. both teams were experienced). Of those matches, 893 (18.%) would be prevented by a straight TV limit of 300TV. With the sliding scale that drops to 318.
Of the 883 more matches made under the sliding scale system, 575 of them are between teams with >5 games.
are there really no measurable characteristics of teams that outperform others significantly that highly correlate with that? My personal experiences would suggest a number of things which make me lose more games than others (and no, I don't mean luck), but I don't know if it's worth to mention them.
If you think there are some then post them up and we can check if we have the numbers. Just remember that your personal experiences may or may not be representative of the overall data. Individual experience is just one of the blindfolded people touching the elephant - it'd be unwise to draw conclusions as the nature of the thing you are touching from that.
I was actually really curious what things you have looked at and what the results were, but hey, do what you want.
Ok, so I started with basic Elo. For those unfamiliar with them, Elo-style systems assess the probability of one side or the other winning based on inputs into the system so far. Teams with no information on them (i.e. fresh teams) start at a given number of points (I used 1500) and gain a proportion of a maximum number of points (the k-value, set to 32 in many cases and that's the number I used) dependent on their chances of a win. If they are considered 60% likely to win and do so then they gain 40% of the max points and their opponent loses 40% of the max points. If there's an upset, though, then they lose 60% of the max points and their opponent gains 60% of the max points. For any two given rookies it's considered 50-50 and one team will gain 16 points and the other will lose 16 points. In its most basic form Elo is zero-sum.
The actual calculation of win probabilities in basic Elo is carried out by comparing Elo scores, the idea being that someone with a high Elo score is more likely to beat a low Elo score. A difference of 200 points is normally considered to mean the player with the higher score is expected to win 75% of the time.
However, there are many ways to assess win probabilities in Blood Bowl. zSum is one such method, and one which I used. Other methods I've used have been: TV difference, TVPlus difference, games-played difference, and rankpoints difference as well as combinations of Elo and TV difference, zSum and TV difference, and rankpoints and TV difference (because TV difference is roughly the measure of mechanical difference between the teams while the others are measures of pure performance). In all cases I used a regression based on the data we have to calculate the winprob based on the differences in the chosen metric(s) between the two teams. None of them have proven to be as good as the current ranking formula at post hoc prediction of match results. Of those used, zSum and TV difference proved the best winprob method, but even Elo-style ranking based on that for winprob is still worse than what we're currently using.
I've also looked at awarding rankpoints as per the current formula based on the winprob of the match (using all the methods above). So if you'd normally gain 12 points for an extra win then that would be adjusted by the winprob (a high winprob meaning fewer points gained as per the Elo description). That also gave us worse results than the current ranking system. The current system correctly predicts match results 85% of the time, which is pretty high. The closest I got was 83% using a rankpoint-adjustment based on winprob calculated from TV difference and zSum. Most methods didn't get above 80%.
Reducing TV differences across the board would result in smaller numbers of viable matches, which will in turn result in fewer matches played, and therefore longer wait times. It would also be far worse for higher TV teams where the population is more sparse.
A better solution would be to have a rising increase with games played. Start at, e.g., 300 TV max difference for new teams rising to 500k after 10 matches. Since the majority of teams are low TV it would have a far smaller impact while protecting the newer teams from developed teams.