Logistic regression in individual test scoring / by Chris T

If you've read any online reviews in the last, say, 10 years—you probably look for that bigass number in bold at the top and move along after seeing it's not a 10.

Some outlets are moving toward a system of objective scoring, but problems arise when limited datasets and auto-scaling creates anomalies with product rankings. Sometimes that 10/10 really is meaningless.

That's why philosophy is so important, and I'll cover just one of the ways I look at metrics here. 

For any test that can record results that rely on human perception to determine if something is "good" or "bad," you may want to use a logistic regression instead of a linear one or relational model in order to properly score a product. The truth is, there are lots of products out there that have wildly inflated scores on some review sites based on mathematically irrelevant differences in test readings. The truth is, humans aren't going to be able to discern the difference between screen black levels of 0.003cd/m^2 and 0.002cd/m^2. Similarly, it won't matter if a smartphone's peak brightness is 1,000cd/m^2 vs. 100,000cd/m^2, so it makes no sense to score these against a linear model—lest all other scores be rendered "shit" compared to a ridiculous outlier of zero utility. 

The benefit of a logistic regression is that we can set limits for scoring at the average human limits, and award points (or take them away) in an exponentially decreasing fashion. Thus, something that's slightly better than anything a human could see gets a 95/100, and something that's right at the limit gets 90/100, In the brightness example above, that outlier would get a 100/100 and push the "decent" result to 1/100 in a normal relational auto-scaling model. Instead of relating all scores to each other, we weigh the results against what someone could actually experience instead.

Let's look at screen brightness.

Brightness score (Regression)

X = screen brightness (cd/m^2), Y = score/100

Using an equation (f(x) = 100/1+200e^-(0.009*(screen brightness in cd/m^2))), we can make a chart that shows what the limits could be.

Looking at the chart, we can see that the inflection point is around 350cd/m^2 and the crest is around 850. That's no accident: 350cd/m^2 is the minimum brightness needed to see an image in direct sunlight. 800cd/m^2 is the threshold of pain in a well-lit room. While these aren't scientific limits, they're just the ones I chose for this illustration. 

Note how the score doesn't reward ludicrous screen brightnesses past a certain point? See how the algorithm keeps sub-350 readouts in the scoring basement? That's by design. We set acceptable limits for what people need based on the philosophy of the product. By establishing the necessary parameters, we can then score against them.

We also don't want to discourage reaching for better heights, so we do reward the brighter screens—though we also want to preserve the philosophical integrity of the system. If we were to score our original hypothetical, 100,000cd/m^2 would get 99.9/100, the 1,000cd/m^2 would get 98/100. Both are ultra-high levels, but one is completely ludicrous, the other is realistic.

The user is the object of the product, not the scoring algorithm, so replacing scoring that rewards beating the pack with scoring that observes the user's needs is the right way forward.