D-prime (signal detection) analysis

1. Taking response bias into account

The model of discrimination performance discussed in the previous file assumes that when listeners do not hear a difference, or are not sure, they respond "same" or "different" randomly, so that performance is at chance. But there is no guarantee that listeners will do that.

Suppose you were a subject in a discrimination task and you wanted to show 100% discrimination. You could answer "different" to every item, and you would then get 100% correct on the different pairs. You would of course also get 0% correct on the same pairs, because you answered "different" to all of them. In many studies, the same pairs are not analyzed at all, and this response strategy would work well. Does this result, 100% correct, mean that you discriminated the pairs very well? Clearly not; you don't even have to have listened to them.

Compare this response strategy with an opposite one -- suppose that you are very conservative in answering "different", and only do so when you are quite sure that the stimuli are different. That is, you don't respond at random when you do not hear a difference, or are not sure; you consistently respond "same". You might then get 100% correct on the same pairs, but you will probably have 0% correct on at least some of the different pairs (small step sizes, within-category pairs). Clearly you might nonetheless be discriminating better than a person who readily answers "different". This pattern of results is common in AX discrimination studies.

The point is that % correct on the different pairs alone is not a very meaningful measure of discrimination. It becomes meaningful when interpreted in terms of the listener's response bias, or tendency to respond "same" or "different". The responses to the same pairs can be used as an indication of response bias.

2. (Signal) Detection theory attributes responses to a combination of sensitivity and bias. Sensitivity is what we are interested in, while bias is what we have to take into account to recover sensitivity. The presentation that follows comes directly from Macmillan and Creelman's 1991 Detection Theory: A User's Guide (known here as “Detection for Dummies”).

Using detection theory, we conceive of sensitivity as (broadly) detecting a signal (e.g. against background noise, or compared to another signal), and model how a perceiver decides whether a signal is present. An experiment presents signals and non-signals to subjects, who try to detect all and only the signals. The traditional way of viewing such an experiment, and naming the possible outcomes, is as follows. “Yes” here represents the presence of a signal or difference to be detected; our different and same labels are added for convenience in thinking about AX discrimination:

	Response: Different (yes)	Response: Same (no)
Stimuli: YES (different)	HIT	MISS
Stimuli: NO (same)	FALSE ALARM	CORRECT REJECTION

This scheme is also used to organize and tabulate subjects' responses. That is, the (raw) number of HITS etc. is entered into the 4 cells. Notice that if the number of Signal (Different) and No-signal (Same) stimuli are the same in the experiment, then the total number of responses in the top row will equal the total number in the bottom row (or, more generally, the total for each row is known in advance from the design of the experiment); however, the total number of responses in the YES response column will not necessarily be the same as the total number in the NO response column, and neither number can be known in advance. But, if you know the number of YES and NO trials in the experiment, you know the value in one column from the value in the other. E.g. if there are 20 different trials, and a subject has 5 hits, then that subject must have 15 misses. So, only 2 of the 4 numbers in the table (1 per row), plus the total numbers of trials, are needed to characterize a subject's performance. These are conventionally the Hits and False Alarms, and these are then given as proportions of the row totals, which are in turn viewed as estimates of probabilities of responses:

hit rate H: proportion of YES trials to which subject responded YES = P("yes" | YES)
false alarm rate F: proportion of NO trials to which subject responded YES = P("yes" | NO)

The table can be rewritten with these and the other 2 rates, with each row totalling to 1.0; but the results of interest are the pair (H,F). (Compare these to the total proportion correct, which is (Hits + Correct rejections)/all responses.)

Consider then that the perfect subject's performance is (1,0), while a random subject has H=F and our subject who always answers YES has (1,1). Intuitively, the best subject maximizes H (and thus minimizes the Miss rate) and minimizes F (and thus maximizes the Correct Rejection rate); and thus the larger the difference between H and F, the better the subject's sensitivity. The statistic d' ("d-prime") is a measure of this difference; it is the distance between the Signal and the Signal+Noise. However, d' is not simply H-F; rather, it is the difference between the z-transforms of these 2 rates:

d' = z(H) - z(F)

where neither H nor F can be 0 or 1 (if so, adjust slightly up or down). Note that z-scores can be positive or negative so you have to watch the signs in the subtraction.

Background: z-transform. A range of values is cast as a normal distribution, with standard deviations around the mean. The mean value is set to 0, and the range of most values is about 3 standard deviations above and below the mean. So each value is some number of SD units above or below the mean. This transform is valuable in allowing comparison of measures with different ranges of absolute values, and in taking into account the inherent variability of different measures. For example, Wightman et al. (1992) J. Acoust. Soc. Am. 91,1707-1717, comparing lengthening before different break indices in a corpus with uncontrolled final consonants and vowels, used transformed duration measurements because different segments have different absolute durations and different degrees of variability.

Of course, whether you use the original proportions or their transforms, when H = F, then d' = 0. This will be true whether the "yes" rate is near 1 or near 0. The highest possible d' (greatest sensitivity) is 6.93, the effective limit (using .99 and .01) 4.65, typical values are up to 2.0, and 69% correct for both different and same trials corresponds to a d' of 1.0.

There are other sensitivity measures - e.g. a transform other than z (or even no transform at all), or differential weighting of H and F, and even alternate versions of d' (see below) - but this is the one you usually see in speech research.

3. How to get d' for your data.

You could calculate H and F, convert them to z-scores, and subtract them. M&C's table A5.1 in Appendix 5 gives the z-score conversions.

M&C's first example of getting H and F:

	#responses different	#responses same	total # responses
stimuli different	20	5	25
stimuli same	10	15	25

So the hit rate H is 20/25, or .8
the miss rate is 5/25, or .2 (these 2 add up to 1.0)
the false alarm rate is 10/25, or .4
the correct rejection rate is 15/25, or .6 (these 2 add up to 1.0)
and the (H,F) pair is (.8,.4)

z(H) = 0.842 and z(F) = -0.253
d' = 0.824-(-0.253) = 1. 095

But probably you will want to do it more automatically. M&C's Appendix 6 provides some information on available programs, the late Tom Wickens still has a website on a UCLA server with a downloadable program, and a search online will turn up several options.

Colin Wilson has provided his Excel formula: d' = NORMINV(hit-rate,0,1) - NORMINV(false-alarm-rate,0,1) where Excel's NORMINV "Returns the inverse of the normal cumulative distribution for the specified mean and standard deviation", 0 being the specified mean and 1 being the specified SD.

But see 13 below for different d' calculations for different experimental designs, including our AX discrimination.

4. Some use of d' in the discrimination literature.

d' is often plotted for instead of, or in addition to, % correct/% different responses to different pairs. Here is a recent example , from Francis & Ciocca (2003), JASA 114(3), p. 1614:

Best et al. (1981), Perc. & Psychophys. 29:191-211 calculate d' for obtained and predicted discrimination, and compare these by ANOVA. Here is a figure plotting obtained minus predicted d' (p. 211):

Godfrey et al. (1981) and others have studies with small numbers of trials for each pair (especially same pairs, and especially in studies with kids), and in these cases d' is calculated not for individual stimulus pairs, but for each subject, combining all different pairs and all same pairs. Alternatively, a d' for each pair is sometimes seen for a group of subjects (e.g. Francis & Ciocca 2003); this has the advantage that perfect scores on any pair are unlikely for a whole group (and thus require no adjusting down from 1.0). See J. Sussman (1993) for averaging H and F, so that d' scores are group scores; tested with G test. Clearly in small experiments we have no choice but to average something; see Ch. 11 in M&C on how to average carefully.

Sometimes H and CR are added to give a proportion correct (for all pairs, not just for different pairs), which is then arcsin transformed and analyzed in the usual way. See Sussman & Carney, Francis & Ciocca. But see M&C p.100ff on the dangers of proportion correct ("an unexpectedly theory-laden statistic").

5. Bias

Bias is measured as the inclination of the subject to say "yes" (or "no"). The bias measure c is a function of H + F. But no one in speech research seems to report it, so we won't cover how to calculate it.

6. Two models of AX discrimination performance

A "same-different" experiment uses 2 or more stimuli in a trial and calls for a "same/different" response. M&C propose that there are really 2 different kinds of these, with different likely subject strategies and therefore different appropriate models for d' (pp. 143ff). In "fixed" designs the 2 stimuli in a pair are the same across trials in a block, and subjects are likely to apply an independent-observation strategy, estimating the category for each stimulus and then comparing the category estimates. For this strategy, d' is calculated in the usual way. In "roving" designs the 2 stimuli vary from trial to trial, and subjects are likely to apply a differencing strategy, applying a threshold of difference to decide if 2 stimuli are different enough to count as different. For this strategy, d' is calculated differently, e.g. using the table of H vs. FA values in M&C's appendix A5.4.
It would seem that speech experiments almost always use a roving design, and thus would seem to call for the differencing model; but on the other hand the theory of the independent-observation model is more like the idea of categorical perception. Both approaches to d' are seen in the speech perception literature. To pursue the issue of whether categorical perception can be modeled as a differencing strategy, see Macmillan, Kaplan, and Creelman 1977, "The psychophysics of categorical perception", Psych Review 84: 452-71.

How much difference will this make in analyzing an experiment? Consider M&C's example in (11) above, for the (.8,.4) pair: d' was
1. 095. This pair is found on p. 347 of Table A5.4, with a d' of 2.35. Compare also the values in the file "some sample dprime.xls". (Some values still need to be looked up - try this yourself.) The differences can be large, but it might not matter when comparing values calculated by the same method.

7. Applying detection theory to identification data

In some papers we see pairs of items along a continuum treated as signal vs. noise for purposes of computing a d-prime for identification responses. For example, Massaro 1989, starting p. 410: “The probabilities of responding /r/ are transformed into z scores. The d’ between two adjacent levels along the /l/-/r/ continuum is simply the positive difference between the respective z scores.” (example given) “A d’ value was computed for each of the two pairs of adjacent levels along the /l/-/r/ continuum.” Then he reports an ANOVA on these d’ values, with 2 within-subject factors (context: 3 levels, and stimpair: 2 levels).

In effect, by subtracting like this, the responses (say, the /l/ responses) to one stimulus are treated as the HITS, and the responses (with the same response category) to the next stimulus over are treated as the FALSE ALARMS.

Another example, which I quote here a bit, is Iverson & Kuhl (1996). Starting on p. 1134: estimating the “perceptual distances” between stimuli from identification responses:

“Through the application of detection theory, identification percentages can also be used to estimate the perceptual distances separating tokens (Macmillan and Creelman, 1991). Within this theoretical framework, the z-transformed identification probability for each token, z(p), indicates its location relative to the category boundary. The absolute value of this measure indicates each token’s distance from the boundary in standard-deviation units. The sign of this measure indicates whether each token is within (positive) or out of (negative) the category. For example, z(p)=0.0 for tokens that are identified as a member of the category on 50% of trials, z(p)=2.3 for tokens that are identified as a member of the category on 99% of trials, and z(p)=-2.3 for tokens that are identified as a member of the category on 1% of trials. The perceptual distances between pairs of tokens (d’) can then be found by subtracting these location measures; tokens that are at similar locations will have a small d’ and tokens that are at dissimilar locations will have a large d’. In other words, d’ will be greater to the extent that tokens are identified differently.”

“The identification judgments were used to estimate perceptual distances by calculating the z transform of the mean /l/ identification percentage for each token and then taking the absolute value of the difference for each pair of tokens. The z transform reaches infinity when percentages equal 0 or 100, so tokens with 0% /l/ identifications were assigned values of 1% and tokens with 100% /l/ identifications were assigned values of 99% (Macmillan and Creelman, 1991).”

Pairwise comparison of identification responses is described by M&C on pp. 212-13. However, it’s not clear that they would use it directly as a measure of perceptual distance.

8. Comparing identification with discrimination responses. Above we see ways to get sensitivity scores for discrimination and identification responses. Once they are both in terms of d-prime, a common currency, they can be compared directly. That is, no need to "predict" an expected discrimination from identification. If discrimination is constrained by categorization, then the two sensitivity functions should be the same. An example of this, though without explanation, is in Schouten et al. (2003).

9. Some Terminology and models in M&C

"one-interval discrimination": "one-interval" means one stimulus in a trial; "discrimination" means telling the different stimuli of the experiment (not of the trial!) apart by selecting different responses from the available set -- so, confusingly, this refers to what we call identification. These are of 2 types:

"yes-no": only 2 responses are available; e.g. for each of the stimuli, is it one you've seen before (yes) or not (no, it's new); is it a Z (yes) or not (no, it's a Y). d' is calculated as before, though this assumes that there is a known, correct, answer for each stimulus, which is not the case for identification of stimuli drawn from a continuum.
"rating": more than 2 responses are available, on a scale, e.g. 1 to 5, where the 2 endpoint responses indicate high certainty See M&C pp. 61ff. for how to calculate H and FA in rating experiments. See also pp. 79ff for alternatives to the rating design, in which multiple conditions are used to manipulate bias.

"two interval": 2 stimuli per trial
"2AFC": which one of these 2 stimuli? (e.g. which one have you seen before (recognition), which one came first (temporal order), which one matches the prompt (identification))

M&C say that a 2AFC design is better than yes-no recognition when a priori familiarity is a likely confound; BUT because 2AFC is easier for subjects, d' is lowered by a factor of roughly .7. They also note that as this design tends to minimize bias (at least for simultaneous visual presentation of stimulus pairs), percent correct is not a bad measure of sensitivity.

Prepared by Pat Keating, Spring 2004, updated Fall 2005

Back to the UCLA Phonetics Lab statistics page.