P-values are probabilities. Translating from a probability into a `significant`

or `not significant`

result involves setting a significance threshold between 0 and 1. P-values less than this threshold are considered significant and p-values higher than this threshold are considered not significant.

The significance threshold is used to convert a p-value into a yes/no or a true/false result. After running a hypothesis test and obtaining a p-value, we can interpret the outcome based on whether the p-value is higher or lower than the threshold. A p-value lower than the significance threshold is considered significant and would result in the rejection of the null hypothesis. A p-value higher than the significance threshold is considered not significant.

When using significance thresholds with hypothesis testing, two kinds of errors may occur. A type I error, also known as a false positive, happens when we incorrectly find a significant result. A type II error, also known as a false negative, happens when we incorrectly find a non-significant result:

Null hypothesis: |
is true |
is false |
---|---|---|

P-value significant | Type I Error | Correct! |

P-value not significant | Correct! | Type II error |

A significance threshold is used to convert a p-value into a yes/no or a true/false result. This introduces the possibility of an error: that we conclude something is true based on our test when it is actually not true. A type I error occurs when we calculate a “significant” p-value when we shouldn’t have. It turns out that the significance threshold we use for a hypothesis test is equal to our probability of making a type I error.

When working with a single hypothesis test, the type I error rate is equal to the significance threshold and is therefore easy for a researcher to control. However, when running multiple hypothesis tests, the probability of at least one type I error increases beyond the significance threshold for each test. The probability of an error occurring when running multiple hypothesis tests is 1-(1-a)^n, where a is the significance threshold and n is the number of tests.