Test suites

Test suites from existing papers have been published in a variety of formats, making them difficult to adapt across models and evaluation pipelines. We require our test suites to conform to a standardized format to facilitate replications.

Meta information

First, we require some basic meta information about the suite, such as the suite name, author name, and reference information. We also require a metric, which specifies the way surprisal should be aggregated over the tokens within a region (see the Regions section for more details).

Currently, the supported metrics are sum, mean, median, range, max, and min. Users may specify any individual metric or any subset of these metrics.

For users uploading test suites in JSON format, the metric must be specified as a string or list of strings. You can specify any individual metric as a string (e.g. 'range'), multiple metrics as a list (e.g. ['sum', 'mean']), or all metrics as the string 'all'.

An example meta dict in JSON looks like this:

1
2
3
4
5
6
7
8
{
    "meta": {
        "name": "test",
        "author": "Syntax James",
        "reference": "James, Syntax (1956). Syntactic Strucgyms.",
        "metric": "all"
    }
}

Regions

The atomic unit of a test suite is a region. A region is a chunk of a sentence that we are interested in comparing across conditions. Regions are defined separately from the items themselves, and each sentence in a test suite is partitioned into the same regions.

For example, suppose we are designing a test suite for subject-verb number agreement. One might imagine that a researcher would be interested in comparing surprisal values at the verb of the sentence when number agreement is satisfied or violated. Thus, one natural region would be verb. Assuming we’re dealing with very simple sentences, we might then define a region subject to deal with the content before the verb and post-verb to deal with the content after the verb, such as adverbs or final punctuation.

Users uploading a test suite in JSON format specify this information in a dictionary called region_meta that associates region numbers with region names. This mapping will be used in the Predictions and Items.

1
2
3
4
5
6
7
{
    "region_meta": {
        "1": "subject",
        "2": "verb",
        "3": "post-verb"
    }
}

Note

A region can consist of multiple tokens. For example, the region corresponding to a noun phrase might contain the tokens my neighbor, the very wrinkly raisin, or dogs.

The surprisal of a region is calculated by aggregating the surprisals of each token in the region. The metric used to aggregate token-level surprisals is specified in the Meta information.

Conditions

In our example, suppose we are interested in two experimental conditions: a singular subject paired with a singular verb, and a singular subject paired with a plural verb. Let’s define these conditions as number_match and number_mismatch, respectively.

Web interface users must enter the names of each condition in the test suite. However, JSON users do not need to explicitly define condition names. They are inferred from the Items.

Predictions

Users often design test suites with a hypothesis in mind: the surprisal at certain regions should be greater in some conditions than others. In order to encode these hypotheses, we allow users to specify predicted relationships between region-level surprisal values across conditions.

Let’s return to the running example. If the model has correctly learned generalizations about number agreement in English grammar, then we would expect the aggregate surprisal at region 2 (verb) to be higher in the number_mismatch condition than in the number_match condition.

Users uploading a test suite in JSON format would include a predictions field, which contains a list of dictionaries, each one containing the prediction for a single region:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
{
    "predictions": [
        {
            "region_number": 2,
            "l_operand": "number_mismatch",
            "relation": "greaterthan",
            "r_operand": "number_match"
        }
    ]
}

The relation field of each dictionary can be lessthan, equals, or greaterthan. The l_operand and r_operand fields specify the condition names that fill the first and second argument positions of the relation, respectively.

If we do not have clear predictions regarding regions 1 (subject) or 3 (post-verb), we do not need to specify dictionaries for those regions.

Warning

Do not omit predictions in the JSON file. If you do not wish to encode any predictions, simply pass an empty list. The predictions field must be present to be properly parsed.

Items

Finally, users must specify a list of items. This is the meat of the test suite, in the sense that it provides the actual test items that are sent to the model for evaluation.

An item is characterized by the lexical content, and takes different forms across conditions. For example, The boy swims today. and *The boy swim today. are different instances of the same item under the number_match and number_mismatch conditions, respectively.

In the web interface, users enter sentences in a grid of text boxes, where each row corresponds to a sentence (a particular item under a particular condition) and each column corresponds to a region.

In the JSON format, items is a list of dictionaries. Each item dictionary specifies an item_number, as well as a list of conditions. Each dictionary in the conditions list corresponds to a sentence. This sentence-level dictionary requires a condition_name as well as a list of regions, where each region is represented as a dictionary with a region_number (consistent with region_meta) and content. This content dictionary is where the actual text lives.

Note

The content of a region is expected to be natural language text, prior to tokenization. Tokenization will be performed on a model-by-model basis when the test suite is used for evaluation.

Here are example items in JSON format:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
{
    "items": [
        {
            "item_number": 1,
            "conditions": [
                {
                    "condition_name": "number_match",
                    "regions": [
                        {
                            "region_number": 1,
                            "content": "The boy"
                        },
                        {
                            "region_number": 2,
                            "content": "swims"
                        },
                        {
                            "region_number": 3,
                            "content": "today."
                        }
                    ]
                },
                {
                    "condition_name": "number_mismatch",
                    "regions": [
                        {
                            "region_number": 1,
                            "content": "The boy"
                        },
                        {
                            "region_number": 2,
                            "content": "swim"
                        },
                        {
                            "region_number": 3,
                            "content": "today."
                        }
                    ]
                }
            ]
        }
    ]
}

Examples

The format of test suites is perhaps best learned by example. To view or download items from existing test suites, see the Test Suites page for inspiration.