Responsiveness and Minimal Important Change of the PROMIS Pain Interference item bank in patients presented in Musculoskeletal Practice.

We evaluated the responsiveness of the PROMIS Pain Interference item bank in patients with musculoskeletal pain by testing predefined hypotheses about the relationship between the change scores on the item bank, change scores on legacy instruments and Global Ratings of Change (GRoC), and we estimated Minimal Important Change (MIC). Patients answered the full Dutch-Flemish V1.1 item bank. From the responses we derived scores for the standard 8-item short form (SF8a) and a CAT-score was simulated. Correlations between the change scores on the item bank, GRoC and legacy instruments were calculated, together with Effect Sizes (ES), Standardized Response Means (SRM), and Area Under the Curve (AUC). GRoC were used as an anchor for estimating the MIC with (adjusted) predictive modeling. Of 1677 patients answering baseline questionnaires 960 completed follow-up questionnaires at three months. The item bank correlated moderately high with the GRoC (Spearman’s rho 0.63) and with the legacy instruments (Pearson’s R ranging from 0.45 to 0.68). It showed a high ES (0.97) and SRM (0.71), and could distinguish well between improved and not improved patients based on the GRoC (AUC 0.77). Comparable results were found for the derived SF8a and CAT-scores. The MIC was estimated to be 3.2 (CI 2.6-3.7) T-score points. Our study supports the responsiveness of the PROMIS-PI item bank in patients with musculoskeletal complaints. Almost all predefined hypotheses were met (94%). The PROMIS-PI item bank correlated well with several legacy instruments which supports generic use of the item bank. MIC for PROMIS-PI was estimated to be 3.2 T-score points.


Introduction
Part of the burden of living with pain is reflected in the way in which pain hinders engagement with social, cognitive, emotional, physical, and recreational activities. For patients with pain, it is therefore important to measure these hindrances. The Pain Interference (PI) item bank was developed within the Patient Reported Outcome Information System (PROMIS™) domain framework 11 to measure the degree to which pain limits or interferes with daily activities 1 . Several studies evaluated psychometric properties of the PROMIS-PI item bank, reporting essential unidimensionality (Omega-H 0.97-0.99, Explained Common Variance 0.81-0.95 16,40 ), good reliability (>0.95 for 96% of a sample of patients with musculoskeletal pain 14 ), good construct validity (correlations >0.50 with several legacy instruments 16,40 ), and good cross-cultural validity for the Dutch-Flemish translation (only 1-2 out of 40 items showing differential item functioning for language 14,40 ). Where most questionnaires are disease-specific, PROMIS questionnaires were designed to be generic. The PROMIS-PI item bank was shown to correlate well with questionnaires addressing neck pain 5,22 , low back pain 5,22 , knee complaints 28 , shoulder complaints 42 , and foot and ankle complaints 35 . Sensitivity to change and responsiveness was addressed in several studies, including populations with carpal tunnel syndrome 21 , spinal disorders 22,41 , COPD 50 , osteoarthritis 12 , stroke 12 , low back pain 3,12,25 , knee pain 26 , cancer 49 , and depressive disorders 3 . All studies administered the PI item bank using either fixed short forms or Computerized Adaptive Testing (CAT). Some of these studies reported Minimal Important Change (MIC) as well, ranging from 1.9 to 8.9 2,4,6,13,26,49 , on a T-score scale that is standardized to have a mean of 50 and a SD of 10. These studies, however, have some limitations. Most of these studies used effect sizes (ES) and Standardized Response Means (SRM) to evaluate responsiveness, but without specifying hypotheses about the expected magnitude of the changes. ES and SRM alone are insufficient measures of responsiveness, because they are influenced by floor and ceiling effects, and they are dependent on the standard deviation of the baseline score (ES) or the standard deviation of the change score (SRM). Therefore only comparisons of ES and SRM of different measurement instruments within studies, i.e. using the same sample, are informative to compare the responsiveness of instruments. Only a small number of studies, some of which with small sample sizes, compared the responsiveness of the PI item bank with other frequently used PROMs that measure similar constructs (called legacy instrument). Most studies presenting MIC values used mean change methods to estimate MIC values, which do not reflect a threshold for minimal improvement 46 . This threshold is important because the aim of estimating MIC values is to get an impression of the minimum change score that can be considered important to the patient. Because of these limitations, and because responsiveness and MIC values may differ between populations, more work is needed to evaluate the responsiveness and the MIC in different settings 4 . We therefore aimed to study the responsiveness and estimate the MIC of the PROMIS-PI item bank in a population of patients with musculoskeletal complaints, who were treated by musculoskeletal physicians 37 . To evaluate the responsiveness we tested predefined hypotheses about the resemblance between the ES and SRM of the PROMIS-PI item bank and the ES and SRM of legacy instruments, together with predefined hypotheses about the correlations of the PROMIS-PI change scores with the change scores of legacy instruments and Global Ratings of Change (GRoC). We tested responsiveness for the full item bank and for derived scores for the standard 8-item short form (SF8a) and a simulated CAT.
We estimated MIC using (adjusted) predictive modeling 43 .

Study design
To collect data for our study we used an existing web-based registry of patients who presented at the practices of 31 participating musculoskeletal (MSK) physicians in the Netherlands. MSK physicians are medical doctors who are trained to use Spinal Manipulative Treatment (SMT). They are consulted by patients with a variety of musculoskeletal complaints, most frequently of spinal origin, such as low back pain or neck pain. Specific SMT techniques are almost invariably used, but can be combined with other treatment options, such as prescription medication, or injections in the spine under X-ray guidance 37 . For our study we recruited patients who presented at the MSK practice for a first consultation. At the first visit (baseline), the physician entered data about the age, gender, type and duration of the main complaint and the existence of concomitant complaints in a web-based register.
Main complaints were recorded according to the International Classification of Primary Care.
Registered patients were asked to participate in this longitudinal study. After the patients gave informed consent the physician entered the patients email address in the registry. A computer program (Readmail) was custom built to send automated invitations by email to fill in a web-based questionnaire immediately after a patients email address was entered in the registry, and after a follow-up period of three months. From October 2013 to February 2014 the data from this registry was used for the present study. The study was approved by the Medical Ethical Committee of the VU Medical Center (2013/20).

Measures
Our study population responded to the full Dutch-Flemish V1.1 PI item bank (www.healthmeasures.net), and to several legacy instruments. The PROMIS-PI item bank consists of 40 items with a temporal context of 7 days (e.g. "in the past seven days, how much did pain interfere with your enjoyment in life"). Each item has five possible response options; three sets of response options are used to correspond to the different items: (1) not at all, a little bit, somewhat, quite a bit, very much, (2) never, rarely, sometimes, often, always, and (3) never, once a week or less, once every few days, once a day, every hour. T-scores for the PROMIS-PI item bank for each patient were calculated based on the US item parameters using the online Health Measures Scoring Service program, provided by the US Assessment Center. Higher scores represent higher trait levels, in this case more pain interference. The Dutch-Flemish PROMIS-PI item bank was validated in Dutch populations with chronic pain 14,40 .
We calculated T-scores based on the full item bank. In addition we calculated scores using only the items from the standard 8-item short form (SF8a) and a simulated CAT. Post-hoc CAT simulations were performed with the R-package catR (v3.16) using the standard PROMIS CAT starting and stopping rules and the original US item parameters, which were obtained from HealthMeasures.
In addition to the PROMIS-PI item bank our study sample responded to one out of five diseasespecific legacy instruments, tailored to the main complaint. Patient with low back pain completed the Roland-Morris Disability Questionnaire (RDQ), a 24 item questionnaire measuring disability as a result of low back pain 36 . Total score ranges from 0-24, with higher scores indicating more disability.
Patients with neck pain completed the Neck Disability Index (NDI), a 10 item questionnaire measuring self-reported pain intensity and limitations in daily activities 48 . Total score ranges from 0- At three months follow-up patients were asked to rate their perceived change in pain interference on a Retrospective Global Ratings of Change (GRoC) instrument (`Compared to three months ago, how much do you think that the limitations that you experience due to your pain have changed`). Patients answered a single item question about their perceived change with the following response options: (1) much improved, (2) improved, (3) slightly improved, (4) unchanged, (5) slightly worse, (6) worse, or (7) much worse. A recently published paper, evaluating the reliability of transition ratings, reported the reliability of this GRoC to be relatively high 18 .

Statistical analyses
Descriptive analyses were presented for the complete sample of baseline responders, and for the group of patients who did or did not answer the follow-up measurement. Possible selective loss to follow-up was evaluated by comparing baseline characteristics between the groups of patients who did or did not answer the follow-up measurement. Responsiveness was defined by the COSMIN initiative as the ability of an instrument to detect changes over time in the construct to be measured 33 . We used various approaches to test responsiveness 17,31 . First the PROMIS-PI measures and legacy instruments were correlated with the GRoC. Measuring comparable, but not precisely the same construct, we expected the PROMIS-PI item bank to correlate at least moderately with the GRoC and with the legacy instruments, with a correlation coefficient of at least 0.50 34 . As a second approach, the Area Under the Receiver Operating Characteristics Curve (AUC) was calculated for all instruments, after dividing the patient population in a group of patients considered to have improved and a group of patients considered not to have improved, based upon the GRoC scores. Patients reporting any form of improvement were considered improved (GRoC categories 1-3, much improved, improved, or slightly improved) and patients reporting to be unchanged and patients reported to be worse were considered not improved (GRoC categories 4-7, unchanged, slightly worse, worse, or much worse). The AUC is considered to reflect the ability of the instrument to discriminate between patients who reported to be improved and patients who reported to be not improved. An AUC of >0.70 indicates adequate ability to distinguish patients who have or have not changed 45 . As a third approach, the correlation between the change in T-scores with the change in the scores on the legacy instruments was assessed, and the Effect Size (ES) and the Standardized Response Mean (SRM) of the PROMIS-PI item bank was compared to the ES and SRM of the legacy instruments. ES is calculated by dividing the mean change score by the SD at baseline. The SRM is calculated by dividing the mean change score by the SD of that change score. We expected the PROMIS-PI item bank to measure change comparable to the legacy instruments, or higher due to the absence of floor and ceiling effects 17, 30,32 . Responsiveness measures are reported for the T-scores calculated from the full item bank as well as the T-scores derived from the subset of items making up the standard 8-item short form and the simulated CAT.
The following hypotheses were tested: -We expected a correlation of at least -0.50 between the change in the PROMIS-PI T-score and the GRoC score. This correlation was expected to be negative because improvement on the PROMIS-PI item bank is represented by a lower score while improvement on the GRoC is represented by higher score.
-We expected a correlation of at least 0.50 between the change in T-score of the PROMIS-PI item bank and the legacy instruments measuring functional disability (RDQ, NDI, DASH, LEFS and HIT-6).
These correlations were expected to be positive for NDI, the RDQ, the HIT-6 and the DASH, in which the disability increases with higher scores. Correlations were expected to be negative for the LEFS, in which disability decreases with higher scores.
-We expected an AUC in excess of 0.70 for the PROMIS-PI item bank.
-We expected the ES and the SRM of the PROMIS-PI measures to be larger, the same, or at the most 0.05 smaller than the ES and the SRM of the legacy instruments.
Responsiveness was considered sufficient if at least 75% of the results were in accordance with the predefined hypotheses.
Minimal Important Change (MIC) was defined as a threshold for a minimal within-person change over time above which patients perceive themselves importantly changed. Assuming that all patients have their individual threshold of what they consider a minimal important change, the MIC can be conceptualized as the mean of these individual thresholds 46 . The MIC can be used as a threshold to determine the number of patients who have improved 46 . We estimated MIC based on data of the full item bank, using predictive modeling. With predictive modeling, the MIC is defined as the change score where the post-test probability of belonging to the improved group equals the pre-test probability (Likelihood ratio = 1) 44  Demographic characteristics of the included sample are presented in Table 1. The average age of the whole sample was 47 years and most patients were female (59%), most patients were treated for spinal pain (75%), predominantly low back pain with or without sciatica (51%) and neck pain (16%), only a small number of patients reported complaints of shorter duration than three months (19%), and more than half of the patients had complaints longer than one year (61%). The RDQ was completed by 493 patients, the NDI by 167 patients, the LEFS by 98 patients, the DASH by 51 patients, and the HIT-6 by 35 patients.
Comparing groups of patient who did or did not answer the follow-up questionnaire showed no significant differences as far as baseline T-scores and baseline legacy scores were concerned. Nonresponders differed statistically significant from responders in terms of age (non-responders were on average 4 years younger), and gender (non-responders were more likely male (45% versus 39%)).

Change in PROM scores over time
Most patients reported improvement on the GRoC between baseline and 3 months follow-up. In Table 2 the changes in scores (T0-T1) on the PROMIS-PI item bank and on the legacy instruments are presented, stratified by the GRoC scores. When using the full item bank patients who reported to be slightly improved, improved, or much improved, changed on average 2.1, 6.0, and 15.1 T-score points, respectively.

Responsiveness
All responsiveness results are presented in Tables 3 and 4. The correlation with the GRoC was more than -0.50, as hypothesized, for the full bank, the short form, and for the simulated CAT (-0.63, -0.60, and -0.57 respectively). The AUC was above 0.70 for the full bank, the short form and for the simulated CAT (0.77, 0.75, and 0.74 respectively). Correlations with legacy instruments were above the hypothesized 0.50 (ranging from 0.58 to 0.68), except for the LEFS (-0.45, -0.50 and -0.38 for the full bank, the 8-item SF and for the simulated CAT respectively). The AUC was above the hypothesized 0.70 for all PROMIS-PI measures. The ES and SRM of the PROMIS-PI item bank was higher than the ES and SRM of all the legacy instruments, except for the SRM of the DASH, which was slightly higher than the PROMIS-PI (0.76 as compared to 0.69-0.72 for PROMIS-PI). This difference was more than the hypothesized 0.05. Combining all these findings, 94% of our results were in accordance with the predefined hypotheses, strongly supporting responsiveness of the PROMIS-PI item bank (Table 5).

Minimal Important Change
The MIC for the PROMIS-PI was estimated to be 3.2 T-score points, with a confidence interval of 2.6-3.7.

Discussion
We studied responsiveness of the PROMIS-PI item bank in a population of patients with musculoskeletal complaints treated by musculoskeletal physicians. Predefined hypotheses about the relation between PROMIS-PI change scores with the change scores of several legacy instruments were tested. Furthermore, we reported the responsiveness for the full item bank as well as the responsiveness of the subset of items making up the standard 8-item short form and a simulated CAT, and we estimated the minimal important change of the PROMIS-PI. Almost all previously defined hypotheses (94%) were met, which strongly supports the responsiveness of the PROMIS-PI item bank in patients with various musculoskeletal complaints. Using adjusted predictive modeling we estimated a MIC of 3.2 for the full PROMIS-PI item bank (CI 2.6-3.7). Since short forms and CAT are based on the full item bank, we consider this MIC value also applicable to short forms and CAT derived from this item bank.
All previous studies except for one 25 reported positively on the responsiveness of the PROMIS-PI item bank. The study with a negative outcome was carried out in a population with musculoskeletal complaints treated with telecare management, overall showing small effect sizes (15). It cannot be ruled out, however, that the telecare was ineffective, rather than that the PROMIS-PI was not responsive. Since responsiveness is concerned with measuring change over time, it is necessary to study responsiveness in a population that shows change over time. To estimate MIC values, it is also necessary to have a proportion of unchanged patients. In that respect, our study population was very well suited to evaluate both responsiveness and to estimate MIC values, as some patients reporting various levels of improvement, and some patients reporting no change or deterioration. Comparing previous studies, the techniques used to evaluate responsiveness differed strongly. Many studies used effect sizes as a measure of responsiveness, a method that has limitations. An instrument should not only measure change in the purported construct, but it should measure the right amount of change, i.e. it should not under-or overestimate the real change in the construct that has occurred 30 . Therefore, change scores from new measures should be compared to change scores of existing instruments in which the responsiveness was properly evaluated. However, even with the different techniques used, almost all studies reported positive findings, and we would suggest that there is strong evidence for the responsiveness of the PROMIS-PI item bank.
Similarly, estimates of the MIC will be influenced by the population studied and the techniques used.
The only study that reported a very low MIC value was conducted on a cohort of patients with arthritis, rheumatism and aging, without specific treatment 4  In our study, the responsiveness results were similar for the full bank, the derived short form, and simulated CAT, which may indicate that the PROMIS-PI short form and CAT were equally responsive as the full item bank, although they contain much less items. Correlation with the legacy instruments were very comparable between the scores obtained from the full item bank and scores obtained only using the items from the 8-item short form and the simulated CAT, which is in line with previous reports about the correlation of PROMIS short-forms with full item bank scores 10 . It must be noted that our population consisted predominantly of patients with spinal complaints, and a much smaller proportion of patients presented with headache or with complaints of the upper or lower extremity. Therefore the comparisons with the DASH, the LEFS and the HIT-6 may be less reliable than the comparisons with the RDQ and the ODI.
Our study confirmed previous reports about the responsiveness of the PI item bank.
In our study, we evaluated responsiveness using predefined hypotheses about the relationship with the change scores of a number of legacy instruments, and (adjusted) predictive modeling to estimate MIC values. The PROMIS-PI item bank showed strong evidence supporting responsiveness. Other studies showed similar correlations with legacy instruments in cross-sectional studies, supporting the generic applicability of the item bank 5,22,28,35,42 . A previous study also showed that scores can be compared across different populations (limited differential item functioning was found) 15 . The PI item bank, therefore, may replace a number of disease-specific instruments, which would greatly simplify routine monitoring of patients with different musculoskeletal complaints.

Strengths and weaknesses
A strength of our study is the large sample of patients with MSK complaints completing the full PROMIS-PI item bank together with GRoC and legacy instruments both before and after treatment.
Furthermore we tested predefined hypotheses to assess responsiveness, as recommended by COSMIN 32 . A weakness of our study was that the short form and CAT were not independently administered but derived from the answers to the full item bank. Therefore the results presented for the short form and CAT will be approximations of the responsiveness for these ways of administrating the item bank. Another weakness could be the relatively low percentage of responders at follow-up. Only 57% of baseline responders completed the questionnaires at three months follow-up. Non-responders were slightly younger and more often of male gender 38 . The Tscores on the PROMIS-PI item bank did not differ between responders and non-responders, nor did the scores on the legacy instruments. As the goal of our study was to test responsiveness rather than to measure the effect of treatment, selective loss to follow-up is not likely to cause bias. In responsiveness studies the change scores on measurement instruments for a similar construct are compared with each other or between a group of improved and not improved patients. Therefore, representativeness of the population is a less important issue. Another weakness is the relatively small proportion of patients with headache or with upper or lower extremity complaints, but we did not aim to estimate MIC values for subgroups of patients.
Significant differences in the age and sex are similar to previous studies in the same population, recruiting slightly more older patients and more patients of the female sex [38][39][40] . We think that is unlikely that this may have biased our study results. Although it may be interesting to evaluate whether the responsiveness is different for different subpopulations, this is not one of the aims of the present study.
Effect sizes in our study are high, which may reflect the effectiveness of the treatment. However, our study is not intended to evaluate the effect of the treatment, but solely to evaluate responsiveness.
Recruiting patients at a first consultation likely selects patients at a point in time where their fluctuating pain is high, and the high effect sizes may (partially) be explained by regression to the mean.

Conclusion
The change scores of the PROMIS-PI item bank correlated well with Global Ratings of Change and with the change scores of a number of legacy instruments, except for the LEFS. Effect Sizes, Standardized Response Means, and Area Under the Curve of the item bank were mostly slightly higher than those of the legacy instruments. Based on a priori hypotheses, the PROMIS-PI item bank showed sufficient responsiveness in patients with musculoskeletal complaints. MIC was estimated to be 3.2 T-score points. Our study supports the generic use of the PROMIS-PI in patients with a variety of musculoskeletal complaints. Table 1; Title: Demographic data and treatment details for the whole sample, responders, and nonresponders at follow-up.
Legend: Baseline data of entire population and of responders versus non-responders at follow-up.
Differences responders versus non-responders at follow-up were analyzed with regression analyses for continuous and dichotomous variables, and Chi2 for nominal variables.    Legend: Outcomes that are according to predefined hypotheses are marked with V, outcomes that are contrary to predefined hypotheses are marked X. For effect sizes and standardized response means the differences between PROMIS-PI and the legacy instruments are presente