Part IV: Developing WQI Protocols

On Developing Protocols for the WQI

From the beginning of our study, we have taken the position that quality is a time-varying feature of online-learning environments. Since wikis preserve their entire edit history in continuous time, data from these wikis are uniquely well suited to evaluating changes in wiki quality. In theory, researchers could take quality measures every second on a given wiki and reconstruct wiki quality in continuous time. However, since our WQI instrument requires human raters to evaluate quality, there are non-trivial costs to taking quality measures (see Part II on Content Analysis). Therefore, in our research design, we needed to develop a data collection protocol that balanced our desire for adequate data with the constraints of our budget.. At present, we measure wiki quality on days 1, 7, 14, 30, 60, 100 and 400. In this section, we describe the preliminary research that we conducted to arrive at these particular occasions of measurement.
In developing protocols for administering the WQI, we faced a tension between the desire to take many measures to accurately model our dependent variable and the constraints of time and money. This tension led us to address two questions about the frequency of and timing of wiki quality measurements as we created protocols for administering the WQI:
1) Frequency: How often should we measure wiki quality?
2) Timing: When in a wiki’s lifecycle should we take quality measurements?
The optimal frequency of wiki quality measurement is proportional to the complexity of wiki quality developmental trajectories. If wiki quality develops in a linear trajectory, then three data points are sufficient to model these linear trajectories. However, if typical wiki quality developmental trajectories are more complex, then more measurements are necessary to accurately model this development. Thus, we needed to assess the complexity of wiki quality trajectories in order to ascertain the optimal frequency of data collection.
While answering the question of how frequently to measure, we also needed to determine the optimal timing of measurements. As we estimated the complexity of wiki quality developmental trajectories, we also needed to model wiki lifetimes to determine when within the wiki lifecycle we should be taking quality measures. For instance, if wikis typically remain active and changing for 10 days, then we would have a different timing for taking measurements than if they typically remain active for 10 months or 10 years.
To address these two questions we used two longitudinal analytical methods. To assess the complexity of wiki quality development, we used empirical growth modeling. To model typical wiki lifetimes, we used continuous-time survival analysis. We combined insights from the findings of both of these methods in order to settle on our protocol of WQI measurements. In the following sections, we summarize the design, findings, and implications of our studies using these two methods, and then we detail the reasoning that led us to our current protocol for wiki quality measurements.

How often should we measure wiki quality?

We faced a catch-22 as we approached the question concerning the appropriate frequency of wiki quality measurements. Ideally, we would have made our determination of the appropriate frequency of wiki quality measurements by evaluating the complexity of wiki quality trajectories. It was not possible, however, to measure wiki quality trajectories before developing an instrument and protocol for measuring wiki quality.
We resolved this dilemma in the earliest phases of our research by using a measurement of wiki usage as a proxy for wiki quality development. We measured the number of page edits (both revisions to existing pages and new page creations) to a wiki, which accounted for both page edits and new page creations. In other words, in order to create a protocol for measuring wiki quality trajectories, we assessed the complexity of wiki developmental trajectories, using page edits as the metric for wiki development. We chose page edits for both practical and theoretical reasons. First, it seemed reasonably to us that periods of high volumes of edits and activity might co-occur with quality development. Second, PBworks provided us with data files containing page edits for a random sample if 1,799 publicly-viewable education-related wikis. (See Part II, Section 2 for details on our samples). These data files were arranged into a project-period dataset, where each row in the dataset corresponded to one day of activity for one wiki. Thus, we had ready access to the data necessary to evaluate wiki development in terms of page edits.
We used both exploratory empirical analysis and statistical modeling to assess the shape of typical wiki developmental trajectories. First, we examined a series of empirical growth plots for wikis, and then we fit a series of non-parametric local regression (loess) models. In the absence of any theoretical assumptions about the shape of wiki quality trajectories, these approaches allowed us to assess and then model wiki development without imposing any parametric constraints on our model.
As recommended by Singer and Willett (Singer & Willett, 2003), before we did any modeling of wiki development, we examined a series of empirical growth plots from sample of 411 U.S., K-12 wikis. First, we created a simple scatter plot for several wikis with days on the x-axis and page edits on the y-axis. We also fit a simple OLS regression line to each plot to begin to estimate the vector of wiki development. After examining wikis with time binned into days, we decided that we might be evaluating too much stochastic variation from day to day. Therefore, we recoded the dataset so that time was binned into months. Especially for longer lived wikis, looking at these empirical growth plots helped capture larger trends.
In Figure 1, we present four of these empirical growth plots. In Figure 1, we present four of these empirical growth plots. Notice that for the top two plots, wikis are more active in the early months and then become less active. For the bottom two plots, wiki activity maintains very low levels throughout the entire history of the wiki. The slopes for most of the OLS trajectories fitted to these plots were either negative or flat near zero. These patterns alerted us to two common patterns of wiki development: 1) wikis with early activity that gradually declines and 2)wikis that have little or no activity throughout their lifetime. Of course, it is difficult to systematically evaluate 411 individual empirical growth plots, so we also modeled wiki development in our entire sample.

Figure 1: Empirical growth plots, measured in monthly page edits, for four U.S., K-12 wikis, with OLS trajectories.

We also created a scatterplot of days and page edits on a single display, which we present here as Figure 2. This model is essentially a kind of outlier analysis, since the density of points at low levels of wiki page edits cannot be ascertained from this figure. For instance, at any given day after day 1, the modal number of edits is 0; however, this density is not represented in the figure. Still, it is interesting to note that the two main “spikes” of activity occur in the very first days in the wiki lifecycle and then again around day 365, a year after wiki creation.

Figure 2: Scatterplot of page edits by day for U.S., K-12 Wikis (n=411).
After these empirical analyses, we analyzed the data using non-parametric loess smooth models. These local regression models fit low-order polynomial functions to small subsets of the data and then connect these functions to create a smooth curve through the dataset. Since no studies of wiki communities existed to provide us any theoretical reasons for specifying a particular functional form of wiki developmental trajectories, the loess approach allowed us to model these trajectories without any parametric constraints.
One challenge of loess regression was that it required the specification of a smoothing parameter that determines the bandwidth of the local data subsets. If one chooses a bandwidth which is too narrow, then the model captures too much random variation; if one chooses a bandwidth too wide, then the model can smooth over important variation. Choosing an appropriate bandwidth is more art than science, since no robust methods exist for the optimal determination of the smoothing parameter. The goal is to choose a parameter which highlights the functional form of the curve and smoothes out the random variation.
In Figure 3, we present two loess curves with slightly different smoothing parameters. The panel on top has a narrower bandwidth than the panel on the bottom. Both panels generally have the same key features. First, on most days, the average number of edits across the U.S. K-12 wikis is very close to zero. On most days, most wikis experience no changes. Second, average wiki activity is higher very early in the wiki lifecycle and then a second spike occurs around day 365, after a year of wiki activity.

Figure 3: LOESS smooth curves of page edits by day for U.S., K-12 wikis, with smoothing parameters of .1 and .25 (n=411)

From analyzing these loess smooth plots, we drew three important lessons about wiki developmental trajectories. First, on average, the bulk of wiki activity happens within the early days of wiki development. Second, there is an additional spike of activity after a year. Third, aside from these periods of high activity, on average, wiki development is negligible. On any given day, most wikis will experience few or no changes in page content, which suggests that they will not change in quality either. From these lessons, we determined that we would need to concentrate a considerable amount of our resources to measuring early wiki quality, and that we should also take at least one measurement after a year of activity to capture the spike that occurs around a wiki’s one year anniversary, for those wikis that survive that long.

When in a wiki’s lifecycle should we take quality measurements?

In order to determine the timing of wiki quality measurements, we needed to assess typical wiki lifecycles. To do this, we needed to first define the features of a wiki lifecycle and definition rules for the “birth” and “death” of a wiki. With these decision rules in place, we used continuous-time survival analysis to model typical wiki survivor functions.
Measuring wiki lifetimes involves applying a biological metaphor, the lifecycle, to a socio-technical community. The birth of a wiki occurs at a distinct, measurable moment when a user generates a new subdomain on a wiki hosting network. Designating the moment of death of a wiki is more subjective, since wikis can always be returned to, changed and edited, even after years of inactivity. Nonetheless, we can identify precisely the last moment when a wiki was changed (through a page edit or new page creation), after waiting a sufficient time without further activity to ensure that the wiki is not merely dormant. Since the longest break in the U.S. academic year is the three-month summer holiday, we have adopted a 90 day period of inactivity as being sufficiently long to designate a wiki as “dead.”
Other definitions of death are possible. One compelling alternative would be to choose the last time a wiki was viewed. We chose to use editorial changes because we value active engagement over the viewing of static information. We also could have chosen a different value for our 90-day observational window. In fact, we know that 13% of our 411 K-12 wikis have gaps between page edits that exceed 90 days, ranging from 92 to 754 days. During our survival analysis, we tested alternative models with a 120 day observational window and found that results did not differ substantially. Therefore, we decided against expanding the window further as we would have censored a large number of wikis very likely to be indefinitely inert in order to avoid labeling as dead a small number of wikis which may experience future changes.
With this established definition of a wiki lifetime, we could then model wiki lifetimes using survival analysis. We could not simply use univariate statistics of lifetime measures or use wiki lifetime as an outcome in ordinary least-squares (OLS) regression analysis because of the problem of censorship. Not every wiki experiences the event of interest—the end of wiki activity—during our observational period. That is, some wikis have their final observed edit within our 90 day window, and as a result we did not know if these wikis are permanently inert or not. If we treated our lifetime measures as an outcome in OLS regression, our results would be biased since some wikis lived longer than our records indicate. Because of this issue of censorship, we use well-developed techniques from the literature on epidemiology known as “survival analysis” or “event history analysis.”
To conduct survival analysis, we recorded our measures in a project-level dataset, where every row in the dataset corresponded to one wiki. Event history analysis requires that we use a dichotomous measure of the event of interest as our outcome. Thus, we record EVENT as dichotomous variable coded as “1” when a wiki’s final edit is at least 90 days before data collection and coded as “0” otherwise. Our measure of wiki lifetimes is DAYS, a continuous variable recording the number of days between creation date and last edit date. In our sample of U.S. K-12 wikis, DAYS ranged from 1 (meaning that the wiki’s last change was within 24 hours of its creation) to 914.
As reported in our Educational Researcher paper (Reich, Murnane & Willett, in press), we used Kaplan-Meier analysis to estimate “baseline” survivor functions for our 255 public school wikis. In Figure 4, we present the Kaplan-Meier estimated survivor function for our wiki sample (Singer & Willett, 2003). We display the time since wiki creation on the X-axis and estimated survival probabilities (the proportion of wikis that remain active beyond each particular time-point) on the Y-axis.

Figure 4: Estimated survivor function of U.S., K-12, public school wikis (n=255).

The steep initial drop in the estimated survivor function indicated that many wikis are terminated almost immediately after creation. For instance, the estimated median lifetime (the length of time beyond which 50% of the original wikis survive) of public school wikis was only 13 days, and only one quarter of wikis persisted beyond 151 days. These estimates suggested that most wikis that were used at all are used for short-term projects and assignments rather than serving as long-term course platforms or student portfolios.
From these findings, we established a useful summary statistic of wiki lifetimes, the median lifetime, that we could use as a referent for determining occasions of measurement. We also recognized that we would need to concentrate our measurements very early in the wiki creation process, since so many wikis failed after so few days.

The protocol for measuring wiki quality

            Using the insights from our growth modeling and survival analysis, we established our protocols for collecting observations of wiki quality. In our earliest wiki quality coding, before we knew exactly how long it would take to code a wiki on average, we settled on four occasions of measurement: day 7, 14, 100, and 400. We chose days 7 and 14 in order to capture two points very early in the wiki lifecycle, given our knowledge that most wiki edits happen early in the wiki lifecycle and half of all public school wikis fail by day 14. It proved convenient that our media lifetime of 13 days was fairly close to a culturally-meaningful marker of time: 14 days or two weeks. While days 7 and 14 had methodological appeal, they are also easy to interpret at one and two weeks into a wiki’s lifetime. We chose day 400 in order to capture the “bump” of activity that we found in wikis that survived at least a year. Finally, we added a third measure at day 100, to capture wiki quality at approximately the semester mark. We chose a date closer to the two week mark—rather than midway between 15 and 400 days—because we knew that wiki survival probabilities decrease rapidly over that period, and at day 100 we would still be measuring quality in approximately 25% of all wikis.
While the uneven spacing of these measurements may appear intuitively inelegant, there are good methodological reasons for choosing this spacing. The precision of ordinary least squares regression estimates of rate of change in a growth model is a function of the precision of the measurements and the spacing and frequency of occasions of measurement (Singer & Willett, 2003). Additional measurements increase precision, and widely spacing additional measurements also increases precision. Therefore, the asymmetry in measurement actually increases the precision of our regression.
After some of our pilot studies in wiki coding, we determined that adding additional occasions of measurement would not be unduly expensive or time consuming, especially measures taken after day 14 when 50% of wikis have ceased changing. We believed that additional measures would help us more accurately model complex wiki development. At the time of making this decision, we were concerned that sparse data at the higher values of time might cause difficulty when we tried to fit models of wiki quality development with polynomial specifications of time. Therefore, we selected day 30 and day 60, days approximately twice and four times the median lifetime, as additional occasions of measurement.
In retrospect, choosing day 30 and day 60 as our additional occasions of measurement might not have been the best allocation of our resources. In developing our protocol, we assumed that we would use some kind of polynomial specification of time in modeling wiki quality. We also assumed that quality would develop throughout wiki’s lifetimes and we needed to have sufficient data throughout the wiki lifetime to model potential complexities in these quality growth trajectories. These two assumptions proved to be incorrect. With six occasions of measurement, we determined that we had sufficient data to attempt complex, non-linear specifications of time. As became clear when analyzed our complete set of wiki quality measurements in our first sample, wiki quality is best modeled with a non-linear, logarithmic specification of time rather than a polynomial specification of time. Moreover, wiki quality primarily changes during the first two weeks of a wiki’s lifetime, not throughout the entire wiki lifetime. We had predicted a concentration of activity early on from from our wiki development trajectories, but the concentration of activity was even more striking than we hypothesized might occur. The most important days to measure wiki quality, therefore, are the earliest days of a wiki lifetime. Rather than adding day 30 and day 60, we would have perhaps been better off adding a measurement at day 1 to continue to increase our precision in modeling early wiki quality development.
Thus, as we measured wiki quality in subsequent samples (for studies that we are currently working on), we evaluated quality at days 1, 7, 14, 30, 60, and 100. We could not measure wiki quality at day 400 in these samples because the wikis had not persisted long enough at the time of our data collection. One of the questions we hope to evaluate in our data analysis is whether the additional measurement at day 1 indeed improves the precision of our measurements.
            We hope that other researchers can take away several lessons from this narrative of our development of protocols for the application of the WQI. First, we used questions concerning the timing and frequency of our measures to frame our decision-making about when to measure wiki quality. Second, in the absence of existing published research about wiki quality, we were able to use easily obtained data about wiki development to make reasonable assumptions about wiki quality development. We learned from initial analyses that wikis typically survived for a short period of time and that most of their activity occurred early in the wiki lifecycle. Thus, we focused our resources on measuring the first days of wiki development, but we also took enough measures to track wiki quality growth over a full year. Finally, we used wiki quality data from our first study to refine our protocols in subsequent studies (in particular, adding an additional occasion of measurement at day 1). In retrospect, it might have been wiser to conduct a small pilot study with a random subsample of our 255 public school wikis, and completely analyze the quality measures before pressing ahead with a complete study of our first sample. We did not do so because by the time we had trained our research assistants, we needed to keep them working steadily throughout the year, so taking a break from measuring while we conducted in depth analyses would not have been feasible. It might have been wiser to plan for a small but complete pilot from the beginning, and manage our hiring and staffing accordingly. Overall, however, our strategies for designing the WQI protocols allowed us to address the research questions of our larger study. While refining our purpose categories, we also asked research assistants to describe “patterns of practice” they encountered on the wikis. These patterns of practice were identifiable discursive moves made by teachers and students to facilitate student learning. Again, we gave coders very few guidelines for what might constitute these patterns of practice. We did ask them to think about our four conceptual quality categories, of participation, expert thinking, complex communication, and new media literacy. Beyond that, however, we asked them to simply write about what they saw happening. At this stage, we examined over 400 wikis with two raters looking at each wiki, so we developed a pretty extensive set of qualitative descriptions of wiki activities.

We also, in these early rounds, began testing preliminary items. For instance, we developed a four-item taxonomy of behaviors displaying complex communication: concatenation, copyediting, co-construction, and commenting. We considered whether we could attempt to create some kind of quality scale for these items, but we realized that it would be impossible to quickly and reliably assess “good” copyediting versus “bad’ copyediting. We did attempt to make a simple, scalar assessment of the frequency of these activities by using a 0-2 scale where 0 was “activity not found,” 1 was “activity found infrequently” and 2 was “activity found regularly.” We did not provide precise definitions for the frequency categories. We found that we were unsuccessful at reliably rating the frequency of these four collaborative activities, but we were successful at reliably identifying the presence or absence of these activities. Moreover, wikis with evidence of multiple collaborative characteristics did appear to be generally more collaborative than wikis with just one characteristic. We also discovered that certain behaviors, such as signing up for a timeslot or a responsibility on a list, did not fit well within our complex communication schema. So in future iterations of the WQI we added items for planning, scheduling, and discussion.
Through additional rounds of pilot testing, we attempted several other approaches towards item design. For instance, we developed a set of indicators of technology use for our new media literacy category. These items included using formatting, adding links, and embedding images. For a while, we tried to distinguish between “substantive” and “decorative” uses of these elements. For instance, when did formatting really enhance the argument or artistic message of a wiki page, and when was it simply meaningless decoration? This was another effort at scalar measurement, and once again we could achieve agreement on the presence or absence of formatting, but we could not reliably distinguish decoration from substantive uses in a timely fashion.
In another pilot version of the WQI, we tried to identify both the presence of an activity as well as the intention for the activity to take place. In some wikis teachers indicate that certain behaviors are supposed to happen. For instance, a teacher might assign students to comment on each other’s work. We attempted to measure both when an activity actually happened and when a teacher intended for the activity to happen. Measuring teacher intent, however, quickly devolved into an exercise in parsing and mind-reading with low reliability, and we abandoned the effort.
While refining the item categories, we also refined our decision rules for each item. We found early on that long decision rules that listed many examples of the presence and absence of a behavior led to disagreement. When decision rules listed many specific examples, some coders only looked for those examples while others looked for the general principle. Based on this experience, for each item we wrote relatively short decision rules that focused on the general principle without many examples. We also experimented with phrasing our decision rules as questions, but we found it more effective to define decision rules as pairs of declarative statements describing the presence and the absence of the behavior. We still use the “question format” in publications as a summary of our instrument, but coders do not use the questions.
Thus, through numerous rounds of pilot testing, refinement, and iteration, we settled upon a near-final version of the WQI. In our last round of pilot coding, before we began training a new set of research assistants, we had two senior research assistants code a set of new wikis with the instrument. Afterwards, they sat down to discuss their disagreements, and we used these points of disagreement to make additional refinements to our decision rules. We also used some of these difficult wikis in our training set for new research assistants, to give them a sense of some of the challenges of coding wikis consistently.
When we started the first round of wiki coding, we had 25 items in four subdomains. There were two differences between that version of the WQI and the one that we reported in our early publications. In the original specification of the WQI, the participation subdomain included six items: Course Materials, Information Gateway, Contribution, Individual Page, Shared Page, and Student Ownership. In the complex communication subdomain, the WQI included the present seven items as well as one item for Beyond Classroom Communication, which evaluated whether students from more than a single classroom interacted on the wiki. We changed these items after coding the wikis for our first study and using principal components analysis to determine if our theorized subdomains in fact clustered together.
We made two changes to the instrument based on our cluster analysis. First, we deleted the item concerning Beyond Classroom Collaboration. This behavior was so rare, that the item artificially inflated our overall interrater agreement (it is easy to agree that something that never happens), and it did not cohere well with the other items in the complex communication category. We also separated out the Course Materials and Information Gateway categories out of the participation subdomain. Theoretically, the reason to include them in the participation subdomain was that they represented basic ways for students to interact with the wiki. However, since many wikis consisted of only students engaging with the wiki through view course materials and links, principal components analysis showed that wikis with positive scores for these two categories tended to score a 0 in all other categories. As a result, we created a fifth subdomain, Information Consumption , based on our empirical data, which included our two items for Course Materials and Information Gateway
At this point, we expect the current version of the WQI, with 24 items in five subdomains, to remain stable as we continue our data analysis on additional wiki samples.

Summarizing the Design Process for the Wiki Quality Instrument

Our process of instrument design included six major steps.

Defining a theoretical framework for wiki quality based on the literature regarding 21^st century skills
Conducting qualitative research with wiki-using teachers and students to determine how they defined and assessed wiki quality
Conducting a literature review of efforts to measure quality in online learning environments in order to assess whether existing items, scales or instruments could be integrated or adapted for our purposes
Conducting several rounds of open coding on wiki learning environments in order to develop a taxonomy of common patterns of practice on wikis
Conduct multiple rounds of pilot testing to test different items, scales, and decision rules
After data collection and analysis, make final revisions to the instrument based on cluster and principal components analysis.

Designing this instrument has been a balancing act. On the one hand, we sought to identify important indicators of potential opportunities for 21^stcentury skill development. On the other hand, in order to investigate wikis at scale, we have ensured that the indicators we chose to examine can be evaluated reliably and relatively quickly. This WQI was designed to be used in a research program where we make thousands of evaluations by examining hundreds of wikis on multiple occasions. Also, it is designed to be used in evaluating a very diverse population of wiki learning environments from all subjects and grade levels. We believe that this foundational instrument can be refined and improved to be even more useful, valid, and nuanced in evaluating more specific sub-populations of wiki learning environments.

We welcome questions and feedback from fellow researchers and other interested parties. While we have chosen not to publish here all of the specific worksheets that we used or training sets that we developed, we are happy to share these materials with other researchers with a material interest in those parts of our research program.

[1] See Part II for analyses of actual coding times; 30 minutes was our target.

[1] Some wikis are created and then never viewed at all by the creator, and when a coder visits the URL of one of these wikis they receive an error message. Some wikis are created and then viewed by the creator, and our raters could view these, even though they were unchanged.

[2] Sample sheets are available by request from the authors. We have not posted them here since we have decided not to repost URLs of wikis from our study.

[3] We have experimented with developing computational tools for determining a wiki’s creation date. We have found that a small number of districts and schools have institutional wiki creation processes. In these cases, API calls to the PBworks data warehouse for the wiki creation date can return dates for when a group of wiki subdomains are named and reserved, rather than when the wiki is actually first generated. Thus we manually check each wiki creation date.

[4] The Recent Activity link shows links by month and date and not by year, which can cause confusion when wikis have not been edited for several years. A review of the page histories, described in the following paragraphs, can resolve this potential confusion.

Wiki Quality Instrument and Protocol by Justin Reich is licensed under a Creative Commons Attribution 3.0 Unported License.
Based on a work at www.edtechresearcher.com.

JUSTIN REICH

EdTechResearcher

On Developing Protocols for the WQI

How often should we measure wiki quality?

When in a wiki’s lifecycle should we take quality measurements?

The protocol for measuring wiki quality

Summarizing the Design Process for the Wiki Quality Instrument