World's most popular travel blog for travel bloggers.

# [Solved]: Peer grading design - choosing a graph, to get accurate rankings/ratings

, ,
Problem Detail:

Background. I am writing some code for semi-automated grading, using peer grading as part of the grading process. Students are given pairs of essays at a time, and the students have a slider to choose which is better and how much better it is. e.g., the slider might look something like this:

A---X-B

Based on the results of the peer grading, essays are ranked and the teacher will then grade the top X% and bottom X% and scores for all essays will be automatically calculated based on this. I have already come up with methods for doing this ranking/scoring process; that part works well.

My question. How should I select which pairs of essays to give to students?

Simulations suggest we need an essay to be peer-graded at least 3 times, to get an accurate ranking. Thus, each essay should appear in at least 3 of the pairs that are presented for peer grading.

We can think of this as a graph problem. Think of the essays as nodes. Each edge represents a pair of essays that are presented during the peer grading process. The accuracy results above suggest that the degree of each node (or of most nodes) should be at least 3. What sort of graph should I use? How should I generate the graph to be used during peer grading?

One challenge is that if you have clusters in the graph, this will skew the peer-gradings. For example, we wouldn't want to have high-quality essays peer-graded mostly against high-quality essays, because that would skew the results of the peer grading.

What would you recommend?

I think this problem could be modelled with a undirected graph using something like the following:

• Start by taking the node with the least degree and link it with the next least
• Continue until your average degree is at least 3
• Maximise node connectivity
• Minimise number of cliques

Is this a good approach? If not what would you recommend instead?

There are two parts to this: (a) selecting a graph (experimental design) to determine which pairs of essays the students will evaluate in the peer grading process, and (b) ranking all the essays, based upon the student's peer grades, to determine which the teacher should rank. I will suggest some methods for each.

## Choosing a graph

Problem statement. The first step is to generate a graph. In other words, you need to select which pairs of essays to show to the students, during the peer grading exercise.

Suggested solution. For this task, I suggest that you generate a random graph $G$, selected uniformly at random from the set of all 3-regular (simple) graphs.

Justification and details. It is known that that a random $d$-regular graph is a good expander. In fact, the regular graphs have asymptotically optimal expansion factor. Also, because the graph is random, this should eliminate the risk of skewing the grading. By selecting a graph uniformly at random, you are ensuring that your approach is equally fair to all students. I suspect that a uniformly random 3-regular graph will be optimal for your purposes.

This raises the question: how do we select a 3-regular (simple) graph on $n$ vertices, uniformly at random?

Fortunately, there are known algorithms for doing this. Basically, you do the following:

1. Create $3n$ points. You can think of this as 3 copies of each of the $n$ vertices. Generate, uniformly at random, a random perfect matching on these $3n$ points. (In other words, repeat the following procedure until all $3n$ points are paired off: select any unpaired point, and pair it with another point chosen uniformly at random from the set of unpaired points.)

2. For each two points that are matched by the matching, draw an edge between the corresponding vertices (that they are a copy of). This gives you a graph on $n$ vertices.

3. Next, test if the resulting graph is simple (i.e., it has no self-loops and no repeated edges). If it is not simple, discard the graph and go back to step 1. If it is simple, you are done; output this graph.

It is known that this procedure generates a uniform distribution on the set of 3-regular (simple) graphs. Also, it is known that at step 3 you have a constant probability of accepting the resulting graph, so on average the algorithm will do $O(1)$ trials -- so this is pretty efficient (e.g., polynomial running time).

I have seen this approach credited to Bollobas, Bender, and Canfield. The approach is also summarized briefly on Wikipedia. You can also find a discussion on this blog post.

Technically speaking, this requires that the number $n$ be even (otherwise there is no 3-regular graph on $n$ vertices). However, this is easy to deal with. For instance, if $n$ is odd, you can randomly choose one essay, set it aside, generate a random 3-regular graph on the remaining essays, then add 3 more edges from the set-aside essay to 3 randomly chosen other essays. (This means that there will be 3 essays that are actually graded 4 times, but that shouldn't do any harm.)

## Ranking all the essays

Problem statement. OK, so now you have a graph, and you have presented these pairs of essays (as indicated by the edges in the graph) to the students for them to grade during the peer grading exercise. You have the results of each comparison of essays. Now your task is to infer a linear ranking on all of the essays, to help you determine which ones to have the teacher evaluate.

Solution. I suggested you use the Bradley-Terry model. It is a mathematical approach that solves exactly this problem. It was designed for ranking players in some sport, based upon the results of matches between some pairs of the players. It assumes that each player has an (unknown) strength, which can be quantified as a real number, and the probability that Alice beats Bob is determined by some smooth function of the difference of their strengths. Then, given the pairwise win/loss records, it estimates the strength of each player.

This should be perfect for you. You can treat each essay as a player. Each comparison between two essays (during the peer grading process) is like the result of a match between them. The Bradley-Terry model will allow you to take all of that data, and infer a strength for each essay, where higher strengths correspond to better essays. Now you can use those strengths to rank-order all of the essays.

Details and discussion. In fact, the Bradley-Terry model is even better than what you asked for. You asked for a linear ranking, but the Bradley-Terry model actually gives a (real-number) rating to each essay. This means you know not only whether essay $i$ is stronger than essay $j$, but a rough estimate of how much stronger it is. For instance, you could use this to inform your selection of which essays to rank.

There are alternative ways to infer ratings or rankings for all the essays, given the data you have. For instance, the Elo method is another. I summarize several of them in my answer to a different question; read that answer for more details.

One other comment: The Bradley-Terry model assumes that the result of each comparison between two players is a win or a loss (i.e., a binary result). However, it sounds like you will actually have more detailed data: your slider will give a rough estimate of how much better the peer grader rated one essay than another. The simplest approach would be to just map each slider to a binary result. However, if you really want, you might be able to use all of the data, by using a more sophisticated analysis. The Bradley-Terry model involves doing logistic regression. If you generalize that to use ordered logit, I bet that you could take advantage of the extra information you have from each slider, given that the results from the sliders are not binary but are one of several possibilities.

## Efficient use of the teacher

You suggest having the teacher manually grade the top X% and bottom X% of all of the essays (using the ranking inferred from the results of the peer-grading). This could work, but I suspect it is not the most efficient use of the teacher's limited time. Instead, I'd like to suggest an alternate approach.

I suggest that you have the teacher grade a subset of the essays, with the subset carefully selected to try to provide the best possible calibration for all of the essays that weren't graded by the teacher. For this, I think it might help if you selected a sample of essays that cover the range of possible answers (so for every essay, there is some teacher-graded essay that is not too far away from it). For this, I can think of two approaches you could consider trying:

• Clustering. Take the ratings that are produced by the Terry-Bradley model. This is a set of $n$ real numbers, one real number per essay. Now cluster them. Suppose you want to have the teacher grade $k$ essays. One approach would be to use $k$-means clustering (on these one-dimensional data points) to cluster the essays into $k$ clusters, and then randomly select one essay from each cluster for the teacher to grade -- or have the teacher grade the "cluster head" of each cluster.

• Furthest-point first. An alternative is to try to select a subset of $k$ essays that are as different from each other as possible. The "furthest-point first" (FPF) algorithm is a clean approach for this. Assume that you have some distance function $d(e_i,e_j)$ that lets you quantify the distance between two essays $e_i$ and $e_j$: a small distance means that the essays are similar, a larger distance means they are dissimilar. Given a set $S$ of essays, let $d(e,S) = \min_{e' \in S} d(e,e')$ be the distance from $e$ to the nearest essay in $S$. The furthest-point first algorithm computes a list of $k$ essays, $e_1,e_2,\dots,e_k$, as follows: $e_{i+1}$ is the essay that maximizes $d(e,\{e_1,e_2,\dots,e_i\})$ (out of all essays $e$ such that $e \notin \{e_1,e_2,\dots,e_i\}$). This algorithms generates a set of $k$ essays that are as dissimilar from each other as possible -- which means that each of the remaining essays is pretty similar to at least one of those $k$. Therefore, it would be reasonable to have the teacher grade the $k$ essays selected by the FPF algorithm.

I suspect either of these approaches might provide more accurate scores than having the teacher grade the top X% and bottom X% of essays -- since the very best and worst essays probably are not representative of the mass of essays in the middle.

In both approaches, you could use a more sophisticated distance function that takes into account not just the strength estimates based upon peer grading but also other factors derived from the essays. The simplest possible distance function would take into account only the result of the Terry-Bradley model, i.e., $d(e_1,e_2) = (s(e_1)-s(e_2))^2$ where $s(e)$ is the strength of essay $e$ as estimated by the Terry-Bradley model based upon the results of the peer grading. However, you can do something more sophisticated. For instance, you could compute the normalized Levenshtein edit distance between essay $e_1$ and $e_2$ (treating them as text strings, computing the edit distance, and dividing by the length of the larger of the two) and use that as another factor in the distance function. You could also compute feature vectors using a bag-of-words model on the words in the essays, and use the L2 distance between these feature vectors (with features normalized using tf-idf) as another factor in the distance function. You might use a distance function that is a weighted average of the difference in strengths (based upon the Terry-Bradley estimates), the normalized edit distance, and anything else that seems helpful. Such a more sophisticated distance function might help do a better job of helping the clustering algorithm select which are the best $k$ essays to have the teacher grade.