This site is not fully supported by Internet Explorer. To fully enjoy this website, please use an alternative browser

Getting learning assessments right when money depends on it!
Education and Skills
School systems (incl private schooling)

By Ian Attfield, DFID Education Adviser Tanzania and Abhijeet Singh, Young Lives Research Officer, University of Oxford.  This blog originally appeared on the HEART website on 11 December 2015

The past two decades have seen an unprecedented global increase in enrolment rates: payments-by-results (PBR) where funding depends on the level of learning improvement as a result of the programme. This blog is a brief exploration of practical aspects of this shift: if we're putting high-stakes on learning gains, how well can we measure them?

A concrete example of payments-by-results

One example of such a shift is the Big Results Now! Education Programme for Results in Tanzania. This UK DFID, World Bank and Swedish financed initiative provides funds to government to incentivise improvements on key steps of the results chain: government's own prioritised expenditure on education, teacher distribution, open school level data and operational rollout of teacher training for example. In addition a significant graduated payment is offered, proportional to the increase in average Kiswahili reading speed from a national baseline taken in 2013 of students completing primary grade two. Use of an average speed metric allows improvement from pupils across the entire reading spectrum to contribute, including non-readers, those with emerging literacy and fluent readers. In linking payments to a relatively straight-forward test, this initiative is probably typical of several other projects (e.g. social impact bonds in India or PBR components of the DFID-funded Girls' Education Challenge).

Measurement decisions for PBR frameworks

While the concept of linking payments to outcomes is a sound one, how these outcomes should be measured is much less straight-forward than it seems. What we need for designing payment schemes is a measurement metric that is (a) robust, in the sense of measuring accurately the dimensions of learning the policy prioritises (b) predictable in its performance, since both the funders and the recipient organisations need some ability to forecast expenditures and incomes and (c) transparent, to ensure that all parties see it as a 'fair' metric and understand how eventual payment amounts were arrived at and (d) non-manipulable, to decrease the incentives for providers to try gaming the metrics rather than focus on underlying learning. It is unclear that most methods of measuring learning currently satisfy these essential concerns.

Simple tests, such as the EGRA or the ASER tests in India, have the great virtue of transparency - everybody understands them. But then, it is much harder to establish what it is that a child should learn in order to benchmark payments. And, yet, such a benchmark must be set. And setting it arbitrarily decreases the predictability of payments. Moreover, these simple measures are often not the best suited for measuring changes in learning levels, thus raising question about robustness.

A larger issue relates to whether payment schemes should target the levels of student achievement or what they have learnt ('value-added'). Not all programmes are randomly assigned -indeed most aren't -so targeting levels confuses a child's socioeconomic background and various other factors with the effect of the programme. In non-experimental programmes, this also adds to the incentive to serve better-off populations.  Value-added is conceptually much more valid, and indeed seems to isolate policy effects well (see this paper), but this statistical sophistication comes at a cost - it is much harder to explain and, in being opaque to many practitioners and stakeholders, loses out on transparency. For measuring changes, researchers typically use 'smoother' tests which, unlike e.g. ASER, capture ability at lots of different levels. Results are typically expressed in standard deviation. But as one of us has previously written, these normalisations also have very important caveats - and depending on the procedure chosen, estimates of the policy effect could change substantially.

A further issue is that of precision of effects. Researchers routinely compute confidence intervals for policy estimates. In practice, these intervals can often be quite broad and getting tighter bounds can be very costly in terms of sample sizes - often infeasibly so for smaller programmes. Wide bounds are bad for predictability since they can sharply vary the amount of money the provider will receive. And knowing that they could have received substantially more/less, depending on whether the payment is at the high estimate or the low, can cause substantial discord.

A final issue is the trade-off between predictability and non-manipulability. Ideally, the provider should know the type of test and agree the metric is fair. But if doing this at scale, and potentially without the capacity for third-party evaluation, this could raise the dangers of either teaching to the test or 'gaming' the assessment. The first risk is probably a lot less severe in developing countries - learning levels are so low that teaching to the test is still not a bad thing! But the possibility of cheating on tests remains.

 

 

Organisational and political economy concerns

In all of this, political economy concerns also abound. Especially when partnering directly with national governments, the loss of face in results being declared sub-par can be as important, if not more so, than the financial loss and this may have wider political consequences for relations between donor agencies and governments. A potential conflict of interest may also arise in using national assessment agencies to manage a survey that may determine the extent of payment to their government. However we risk undermining national systems and limiting capacity transfers if verification is outsourced, wholesale.

There is also a low evidence base that national systems and actors know how to drive up learning, enrolment expansion and improve equity in tandem. Large gains in learning are well documented in smaller boutique intervention programmes, but taking to the national level is a real challenge. A pilot by DFID struggled to make payments, as anticipated increases in candidates and graduates for the secondary leaving exams did not materialise in targeted, poorer regions.

Where now with all this?

Suffice it to say that our understanding of a lot of these issues is still in its infancy. But this is an area that will only grow. But there is a lot of experimentation currently ongoing in trying out performance-based payments in current programmes. As we do more of these, we'll probably learn what we could do better and what the right balance is between PBR and a more predictable flow-of-funds. The Tanzania progamme as mentioned above does link some funds to other desirable inputs and outcomes in the results chain. Having documented the possible risks, we wait in anticipation for the coming results in the next few months. Watch this space and we will update on the actual of this novel approach to linking funding to learning improvements.

This blog is part of a series of blogs written by DFID Education Advisers in collaboration with Young Lives researchers. Others in the series include:

Getting learning assessments right when money depends on it!
Education and Skills
School systems (incl private schooling)

By Ian Attfield, DFID Education Adviser Tanzania and Abhijeet Singh, Young Lives Research Officer, University of Oxford.  This blog originally appeared on the HEART website on 11 December 2015

The past two decades have seen an unprecedented global increase in enrolment rates: payments-by-results (PBR) where funding depends on the level of learning improvement as a result of the programme. This blog is a brief exploration of practical aspects of this shift: if we're putting high-stakes on learning gains, how well can we measure them?

A concrete example of payments-by-results

One example of such a shift is the Big Results Now! Education Programme for Results in Tanzania. This UK DFID, World Bank and Swedish financed initiative provides funds to government to incentivise improvements on key steps of the results chain: government's own prioritised expenditure on education, teacher distribution, open school level data and operational rollout of teacher training for example. In addition a significant graduated payment is offered, proportional to the increase in average Kiswahili reading speed from a national baseline taken in 2013 of students completing primary grade two. Use of an average speed metric allows improvement from pupils across the entire reading spectrum to contribute, including non-readers, those with emerging literacy and fluent readers. In linking payments to a relatively straight-forward test, this initiative is probably typical of several other projects (e.g. social impact bonds in India or PBR components of the DFID-funded Girls' Education Challenge).

Measurement decisions for PBR frameworks

While the concept of linking payments to outcomes is a sound one, how these outcomes should be measured is much less straight-forward than it seems. What we need for designing payment schemes is a measurement metric that is (a) robust, in the sense of measuring accurately the dimensions of learning the policy prioritises (b) predictable in its performance, since both the funders and the recipient organisations need some ability to forecast expenditures and incomes and (c) transparent, to ensure that all parties see it as a 'fair' metric and understand how eventual payment amounts were arrived at and (d) non-manipulable, to decrease the incentives for providers to try gaming the metrics rather than focus on underlying learning. It is unclear that most methods of measuring learning currently satisfy these essential concerns.

Simple tests, such as the EGRA or the ASER tests in India, have the great virtue of transparency - everybody understands them. But then, it is much harder to establish what it is that a child should learn in order to benchmark payments. And, yet, such a benchmark must be set. And setting it arbitrarily decreases the predictability of payments. Moreover, these simple measures are often not the best suited for measuring changes in learning levels, thus raising question about robustness.

A larger issue relates to whether payment schemes should target the levels of student achievement or what they have learnt ('value-added'). Not all programmes are randomly assigned -indeed most aren't -so targeting levels confuses a child's socioeconomic background and various other factors with the effect of the programme. In non-experimental programmes, this also adds to the incentive to serve better-off populations.  Value-added is conceptually much more valid, and indeed seems to isolate policy effects well (see this paper), but this statistical sophistication comes at a cost - it is much harder to explain and, in being opaque to many practitioners and stakeholders, loses out on transparency. For measuring changes, researchers typically use 'smoother' tests which, unlike e.g. ASER, capture ability at lots of different levels. Results are typically expressed in standard deviation. But as one of us has previously written, these normalisations also have very important caveats - and depending on the procedure chosen, estimates of the policy effect could change substantially.

A further issue is that of precision of effects. Researchers routinely compute confidence intervals for policy estimates. In practice, these intervals can often be quite broad and getting tighter bounds can be very costly in terms of sample sizes - often infeasibly so for smaller programmes. Wide bounds are bad for predictability since they can sharply vary the amount of money the provider will receive. And knowing that they could have received substantially more/less, depending on whether the payment is at the high estimate or the low, can cause substantial discord.

A final issue is the trade-off between predictability and non-manipulability. Ideally, the provider should know the type of test and agree the metric is fair. But if doing this at scale, and potentially without the capacity for third-party evaluation, this could raise the dangers of either teaching to the test or 'gaming' the assessment. The first risk is probably a lot less severe in developing countries - learning levels are so low that teaching to the test is still not a bad thing! But the possibility of cheating on tests remains.

 

 

Organisational and political economy concerns

In all of this, political economy concerns also abound. Especially when partnering directly with national governments, the loss of face in results being declared sub-par can be as important, if not more so, than the financial loss and this may have wider political consequences for relations between donor agencies and governments. A potential conflict of interest may also arise in using national assessment agencies to manage a survey that may determine the extent of payment to their government. However we risk undermining national systems and limiting capacity transfers if verification is outsourced, wholesale.

There is also a low evidence base that national systems and actors know how to drive up learning, enrolment expansion and improve equity in tandem. Large gains in learning are well documented in smaller boutique intervention programmes, but taking to the national level is a real challenge. A pilot by DFID struggled to make payments, as anticipated increases in candidates and graduates for the secondary leaving exams did not materialise in targeted, poorer regions.

Where now with all this?

Suffice it to say that our understanding of a lot of these issues is still in its infancy. But this is an area that will only grow. But there is a lot of experimentation currently ongoing in trying out performance-based payments in current programmes. As we do more of these, we'll probably learn what we could do better and what the right balance is between PBR and a more predictable flow-of-funds. The Tanzania progamme as mentioned above does link some funds to other desirable inputs and outcomes in the results chain. Having documented the possible risks, we wait in anticipation for the coming results in the next few months. Watch this space and we will update on the actual of this novel approach to linking funding to learning improvements.

This blog is part of a series of blogs written by DFID Education Advisers in collaboration with Young Lives researchers. Others in the series include: