Star Ratings round-up

A key part of the Commissioning Strategy is the requirement for open performance information, so that providers can be held to account and customers can choose which provision to go for.

While it's not entirely clear whether the raw performance data will be published, one thing that is gearing up to take on the task is Star Ratings. Given their potential importance (think Star Ratings for hospitals and A-C grades for schools) and the questions swirling round them, we've put together a quick primer on the topic.

The primer comes in five sections:

  1. Introduction to Star Ratings – Don’t know much about them? Read this.
  2. Practical issues – Flaws in the individual measures
  3. The big picture – Problems with the system as a whole
  4. Conclusion – Possible solutions, and the future of Star Ratings
  5. Resources

This article has been put together in discussion with providers, think tanks and individual practitioners from across the sector, and hopefully provides a comprehensive overview of thoughts and suggestions from across the industry on how to ensure Star Ratings do their job.


Introduction to Star Ratings

What's the point of Star Ratings?

From the DWP guide to Star Ratings:

  • To provide an insight for Ministers, senior officials and others into the extent to which providers are improving.
  • To enable the Department and providers to understand and drive up performance. To enable us to make informed decisions about contract letting, extension and management.
  • To support informed customer choice wherever possible.

In essence, they're about exposing provider performance, so that bad providers can be punished, good providers can be rewarded, and customers can choose good providers. The idea is that this will result in providers doing a better job over time, or at least replace bad providers with good ones. If you work for a provider, it's a large part of how your job is going to be measured within the next year or so.

How do the current Star Ratings work?

The current star ratings are made up by adding together ratings for various parts of delivery:

  • 70% performance, that is getting people into jobs and keeping them there. The top performer relative to their contract target gets 70%, and the lowest performer relative to their contract target gets slightly above 0%
  • 20% quality of delivery. This is measured using the provider's self-assessment report (SAR) rather than through inspection
  • 10% contract compliance. This is made up by adding together the DWP contract manager's rating of contract delivery and the DWP Financial Audit and Monitoring (FAM) team's rating for administration

The final, single rating is a score out of a hundred. 75% or more is 4 stars, 60%+ is 3 stars, 45%+ 2 stars, and 1 star for less than 45%.

What do people think of Star Ratings?

The basic concept of open performance information is supported almost unanimously, and is critical to the success of the new commissioning approach. The select committee summed up the general support for the concept of Star Ratings in its recent report.

However, there have been rumblings for a while among researchers and providers about the actual implementation of the Star Ratings system, which were also reflected by the select committee.


Practical issues

Last year’s jobs, this year’s starts

Imagine someone starting on provision in January. It takes time to help them and move them into work, so, on average, they might get jobs in April. Then after that, they only count as sustained jobs when they've been in work for e.g. 26 weeks, so they produce sustained job outcomes in October.

Now imagine what happens if, instead of measuring performance by comparing starts in January, jobs in April and sustainments in October, you compare all three each month. That’s how Star Ratings work at the moment, albeit for six-monthly periods rather than each month.

This produces a whiplash effect, where old customers going into and staying in jobs are compared to new customers starting provision. If the rate of people starting provision has gone up or down substantially, then the performance figures will be skewed. Let's look at a simplified example:

The job entry rate is the percentage of starts that get a job. So a 50% job entry rate means 50 out of every 100 customers entering provision gets a job. Our provider, Work-a-Job, have a steady job entry rate of 50%.

100 people start at Work-a-Job in January. They get jobs after three months of support, in April. This means there are 50 job entries in April.

However, what if April’s a quiet month for starts? It’s Easter, and only 40 new people start. As far as Star Ratings are concerned, the job entry rate for April is 50 job entries / 40 starts = 125%.

Conversely, if 150 people had started in April, the official job entry rate would be 50 job entries / 150 starts = 33%. Of course, both of these figures are wrong. The provider is getting a steady 50% of people into jobs.

The whiplash effect causes two problems:

  1. It makes the star ratings less stable, since providers can bounce up and down them from one period to the next as the whiplash kicks in different directions
  2. It makes the star ratings less accurate, since it will be impossible to tell if bad ratings are the result of bad performance or variation in the number of starts

Different targets, same ratings

One of the curiosities of Employment Zones (EZs) is that each contract has different targets for job entry and sustainment rates. Each contract has been separately renegotiated over the years since they were set up, and unless the DWP have been surprisingly perfect negotiators, some contracts are simply going to be tougher than others to deliver.

Where this causes problems for Star Ratings is that each provider’s performance is taken against their contract target. Let’s look at another simplified example to see what this does:

Work-a-Job delivers an EZ in London. Its job entry rate target is 50%. It’s actually getting 40% of customers into jobs.

EmProvider has an EZ in Wales. Its job entry rate target is 30%. It’s actually getting 30% of customers into jobs.

Remember, Star Ratings use performance against target, not absolute performance.

Work-a-Job is delivering at 40% actuals / 50% target = 80% of target

EmProvider is delivering at 30% actuals / 30% target = 100% of target

This means that EmProvider gets a much better Star Rating than Work-a-Job, even though Work-a-Job is getting 10% more customers into work.

A world of extremes

The final issue with the performance measure is one that UK providers haven’t yet raised. It was actually highlighted by the review of Star Ratings in Australia, which have been going for far longer than the UK ones and provided the initial inspiration for the UK system.

The performance measure makes up 70% of the total Star Ratings score. The provider with the best performance gets 70%. The worst one gets slightly above 0%. Everyone else gets a score proportionate to their placing between the best and worst.

However, what happens if providers have very similar performance scores? In the current system, the distribution still applies, so in that case, insignificant differences in performance would translate to huge differences in Star Rating. The Australians now advocate the following model:

'Performance ratings will no longer rely on a fixed distribution. The number of Providers in each of five ratings bands will depend on the extent to which each Provider’s performance exceeds or falls below the average. Providers with very marginal differences in performance will no longer find themselves with different ratings.'

Quality Assured

Last of all, we turn to the quality measure. This makes up 20% of the final score, and it is largely self-assessed. That is, it’s taken from the provider’s self-assessment report (SAR) of their own quality. While this is discussed and agreed with the DWP contract manager, it’s basically a score that providers pick for themselves. Needless to say that some providers had notable jumps in their opinion of themselves between the first and second round of Star Ratings.


The big picture


Moving away from the mechanics of Star Ratings, suppose for a second that the current issues were resolved and that Star Ratings were both stable and accurate. Would they achieve the aims set for them of measuring improvement, driving up performance, informing contracting decisions, and helping customer choice? Let’s take each of these in turn:

1. Measuring Improvement

The meaning of this is unclear. Measuring improvement in the industry as a whole is impossible with Star Ratings, as they use rankings rather than absolute performance. It’s possible to measure relative improvement by individual providers compared to other providers, but if everyone improves equally then nobody improves their rating!

Measuring improvements in absolute performance would require the raw performance data from each contract to be released, a recommendation that also features prominently in the next section.

2. Driving up performance

In order to drive up performance, it must be possible to measure performance accurately, set a reasonable performance target, then push the provider to improve if they’re below target. In addition to the measurement issues that were outlined in the previous section, there is a more fundamental concerns over the ability of Star Ratings to support these three steps.

It is not clear that anyone can set a reasonable performance target for moving people into work. To predict the cost of moving a group of people into work, you need to know what needs they have and in what proportion they have them, how many of them there are, and how much it has cost to deliver the same service in the past. With the current welfare reforms and the recession, predicting the delivery needs and numbers of customers are almost impossible. With the introduction of a raft of new contracts, the cost of moving someone into work is difficult to estimate. Between the two of them, setting the right performance target for each contract will be almost impossible.

How, then, to compare performance? The Australians use clever analysis to estimate what performance should be in each delivery area, based on the characteristics of local benefit claimants. While imperfect, this does give a way of tracking differences in difficulty of delivery over time and between different areas.

This could work, but the one action most likely to answer complaints from providers is to publish the raw data and targets. Underperforming providers would be unable to hide behind a cloak of secrecy, and the open comparison of different contracts would give providers and senior officials far more useful information on which to base their judgments and their contract negotiations.

In terms of pushing providers to up their performance, using Star Ratings to inform contracting decisions (see below) would eventually clear out providers who didn't pay attention to them. In the shorter term, the Australian model varied the size of contracts with their Star Ratings. Contracts with a lower rating would get fewer customers, and be more difficult to make a profit on. This lends immediacy to the need for decent Star Ratings, and acts as a strong deterrent to dropping down the rankings. Customer Choice could replicate this in a more organic fashion, assuming that customers will prefer the provider with the higher rating.

3. Informing contracting decisions

This is the part that gives Star Ratings teeth. Dropping or shrinking underperforming contracts, picking high performers for new contracts, and giving more freedom to higher performers – all of these can exert evolutionary pressure on welfare-to-work delivery. Without this, Star Ratings has no way of feeding back into delivery.

In the Australian system, providers that get a failing rating lose their contracts. In the UK, there are no details on what will happen to low-scoring providers, other than perhaps slightly fewer starts and five years of delivery instead of seven.

Also in the Australian system, contracts are sent more or fewer customers as their Star Ratings change, lending immediacy to the competition between providers.

A major missed opportunity is that past performance is not currently taken into account when awarding new contracts. Various (newer or better-performing) providers have raised this repeatedly in recent DWP briefings.

4. Helping customer choice

One of the stated goals of Star Ratings is to help customers choose a provider. However, the actual experience of a customer in a customer choice area is that they'll be told two names, each of which has a number between 1 and 4, given a brief blurb about the provider, then told to make a decision. Also, if one provider gets too popular then their choice will be ignored anyway. And if they’re in Manchester, they’ll only have one provider and won’t get any choice anyway.

Contrast this with the Dutch system, where customers agree a tailored package of support with their adviser, delivered by a range of specialist organisations. There's a fair chance that FND prime contractors will implement something similar to the Dutch model within their programmes, but this would be entirely separate to the Star Ratings model, and each contractor would have its own systems for measuring subcontractor performance, with no obligation to share that information with customers.

If the DWP are really serious about informing customer choice, a different mechanism will be needed. One possibility might be giving access to detailed statistics on how contractors perform with different customer groups and areas. Another might be an independent feedback mechanism for previous customers to pass on their impressions to new ones. Amusingly enough, true customer choice could quickly lead to one provider receiving a disproportionate number of hardest-to-help customers, thereby totally screwing the payment and performance model.


Conclusions

To date, Star Ratings have been seen by providers as very much a work in progress. Their impact on Emploment Zone delivery has been minimal, and the scores have been inconsistent from one set to the next. The most spectacular case of this was Reed in Partnership, who got terrible ratings in the first round, and bounced right back in the second after some problems with the paperwork were resolved.

To achieve their potential, the consensus view among industry practitioners with whom I’ve spoken is that Star Ratings need:

  • Resolution of issues with the measurement process
  • Publication of all the information feeding into Star Ratings, including starts, job entries, sustainments, and targets
  • Defined mechanisms for rewarding good providers with more or longer contracts, and punishing poor providers
  • Separate support for customer choice, as Star Ratings won’t be all that helpful

So how likely are these to be put in place? Well, Star Ratings are still under development, and it seems likely that some or all of these will actually be put in place prior to FND roll-out.

A less positive prognosis comes from some in the industry, who detect a lack of enthusiasm to the entire enterprise, and point to the upcoming changes to Star Ratings in Australia, not to mention the problems with hospital star ratings in the UK.

Whatever happens, some kind of headline measure of performance is likely to be crucial to helping people understand how providers are performing, just as long as it gets published with the rest of the story included.


Resources


Much of the material for this article came from informal sources and conversations. Further reading can be found at:

AttachmentSize
DEEWR_ESC2009-2012_Performance_Management_Framework.docx18.62 KB

Comments

BBC News has an interesting piece on A-E gradings for schools in New York. Apparently the Education Secretary is fairly keen on introducing something similar in the UK.

The other thing to remember about the Australian model is that there is genuine competition, with multiple providers operating in areas much smaller than our JCP Districts. This affords DEEWR far more scope to remove poor performing providers and add new providers into the mix. DWP's Prime Contracting direction means that there is no opportunity to do this, with massive disruption in a place such as Greater Manchester should they choose to uproot their single FND provider. Combined with a 5 year contracting cycle, and a pre-qualification process that allows providers to quote only their 5 best performing contracts (easily done when your contract book consists of several dozen individual contracts), it's hard to see exactly what Ministers are expecting the Star Rating to achieve.

Of course, even Australia isn't flawless. While the system works perfectly well inbetween re-bidding cycles, there has been some genuine shock in the Ozzie marketplace with providers losing contracts for 4* and 5* offices following a bidding round this year that clearly put much less emphasis on these ratings.

Thanks for the info Phil. I'm not entirely sure why taking account of past performance seems to be such a non-starter - it seems simple enough, and it would even be possible to do it in a way that was neutral toward bidders that didn't have past performance to draw on.

Just how big are the Australian contracts anyway? I know they spend a higher proportion of GDP on welfare to work than the UK, but there are only 21.5m people in Australia.