Critical appraisal of a medical paper. AMR Nov 2002

There are many reasons that doctors may require skills in critical appraisal:

To quickly decide what is worth reading.

To spot flaws and limitations in papers.

To determine which papers to cite in research or other work.

To decide on which studies should be included in meta analyses or reviews.

To review or referee papers for a journal.

To pass exams!

This document outlines some basic principles of critical appraisal and sets out to illustrate how to do it. Remember, it is often easier to be critical than to praise. A paper may be full of apparent biases and flaws, but if these are quantified and discussed, the paper may still have great merit. The depth and emphasis of critical appraisal really depends upon which hat you are wearing.

It is helpful to read the Abstract first. This gives a brief outline of the paper, summarizing key points. It is perhaps easiest to then scan the whole paper to get a feel for the level of detail and complexity; prior to reading more fully, creating your own mental summary. Use your own instincts and common sense and try not to get bogged down. At the outset decide;

What type of study is it?

What is the message?

Is the message important?

Do I believe it?

Does it fit with my view of the world?

Are there any obvious problems with the paper?

Where there are problems with a paper, some common themes emerge:

a. Conclusions do not relate to the stated aims and objectives of a study

Consider a study with objectives to assess the reliability of near-patient testing kits for influenza that concluded all GPs should routinely use these tests. The test may be reliable, but it is a different question as to whether they would be useful in primary care.

b. Generalisation is made from a study carried out in one population but findings are applied to a different type of population.

Watch out for hospital-based studies used to advise on management in primary care. In the early 1990s GPs were heavily criticized for inadequately investigating children with a proven UTI. Studies carried out in specialist hospitals showed that ultrasound scanning could fail to pick up some children with scarred kidneys; hence a micturating cystourethrogram was advocated as an additional investigation. However as at least 10% of girls are diagnosed as having a UTI at some stage; the recommendations would mean an enormous number of children would be subjected to this unpleasant and invasive procedure. A second example would be that it would be erroneous to base estimates of the incidence of epilepsy in primary care populations following a febrile convulsion in children attending tertiary centers e.g. Great Ormond Street; considerable pre-selection of cases will have occurred.

The principle of generalisability is encompassed in the concept of ‘the predictive value’ of a symptom sign or test being dependant on the population examined. If the prevalence of disease changes, then so does predictive value (this is explained further later on).

c. Type 1 or ‘alpha’ errors.

These occur when a study claims to show a difference in outcomes when in fact there is not. This is a false positive result. Usually the risk of a false positive error is quoted; it is called the ‘p’ value. The lower it is, the less likely it is to have happened by chance. Usually anything less than 0.05 (1 in 20) is considered statistically significant. But remember, if you carry out enough studies you will eventually come up with a statistically significant result; Foinavon won the Grand National at odds of 100 to 1!

d. Type 2 or ‘beta’ errors

These are seen when a conclusion is reached that there is no difference between two groups (particularly regarding outcome) yet the study lacks the necessary power to draw such a conclusion. In other words the study was not big enough. Prior to undertaking some studies a statistician is consulted to advise on the population sample size required to run a 20% risk (the usual level) of concluding that an experimental therapy and conventional therapy do not produce different clinical outcomes when in fact they do. The size of the sample required depends on how common the outcome measure is in the population being measured anyway. This principle can be used to criticize small studies concluding that there is no difference between two treatments or populations. A somewhat extreme example would be that an enormously large study would be needed to compare efficacy of treatments in previously healthy individuals who contract chickenpox where death was chosen as the outcome measure. Thankfully death from chickenpox is so rare it would be virtually impossible to recruit enough patients to a trial to do this. In the case of therapies for chickenpox, better outcome measures would be time for resolution of spots, incidence of ear complications, pneumonia or secondary skin infection. Type 1 and type 2 errors can be summarized as per the following table:

The "Truth"

 

Drug A is better than B

Drug A is no better than B

Drug A is better than B

True positive

correct

False positive

Type 1 error

Drug A is no better than B

False negative

Type 2 error

True negative

correct

 

 

Conclusions drawn

from the

trial.

 

 

 

More general points follow under subheadings:

Objectives

What are they?

Are they clear?

What is the hypothesis being tested?

Method

The methods section is an important part of a paper in terms of spotting problems; try and determine exactly what was done.

Work out what type of study is it?

Controlled trial (randomized double blind etc)

Case controlled study

Cohort study

Cross sectional survey or study

Case series

Case report

Remember that even though a trial may be randomized, double blind with very rigid protocols; it may be limited by the presence of strict inclusion criteria e.g. patients with co-morbidity, those without classic symptoms or signs, or those who cannot give consent are excluded. The people studied may be so highly selected that results are difficult to apply in general practice. The converse is also true, loose inclusion criteria are sometimes an asset, for example; surveillance in general practice is very useful because it reflects the reality of what really goes on in primary care; if a doctor calls something tonsillitis or influenza then a management plan is developed on this basis. If patients must have a recorded fever >38Celcius to be considered as having either of these conditions, this poses great difficulty when patients are seen in ‘one-off’ consultations.

 

Selection of subjects is very important; some diseases are difficult to define e.g. IBS, ME, fibromyalga. For many diseases there is huge variation in severity, asthma is a good example.

 

Subjects may be paid for taking part; does this introduce bias?

Are questionnaires well designed? Were they piloted?

Are interviewers trained? Interviews standardized?

Is the control group well matched?

Are exclusion criteria valid?

Is the time span long enough for the outcome measure to occur?

Is the study ethical?

Setting and subjects

Setting is very important in primary care studies, for example primary care differs considerably between countries (remember the theme of generalisability).

Who?

Whole population or subset?

Is the sample size big enough?

Is there selection bias e.g. if you are examining QED care pathways for dyspepsia one needs to be sure patients whose history is suggestive of cancer are still randomized and not referred via a different route.

 

Outcome measures

Are they clearly defined?

How were they developed?

Are they relevant to the objectives?

Are they reliable and reproducible?

Are they valid?

Are they consistent?

Missing data, deaths, drop outs?

Observer bias? Are you measuring your own intervention?

 

Results

Are the findings clearly and objectively presented?

Adequate response rate in a questionnaire study (ideally above 70%)?

Is there sufficient detail e.g. age/gender breakdown of results?

Do numbers add up (internal consistency)?

Are non responders dealt with appropriately? In some studies one should assume a worst- case scenario and put these in the treatment failure group. Sometimes an ‘Intention to Treat’ (ITT) analysis is appropriate; drop-outs are still included in the analysis.

Is appropriate statistical analysis carried out?

 

Discussion

Have the initial objectives been met?

Hypothesis proved or disproved?

Has the data been interpreted correctly?

Are the conclusions justified?

Are all results discussed?

Are results clinically useful? - Should data be presented as ‘Numbers Needed to Treat’ (NNT)? – a drug may show an impressive 50% reduction in a given outcome measure (Relative Risk Reduction), but if this only happens in 1% of untreated cases, then 200 people need to be treated to prevent that outcome measure occurring in one patient. Of course 200 people are exposed to the potential side effects of the drug to achieve this. This is one of the reasons to be selective regarding the use of antibiotics to treat minor illness.

Is there any confounding? – age, social class, ethnicity, smoking, disease duration, co-morbidity are important examples. Multiple regression analysis or strict matching of controls reduces this problem.

Bias may have many forms; observer bias such as non blinding; trying to ensure a patient has drug rather than placebo; contamination where the intervention group passes on information to the control group in health education intervention studies.

Annual and Seasonal factors in the variation of disease may be important; consider how respiratory diseases such as infection, rhinitis and asthma vary by year and season.

Recall bias is important, one study of hay fever compared occupants of Munich and Leipzig but sent the questionnaires out at different times of the year; subjects are notoriously bad at remembering when something fairly minor happens to them.

Who funded the study? Drugs companies might seek to publish studies that show their product in a favorable light, but ignore negative studies.

Is there any conflict of interest? Some journals insist on explicit declaration of this.

Can the authenticity of the research be relied upon?…a doctor found guilty of fabricating the results of research may be dealt with severely by the GMC.

--------------------------------------------------

This was just a brief outline of critical appraisal; it is a huge but challenging area. Some basic info on 2x2 tables follows. For those who are interested in reading more;

Clinical Epidemiology. A Basic Science for clinical medicine. Second Edition. Sackett D, Haynes B, Guyatt G, Tugwell P. Published by Little, Brown and Company Boston/ Toronto/London.

Diagnostic symptoms, signs and tests

2x2 tables

Diseased

+ -

Test + a b

- c d

Prevalence is the proportion of truly diseased in the population under study

a+c

a+b+c+d

Sensitivity/specificity tell you how good a test is

Sensitivity is the proportion of truly diseased persons who test positive amongst all those who are truly diseased

a

a+c

 

Specificity is the proportion of truly nondiseased persons who test negative amongst those who truly do not have disease

d

b+d

Predictive value tells you what the test means in a given population (prevalence varies)

Positive predictive value is the probability that a person has disease when testing positive

a

a+b

Negative predictive value is the probability that a person does not have disease when testing negative

d

c+d

The role of prevalence is of utmost importance to the interpretation of test results. If one considers an excellent screening test with both sensitivity and specificity of 95% the effect of prevalence is immense. If the pretest likelihood of disease is 5%, a negative result gives a post test odds of 0.3%, but crucially a positive test gives a probability of disease of just 50% (one can then spin a coin!). You can work these figures out for yourself on a 2x2 table.