Critical appraisal of a medical
paper.
AMR Nov 2002
There are many reasons that doctors may
require skills in critical appraisal:
To quickly decide what is worth
reading.
To spot flaws and limitations in
papers.
To determine which papers to cite in
research or other work.
To decide on which studies should be
included in meta analyses or reviews.
To review or referee papers for a
journal.
To pass exams!
This document outlines some basic
principles of critical appraisal and sets out to illustrate how to do it.
Remember, it is often easier to be critical than to praise. A paper may be full
of apparent biases and flaws, but if these are quantified and discussed, the
paper may still have great merit. The depth and emphasis of critical appraisal
really depends upon which hat you are wearing.
It is helpful to read the Abstract
first. This gives a brief outline of the paper, summarizing key points. It is
perhaps easiest to then scan the whole paper to get a feel for the level of
detail and complexity; prior to reading more fully, creating your own mental
summary. Use your own instincts and common sense and try not to get bogged down.
At the outset decide;
What type of study is it?
What is the message?
Is the message important?
Do I believe it?
Does it fit with my view of the world?
Are there any obvious problems with the
paper?
Where there are problems with a paper,
some common themes emerge:
a. Conclusions do not relate to the
stated aims and objectives of a study
Consider a study with objectives to
assess the reliability of near-patient testing kits for influenza that concluded
all GPs should routinely use these tests. The test may be reliable, but it is a
different question as to whether they would be useful in primary care.
b. Generalisation is made from a study carried out in
one population but findings are applied to a different type of population.
Watch out for hospital-based studies
used to advise on management in primary care. In the early 1990s GPs were
heavily criticized for inadequately investigating children with a proven UTI.
Studies carried out in specialist hospitals showed that ultrasound scanning
could fail to pick up some children with scarred kidneys; hence a micturating
cystourethrogram was advocated as an additional investigation. However as at
least 10% of girls are diagnosed as having a UTI at some stage; the
recommendations would mean an enormous number of children would be subjected to
this unpleasant and invasive procedure. A second example would be that it would
be erroneous to base estimates of the incidence of epilepsy in primary care
populations following a febrile convulsion in children attending tertiary
centers e.g. Great Ormond Street; considerable pre-selection of cases will have
occurred.
The principle of generalisability is encompassed in the
concept of ‘the predictive value’ of a symptom sign or test being dependant
on the population examined. If the prevalence of disease changes, then so does
predictive value (this is explained further later on).
c. Type 1 or ‘alpha’ errors.
These occur when a study claims to show a difference in
outcomes when in fact there is not. This is a false positive result. Usually the
risk of a false positive error is quoted; it is called the ‘p’ value. The
lower it is, the less likely it is to have happened by chance. Usually anything
less than 0.05 (1 in 20) is considered statistically significant. But
remember, if you carry out enough studies you will eventually come up with a
statistically significant result; Foinavon won
the Grand National at odds of 100 to 1!
d. Type 2 or ‘beta’ errors
These are seen when a conclusion is reached that there
is no difference between two groups (particularly regarding outcome) yet the
study lacks the necessary power to draw such a conclusion. In other words
the study was not big enough. Prior to undertaking some studies a statistician
is consulted to advise on the population sample size required to run a 20% risk
(the usual level) of concluding that an experimental therapy and conventional
therapy do not produce different clinical outcomes when in fact they do. The
size of the sample required depends on how common the outcome measure is in the
population being measured anyway. This principle can be used to criticize small
studies concluding that there is no difference between two treatments or
populations. A somewhat extreme example would be that an enormously large study
would be needed to compare efficacy of treatments in previously healthy
individuals who contract chickenpox where death was chosen as the outcome
measure. Thankfully death from chickenpox is so rare it would be virtually
impossible to recruit enough patients to a trial to do this. In the case of
therapies for chickenpox, better outcome measures would be time for resolution
of spots, incidence of ear complications, pneumonia or secondary skin infection.
Type 1 and type 2 errors can be summarized as per the following table:
The "Truth"
| |
Drug A is better than B |
Drug A is no better than B |
|
Drug A is better than B |
True positive
correct |
False positive
Type 1 error |
|
Drug A is no better than B |
False negative
Type 2 error |
True negative
correct |
Conclusions drawn
from the
trial.
More general points follow under
subheadings:
Objectives
What
are they?
Are they clear?
What is the hypothesis being
tested?
Method
The methods section is an important
part of a paper in terms of spotting problems; try and determine exactly what
was done.
Work out what type of study is it?
Controlled trial
(randomized double blind etc)
Case controlled study
Cohort study
Cross sectional survey
or study
Case series
Case report
Remember that even though a trial may
be randomized, double blind with very rigid protocols; it may be limited by the
presence of strict inclusion criteria e.g. patients with co-morbidity, those
without classic symptoms or signs, or those who cannot give consent are
excluded. The people studied may be so highly selected that results are
difficult to apply in general practice. The converse is also true, loose
inclusion criteria are sometimes an asset, for example; surveillance in general
practice is very useful because it reflects the reality of what really goes on
in primary care; if a doctor calls something tonsillitis or influenza then a
management plan is developed on this basis. If patients must have a recorded
fever >38Celcius to be considered as having either of these conditions, this
poses great difficulty when patients are seen in ‘one-off’ consultations.
Selection
of subjects is very important; some diseases are difficult to define e.g. IBS,
ME, fibromyalga. For many diseases there is huge variation in severity, asthma
is a good example.
Subjects may be paid for taking part;
does this introduce bias?
Are questionnaires well designed?
Were they piloted?
Are interviewers trained?
Interviews standardized?
Is the control group well matched?
Are exclusion criteria valid?
Is the time span long enough for
the outcome measure to occur?
Is the study ethical?
Setting and subjects
Setting is very important in primary
care studies, for example primary care differs considerably between countries
(remember the theme of generalisability).
Who?
Whole population or subset?
Is the sample size big enough?
Is there selection bias e.g. if
you are examining QED care pathways for dyspepsia one needs to be sure patients
whose history is suggestive of cancer are still randomized and not referred via
a different route.
Outcome measures
Are they clearly defined?
How were they developed?
Are they relevant to the
objectives?
Are they reliable and reproducible?
Are they valid?
Are they consistent?
Missing data,
deaths, drop outs?
Observer bias?
Are you measuring your own intervention?
Results
Are the findings clearly and
objectively presented?
Adequate response rate in a
questionnaire study (ideally above 70%)?
Is there sufficient detail e.g.
age/gender breakdown of results?
Do numbers add up (internal
consistency)?
Are non responders dealt with
appropriately? In some studies one should assume a worst- case scenario and put
these in the treatment failure group. Sometimes an ‘Intention to Treat’
(ITT) analysis is appropriate; drop-outs are still included in the analysis.
Is appropriate statistical analysis
carried out?
Discussion
Have the initial objectives been met?
Hypothesis proved or disproved?
Has the data been interpreted
correctly?
Are the conclusions justified?
Are all results discussed?
Are results clinically useful? - Should
data be presented as ‘Numbers Needed to Treat’ (NNT)? – a drug may
show an impressive 50% reduction in a given outcome measure (Relative Risk
Reduction), but if this only happens in 1% of untreated cases, then 200 people
need to be treated to prevent that outcome measure occurring in one patient. Of
course 200 people are exposed to the potential side effects of the drug to
achieve this. This is one of the reasons to be selective regarding the use of
antibiotics to treat minor illness.
Is there any confounding? –
age, social class, ethnicity, smoking, disease duration, co-morbidity are
important examples. Multiple regression analysis or strict matching of controls
reduces this problem.
Bias
may have many forms; observer bias such as non blinding; trying to ensure
a patient has drug rather than placebo; contamination where the
intervention group passes on information to the control group in health
education intervention studies.
Annual and Seasonal factors
in the variation of disease may be important; consider how respiratory diseases
such as infection, rhinitis and asthma vary by year and season.
Recall
bias is important, one study of hay fever compared occupants of Munich and
Leipzig but sent the questionnaires out at different times of the year; subjects
are notoriously bad at remembering when something fairly minor happens to them.
Who funded the study? Drugs
companies might seek to publish studies that show their product in a favorable
light, but ignore negative studies.
Is there any conflict of interest?
Some journals insist on explicit declaration of this.
Can the authenticity of the
research be relied upon?…a doctor found guilty of fabricating the results of
research may be dealt with severely by the GMC.
--------------------------------------------------
This was just a brief outline of
critical appraisal; it is a huge but challenging area. Some basic info on 2x2
tables follows. For those who are interested in reading more;
Clinical Epidemiology. A Basic Science
for clinical medicine. Second Edition. Sackett D, Haynes B, Guyatt G, Tugwell P.
Published by Little, Brown and Company Boston/ Toronto/London.
Diagnostic symptoms, signs and tests
2x2 tables
Diseased
+ -
Test + a b
- c d
Prevalence
is the proportion of truly diseased in the population under study
a+c
a+b+c+d
Sensitivity/specificity
tell you how good a test is
Sensitivity is the proportion of truly
diseased persons who test positive amongst all those who are truly diseased
a
a+c
Specificity is the proportion of truly
nondiseased persons who test negative amongst those who truly do not have
disease
d
b+d
Predictive value
tells you what the test means in a given population (prevalence varies)
Positive predictive value is the
probability that a person has disease when testing positive
a
a+b
Negative predictive value is the
probability that a person does not have disease when testing negative
d
c+d
The role of prevalence is of utmost
importance to the interpretation of test results. If one considers an excellent
screening test with both sensitivity and specificity of 95% the effect of
prevalence is immense. If the pretest likelihood of disease is 5%, a negative
result gives a post test odds of 0.3%, but crucially a positive test gives a
probability of disease of just 50% (one can then spin a coin!). You can work
these figures out for yourself on a 2x2 table.
|