07.02.2017 Views

Dan Mayer Essential Evidence-based Medicine

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

Second Edition


<strong>Essential</strong><br />

<strong>Evidence</strong>-Based<br />

<strong>Medicine</strong><br />

Second Edition<br />

<strong>Dan</strong> <strong>Mayer</strong>, MD


cambridge university press<br />

Cambridge, New York, Melbourne, Madrid, Cape Town, Singapore, São Paulo, Delhi<br />

Cambridge University Press<br />

The Edinburgh Building, Cambridge CB2 8RU, UK<br />

Published in the United States of America by Cambridge University Press, New York<br />

www.cambridge.org<br />

Information on this title: www.cambridge.org/9780521712415<br />

First edition c○ D. <strong>Mayer</strong> 2004<br />

Second edition c○ D. <strong>Mayer</strong> 2010<br />

This publication is in copyright. Subject to statutory exception<br />

and to the provisions of relevant collective licensing agreements,<br />

no reproduction of any part may take place without<br />

the written permission of Cambridge University Press.<br />

First published 2010<br />

Printed in the United Kingdom at the University Press, Cambridge<br />

A catalog record for this publication is available from the British Library<br />

Library of Congress Cataloging in Publication data<br />

<strong>Mayer</strong>, <strong>Dan</strong>.<br />

<strong>Essential</strong> evidence-<strong>based</strong> medicine / <strong>Dan</strong> <strong>Mayer</strong>. – 2nd ed.<br />

p. ; cm.<br />

Includes bibliographical references and index.<br />

ISBN 978-0-521-71241-5 (pbk.)<br />

1. <strong>Evidence</strong>-Based <strong>Medicine</strong>. I. Title.<br />

[DNLM: 1. <strong>Evidence</strong>-Based <strong>Medicine</strong>. WB 102.5 M468 2010]<br />

R723.7.M396 2010<br />

616 – dc22 2009024641<br />

ISBN 978-0-521-71241-5 Paperback<br />

All material contained within the CD-ROM is protected by copyright and other intellectual<br />

property laws. The customer acquires only the right to use the CD-ROM and does not<br />

acquire any other rights, express or implied, unless these are stated explicitly in a separate<br />

licence.<br />

To the extent permitted by applicable law, Cambridge University Press is not liable for<br />

direct damages or loss of any kind resulting from the use of this product or from errors or<br />

faults contained in it, and in every case Cambridge University Press’s liability shall be<br />

limited to the amount actually paid by the customer for the product.<br />

Every effort has been made in preparing this publication to provide accurate and<br />

up-to-date information which is in accord with accepted standards and practice at the<br />

time of publication. Although case histories are drawn from actual cases, every effort has<br />

been made to disguise the identities of the individuals involved. Nevertheless, the<br />

authors, editors, and publishers can make no warranties that the information contained<br />

herein is totally free from error, not least because clinical standards are constantly<br />

changing through research and regulation. The authors, editors, and publishers therefore<br />

disclaim all liability for direct or consequential damages resulting from the use of material<br />

contained in this publication. Readers are strongly advised to pay careful attention to<br />

information provided by the manufacturer of any drugs or equipment that they plan to<br />

use.<br />

The publisher has used its best endeavors to ensure that the URLs for external websites<br />

referred to in this publication are correct and active at the time of going to press. However,<br />

the publisher has no responsibility for the websites and can make no guarantee that a site<br />

will remain live or that the content is or will remain appropriate.


Contents<br />

List of contributors<br />

Preface<br />

ForewordbySirMuirGray<br />

Acknowledgments<br />

page vii<br />

ix<br />

xi<br />

xiii<br />

1 A brief history of medicine and statistics 1<br />

2 What is evidence-<strong>based</strong> medicine? 9<br />

3 Causation 19<br />

4 The medical literature: an overview 24<br />

5 Searching the medical literature 33<br />

Sandi Pirozzo and Elizabeth Irish<br />

6 Study design and strength of evidence 56<br />

7 Instruments and measurements: precision and validity 67<br />

8 Sources of bias 80<br />

9 Review of basic statistics 93<br />

10 Hypothesis testing 109<br />

11 Type I errors and number needed to treat 120<br />

12 Negative studies and Type II errors 130<br />

13 Risk assessment 141<br />

14 Adjustment and multivariate analysis 156<br />

15 Randomized clinical trials 164<br />

16 Scientific integrity and the responsible conduct of research 179<br />

John E. Kaplan<br />

v


vi<br />

Contents<br />

17 Applicability and strength of evidence 187<br />

18 Communicating evidence to patients 199<br />

Laura J. Zakowski, Shobhina G. Chheda, Christine S. Seibert<br />

19 Critical appraisal of qualitative research studies 208<br />

Steven R. Simon<br />

20 An overview of decision making in medicine 215<br />

21 Sources of error in the clinical encounter 233<br />

22 The use of diagnostic tests 244<br />

23 Utility and characteristics of diagnostic tests: likelihood ratios,<br />

sensitivity, and specificity 249<br />

24 Bayes’ theorem, predictive values, post-test probabilities, and<br />

interval likelihood ratios 261<br />

25 Comparing tests and using ROC curves 276<br />

26 Incremental gain and the threshold approach to diagnostic testing 282<br />

27 Sources of bias and critical appraisal of studies of diagnostic tests 295<br />

28 Screening tests 310<br />

29 Practice guidelines and clinical prediction rules 320<br />

30 Decision analysis and quantifying patient values 333<br />

31 Cost-effectiveness analysis 350<br />

32 Survival analysis and studies of prognosis 359<br />

33 Meta-analysis and systematic reviews 367<br />

Appendix 1 Levels of evidence and grades of recommendations 378<br />

Appendix 2 Overview of critical appraisal 384<br />

Appendix 3 Commonly used statistical tests 387<br />

Appendix 4 Formulas 389<br />

Appendix 5 Proof of Bayes’ theorem 392<br />

Appendix 6 Using balance sheets to calculate thresholds 394<br />

Glossary 396<br />

Bibliography 411<br />

Index 425


Contributors<br />

Shobhina G. Chheda University of Wisconsin School of <strong>Medicine</strong> and Public<br />

Health, Madison, Wisconsin, USA<br />

Elizabeth Irish<br />

John E. Kaplan<br />

Sandi Pirozzo<br />

Albany Medical College, New York, USA<br />

Albany Medical College, New York, USA<br />

University of Queensland, Brisbane, Australia<br />

Christine S. Seibert University of Wisconsin School of <strong>Medicine</strong> and Public<br />

Health, Madison, Wisconsin, USA<br />

Steven R. Simon<br />

Harvard Medical School, Boston, Massachusetts, USA<br />

Laura J. Zakowski University of Wisconsin School of <strong>Medicine</strong> and Public<br />

Health, Madison, Wisconsin, USA<br />

vii


Preface<br />

In 1992 during a period of innovative restructuring of the medical school curriculum<br />

at Albany Medical College, Dr. Henry Pohl, then Associate Dean for Academic<br />

Affairs, asked me to develop a course to teach students how to become<br />

lifelong learners and how the health-care system works. This charge became the<br />

focus of a new longitudinal required 4-year course initially called CCCS, or Comprehensive<br />

Care Case Study. In 2000, the name was changed to <strong>Evidence</strong>-Based<br />

<strong>Medicine</strong>.<br />

During the next 15 years, a formidable course was developed. It concentrates<br />

on teaching evidence-<strong>based</strong> medicine (EBM) and health-care systems operations<br />

to all medical students at Albany Medical College. The first syllabus was<br />

<strong>based</strong> on a course in critical appraisal of the medical literature intended for internal<br />

medicine residents at Michigan State University. This core has expanded by<br />

incorporating medical decision making and informatics. The basis for the organization<br />

of the book lies in the concept of the educational prescription proposed<br />

by W. Scott Richardson, M.D.<br />

The goal of the text is to allow the reader, whether medical student, resident,<br />

allied health-care provider, or practicing physician, to become a critical consumer<br />

of the medical literature. This textbook will teach you to read between<br />

the lines in a research study and apply that information to your patients.<br />

For reasons I do not clearly understand, many physicians are “allergic” to<br />

mathematics. It seems that even the simplest mathematical calculations drive<br />

them to distraction. <strong>Medicine</strong> is mathematics. Although the math content in<br />

this book is on a pretty basic level, most daily interaction with patients involves<br />

some understanding of mathematical processes. We may want to determine how<br />

much better the patient sitting in our office will do with a particular drug, or how<br />

to interpret a patient’s concern about a new finding on their yearly physical. Far<br />

more commonly, we may need to interpret the information from the Internet<br />

that our patient brought in. Either way, we are dealing in probability. However, I<br />

have endeavored to keep the math as simple as possible.<br />

This book does not require a working knowledge of statistical testing. The math<br />

is limited to simple arithmetic, and a handheld calculator is the only computing<br />

ix


x<br />

Preface<br />

instrument that is needed. Online calculators are available to do many of the<br />

calculations needed in the book and accompanying CD-ROM. They will be referenced<br />

and their operations explained.<br />

The need for learning EBM is elucidated in the opening chapters of the book.<br />

The layout of the book is an attempt to follow the process outlined in the educational<br />

prescription. You will be able to practice your skills with the practice<br />

problems on the accompanying CD-ROM. The CD-ROM also contains materials<br />

for “journal clubs” (critical appraisal of specific articles from the literature) and<br />

PowerPoint slides.<br />

A brief word about the CD-ROM<br />

The attached CD-ROM is designed to help you consolidate your knowledge and<br />

apply the material in the book to everyday situations in EBM. There are four types<br />

of problems on the CD:<br />

(1) Multiple choice questions are also called self-assessment learning exercises.<br />

You will be given information about the answer after pressing “submit” if you<br />

get the question wrong. You can then go back and select the correct answer.<br />

If you are right, you can proceed to the next question. A record will be kept of<br />

your answers.<br />

(2) Short essay questions are designed for one- to three-sentence answers.<br />

When you press “submit,” you will be shown the correct or suggested answer<br />

for that question and can proceed to the next question. Your answer will be<br />

saved to a specified location in your computer.<br />

(3) Calculation and graphing questions require you to perform calculations or<br />

draw a graph. These must be done off the program. You will be shown the<br />

correct answers after pressing the “submit” button. Your answer will not be<br />

saved.<br />

(4) Journal clubs require you to analyze a real medical study. You will be asked<br />

to fill in a worksheet with your answers in short essay form. After finishing, a<br />

sample of correct and acceptable answers will be shown for you to compare<br />

with your answers.


Foreword<br />

The impact of evidence-<strong>based</strong> decision-making on the way in which we work has<br />

had an impact on our understanding of the language that is used to make and<br />

take decisions. Decisions are made by language and the language includes both<br />

words and numbers, but before evidence-<strong>based</strong> decision-making came along,<br />

relatively little consideration was given to the types of statement or proposition<br />

being made. Hospital Boards and Chief Executives, managers and clinicians,<br />

made statements but it was never clear what type of statement they were making.<br />

Was it, for example, a proposition <strong>based</strong> on evidence, or was it a proposition<br />

<strong>based</strong> on experience, or a proposition <strong>based</strong> on values? All these different types<br />

of propositions are valid but to a different degree of validity.<br />

This language was hard-packed like Arctic ice, and the criteria of evidence<strong>based</strong><br />

decision-making smash into this hard-packed ice like an icebreaker with,<br />

on one side propositions <strong>based</strong> on evidence and, on another, propositions <strong>based</strong><br />

on experience and values. As with icebreakers, the channel may close up when<br />

the icebreaker has moved through but usually it stays open long enough for a<br />

decision to be made.<br />

We use a simple arrows diagram to illustrate the different components of a<br />

decision, each of which is valid but has a different type of validity.<br />

Patients’ values<br />

and expectations<br />

EVIDENCE CHOICE DECISION<br />

Baseline risk<br />

xi


xii<br />

Foreword<br />

In this book <strong>Dan</strong> <strong>Mayer</strong> has demonstrated how to make decisions <strong>based</strong> on best<br />

current evidence while taking into account the knowledge about the particular<br />

patient or service under consideration. <strong>Evidence</strong>-<strong>based</strong> decision-making is what<br />

it says on the tin – it is evidence-<strong>based</strong> – but it needs to take into account the<br />

needs and values of a particular patient, service or population, and this book<br />

describes very well how to do that.<br />

Sir Muir Gray, CBE<br />

Consultant in Public Health


Acknowledgments<br />

There are many people who were directly or indirectly responsible for the publication<br />

of this book. Foremost, I want to thank my wife, Julia Eddy, without whose<br />

insight this book would never have been written and revised. Her encouragement<br />

and suggestions at every stage during the development of the course, writing<br />

the syllabi, and finally putting them into book form, were the vital link in<br />

creating this work. At the University of Vermont, she learned how statistics could<br />

be used to develop and evaluate research in psychology and how it should be<br />

taught as an applied science. She encouraged me to use the “scientific method<br />

approach” to teach medicine to my students, evaluating new research using<br />

applied statistics to improve the practice of medicine. She has been my muse<br />

for this great project.<br />

Next, I would like to acknowledge the help of all the students and faculty<br />

involved in the EBHC Theme Planning Group for the course since the start. This<br />

group of committed students and faculty has met monthly since 1993 to make<br />

constructive changes in the course. Their suggestions have been incorporated<br />

into the book, and this invaluable input has helped me develop it from a rudimentary<br />

and disconnected series of lectures and workshops to what I hope is a<br />

fully integrated educational text.<br />

I am indebted to the staff of the Office of Medical Education of the Department<br />

of Internal <strong>Medicine</strong> at the Michigan State University for the syllabus material<br />

that I purchased from them in 1993. This became the skeleton structure of the<br />

course on which this book is <strong>based</strong>. I think they had a great idea on how to introduce<br />

the uninitiated to critical appraisal. The structure of their original course<br />

can be seen in this work.<br />

I would like to thank Sandi Pirozzo, B.Sc., M.P.H., John E. Kaplan, Ph.D.,<br />

Laura J. Zakowski, M.D., Shobhina G. Chheda, M.D., M.P.H., Christine S. Seibert,<br />

M.D., and Steven R. Simon, M.D., M.P.H., for their chapters on searching, the<br />

ethical conduct of research, communicating evidence to patients, and critical<br />

appraisal of qualitative studies, respectively. I would especially like to thank<br />

the following faculty and students at Albany Medical College for their review<br />

of the manuscript: John Kaplan, Ph.D., Paul Sorum, M.D., Maude Dull, M.D.<br />

xiii


xiv<br />

Acknowledgments<br />

(AMC 2000), Kathleen Trapp, B.S., Peter Bernstein, B.S. (AMC 2002), Sue Lahey,<br />

M.L.S., Cindy Koman, M.L.S., and Anne Marie L’Hommedieu, M.L.S. Their editorial<br />

work over the past several years has helped me refine the ideas in this book.<br />

I would also like to thank Chase Echausier, Rachael Levet, and Brian Leneghan<br />

for their persistence in putting up with my foibles in the production of the<br />

manuscript, and my assistant, Line Callahan, for her Herculean effort in typing<br />

the manuscript. For the Second Edition, I also want to thank Abbey Gore (AMC<br />

2009) for her editorial criticism that helped me improve the readability of the<br />

text. I also thank the creators of the CD-ROM, which was developed and executed<br />

by Tao Nyeu and my son, Noah <strong>Mayer</strong>. I owe a great debt to the staff at<br />

the Cambridge University Press for having the faith to publish this book. Specifically,<br />

I want to thank Senior Commissioning Editor for <strong>Medicine</strong>, Peter Silver, for<br />

starting the process, and Richard Marley and Katie James for continuing with the<br />

Second Edition. Of course, I am very thankful to my original copy-editor, Hugh<br />

Brazier, whose expertise and talent made the process of editing the book actually<br />

pleasant.<br />

Finally, the First Edition of the book was dedicated to my children: Memphis,<br />

Gilah, and Noah. To that list, I want to add my grandchildren: Meira, Chaim,<br />

Eliana, Ayelet, Rina, and Talia. Thanks for all of your patience and good cheer.


1<br />

A brief history of medicine and statistics<br />

History is a pack of lies about events that never happened told by people who weren’t<br />

there. Those who cannot remember the past are condemned to repeat it.<br />

George Santayana (1863–1952)<br />

Learning objectives<br />

In this chapter, you will learn:<br />

a brief history of medicine and statistics<br />

the background to the development of modern evidence-<strong>based</strong> medicine<br />

how to put evidence-<strong>based</strong> medicine into perspective<br />

Introduction<br />

The American health-care system is among the best in the world. Certainly we<br />

have the most technologically advanced system. We also spend the most money.<br />

Are we getting our money’s worth? Are our citizens who have adequate access<br />

to health care getting the best possible care? What are the elements of the best<br />

possible health care, and who defines it? These questions can be answered by the<br />

medical research that is published in the medical literature. When you become<br />

an effective and efficient reader of the medical literature, you will be able to<br />

answer these questions. It is this process that we will be discussing in this book.<br />

This chapter will give you a historical perspective for learning how to find and<br />

use the best evidence in the practice of medicine.<br />

<strong>Evidence</strong>-<strong>based</strong> medicine (EBM) is a new paradigm for the health-care system<br />

involving using the current evidence (results of medical research studies)<br />

in the medical literature to provide the best possible care to patients. What follows<br />

is a brief history of medicine and statistics, which will give you the historical<br />

basis and philosophical underpinnings of EBM. This is the beginning of a process<br />

designed to make you a more effective reader of the medical research literature.<br />

1


2 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

Table 1.1. The basis of healing systems in different civilizations<br />

Civilization Energy Elements<br />

European Humors Earth, air, choler (yellow bile), melancholia (black bile)<br />

East Indian Chakras Spirit, phlegm, bile<br />

Chinese Qi Earth, metal, fire, water, wood<br />

Native American Spirits Earth, air, fire, water<br />

Prehistory and ancient history<br />

Dawn of civilization to about AD 1000<br />

Prehistoric man looked upon illness as a spiritual event. The ill person was seen<br />

as having a spiritual failing or being possessed by demons. <strong>Medicine</strong> practiced<br />

during this period and for centuries onward focused on removing these demons<br />

and cleansing the body and spirit of the ill person. Trephination, a practice in<br />

which holes were made in the skull to vent evil spirits or vapors, and religious<br />

rituals were the means to heal. With advances in civilization, healers focused<br />

on “treatments” that seemed to work. They used herbal medicines and became<br />

more skilled as surgeons.<br />

About 4000 years ago, the Code of Hammurabi listed penalties for bad outcomes<br />

in surgery. In some instances, the surgeon lost his hand if the patient<br />

died. The prevailing medical theories of this era and the next few millennia<br />

involved manipulation of various forms of energy passing through the body.<br />

Health required a balance of these energies. The energy had different names<br />

depending on where the theory was developed. It was qi in China, chakras in<br />

India, humors in Europe, and natural spirits among Native Americans. The forces<br />

achieving the balance of energy also had different names. Each civilization developed<br />

a healing method predicated on restoring the correct balance of these energies<br />

in the patient, as described in Table 1.1.<br />

The ancient Chinese system of medicine was <strong>based</strong> upon the duality of the<br />

universe. Yin and yang represented the fundamental forces in a dualistic cosmic<br />

theory that bound the universe together. The Nei Ching, one of the oldest medical<br />

textbooks, was written in about the third century BC. According to the Nei<br />

Ching, medical diagnosis was done by means of “pulse diagnosis” that measured<br />

the balance of qi (or energy flow) in the body. In addition to pulse diagnosis,<br />

traditional Chinese medicine incorporated the five elements, five planets, conditions<br />

of the weather, colors, and tones. This system included the 12 channels<br />

in which the qi flowed. Anatomic knowledge either corroborated the channels or<br />

was ignored. Acupuncture as a healing art balanced yin and yang by insertion of<br />

needles into the energy channels at different points to manipulate the qi. For the


A brief history of medicine and statistics 3<br />

Chinese, the first systematic study of human anatomy didn’t occur until the mid<br />

eighteenth century and consisted of the inspection of children who had died of<br />

plague and had been torn apart by dogs.<br />

<strong>Medicine</strong> in ancient India was also very complex. Medical theory included<br />

seven substances: blood, flesh, fat, bone, marrow, chyle, and semen. From extant<br />

records, we know that surgical operations were performed in India as early as<br />

800 BC, including kidney stone removal and plastic surgery, such as the replacement<br />

of amputated noses, which were originally removed as punishment for<br />

adultery. Diet and hygiene were crucial to curing in Indian medicine, and clinical<br />

diagnosis was highly developed, depending as much on the nature of the life<br />

of the patient as on his symptoms. Other remedies included herbal medications,<br />

surgery, and the “five procedures”: emetics, purgatives, water enemas, oil enemas,<br />

and sneezing powders. Inhalations, bleeding, cupping, and leeches were<br />

also employed. Anatomy was learned from bodies that were soaked in the river<br />

for a week and then pulled apart. Indian physicians knew a lot about bones, muscles,<br />

ligaments, and joints, but not much about nerves, blood vessels, or internal<br />

organs.<br />

The Greeks began to systematize medicine about the same time as the Nei<br />

Ching appeared in China. Although Hippocratic medical principles are now considered<br />

archaic, his principles of the doctor–patient relationship are still followed<br />

today. The Greek medical environment consisted of the conflicting schools of the<br />

dogmatists, who believed in medical practice <strong>based</strong> on the theories of health and<br />

medicine, and the empiricists, who <strong>based</strong> their medical therapies on the observation<br />

of the effects of their medicines. The dogmatists prevailed and provided<br />

the basis for future development of medical theory. In Rome, Galen created popular,<br />

albeit incorrect, anatomical descriptions of the human body <strong>based</strong> primarily<br />

on the dissection of animals.<br />

The Middle Ages saw the continued practice of Greek and Roman medicine.<br />

Most people turned to folk medicine that was usually performed by village elders<br />

who healed using their experiences with local herbs. Other changes in the Middle<br />

Ages included the introduction of chemical medications, the study of chemistry,<br />

and more extensive surgery by those involved with Arabic medicine.<br />

Renaissance and industrial revolution<br />

The first medical school was started in Salerno, Italy, in the thirteenth century.<br />

The Renaissance led to revolutionary changes in the theory of medicine. In the<br />

fifteenth century, Vesalius repudiated Galen’s incorrect anatomical theories and<br />

Paracelsus advocated the use of chemical instead of herbal medicines. In the sixteenth<br />

century, the microscope was developed by Janssen and Galileo and popularized<br />

by Leeuwenhoek and Hooke. In the seventeenth century, the theory of


4 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

the circulation of blood was proposed by Harvey and scientists learned about the<br />

actual functioning of the human body. The eighteenth century saw the development<br />

of modern medicines with the isolation of foxglove to make digitalis by<br />

Withering, the use of inoculation against smallpox by Jenner, and the postulation<br />

of the existence of vitamin C and antiscorbutic factor by Lind.<br />

During the eighteenth century, medical theories were undergoing rapid and<br />

chaotic change. In Scotland, Brown theorized that health represented the conflict<br />

between strong and weak forces in the body. He treated imbalances with<br />

either opium or alcohol. Cullen preached a strict following of the medical orthodoxy<br />

of the time and recommended complex prescriptions to treat illness. Hahnemann<br />

was disturbed by the use of strong chemicals to cure, and developed the<br />

theory of homeopathy. Based upon the theory that like cures like, he prescribed<br />

medications in doses that were so minute that current atomic analysis cannot<br />

find even one molecule of the original substance in the solution. Benjamin Rush,<br />

the foremost physician of the century, was a strong proponent of bloodletting, a<br />

popular therapy of the time. He has the distinction of being the first physician in<br />

America who was involved in a malpractice suit, which is a whole other story. He<br />

won the case.<br />

The birth of statistics<br />

Prehistoric peoples had no concept of probability, and the first mention is in<br />

the Talmud, written between AD 300 and 400. This alluded to the probability<br />

of two events being the product of the probability of each, but without explicitly<br />

using mathematical calculations. Among the ancients, the Greeks believed<br />

that the gods decided all life and, therefore, that probability did not enter into<br />

issues of daily life. The Greek creation myth involved a game of dice between<br />

Zeus, Poseidon, and Hades, but the Greeks themselves turned to oracles and the<br />

stars instead.<br />

The use of Roman numerals made any kind of complex calculation impossible.<br />

Numbers as we know them today, using the decimal system and the zero, probably<br />

originated around AD 500 in the Hindu culture of India. This was probably<br />

the biggest step toward being able to manipulate probabilities and determine<br />

statistics. The Arabic mathematician Khowarizmi defined rules for adding, subtracting,<br />

multiplying, and dividing in about AD 800. In 1202, the book of the abacus,<br />

Liber abaci by Leonardo Pisano (more commonly known as Fibonacci), first<br />

introduced the numbers discovered by Arabic cultures to European civilization.<br />

In 1494, Luca Paccioli defined basic principles of algebra and multiplication<br />

tables up to 60 × 60 in his book Summa de arithmetica, geometria, proportioni e<br />

proportionalita. He posed the first serious statistical problem of two men playing<br />

a game called balla, which is to end when one of them has won six rounds.


A brief history of medicine and statistics 5<br />

However, when they stop playing A has only won five rounds and B three. How<br />

should they divide the wager? It would be another 200 years before this problem<br />

was solved.<br />

In 1545, Girolamo Cardano wrote the books Ars magna (The Great Art) and<br />

Liber de ludo aleae (Book on Games of Chance). This was the first attempt to use<br />

mathematics to describe statistics and probability, and he accurately described<br />

the probabilities of throwing various numbers with dice. Galileo expanded on<br />

this by calculating probabilities using two dice. In 1619, a puritan minister<br />

named Thomas Gataker, expounded on the meaning of probability by noting<br />

that it was natural laws and not divine providence that governed these outcomes.<br />

Other famous scientists of the seventeenth century included Huygens, Leibniz,<br />

and Englishman John Graunt, who all wrote further on norms of statistics,<br />

including the relation of personal choice and judgment to statistical probability.<br />

In 1662, a group of Parisian monks at the Port Royal Monastery wrote an early<br />

text on statistics and were the first to use the word probability. Wondering why<br />

people were afraid of lightning even though the probability of being struck is very<br />

small, they stated that the “fear of harm ought to be proportional not merely to<br />

the gravity of the harm but also to the probability of the event.” 1 This linked the<br />

severity, perception, and probability of the outcome of the risk for the person<br />

involved.<br />

In 1660, Blaise Pascal refined the theories of statistics and, with help from<br />

Pierre de Fermat, solved the balla problem of Paccioli. All of these theories paved<br />

the way for modern statistics, which essentially began with the use of actuarial<br />

tables to determine insurance for merchant ships. Edward Lloyd opened his<br />

coffee shop in London at which merchant ship captains used to gather, trade<br />

their experiences, and announce the arrival of ships from various parts of the<br />

world. One hundred years later, this endeavour led to the foundation of Lloyds of<br />

London, which began its business of naval insurance in the 1770s.<br />

John Graunt, a British merchant, categorized the cause of death of the London<br />

populace using statistical sampling, noting that “considering that it is esteemed<br />

an even lay, whether any man lived 10 years longer, I supposed it was the same,<br />

that one of any 10 might die within one year.” He also noted the reason for doing<br />

this: to “set down how many died of each [notorious disease] . . . those persons<br />

may better understand the hazard they are in.” 2 Graunt’s statistics can be compared<br />

to recent data from the United States in 1993 in Table 1.2. As a result of<br />

this work, the government of the United Kingdom set up the first governmentsponsored<br />

statistical sampling service.<br />

With the rise in statistical thinking, Jacob Bernoulli devised the law of large<br />

numbers, which stated that as the number of observations increased the actual<br />

1 P. L. Bernstein. Against the Gods: the Remarkable Story of Risk. New York, NY: Wiley, 1998. p. 71.<br />

2 Ibid., p. 82.


6 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

Table 1.2. Probability of survival:<br />

1660 and 1993<br />

Percentage survival<br />

to each age<br />

Age, y 1660 1993<br />

0 100% 100%<br />

26 25% 98%<br />

46 10% 95%<br />

76 1% 70%<br />

frequency of an event would approach its theoretical probability. This is the basis<br />

of all modern statistical inference. In the 1730s, Jacob’s nephew <strong>Dan</strong>iel Bernoulli<br />

developed the idea of utility as the mathematical combination of the quantity<br />

and perception of risk.<br />

Modern era<br />

Nineteenth century to today<br />

The nineteenth century saw the development of Claude Bernard’s modern physiology,<br />

William Morton’s anesthesia, Joseph Lister and Ignatz Semmelweis’ antisepsis,<br />

Wilhelm Roentgen’s x-rays, Louis Pasteur and Robert Koch’s germ theory,<br />

and Sigmund Freud’s psychiatric theory. Changes in medical practice were<br />

illustrated by the empirical analysis done in 1838 by Pierre Charles Alexandre<br />

Louis. He showed that blood-letting therapy for typhoid fever was associated<br />

with increased mortality and changed this practice as a result. The growth of sanitary<br />

engineering and public health preceded this in the seventeenth and eighteenth<br />

centuries. This improvement had the greatest impact on human health<br />

through improved water supplies, waste removal, and living and working conditions.<br />

John Snow performed the first recorded modern epidemiological study<br />

in 1854 during a cholera epidemic in London. He found that a particular water<br />

pump located on Broad Street was the source of the epidemic and was being contaminated<br />

by sewage dumped into the River Thames. At the same time, Florence<br />

Nightingale was using statistical graphs to show the need to improve sanitation<br />

and hygiene in general for the British troops during the Crimean War. This type<br />

of data gathering in medicine was rare up to that time.<br />

The twentieth century saw an explosion of medical technology. Specifics<br />

include the discovery of modern medicines by Paul Erlich, antibiotics (specifically<br />

sulfanilamide by Domagk and penicillin by Fleming), and modern


A brief history of medicine and statistics 7<br />

chemotherapeutic agents to treat ancient scourges such as diabetes (specifically<br />

the discovery of insulin by Banting, Best, and McLeod), cancer, and hypertension.<br />

The modern era of surgery has led to open-heart surgery, joint replacement,<br />

and organ transplantation. Advances in medicine continue at an ever-increasing<br />

rate.<br />

Why weren’t physicians using statistics in medicine? Before the middle of the<br />

twentieth century, advances in medicine and conclusions about human illness<br />

occurred mainly through the study of anatomy and physiology. The case study<br />

or case series was a common way to prove that a treatment was beneficial or<br />

that a certain etiology was the cause of an illness. The use of statistical sampling<br />

techniques took a while to develop. There were intense battles between those<br />

physicians who wanted to use statistical sampling and those who believed in the<br />

power of inductive reasoning from physiological experiments.<br />

This argument between inductive reasoning and statistical sampling continued<br />

into the nineteenth century. Pierre Simon Laplace (1814) put forward the<br />

idea that essentially all knowledge was uncertain and, therefore, probabilistic in<br />

nature. The work of Pierre Charles Alexandre Louis on typhoid and diphtheria<br />

(1838) debunking the theory of bleeding used probabilistic principles. On the<br />

other side was Francois Double, who felt that treatment of the individual was<br />

more important than knowing what happens to groups of patients. The art of<br />

medicine was defined as deductions from experience and induction from physiologic<br />

mechanisms. These were felt to be more important than the “calculus<br />

of probability.” This debate continued for over 100 years in France, Germany,<br />

Britain, and the United States.<br />

The rise of modern biomedical research<br />

Most research done before the twentieth century was more anecdotal than systematic,<br />

consisting of descriptions of patients or pathological findings. James<br />

Lind, a Royal Navy surgeon, carried out the first recorded clinical trial in 1747.<br />

In looking for a cure for scurvy, he fed sailors afflicted with scurvy six different<br />

treatments and determined that a factor in limes and oranges cured the disease<br />

while other foods did not. His study was not blinded, but as a result, 40 years<br />

later limes were stocked on all ships of the Royal Navy, and scurvy among sailors<br />

became a problem of the past.<br />

Research studies of physiology and other basic science research topics began<br />

to appear in large numbers in the nineteenth century. By the start of the twentieth<br />

century, medicine had moved from the empirical observation of cases to the<br />

scientific application of basic sciences to determine the best therapies and catalog<br />

diagnoses. Although there were some epidemiological studies that looked<br />

at populations, it was uncommon to have any kind of longitudinal study of large


8 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

groups of patients. There was a 200-year gap from Lind’s studies before the controlled<br />

clinical trial became the standard study for new medical innovations. It<br />

was only in the 1950s that the randomized clinical trial became the standard for<br />

excellent research.<br />

There are three more British men who made great contributions to the early<br />

development of the current movement in EBM. Sir Ronald Fisher was the father<br />

of statistics. Beginning in the early 1900s, he developed the basis for most theories<br />

of modern statistical testing. Austin Bradford Hill was another statistician,<br />

who, in 1937, published a series of articles in the Lancet on the use of statistical<br />

methodology in medical research. In 1947, he published a simple commentary<br />

in the British Medical Journal calling for the introduction of statistics in the<br />

medical curriculum. 3 He called for physicians to be well versed in basic statistics<br />

and research study design in order to avoid the biases that were then so prevalent<br />

in what passed for medical research. Bradford Hill went on to direct the first<br />

true modern randomized clinical trial. He showed that streptomycin therapy was<br />

superior to standard therapy for the treatment of pulmonary tuberculosis.<br />

Finally, Archie Cochrane was particularly important in the development of the<br />

current movement to perform systematic reviews of medical topics. He was a<br />

British general practitioner who did a lot of epidemiological work on respiratory<br />

diseases. In the late 1970s, he published an epic work on the evidence for<br />

medical therapies in perinatal care. This was the first quality-rated systematic<br />

review of the literature on a particular topic in medicine. His book Effectiveness<br />

and Efficiency set out a rational argument for studying and applying EBM<br />

to the clinical situation. 4 Subsequently, groups working on systematic reviews<br />

spread through the United Kingdom and now form a network in cyberspace<br />

throughout the world. In his honor, this network has been named the Cochrane<br />

Collaboration.<br />

As Santayana said, it is important to learn from history so as not to repeat<br />

the mistakes that civilization has made in the past. The improper application<br />

of tainted evidence has resulted in poor medicine and increased cost without<br />

improving on human suffering. This book will give physicians the tools to evaluate<br />

the medical literature and pave the way for improved health for all. In the next<br />

chapter, we will begin where we left off in our history of medicine and statistics<br />

and enter the current era of evidence-<strong>based</strong> medicine.<br />

3 A. Bradford Hill. Statistics in the medical curriculum? Br. Med. J. 1947; ii: 366.<br />

4 A. L. Cochrane. Effectiveness & Efficiency: Random Reflections on Health Services. London: Royal Society<br />

of <strong>Medicine</strong>, 1971.


2<br />

What is evidence-<strong>based</strong> medicine?<br />

The most savage controversies are those about matters as to which there is no good<br />

evidence either way.<br />

Bertrand Russell (1872–1970)<br />

Learning objectives<br />

In this chapter, you will learn:<br />

why you need to study evidence-<strong>based</strong> medicine<br />

the elements of evidence-<strong>based</strong> medicine<br />

how a good clinical question is constructed<br />

The importance of evidence<br />

In the 1980s, there were several studies looking at the utilization of various surgeries<br />

in the northeastern United States. These studies showed that there were<br />

large variations in the amount of care delivered to similar populations. They<br />

found variations in rates of prostate surgery and hysterectomy of up to 300%<br />

between similar counties. The variation rate in the performance of cataract<br />

surgery was 2000%. The researchers concluded that physicians were using very<br />

different standards to decide which patients required surgery. Why were physicians<br />

using such different rules? Weren’t they all reading the same textbooks and<br />

journal articles? In that case, shouldn’t their practice be more uniform?<br />

“Daily, clinicians confront questions about the interpretation of diagnostic<br />

tests, the harm associated with exposure to an agent, the prognosis of disease<br />

in a specific patient, the effectiveness of a preventive or therapeutic<br />

intervention, and the costs and clinical consequences of many other clinical<br />

decisions. Both clinicians and policy makers need to know whether the<br />

9


10 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

Fig. 2.1 The four elements to<br />

evidence-<strong>based</strong> health care:<br />

best available evidence, clinical<br />

situation, patient values and<br />

preferences, all bound together<br />

by clinical experience.<br />

Best evidence<br />

Clinical experience<br />

Clinical situation<br />

Patient values<br />

conclusions of a systematic review are valid, and whether recommendations<br />

in practice guidelines are sound.” 1<br />

This is where <strong>Evidence</strong>-Based <strong>Medicine</strong> comes in.<br />

<strong>Evidence</strong>-<strong>based</strong> medicine (EBM) has been defined as “the conscientious,<br />

explicit, and judicious use of the best evidence in making decisions<br />

about the care of individual patients” (http://ebm.mcmaster.ca/documents/<br />

how to teach ebcp workshop brochure 2009.pdf). 2 The EBM stems from the<br />

physician’s need to have proven therapies to offer patients. This is a paradigm<br />

shift that represents both a breakdown of the traditional hierarchical system of<br />

medical practice and the acceptance of the scientific method as the governing<br />

force in advancing the field of medicine. Simply stated, EBM is applying the best<br />

evidence that can be found in the medical literature to the patient with a medical<br />

problem, resulting in the best possible care for each patient. <strong>Evidence</strong>-<strong>based</strong><br />

clinical practice (EBCP) is a definition of an approach to medical practice in<br />

which you the clinician are able to evaluate the strength of that evidence and<br />

use it in the best clinical practice for the patient sitting in your office.<br />

<strong>Evidence</strong>-<strong>based</strong> medicine can be seen as a combination of three skills by which<br />

practitioners become aware of, critically analyze, and then apply the best available<br />

evidence from the medical research literature for the care of individual<br />

patients. The first of these is Information Mastery (IM), the skill of searching<br />

the medical literature in the most efficient manner to find the best available evidence.<br />

This skill will be the focus of Chapter 5. The majority of the chapters in this<br />

book will focus on the skill of Critical Appraisal (CA) of the literature. This set of<br />

skills will help you to develop critical thinking about the content of the medical<br />

literature. Finally, the results of the information found and critically appraised<br />

must be applied to patient care in the process of Knowledge Translation (KT),<br />

which is the subject of Chapter 17. The application of research results is a blend<br />

of the available evidence, the patient’s preferences, the clinical situation, and the<br />

practitioner’s clinical experience (Fig. 2.1).<br />

1 McMaster University Department of Clinical Epidemiology and Biostatistics. <strong>Evidence</strong>-<strong>based</strong> clinical<br />

practice (EBCP) course, 1999.<br />

2 D. L. Sackett, W. M. Rosenberg, J. A. Gray, R. B. Haynes & W. S. Richardson. <strong>Evidence</strong> <strong>based</strong> medicine:<br />

what it is and what it isn’t. BMJ 1996; 312: 71–72.


What is evidence-<strong>based</strong> medicine? 11<br />

Medical decision making: expert vs. evidence-<strong>based</strong><br />

Because of the scientific basis of medical research, the essence of evidence-<strong>based</strong><br />

medical practice has been around for centuries. Its explicit application as EBM<br />

to problem solving in clinical medicine began simultaneously in the late 1980s at<br />

McMaster University in Canada and at Oxford University in the United Kingdom.<br />

In response to the high variability of medical practice and increasing costs and<br />

complexity of medical care, systems were needed to define the best and, if possible,<br />

the cheapest treatments. Individuals trained in both clinical medicine and<br />

epidemiology collaborated to develop strategies to assist in the critical appraisal<br />

of clinical data from the biomedical journals.<br />

In the past, a physician faced with a clinical predicament would turn to an<br />

expert physician for the definitive answer to the problem. This could take the<br />

form of an informal discussion on rounds with the senior attending (or consultant)<br />

physician, or the referral of a patient to a specialist. The answer would come<br />

from the more experienced and usually older physician, and would be taken<br />

at face value by the younger and more inexperienced physician. That clinical<br />

answer was usually <strong>based</strong> upon the many years of experience of the older physician,<br />

but was not necessarily ever empirically tested. <strong>Evidence</strong>-<strong>based</strong> medicine<br />

has changed the culture of health-care delivery by encouraging the rapid and<br />

transparent translation of the latest scientific knowledge to improve patient care.<br />

This new knowledge translation begins at the time of its discovery until its general<br />

acceptance in the care of patients with clinical problems for which that<br />

knowledge is valid, relevant, and crucial.<br />

Health-care workers will practice EBM on several levels. Most practitioners<br />

have to keep up by regularly reading relevant scientific journals and need to<br />

decide whether to accept what they read. This requires having a critical approach<br />

to the science presented in the literature, a process called “doing” EBM and the<br />

activity is done by “doers.” Some of these “doers” are also the people who create<br />

critically appraised sources of evidence and systematic reviews or meta-analyses.<br />

Most health-care workers will spend a greater part of their time functioning as<br />

“users” of the medical evidence. They will have the skills to search for the best<br />

available evidence in the most efficient way. They will be good at looking for preappraised<br />

sources of evidence that will help them care for their patients in the<br />

most effective way. Finally, there is one last group of health-care workers that<br />

can be called the “replicators,” who simply accept the word of experts about the<br />

best available evidence for care of their patients. The goal of this book is to teach<br />

you, the clinician, to be a “doer.”<br />

With the rise of EBM, various groups have developed ways to package evidence<br />

to make it more useful to individual practitioners. These sources allow healthcare<br />

professionals to practice EBM in a more efficient manner at the point of


12 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

patient care. Information Mastery will help you to expedite your searches for<br />

information when needed during the patient care process. Ideally, you’d like<br />

to find and use critical evaluations of clinically important questions done by<br />

authors other than those who wrote the study. Various online databases around<br />

the world serve as repositories for these summaries of evidence. To date, most<br />

of the major centers for the dissemination of these have been in the United<br />

Kingdom.<br />

The National Health Service sponsors the Centre for <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

<strong>based</strong> at Oxford University. This center is the home of various EBM resources,<br />

one in particular is called the Bandolier. Bandolier is a summary of recent interesting<br />

evidence evaluated by the center and is published monthly. It is found<br />

at www.jr2.ox.ac.uk/bandolier and is a wonderful blend of interesting medical<br />

information and uniquely British humor in an easy-to-read format. It is excellent<br />

for student use and free to browse. The center also has various other free and easily<br />

accessible features on its main site found at www.cebm.net. Other useful EBM<br />

websites are listed in the Bibliography and additional IM sites, and processes will<br />

be discussed in Chapter 6.<br />

Alphabet soup of critical appraisal of the medical literature<br />

Several commonly used forms of critical appraisal are the Critically Appraised<br />

Topic (CAT), Disease Oriented <strong>Evidence</strong> (DOE), the Patient-Oriented <strong>Evidence</strong><br />

that Matters (POEM), and the Journal Club Bank (JCB). The CAT format is developed<br />

by the Centre for <strong>Evidence</strong>-Based <strong>Medicine</strong>, and many CATs are available<br />

online at the center’s website. They use the User’s Guide to the Medical Literature<br />

format (see Bibliography) to catalog reviews of clinical studies. In a similar format<br />

DOEs and POEMs are developed for use by family physicians by the American<br />

Academy of Family Practice. The JCB is the format for critical appraisal used<br />

by the <strong>Evidence</strong>-Based Interest Group of the American College of Physicians<br />

(ACP) and the <strong>Evidence</strong>-Based Emergency <strong>Medicine</strong> group (www.ebem.org)<br />

working through the New York Academy of <strong>Medicine</strong>. Other organizations are<br />

beginning to use these formats to disseminate critical reviews on the World Wide<br />

Web.<br />

A DOE is a critical review of a study that shows that there is a change in a particular<br />

disease marker when a particular intervention is applied. However, this<br />

disease-specific outcome may not make a difference to an individual patient.<br />

For example, it is clear that statins lower cholesterol. However, it is not necessarily<br />

true that the same drugs reduce mortality from heart disease. This is where<br />

POEMs come in. A POEM would be that the studies for some of these statin<br />

drugs have shown the correlation between statin use and decreased mortality<br />

from heart disease, an outcome that matters to the patient rather than simply


What is evidence-<strong>based</strong> medicine? 13<br />

a disease-oriented outcome. Another example is the prostate-specific antigen<br />

(PSA) test for detecting prostate cancer. There is no question that the test can<br />

detect prostate cancer most of the time at a stage that is earlier than would be<br />

detected by a physician examination, so it is a positive DOE. However, it has yet<br />

to be shown that early detection using the PSA results in a longer life span or an<br />

improved quality of life; thus, it is not a positive POEM.<br />

Other compiled sources of evidence are the American Society of Internal<br />

<strong>Medicine</strong> and the American College of Physicians’ ACP Journal Club, published<br />

by the journal Annals of Internal <strong>Medicine</strong>, and the Cochrane Library, sponsored<br />

by the National Health Service in the United Kingdom. Both are available by subscription.<br />

The next step for the future use of EBM in the medical decision-making<br />

process is making the evidence easily available at the patient’s bedside. This has<br />

been tried using an “evidence cart” containing a computer loaded with evidence<strong>based</strong><br />

resources during rounds. 3 Currently, personal digital assistants (PDAs) and<br />

other handheld devices with evidence-<strong>based</strong> databases downloaded onto them<br />

are being used at the bedside to fulfil this mission.<br />

How to put EBM into use<br />

For many physicians, the most complex part of the process of EBM is the critical<br />

appraisal of the medical literature. Part of the perceived complexity with this<br />

process is a fear of statistics and consequent lack of understanding of statistical<br />

processes. The book will teach this in several steps. Each step will be reinforced<br />

on the CD-ROM with a series of practice problems and self-assessment<br />

learning exercises (SALEs) in which examples from the medical literature will<br />

be presented. This will also help you develop your skills of formulating clinical<br />

questions, and in time, you will become a competent evaluator of the medical<br />

literature. This skill will serve you well for the rest of your career.<br />

The clinical question: background vs. foreground<br />

You can classify clinical questions into two basic types. Background questions<br />

are those which have been answered in the past and are now part of the “fiber of<br />

medicine.” Answers to these questions are usually found in medical textbooks.<br />

The learner must beware, since the answers to these questions may be inaccurate<br />

and not <strong>based</strong> upon any credible evidence. Typical background questions relate<br />

to the nature of a disease or the usual cause, diagnosis, or treatment of illnesses.<br />

3 D. L. Sackett & S. E. Straus. Finding and applying evidence during clinical rounds: the “evidence cart”.<br />

JAMA 1998; 280: 1336–1338.


14 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

%<br />

Background<br />

Foreground<br />

Years of experience<br />

Fig. 2.2 The relationship<br />

between foreground and<br />

background questions and the<br />

clinician’s experience.<br />

Foreground questions are those usually found at the cutting edge of medicine.<br />

They are questions about the most recent therapies, diagnostic tests, or current<br />

theories of illness causation. These are the questions that are the heart of the<br />

practice of EBM. A four-part clinical question called a PICO question is designed<br />

to easily search for this evidence.<br />

The determination of whether a question is foreground or background<br />

depends upon your level of experience. The experienced clinician will have very<br />

few background questions that need to be researched. On the other hand, the<br />

novice has so many unanswered questions that most are of a background nature.<br />

The graph in Fig. 2.2 shows the relationship between foreground and background<br />

questions and the clinician’s experience.<br />

When do you want to get the most current evidence? How often is access to<br />

EBM needed each day for the average clinician? Most physician work is <strong>based</strong><br />

upon knowledge gained by answering background questions. There are some<br />

situations for which current evidence is more helpful. These include questions<br />

that are going to make a major impact for your patient. Will the disease kill them,<br />

and if so, how long will it take and what will their death be like? These are typical<br />

questions that a cancer patient would ask. Other reasons for searching for the<br />

best current evidence include problems that recur commonly in your practice,<br />

those in which you are especially interested, or those for which answers are easily<br />

found. The case in which you are confronted with a patient whose problem<br />

you cannot solve and for which there is no good background information would<br />

lead you to search for the most current foreground evidence.<br />

Steps in practicing EBM<br />

There are six steps in the complete process of EBM. It is best to start learning<br />

EBM by learning and practicing these steps. As you become more familiar with<br />

the process, you can start taking short cuts and limiting the steps. Using a patient<br />

scenario as a starting point, the first step is recognizing that there is an educational<br />

need for more current information. This step leads to the “educational<br />

prescription,” 4 which can be prepared by the learner or given to them by the<br />

teacher. The steps then taken are as follows:<br />

(1) Craft a clinical question. Often called the PICO or PICOT formulation, this is<br />

the most important step since it sets the stage for a successful answer to the<br />

clinical predicament. It includes four or sometimes five parts:<br />

the patient<br />

the intervention<br />

the comparison<br />

4 Based on: W. S. Richardson. Educational prescription: the five elements. University of Rochester.


What is evidence-<strong>based</strong> medicine? 15<br />

the outcome of interest<br />

the time frame<br />

(2) Search the medical literature for those studies that are most likely to give<br />

the best evidence. This step requires good searching skills using medical<br />

informatics.<br />

(3) Find the study that is most able to answer this question. Determine the magnitude<br />

and precision of the final results.<br />

(4) Perform a critical appraisal of the study to determine the validity of the<br />

results. Look for sources of bias that may represent a fatal flaw in the study.<br />

(5) Determine how the results will help you in caring for your patient.<br />

(6) Finally, you should evaluate the results of applying the evidence to your<br />

patient or patient population.<br />

The clinical question: structure of the question<br />

The first and most critical part of the EBM process is to ask the right question.<br />

We are all familiar with the computer analogy, “garbage in, garbage out.” The<br />

clinical question (or query) should have a defined structure. The PICO model has<br />

become the standard for stating a searchable question. A good question involves<br />

Patient, Intervention, Comparison,andOutcome. A fifth element, Time,isoften<br />

added to this list. These must be clearly stated in order to search the question<br />

accurately.<br />

The Patient refers to the population group to which you want to apply the<br />

information. This is the patient sitting in your office, clinic, or surgery. If<br />

you are too specific with the population, you will have trouble finding any<br />

evidence for that person. Therefore, you must initially be general in your<br />

specification of this group. If your patient is a middle-aged man with hypertension,<br />

there may be many studies of the current best treatment of hypertension<br />

in this group. However, if you had a middle-aged African-American<br />

woman in front of you, you may not find studies that are limited to this population.<br />

In this case, asking about treatment of hypertension in general will<br />

turn up the most evidence. You can then look through these studies to find<br />

those applicable to that patient.<br />

The Intervention is the therapy, etiology, or diagnostic test that you are interested<br />

in applying to your patient. A therapy could simply be a new drug. If<br />

you are answering a question about the causes of diseases, the exposure to<br />

a potentially harmful process, or risk factors leading to premature mortality,<br />

you will be looking for etiology. We will discuss studies of diagnostic tests in<br />

more detail in Chapters 20–26.<br />

The Comparison is the intervention (therapy, etiology, or diagnostic test)<br />

against which the intervention is measured. A reasonable comparison group


16 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

is one that would be commonly encountered in clinical practice. Testing a<br />

new drug against one that is never used in current practice is not going to<br />

help the practitioner. The comparison group ought to be a real alternative<br />

and not just a “straw man.” Currently, the use of placebo for comparison<br />

in many studies is no longer considered ethical since there are acceptable<br />

treatments for the problem being studied.<br />

The Outcome is the endpoint of interest to you or your patient. The most<br />

important outcomes are the ones that matter to the patient. These are most<br />

often death, disability, or full recovery. Surprisingly, not all outcomes are<br />

important to the patient. One specific type of outcome is referred to as<br />

the surrogate outcome. This refers to disease markers that ought to cause<br />

changes in the disease process. However, the expected changes to the disease<br />

process may not actually happen. Studies of heart-attack patients done<br />

in the 1960s showed that some died suddenly from irregular heart rhythms.<br />

These patients were identified before death by the presence of premature<br />

ventricular contractions (PVCs) on the electrocardiogram. Physicians<br />

thereafter began treating all patients with heart attacks with drugs to suppress<br />

PVCs and noted that there was a lower rate of death of patients with<br />

PVCs. Physicians thought this would reduce deaths in all patients with heart<br />

attacks, but a large study found that the death rate actually increased when<br />

all patients were given these drugs. While they prevented death in a small<br />

number of patients who had PVCs, they increased death rates in a majority<br />

of patients.<br />

The Time relates to the period over which the intervention is being studied.<br />

This element is usually omitted from the searching process. However, it may<br />

be considered when deciding if the study was carried out for a sufficient<br />

amount of time.<br />

Putting EBM into context in the current practice of medicine: the<br />

science and art of medicine<br />

<strong>Evidence</strong>-<strong>based</strong> medicine should be part of the everyday practice of all physicians.<br />

It has been only slightly more than 50 years since statistics was first<br />

felt to be an important part of the medical curriculum. In a 1947 commentary<br />

in the British Medical Journal entitled “Statistics in the medical curriculum?”, 5<br />

Sir Austin Bradford Hill lamented that most physicians would interpret this as<br />

“What! Statistics in the medical curriculum?” We are now in a more enlightened<br />

era. We recognize the need for physicians to be able to understand the nature<br />

of statistical processes and to be able to interpret these for their patients. This<br />

5 A. Bradford Hill. Statistics in the medical curriculum? Br. Med. J. 1947; ii: 366.


What is evidence-<strong>based</strong> medicine? 17<br />

goes to the heart of the science and art of medicine. The science is in the medical<br />

literature and in the ability of the clinician to interpret that literature. Students<br />

learn the clinical and basic sciences that are the foundation of medicine during<br />

the first 2 years of medical school. These sciences are the building blocks for a<br />

physician’s career. The learning doesn’t stop there. Having a critical understanding<br />

of new advances in medicine by using EBM is an important part of medical<br />

practice.<br />

The art of medicine is in determining to which patients the literature will apply<br />

and then communicating the results to the patients. Students learn to perform an<br />

adequate history and physical examination of patients to extract the maximum<br />

amount of evidence to use for good medical decision making. Students must also<br />

learn to give patients information about their illnesses and empower them to<br />

act appropriately to effect a cure or control and moderate the illness. Finally, as<br />

pracitioners, physicians must be able to know when to apply the results of the<br />

most current literature to patients, and when other approaches should be used<br />

for their patients.<br />

Although most practicing physicians these days believe that they practice EBM<br />

all the time, the observed variation in practice suggests otherwise. <strong>Evidence</strong><strong>based</strong><br />

medicine can be viewed as an attempt to standardize the practice of<br />

medicine, but at the same time, it is not “cookbook” medicine. The application<br />

of EBM may suggest the best approach to a specific clinical problem. However, it<br />

is still up to the clinician to determine whether the individual patient will benefit<br />

from that approach. If your patient is very different from those for whom there<br />

is evidence, you may be justified in taking another approach to solve the problem.<br />

These decisions ought to be <strong>based</strong> upon sound clinical evidence, scientific<br />

knowledge, and pathophysiological information.<br />

<strong>Evidence</strong>-<strong>based</strong> medicine is not cookbook medicine. Accused of being “microfascist”<br />

by some, EBM can be used to create clinical practice guidelines for a<br />

common medical problem that has led to a large variation in practice and has<br />

several best practices that ought to be standardized. <strong>Evidence</strong>-<strong>based</strong> medicine<br />

is not a way for managed care (or anyone else) to simply save money. <strong>Evidence</strong><strong>based</strong><br />

practices can be more or less expensive than current practices, but they<br />

should be better.<br />

<strong>Evidence</strong>-<strong>based</strong> medicine is the application of good science to the practice of<br />

health care, leading to reproducibility and transparency in the science supporting<br />

health-care practice. <strong>Evidence</strong>-<strong>based</strong> medicine is the way to maximize the<br />

benefits of science in the practice of health care.<br />

Finally, Fig. 2.3 is a reprint from the BMJ (the journal formerly known as the<br />

British Medical Journal) and is a humorous look at alternatives to EBM.


18 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

Fig. 2.3 Isaacs, D. & Fitzgerald, D. Seven alternatives to evidence <strong>based</strong> medicine. BMJ 1999;<br />

319: 1618. Reprinted with permission.


3<br />

Causation<br />

Heavier than air flying machines are impossible.<br />

Lord Kelvin, President of the Royal Society, 1895<br />

Learning objectives<br />

In this chapter you will learn:<br />

cause-and-effect relationships<br />

Koch’s principles<br />

the concept of contributory cause<br />

the relationship of the clinical question to the type of study<br />

The ultimate goal of medical research is to increase our knowledge about the<br />

interaction between a particular agent (cause) and the health or disease in our<br />

patient (effect). Causation is the relationship between an exposure or cause and<br />

an outcome or effect such that the exposure resulted in the outcome. However,<br />

a strong association between an exposure and outcome may not be equivalent<br />

to proving a cause-and-effect relationship. In this chapter, we will discuss the<br />

theories of causation. By the end of this chapter, you will be able to determine<br />

the type of causation in a study.<br />

Cause-and-effect relationships<br />

Most biomedical research studies try to prove a relationship between a particular<br />

cause and a specified effect. The cause may be a risk factor resulting in a<br />

disease, an exposure, a diagnostic test, or a treatment helping alleviate suffering.<br />

The effect is a particular outcome that we want to measure. The stronger the<br />

design of a study, the more likely it is to prove a relationship between cause and<br />

effect. Not all study designs are capable of proving a cause-and-effect relationship,<br />

and these study designs will be discussed in a later chapter.<br />

19


20 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

The cause is also called the independent variable and is set by the researcher<br />

or the environment. In some studies relating to the prognosis of disease, time is<br />

the independent variable. The effect is called the dependent variable. It is dependent<br />

upon the action of the independent variable. It can be an outcome such as<br />

death or survival, the degree of improvement on a clinical score or the detection<br />

of disease by a diagnostic test. You ought to be able to identify the cause and<br />

effect easily in the study you are evaluating if the structure of the study is of good<br />

quality. If not, there are problems with the study design.<br />

Types of causation<br />

It’s not always easy to establish a link between a disease and its suspected cause.<br />

For example, we think that hyperlipidemia (elevated levels of lipids or fats in the<br />

blood) is a cause of cardiovascular disease. But how can we be sure that this is<br />

a cause and not just a related factor? Perhaps hyperlipidemia is caused by inactivity<br />

or a sedentary lifestyle and the lack of exercise actually causes both cardiovascular<br />

disease and hyperlipidemia.<br />

This may even be true with acute infections. Streptococcus viridans is a bacterium<br />

that can cause infection of the heart valves. However, it takes more than<br />

the presence of the bacterium in the blood to cause the infection. We cannot<br />

say that the presence of the bacterium in the blood is sufficient to cause this<br />

infection. There must be other factors such as local deformity of the valve or<br />

immunocompromise that make the valve prone to infection.<br />

In a more mundane example, it has been noted that the more churches a town<br />

has, the more robberies occur. Does this mean that clergy are robbing people?<br />

No – it simply means that a third variable, population, explains the number both<br />

of churches and of muggings. The number or churches is a surrogate marker<br />

for population, the true cause. Likewise, we know that Streptococcus viridans is a<br />

cause of subacute endocarditis. But it is neither the only cause, nor does it always<br />

lead to the result of an infected heart valve. How are we to be sure then, of causeand-effect?<br />

In medical science, there are two types of cause-and-effect relationships:<br />

Koch’s postulates and contributory cause. Robert Koch, a nineteenth-century<br />

microbiologist, developed his famous postulates as criteria to determine if a certain<br />

microbiologic agent was the cause of an illness. Acute infectious diseases<br />

were the scourge of mankind before the mid twentieth century. As a result of better<br />

public health measures such as water treatment and sewage disposal, and<br />

antibiotics, these are less of a problem today. Dr. Koch studied the anthrax bacillus<br />

as a cause of habitual abortion in cattle. He created the following postulates in<br />

an attempt to determine the relationship between the agent causing the illness<br />

and the illness itself.


Causation 21<br />

Koch’s postulates stated four basic steps to prove causation. First, the infectious<br />

agent must be found in all cases of the illness. Second, when found it must<br />

be able to be isolated from the diseased host and grown in a pure culture. Next,<br />

the agent from the culture when introduced into a healthy host must cause the<br />

illness. Finally, the infectious agent must again be recovered from the new host<br />

and grown in a pure culture. This entire cascade must be met in order to prove<br />

causation.<br />

While this model may work well in the study of acute infectious diseases, most<br />

modern illnesses are chronic and degenerative in nature. Illnesses such as diabetes,<br />

heart disease, and cancer tend to be multifactorial in their etiology and<br />

usually have multiple treatments that can alleviate the illness. For these diseases,<br />

it is virtually impossible to pinpoint a single cause or the effect of a single treatment<br />

from a single research study. Stronger studies of these diseases are more<br />

likely to point to useful clinical information relating one particular cause with an<br />

effect on the illness.<br />

Applying contributory cause helps prove causation in these complex and multifactorial<br />

diseases. The requirements for proof are less stringent than Koch’s<br />

postulates. However, since the disease-related factors are multifactorial, it is<br />

more difficult to prove that any one factor is decisive in either causing or curing<br />

the disease. Contributory cause recognizes that there is a large gray zone in<br />

which some of the many causes and treatments of a disease overlap.<br />

First, the cause and effect must be seen together more often than would be<br />

expected to occur by chance alone. This means that the cause and effect are associated<br />

more often than would be expected by chance if the concurrence of those<br />

two factors was a random event. Second, the cause must always be noted to precede<br />

the effect. If there were situations for which the effect was noted before the<br />

occurrence of the cause, that would negate this relationship in time. Finally and<br />

ideally, it should be shown that changing the cause changes the effect. This last<br />

factor is the most difficult to prove and requires an intervention study be performed.<br />

Overall, contributory cause to prove the nature of a chronic and multifactorial<br />

illness must minimally show association and temporality. However, to<br />

strengthen the causation, the change of the effect by a changing cause must also<br />

be shown. Table 3.1 compares Koch’s postulates and contributory cause.<br />

Causation and the clinical question<br />

The two main components of causation are also parts of the clinical question.<br />

Since the clinical question is the first step in EBM, it is useful to put the clinical<br />

question into the context of causation. The intervention is the cause that is<br />

being investigated. In most studies, this is compared to another cause, named<br />

the comparison. The outcome of interest is the effect. You will learn to use good


22 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

Table 3.1. Koch’s postulates vs. contributory cause<br />

Koch’s postulates (most stringent)<br />

(1) Sufficient: if the agent (cause) is present, the disease (effect) is present<br />

(2) Necessary: if the agent (cause) is absent, the disease (effect) is absent<br />

(3) Specific: the agent (cause) is associated with only one disease (effect)<br />

Contributory cause (most clinically relevant)<br />

(1) Not all patients with the particular cause will develop the effect (disease): the<br />

cause is not sufficient<br />

(2) Not all patients with the specific effect (disease) were exposed to the particular<br />

cause: the cause is not necessary<br />

(3) The cause may be associated with several diseases (effects) and is therefore<br />

non-specific<br />

Table 3.2. Cause and effect relationship for most common types of studies<br />

Type of study Cause Effect<br />

Etiology, harm, or risk Medication, environmental,<br />

or genetic agent<br />

Disease, complication, or<br />

mortality<br />

Therapy or prevention Medication, other therapy, or<br />

preventive modality<br />

Improvement of symptoms<br />

or mortality<br />

Prognosis Disease or therapy Time to outcome<br />

Diagnosis Diagnostic test Accuracy of diagnosis<br />

searching techniques so that you find the study that answers this query in the<br />

best manner possible. The intervention, comparison, and outcome all relate to<br />

the patient population being studied.<br />

Primary clinical research studies can be roughly divided into four main types,<br />

determined by the elements of cause and effect. They are studies of etiology (or<br />

harm or risk), therapy, prognosis,anddiagnosis. There are numerous secondary<br />

study types that will be covered later in the book. The nomenclature used for<br />

describing the cause and effect in these studies can be somewhat confusing and<br />

is shown in Table 3.2.<br />

Studies of etiology, harm, or risk compare groups of patients that do or don’t<br />

have the outcome of interest and look to see if they do or don’t have the risk<br />

factor. They can also go in the other direction, starting from the presence or<br />

absence of the risk factor and finding out who went on to have or not have the<br />

outcome. Also, the direction of the study can be either forward or backward<br />

in time. Useful ways of looking at this category of studies is to look for cohort,


case–control, orcross-sectional studies. These will be defined in more detail<br />

in Chapter 6. In studies of etiology, the risk factor for a disease is the cause and<br />

the presence of disease is the outcome. In other studies, the cause could be a<br />

therapy for a disease and the effect could be the improvement in disease.<br />

Studies of therapy or prevention tend to be randomized clinical trials,inwhich<br />

some patients get the therapy or preventive modality being tested and others<br />

do not. The outcome is compared between the two groups.<br />

Studies of prognosis look at disease progression over time. They can be either<br />

cohort studies or randomized clinical trials. There are special elements to<br />

studies of prognosis that will be discussed in Chapter 33.<br />

Studies of diagnosis are unique in that we are looking for some diagnostic<br />

maneuver that will separate those with a disease from those who may have a<br />

similar presentation and yet do not have the disease. Usually these are cohort,<br />

case–control,orcross-sectional studies. These will be discussed in more detail<br />

in Chapter 28.<br />

There is a relationship between the clinical question and the study type. In<br />

general the clinical question can be written as: among patients with a particular<br />

disease (population), does the presence of a therapy or risk factor (intervention),<br />

compared with no presence of the therapy or risk factor (comparison), change<br />

the probability of an adverse event (outcome)? For a study of risk or harm, we<br />

can write this as: among patients with a disease, does the presence of a risk factor,<br />

compared with the absence of a risk factor, worsen the outcome? We can<br />

also write it as: among patients with exposure or non-exposure to a risk factor,<br />

are they more likely to have the outcome of interest? For therapy, the question<br />

is: among patients with a disease, does the presence of an exposure to therapy,<br />

compared with the use of placebo or standard therapy, improve the outcome?<br />

The form of the question can help you perform better searches, as we will see in<br />

Chapter 5. Through regular practice, you will learn to write better questions and<br />

in turn, find better answers.<br />

Causation 23


4<br />

The medical literature: an overview<br />

It is astonishing with how little reading a doctor can practice medicine, but it is not<br />

astonishing how badly he may do it.<br />

Sir William Osler (1849–1919)<br />

Learning objectives<br />

In this chapter you will learn:<br />

the scope and function of the articles you will find in the medical literature<br />

the function of the main parts of a research article<br />

The medical literature is the source of most of our current information on the<br />

best medical practices. This literature consists of many types of articles, the most<br />

important of which for our purposes are research studies. In order to evaluate the<br />

results of a research study, you must understand what clinical research articles<br />

are designed to do and what they are capable of accomplishing. Each part of a<br />

study contributes to the final results of the published research. To be an intelligent<br />

reader of the medical literature, you then must understand which types of<br />

articles will provide the information you are seeking.<br />

Where is clinical research found?<br />

In your medical career, you will read and perhaps also write, many research<br />

papers. All medical specialties have at least one primary peer-reviewed journal<br />

and most have several. There are also many general-interest medical journals.<br />

One important observation you will make is that not all journals are created<br />

equal. For example, peer-reviewed journals are “better” than non–peer-reviewed<br />

journals since their articles are more carefully screened and contain fewer “problems.”<br />

Many of these peer-reviewed journals have a statistician on staff to ensure<br />

that the statistical tests used are correct. This is just one example of differences<br />

between journals and journal quality.<br />

24


The medical literature: an overview 25<br />

The New England Journal of <strong>Medicine</strong> and the Journal of the American Medical<br />

Association (JAMA) are the most widely read and prestigious general medical<br />

journals in the United States. The Lancet and the British Medical Journal (BMJ)<br />

are the other top English-language journals in the world. However, even these<br />

excellent journals print imperfect studies. As the consumer of this literature, you<br />

are responsible for determining how to use the results of clinical research. You<br />

will also have to translate the results of these research studies to your patients.<br />

Many patients these days will read about medical studies in the lay press or hear<br />

about them on television, and may even base their decisions about health care<br />

upon what the magazine writers or journalists say. Your job as a physician is to<br />

help your patient make a more informed medical decision rather than just taking<br />

the media’s word for it. In order to do this, you will need to have a healthy skepticism<br />

of the content of the medical literature as well as a working knowledge of<br />

critical appraisal. Other physicians, journal reviewers, and even editors may not<br />

be as well trained as you.<br />

Non–peer-reviewed and minor journals may still have articles and studies that<br />

give good information. Many of the articles in these journals tend to be expert<br />

reviews or case reports. All studies have some degree of useful information, and<br />

the aforementioned articles are useful for reviewing and relearning background<br />

information. Bear in mind that no matter how prestigious the journal, no study<br />

is perfect. But, all studies have some degree of useful information. A partial list<br />

of common and important medical journals is included in the Bibliography.<br />

What are the important types of articles?<br />

Usually, when asked about articles in the medical literature, one thinks of clinical<br />

research studies. These include such epidemiological studies as case–control,<br />

cohort or cross-sectional studies, and randomized clinical trials. These are not<br />

the only types of articles that are important for the reader of the medical literature.<br />

There are several other broad types of articles with which you should be<br />

familiar, and each has its own strengths and weaknesses. We will discuss studies<br />

other than clinical research in this chapter, and will address the common types<br />

of clinical research studies in Chapter 6.<br />

Basic science research<br />

Animal or basic science research studies are usually considered pure research.<br />

They may be of questionable usefulness in your patients since people clearly are<br />

not laboratory rats and in vitro does not always equal in vivo. Because of this,<br />

they may not pass the “so what?” test. However, they are useful preliminary studies,<br />

and they may justify human clinical studies. It is only through these types


26 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

of studies that medicine will continue to push the envelope of our knowledge of<br />

physiological and biochemical mechanisms of disease.<br />

Animal or other bench research is sometimes used to rationalize certain treatments.<br />

This leap of faith may result in unhelpful, and potentially harmful, treatments<br />

being given to patients. An example of potentially useful basic science<br />

research is the discovery of angiostatin, a chemical that stops the growth of blood<br />

vessels into tumors. The publication of research done in mice showing that infusion<br />

of this chemical caused regression of tumors resulted in a sudden increase<br />

in inquiries to physicians from family members of cancer patients. These family<br />

members were hoping that they would be able to obtain the drug and get a<br />

cure for their loved ones. Unfortunately, this was not going to happen. When the<br />

drug was given to patients in a clinical trial, the results were much less dramatic.<br />

This is not the only clinical trial that diplayed less dramatic results in humans.<br />

In another example, there were similar outcomes when bone-marrow transplant<br />

therapy was used to treat breast cancer.<br />

The discovery of two forms of the enzyme cyclo-oxygenase (COX 1 and 2)<br />

occurred in animal research and subsequently was identified using research in<br />

humans. Cyclo-oxygenase 2 is the primary enzyme in the inflammatory process,<br />

while COX 1 is the primary enzyme in the maintenance of the mucosal protection<br />

of the stomach. Inhibition of both enzymes is the primary action of most<br />

non-steroidal anti-inflammatory drugs (NSAIDs). With the discovery of these<br />

two enzymes, drugs selective for inhibition of the COX 2 enzyme were developed.<br />

These had anti-inflammatory action without causing gastric mucosal irritation<br />

and gastrointestinal bleeding. At first glance, this development appeared to be a<br />

real advance in medicine. However, extending the use of this class of drug to routine<br />

pain management was not warranted. Clinical studies have since demonstrated<br />

equivalence in pain control with other NSAIDs with only modest reductions<br />

in side effects at a very large increase in cost. Finally, more recently, the<br />

drugs were found to actually increase the rate of heart attacks.<br />

Basic science research is important for increasing the content of biomedical<br />

knowledge. For instance, recent basic science research has demonstrated the<br />

plasticity of the nervous system. Prior to this discovery, it was standard teaching<br />

that nervous system cells were permanent and not able to regenerate. Current<br />

research now shows that new brain and nerve cells can be grown, in both<br />

animals and in humans. While not clinically useful at this time, it is promising<br />

research for the future treatment of degenerative nerve disorders such as<br />

Alzheimer’s disease.<br />

Because these basic science studies seem to be more reliable given that they<br />

measure basic physiologic processes, the results of these studies are sometimes<br />

accepted without question. Doing this could be an error. A recent study by<br />

Bogardus et al. found that there were significant methodological problems in<br />

many clinical studies of molecular genetics. These studies used basic science


The medical literature: an overview 27<br />

techniques in clinical settings. 1 Thus, while this book focuses on clinical<br />

studies, the principles discussed also apply to your critical appraisal of basic science<br />

research studies.<br />

Editorials<br />

Editorials are opinion pieces written by a recognized expert on a given topic.<br />

Most often they are published in response to a study in the same journal issue.<br />

Editorials are the vehicle that puts a study into perspective and shows its usefulness<br />

in clinical practice. They give contextual commentary to the study, but,<br />

because they are written by an expert who is giving an opinion, the piece incorporates<br />

that expert’s biases. Editorials should be well referenced and they should<br />

be read with a skeptical eye and not be the only article that you use to form your<br />

opinion.<br />

Clinical review<br />

A clinical review article seeks to review all the important studies on a given subject<br />

to date. It is written by an expert or someone with a special interest in the<br />

topic and is more up to date than a textbook. Clinical reviews are most useful for<br />

new learners updating their background information. Because a clinical review<br />

is written by a single author, it is subject to the writer’s biases in reporting the<br />

results of the referenced studies. Due to this, it should not be accepted uncritically.<br />

However, if you are familiar with the background literature and can determine<br />

the accuracy of the citations and subsequent recommendations, a review<br />

can help to put clinical problems into perspective. The overall strength of the<br />

review depends upon the strength (validity and impact) of each individual study.<br />

Meta-analysis or systematic review<br />

Meta-analysis or systematic review is a relatively new technique to provide a<br />

comprehensive and objective analysis of all clinical studies on a given topic. It<br />

attempts to combine many studies and is more objective in reviewing these studies<br />

than a clinical review. The authors apply statistical techniques to quantitatively<br />

combine the results of the selected studies. We will discuss the details on<br />

evaluating these types of article in Chapter 33.<br />

Components of a clinical research study<br />

Clinical studies should be reported upon in a standardized manner. The most<br />

important reporting style is referred to as the IMRAD style. This stands for<br />

1 S. T. Bogardus, Jr., J. Concato & A. R. Feinstein. Clinical epidemiological quality in molecular genetic<br />

research: the need for methodological standards. JAMA 1999; 281: 1919–1926.


28 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

Table 4.1. Components of reported<br />

clinical studies<br />

(1) Abstract<br />

(2) Introduction<br />

(3) Methods<br />

(4) Results<br />

(5) Discussion<br />

(6) Conclusion<br />

(7) References/bibliography<br />

Introduction, Methods, Results, and Discussion. First proposed by Day in 1989,<br />

it is now the standard for all clinical studies reported in the English-language<br />

literature. 2 Structured abstracts proposed by the SORT (Standards of Reporting<br />

Trials) group are now required by most medical journals. The structure of the<br />

abstract follows the structure of the full article (Table 4.1).<br />

Abstract<br />

The abstract is a summary of the study. It should accurately reflect what actually<br />

happened in the study. Its purpose is to give you an overview of the research<br />

and let you decide if you want to read the full article. The abstract includes<br />

a sentence or two on each of the elements of the article. These include the<br />

introduction, study design, population studied, interventions and comparisons,<br />

outcomes measured, primary or most important results, and conclusions. The<br />

abstract may not completely or accurately represent the actual findings of the<br />

article and often does not contain important information found only in the article.<br />

Therefore it should never be used as the sole source of information about the<br />

study.<br />

Introduction<br />

The introduction is a brief statement of the problem to be solved and the purpose<br />

of the research. It describes the importance of the study by either giving the<br />

reader a brief overview of previous research on the same or related topics or giving<br />

the scientific justification for doing the study. The hypotheses being tested<br />

should be explicitly stated. Too often, the hypothesis is only implied, potentially<br />

leaving the study open to misinterpretation. As we will learn later, only the null<br />

hypothesis can be directly tested. Therefore, the null hypothesis should either<br />

be explicitly stated or obvious from the statement of the expected outcome of<br />

the research, which is also called the alternative hypothesis.<br />

2 R. A. Day. The origins of the scientific paper: the IMRAD format. AMWAJ 1989; 4: 16–18.


The medical literature: an overview 29<br />

Methods<br />

The methods section is the most important part of a research study and should<br />

be the first part of a study that you read. Unfortunately, in practice, it is often<br />

the least frequently read. It includes a detailed description of the research<br />

design, the population sample, the process of the research, and the statistical<br />

methods. There should be enough details to allow anyone reading the study to<br />

replicate the experiment. Careful reading of this section will suggest potential<br />

biases and threats to the validity of the study.<br />

(1) The sample is the population being studied. It should also be the population<br />

to which the study is intended to pertain. The processes of sample selection<br />

and/or assignment must be adequately described. This includes the eligibility<br />

requirements or inclusion criteria (who could be entered into the<br />

experiment) and exclusion criteria (who is not allowed to be in the study<br />

and why). It also includes a description of the setting in which the study<br />

is being done. The site of research such as a community outpatient clinic,<br />

specialty practice, hospital, or others may influence the types of patients<br />

enrolled in the study thus these settings should be stated in the methods<br />

section.<br />

(2) The procedure describes both the experimental processes and the outcome<br />

measures. It includes data acquisition, randomization, and blinding conditions.<br />

Randomization refers to how the research subjects were allocated<br />

to different groups. The blinding information should include whether the<br />

treating professionals, observers, or participants were aware of the nature<br />

of the study and if the study is single-, double-, or triple-blinded. All of<br />

the important outcome measures should be examined and the process by<br />

which they are measured and the quality of these measures should all<br />

be explicitly described. These are known as the instruments and measurements<br />

of a study. In studies that depend on patient record review,<br />

the process by which that review was carried out should be explicitly<br />

described.<br />

(3) The statistical analysis section includes types of data such as nominal,<br />

ordinal, interval, ratio, continuous, or dichotomous data; how the data are<br />

described, including the measures of central tendency and dispersion of<br />

data; and what analytic statistical tests will be used to assess statistical relationships<br />

between two or more variables. It should also note the levels of α<br />

and β error and the power.<br />

Results<br />

The results section should summarize all the data pertinent to the purpose<br />

of the study. It should also give an explanation of the statistical significance<br />

of the data. This part of the article is not a place for commentary or


30 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

opinions – “just the facts!” 3 All important sample and subgroup characteristics,<br />

and the results of all important research outcomes, should be included. The<br />

description of the measurements should include the measures of central tendency<br />

and dispersion (e.g., standard deviations or standard error of the mean)<br />

and the P values or confidence intervals. These values should be given so that<br />

readers may determine for themselves if the results are statistically and/or clinically<br />

significant. In addition, the tables and graphs should be clearly and accurately<br />

labeled.<br />

Discussion<br />

The discussion includes an interpretation of the data and a discussion of the<br />

clinical importance of the results. It should flow logically from the data shown<br />

and incorporate other research about the topic, explaining why this study did<br />

or did not corroborate the results of those studies. Unfortunately, this section is<br />

often used to spin the results of a study in a particular direction and will overor<br />

under-emphasize certain results. This is why the reader’s critical appraisal is<br />

so important. The discussion section should include a discussion of the statistical<br />

and clinical significance of the results, the non-significant results, and the<br />

potential biases in the study.<br />

(1) The statistical significance is a mathematical phenomenon depending only<br />

on the sample size, the precision of the data, and the magnitude of the difference<br />

found between groups, also known as effect size. As the sample size<br />

increases, the power of the study will increase, and a smaller effect size will<br />

become statistically significant.<br />

(2) The clinical significance means that the results are important and will be<br />

useful in clinical practice. If a small effect size is found, that treatment may<br />

not be clinically important. Also, a study with enough subjects may find statistical<br />

significance if even a tiny difference in outcomes of the groups is<br />

found. In these cases, the study result may make no clinical difference for<br />

your patient. What is important is a change in disease status that matters to<br />

the patient sitting in your office.<br />

(3) Interpretation of results that are not statistically significant must be included<br />

in the discussion section. A study result that is not statistically significant<br />

does not conclusively mean that no relationship or association exists. It is<br />

possible that the study may not have had adequate power to find those<br />

results to be statistically significant. This is often true in studies with small<br />

sample sizes. On the whole, absence of evidence of an effect is not the same<br />

thing as evidence of absence of an effect.<br />

3 Sargent Friday (played by Jack Webb) in the 1960s television show Dragnet.


The medical literature: an overview 31<br />

(4) Finally, the discussion should address all potential biases in the study and<br />

hypothesize on their effects on the study conclusions. The directions for<br />

future research in this area should then be addressed.<br />

Conclusion<br />

The study results should be accurately reflected in the conclusion section, a<br />

one-paragraph summary of the final outcome. There are numerous points that<br />

should be addressed in this section. Notably, important sources of bias should be<br />

mentioned as disclaimers. The reader should be aware that pitfalls in the interpretations<br />

of study conclusions include the use of biased language and incorrect<br />

interpretation of results not supported by the data. Studies sponsored by drug<br />

companies or written by authors with other conflicts of interest may be more<br />

prone to these biases and should be regarded with caution. All sources of conflict<br />

of interest should be listed either at the start or at the end of the article.<br />

Bibliography<br />

The references/bibliography section demonstrates how much work from other<br />

writers the author has acknowledged. This includes a comprehensive reference<br />

list including all important studies of the same or similar problem. You will<br />

be better at interpreting the completeness of the bibliography when you have<br />

immersed yourself in a specialty area for some time and are able to evaluate this<br />

author’s use of the literature. Be wary if there are multiple citations of works by<br />

just one or two authors, especially if by the author(s) of the current study.<br />

How can you get started?<br />

You have to decide which journals to read. The New England Journal of <strong>Medicine</strong><br />

is a great place for medical students to start. It publishes important and high<br />

quality studies and includes a lot of correlation with basic sciences. There are<br />

also excellent case discussions, review articles, and basic-science articles. In general,<br />

begin by reading the abstract. This will tell you if you really want to read this<br />

study in the first place. If you don’t care about this topic, go on to the next article.<br />

Remember, that what you read in the abstract should not be used to apply<br />

the results of the study to a clinical scenario. You still need to read and evaluate<br />

the article, especially the methods section. JAMA (Journal of the American Medical<br />

Association) is another excellent journal with many studies regarding medical<br />

education and the operation of the health-care system. For readers in the United<br />

Kingdom, the Lancet and the BMJ (British Medical Journal) are equivalent journals<br />

for the student to begin reading.


32 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

The rest of this book will present a set of useful skills that will assist you in evaluating<br />

clinical research studies. Initially, we will focus on learning how to critically<br />

evaluate the most common clinical studies. Specifically, these are studies of<br />

therapy, risk, harm, and etiology. These skills will help you to grade the quality of<br />

the studies using a schema outlined in Appendix 1. Appendix 2 is a useful outline<br />

of steps to help you to do this. Later the book will focus on studies of diagnostic<br />

tests, clinical decision making, cost analyses, prognosis, and meta-analyses or<br />

systematic reviews.


5<br />

Searching the medical literature<br />

Sandi Pirozzo, B.Sc., M.P.H.<br />

Updated by Elizabeth Irish, M.L.S.<br />

Through seeking we may learn and know things better. But as for certain truth, no man<br />

has known it, for all is but a woven web of guesses.<br />

Xenophanes (sixth century BC)<br />

Learning objectives<br />

In this chapter you will learn how to:<br />

use a clinical question to initiate a medical literature search<br />

formulate an effective search strategy to answer specific clinical questions<br />

select the most appropriate database to use to answer a specific clinical<br />

question<br />

use Boolean operators in developing a search strategy<br />

identify the types and uses of various evidence-<strong>based</strong> review databases<br />

To become a lifelong learner, the physician must be a competent searcher of the<br />

medical literature. This requires one to develop an effective search strategy for<br />

a clinical question. By the end of this chapter you will understand how to write<br />

a clinical question and formulate a search of the literature. Once an answerable<br />

clinical question is written and the best study design that could answer the question<br />

is decided upon, the next task is to search the literature to find the best available<br />

evidence. This might appear an easy task, but, unless one is sure of which<br />

database to use and has good searching skills, it can be time-consuming, frustrating,<br />

and wholly unproductive. This chapter will go through some common<br />

databases and provide the information to make the search for evidence both efficient<br />

and rewarding.<br />

Introduction<br />

Finding all relevant studies that have addressed a single question is not an<br />

easy task. The exponential growth of medical literature necessitates a systematic<br />

33


34 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

searching approach in order to identify the best evidence available to answer a<br />

clinical question. While many people have a favorite database or website, it is<br />

important to consult more than one resource to ensure that all relevant information<br />

is retrieved.<br />

Use of different databases<br />

Of all the databases that index medical and health-related literature, MEDLINE<br />

is probably the best known. Developed by the National Library of <strong>Medicine</strong> at<br />

the National Institutes of Health in the United States, it is the world’s largest general<br />

biomedical database and indexes approximately one-third of all biomedical<br />

articles. Since it was the first medical literature database available for electronic<br />

searching, most clinicians are familiar with its use. Due to its size and<br />

breadth, it is sometimes a challenge to get exactly what one wants from it. This<br />

will be the first database discussed, after a discussion of some basic principles of<br />

searching.<br />

In addition to MEDLINE, there are other, more specialized databases that may<br />

yield more clinically useful information, depending on the nature of the clinical<br />

query. The database selected depends on the content area and the type of<br />

question being asked. The database for nursing and allied health studies is called<br />

CINAHL, and the one for psychological studies is called PsycINFO. If searching<br />

for the answer to a question of therapy or intervention, then the Cochrane<br />

Library might be a particularly useful resource. It provides systematic reviews<br />

of trials of health-care interventions and a registry of controlled clinical trials.<br />

The TRIP database will do a systematic search of over 100 critically appraised<br />

evidence-<strong>based</strong> websites and databases, including MEDLINE via PubMed and<br />

the Cochrane Library, to provide a synopsis of results in one place. It is free and<br />

can be found at www.tripdatabase.com.<br />

For information at the point of care, DynaMed <strong>Essential</strong> <strong>Evidence</strong> Plus and<br />

Ganfyd at www.ganfyd.org are designed to provide quick synopses of topics that<br />

are meant to be accessed at the bedside using a hand-held device, such as a<br />

PDA or Smart Phone. Many would consider these to be essentially on-line textbooks<br />

and only provide background information. They may have explicit levels<br />

of evidence and the most current evidence, but are works in progress. To<br />

broaden your search to the life sciences as well as conference information and<br />

cited articles, the search engines Scopus or WebofScienceshould be consulted.<br />

It is easy to surmise that not only is the medical literature growing exponentially,<br />

but that the available databases and websites to retrieve this literature are<br />

also increasing. In addition to the resources covered in this chapter, an additional<br />

list of relevant databases and other online resources is provided in the<br />

Bibliography.


Searching the medical literature 35<br />

Outcome<br />

Mortality<br />

Screening<br />

Colorectal<br />

neoplasms<br />

Intervention<br />

Population<br />

Fig. 5.1 Venn diagram for<br />

colorectal screening search.<br />

Comparison is frequently<br />

omitted in search strategies.<br />

Developing effective information retrieval strategies<br />

Having selected the most appropriate database one must develop an effective<br />

search strategy to retrieve the best available evidence on the topic of interest.<br />

This section will give a general searching framework that can be applied to any<br />

database. Databases often vary in terms of software used, internal structure,<br />

indexing terms, and amount of information that they give. However, the principles<br />

behind developing a search strategy remain the same.<br />

The first step is to identify the key words or concepts in the study question. This<br />

leads to a systematic approach of breaking down the question into its individual<br />

components. The most useful way of dividing a question into its components is<br />

to use the PICO format that was introduced in Chapter 2. To review: P stands<br />

for the population of interest; I is the intervention, whether a therapy, diagnostic<br />

test, or risk factor; C is the comparison to the intervention; and O is the outcome<br />

of interest.<br />

A PICO question can be represented pictorially using a Venn diagram. As an<br />

example, we will use the following question: What is the mortality reduction in<br />

colorectal cancer as a result of performing hemoccult testing of the stool (fecal<br />

occult blood test, FOBT) screening in well-appearing adults? Using the PICO format,<br />

we recognize that mortality is the outcome, screening with the hemoccult<br />

is the intervention, not screening is the comparison, and adults who appear well<br />

but do and don’t have colorectal neoplasms is the population. The Venn diagram<br />

for that question is shown in Fig. 5.1.<br />

Once the study question has been broken into its components, they can be<br />

combined using Boolean logic. This consists of using the terms AND, OR, and<br />

NOT as part of the search. The AND operator is used when you wish to retrieve<br />

those records containing both terms. Using the AND operator serves to narrow<br />

your search and reduces the number of citations recovered. The OR operator<br />

is used when at least one of the terms must appear in the record. It broadens<br />

the search, should be used to connect synonyms or related concepts, and will<br />

increase the number of citations recovered. Finally, the NOT operator is used to


36 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

A OR B<br />

A NOT B<br />

A<br />

B<br />

A AND B<br />

A<br />

B<br />

A<br />

B<br />

AND operator OR operator NOT operator<br />

Fig. 5.2 Boolean operators (AND,<br />

OR, and NOT). Blue areas<br />

represent the search results in<br />

each case.<br />

retrieve records containing one term and not the other. This also reduces the<br />

number of citations recovered and is useful to eliminate documents relating to<br />

potentially irrelevant topics. Be careful using this operator as it can eliminate<br />

useful references too. It can be used to narrow initially wide-ranging searches or<br />

to remove duplicate records from a previously viewed set. The use of these operators<br />

is illustrated in Fig. 5.2.<br />

Using the example of the question about the effect of hemoccult screening<br />

on colon cancer mortality, the combination of the initial search terms colorectal<br />

neoplasms AND screening represents the overlap between these two terms and<br />

retrieves only articles that use both terms. This will give us a larger number of<br />

articles from our search than if we used all three terms in one search: screening<br />

AND colorectal neoplasms AND mortality. That combination represents a smaller<br />

area, the one where all three terms overlap, and will retrieve only articles with all<br />

three terms.<br />

Although the overlap of all three parts may have the highest concentration of<br />

relevant articles, the other areas may still contain many relevant and important<br />

articles. We call this a high-specificity search. The set we retrieve will contain a<br />

high proportion of articles that are useful, but many others may be missed. This<br />

means that the search lacks sensitivity in that it will not identify some studies that<br />

are relevant to the question being asked. Hence, if the disease AND study factor<br />

combination (colorectal neoplasms AND screening) yields a manageable number<br />

of citations, it is best to work with this and not further restrict the search by using<br />

the outcomes (screening AND colorectal neoplasms AND mortality).<br />

Everyone searches differently! Most people will start big (most hits possible)<br />

and then begin limiting the results. Look at these results along the way to make<br />

sure you are on the right track. My preference is to start with the smallest number<br />

of search terms that gives a reasonable number of citations and then add others<br />

(in a Boolean fashion) as a means of either increasing (OR operator) or limiting<br />

(AND or NOT operators) the search. Usually, for most searches, anything less<br />

than about 50 to 100 citations to look through by hand is reasonable. Remember<br />

that these terms are entered into the database by hand and errors of classification<br />

will occur. The more that searches are limited, the more likely they are to miss<br />

important citations. In general, both the outcome and study design terms are<br />

options usually needed only when the search results are very large and unmanageable.


Searching the medical literature 37<br />

You can use nested Boolean search to form more complex combinations that<br />

will capture all the overlap areas between all three circles. For our search, these<br />

are: (mortality AND screening) OR (mortality AND colorectal neoplasms) OR<br />

(screening AND colorectal neoplasms). This strategy will yield a higher number<br />

of hits, but will still find less than all three terms with OR function connecting<br />

them. However, it may not be appropriate if you are looking for a quick answer<br />

to a clinical question since you will then have to hand-search more citations.<br />

Whatever strategy you choose to start with, try it. You never know a priori what<br />

results you are going to get.<br />

Use of synonyms and wildcard symbol<br />

When the general structure of the question is developed and only a small number<br />

of citations are recovered, it may be worthwhile to look for synonyms for<br />

each component of the search. For our question about mortality reduction in<br />

colorectal cancer due to fecal occult blood screening in adults, we can use several<br />

synonyms. Screening can be screen or early detection, colorectal cancer can<br />

be bowel cancer, and mortality can be death or survival. Since these terms are<br />

entered into the database by coders they may vary greatly from study to study for<br />

the same ultimate question. What you miss with one synonym, you may pick up<br />

with another.<br />

Truncation or the “wildcard” symbol can be used to find all the words with the<br />

same stem in order to increase the scope of successful searching. Thus our search<br />

string can become (screen ∗ OR early detection) AND (colorectal cancer OR bowel<br />

cancer) AND (mortality OR death ∗ OR survival). The term screen ∗ is shorthand for<br />

words beginning with “screen.” It will turn up screen, screened, screening, etc.<br />

The wildcard is extremely useful but should be used with caution. If you were<br />

searching for information about hearing problems and you used hear ∗ as one of<br />

your search terms you would retrieve not only articles with the word “hear” and<br />

“hearing” but also all those articles with the word “heart.” Note that the wildcard<br />

symbol varies between systems but, most commonly it will be an asterisk<br />

( ∗ ) or dollar sign ($). It is important to check the database’s help documentation<br />

to determine not only the correct symbol, but to also ensure that the database<br />

supports truncation. For instance, if a database automatically truncates then the<br />

use of a wildcard symbol could inadvertently result in a smaller retrieval rather<br />

than a broader one.<br />

MEDLINE via PubMed<br />

MEDLINE is available online for free using the PubMed website at www.pubmed.<br />

gov. It is often assumed that MEDLINE and PubMed are one and the same.


38 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

But, PubMed is actually a very user-friendly interface for searching MEDLINE<br />

as well as several other data bases. These are: OLDMEDLINE, in-process citations<br />

that are not yet included in MEDLINE, selected life science journals beyond<br />

the scope of MEDLINE, and citations to author manuscripts for NIH-funded<br />

researchers’ publications. PubMed also provides time-saving search services<br />

such as Clinical Queries and the MeSH database, which help the user to search<br />

more efficiently. The best way to get to know PubMed is to use it, explore its<br />

capabilities, and experiment with some searches. Rather than provide a comprehensive<br />

tutorial on searching PubMed, this chapter will focus on a few of<br />

the features that are most helpful in the context of EBM. Remember that all<br />

databases are continually being updated and upgraded, so that it is important to<br />

consult the help documentation or your health sciences librarian for searching<br />

guidance.<br />

PUBMED Clinical Queries: searching using methodological filters<br />

Within PubMed there is a special feature called Clinical Queries, which can be<br />

found in the left-hand side bar of the PubMed home page. It uses a set of builtin<br />

search filters that are <strong>based</strong> on methodological search techniques developed<br />

by Haynes in 1994 and which search for the best evidence on clinical questions<br />

in four study categories: diagnosis, therapy, etiology, and prognosis. In turn each<br />

of these categories may be searched with an emphasis on specificity for which<br />

most of the articles retrieved will be relevant, but many articles may be missed or<br />

sensitivity for which, the proportion of relevant articles will decrease, but many<br />

more articles will be retrieved and fewer missed. It is also possible to limit the<br />

search to a systematic review of the search topic by clicking on the “systematic<br />

review” option. Figure 5.3 shows the PubMed clinical queries page. In order to<br />

continue searching in clinical queries, click on the “clinical queries” link in the<br />

left-hand side bar each time a search is conducted. If this is not done, searches<br />

will be conducted in general PubMed. Clicking on the “filter table” option within<br />

clinical queries shows how each filter is interpreted in PubMed query language.<br />

It is best to start with the specificity emphasis when initiating a new search<br />

and then add terms to the search if not enough articles are found. Once search<br />

terms are entered into the query box on PubMed and “go” is clicked, the search<br />

engine will display your search results. This search is then displayed with the<br />

search terms that were entered combined with the methodological filter terms<br />

that were applied by the search engine. Below the query box is the features bar,<br />

which provides access to additional search options. The PubMed query box and<br />

features bar are available from every screen except the Clinical Queries home<br />

page. Return to the Clinical Queries homepage each time a new Clinical Queries<br />

search is desired.


Searching the medical literature 39<br />

Entering search terms can be done in a few ways. Terms are searched in various<br />

fields of the record when one or more terms are entered (e.g., vitamin c AND<br />

common cold) in the query box. The Boolean operators AND, OR, NOT must be<br />

in upper-case (e.g., vitamin c OR zinc). The truncation or wildcard symbol ( ∗ )<br />

tells PubMed to search for the first 600 variations of the truncated term. If a truncated<br />

term (e.g., staph ∗ ) produces more than 600 variations, PubMed displays<br />

a warning message such as “Wildcard search for ‘staph ∗ ’ used only the first 600<br />

variations. Lengthen the root word to search for all endings”. Use caution when<br />

applying truncation in PubMed, because it turns off the automatic term mapping<br />

and the automatic explosion of a MeSH term features, resulting in an incomplete<br />

search retrieval. As a rule of thumb, it is better to use the wildcard symbol as a last<br />

resort in PubMed.<br />

Fig. 5.3 PubMed “clinical<br />

queries.” (National Library of<br />

<strong>Medicine</strong>. Used with<br />

permission.)<br />

Limits<br />

The features bar consists of limits, preview/index, history, clipboard, and<br />

details. To limit a search, click “limits” from the features bar, which opens the


40 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

Fig. 5.4 The “limits” window in<br />

PubMed. (National Library of<br />

<strong>Medicine</strong>. Used with<br />

permission.)<br />

Limits window shown in Fig. 5.4. This offers a number of useful ways of reducing<br />

the number of retrieved articles. A search can be restricted to words in a<br />

particular field within a citation, a specific age group or gender, human or animal<br />

studies, articles published with abstracts or in a specific language, or a specific<br />

publication type (e.g., meta-analysis or RCT). Limiting by publication type is<br />

especially useful when searching for evidence-<strong>based</strong> studies.<br />

Another method of limiting searches is by either the Entrez or publication date<br />

of a study. The “Entrez date” is the date that the citation was entered into the<br />

Medline system and the publication date is the month and year it was published.<br />

Finally, the subset pull-down menu allows retrieval to be limited to a specific subset<br />

of citations within PubMed, such as AIDS-related or other citations. Applying<br />

limits to a search results in a check-box next to the “limits” space and a listing of<br />

the limit selections will be displayed. To turn off the existing limits remove the<br />

check before running the next search.


Searching the medical literature 41<br />

History<br />

PubMed will retain an entire search strategy with the results, which can be<br />

viewed by clicking on “history” function on the features bar. This is only available<br />

after running a search and it will list and number the searches in the order<br />

in which they were run. As shown in Fig. 5.5, the history displays the search number,<br />

search query, the time of search, and the number of citations in the results.<br />

Searches can be combined or additional terms added to an existing search by<br />

using the number (#) sign before the search number: e.g., #1 AND #2, or #1 AND<br />

(drug therapy OR diet therapy). Once a revised search strategy has been entered<br />

in the query box, clicking “go” will view the search results. Clicking “clear history”<br />

will remove all searches from the history and preview/index screens. The<br />

maximum number of queries held in the history is 100 and once that number is<br />

reached, PubMed will remove the oldest search from the history to add the most<br />

recent search. The search history will be lost after 1 hour of inactivity on PubMed.<br />

PubMed will move a search statement number to the top of the history if that new<br />

search is the same as a previous search. The preview/index allows search terms<br />

to be entered one at a time using pre-selected search fields, making it useful for<br />

finding specific references.<br />

Fig. 5.5 The history of a search<br />

in PubMed. (National Library of<br />

<strong>Medicine</strong>. Used with<br />

permission.)


42 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

Clipboard<br />

The clipboard is a place to collect selected citations from one or several searches<br />

to print or save for future use. Up to 500 items can be placed in the clipboard at<br />

any time. After adding items to the clipboard, click on “clipboard” from the features<br />

bar to view the saved selections. Citations in the clipboard are displayed in<br />

the order they were added. To place an item in the clipboard, click on the checkbox<br />

to the left of the citation, go to the send menu and select “clipboard,” and<br />

then click “send.” Once a citation has been added to the clipboard, the recordnumber<br />

color will change to green. By sending to the clipboard without selecting<br />

citations, PubMed will add up to 500 citations of the search results to the clipboard.<br />

Clipboard items are automatically removed after eight hours of inactivity.<br />

Printing and saving<br />

When ready to save or print clipboard items it is best to change them to ordinary<br />

text to simplify the printout and save paper so it will not print all the PubMed<br />

motifs and icons. To do this, click on “clipboard” on the features bar, which will<br />

show only the articles placed on the clipboard. From the send menu select “text”<br />

and a new page will be displayed which resembles an ordinary text document<br />

for printing. This “send to text” option can also be used for single references and<br />

will omit all the graphics. To save the entire set of search results click the display<br />

pull-down menu to select the desired format and then select “send to file” from<br />

the send menu. To save specific citations click on the check-box to the left of<br />

each citation, including other pages in the retrieval process, and when finished<br />

making all of the desired selections, select “send to file.”<br />

To save the entire search to review or update at a later time, it is best to create<br />

a free, “My NCBI account.” “My NCBI” is a place where current searches<br />

can be saved, reviewed, and updated. It can also be used to send e-mail alerts,<br />

apply filters, and other customization features. Unlike the Clipboard, searches<br />

on “My NCBI” are permanently saved and will not be removed unless chosen to<br />

be deleted.<br />

General searching in PubMed<br />

The general search page in PubMed is useful to find evidence that is not coming<br />

up on the Clinical Queries search, or when looking for multiple papers by a single<br />

author who has written extensively in a single area of interest. Begin by clicking<br />

on the PubMed symbol in the top left-hand corner of the screen to display the<br />

general search screen (Fig. 5.6). Simply type the search terms in the query box<br />

and your search results will be displayed as before. If there are too many articles<br />

found, apply limits and if too few, add other search terms using the OR function.


Searching the medical literature 43<br />

MeSH terms to assist in searching<br />

In looking for synonyms to broaden or improve a search consider using both<br />

text words and keywords (index terms) in the database. One of MEDLINE’s great<br />

strengths is its MeSH (Medical Subject Headings) system. By default, PubMed<br />

automatically “maps” the search terms to the appropriate MeSH terms. A specific<br />

MeSH search can also be performed by clicking on the “MeSH database”<br />

link in the left-hand side bar. Typing in “colorectal cancer” will lead to the MeSH<br />

term colorectal neoplasms (Fig. 5.7). The search can then be refined by clicking<br />

on the term to bring up the detailed display (Fig. 5.8). This allows the selection of<br />

subheadings (diagnosis, etiology, therapy, etc.) to narrow the search, and also get<br />

access to the MeSH tree structure.<br />

The “explode” (exp) feature will capture an entire subtree of MeSH terms with<br />

a single word. For the search term colorectal neoplasms, the “explode” incorporates<br />

the entire MeSH tree below colorectal neoplasms (Table 5.1). Click on any<br />

specific terms in the tree to search that term and the program will get all the<br />

descriptors for that MeSH term and all those under it. Select the appropriate<br />

MeSH term, with or without subheadings, and with or without explosion, and<br />

use the send menu to “send to search box.” These search terms will appear in<br />

Fig. 5.6 General search screen in<br />

PubMed. (National Library of<br />

<strong>Medicine</strong>. Used with<br />

permission.)


44 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

Table 5.1. A MeSH tree containing the term colorectal neoplasms<br />

Neoplasms<br />

Neoplasms by Site<br />

Digestive System Neoplasms<br />

Gastrointestinal Neoplasms<br />

Intestinal Neoplasms<br />

Colorectal Neoplasms<br />

Colonic Neoplasms<br />

Colonic Polyps +<br />

Sigmoid Neoplasms<br />

Colorectal Neoplasms, Hereditary<br />

Nonpolyposis<br />

Rectal Neoplasms<br />

Anus Neoplasms +<br />

Fig. 5.7 PubMed MeSH database. (National Library of <strong>Medicine</strong>. Used with permission.)


Searching the medical literature 45<br />

Fig. 5.8 PubMed MeSH database with subheadings. (National Library of <strong>Medicine</strong>. Used with<br />

permission.)<br />

the query box at the top of the screen. Clicking “search PubMed” will execute the<br />

search, which will automatically explode the term unless restricted by selecting<br />

the “do not explode this term” box. Every article has been indexed by at least<br />

one of the MeSH keywords from the tree. To see the difference that exploding a<br />

MeSH term makes, repeat the search using the term colorectal neoplasms in the<br />

search window without exploding. This will probably result in retrieval of about<br />

one-quarter of the articles retrieved in the previous search.<br />

Novice users of PubMed often ask “how do I find out the MeSH keywords that<br />

have been used to categorize a paper?” Knowing the relevant MeSH keywords<br />

will help to focus and/or refine the search. A simple way to do this is that once a<br />

relevant citation has been found, click on the author link to view the abstract and<br />

then go to the “display” box and open it as shown in Fig. 5.9. Select MEDLINE and<br />

click “display.” The record is now displayed as it is indexed and by scrolling down<br />

the MeSH terms for this paper will be listed. The initials MH precede each of the<br />

MeSH terms. Linking to “related articles” will find other relevant citations, but<br />

the selected limits are not applied to this retrieval. If there was a search limited


46 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

Fig. 5.9 The “display” menu in<br />

PubMed. (National Library of<br />

<strong>Medicine</strong>. Used with<br />

permission.)<br />

to English language only then selecting the related articles link will get articles<br />

that appear in other languages.<br />

While the MeSH system is useful, it should supplement rather than usurp the<br />

use of textwords so that incompletely coded articles are not missed. PubMed is<br />

a compilation of a number of databases not just MEDLINE and includes newer<br />

articles that have not been indexed for MEDLINE yet.<br />

Methodological terms and filters<br />

MeSH terms cover not only subject content but also a number of useful terms on<br />

study methodology. For example, looking for the answer to a question of therapy,<br />

many randomized trials are tagged in MEDLINE by the specific methodological<br />

term randomized controlled trial or clinical trial. These can be selected by limiting<br />

the search to one study design type in PubMed under the limit feature for<br />

publication types in the pull-down menu.<br />

An appropriate methodological filter may help confine the retrieved studies<br />

to primary research. For example, if searching whether a screening intervention<br />

reduces mortality from colorectal cancer, confine the retrieved studies to controlled<br />

trials. The idea of methodological terms as filters may be extended to


Searching the medical literature 47<br />

multiple terms that attempt to identify particular study types. Such terms are<br />

used extensively in the Clinical Queries search functions.<br />

Note that many studies do not have the appropriate methodological tag. The<br />

Cochrane Collaboration and the US National Library of <strong>Medicine</strong> (NLM) are<br />

working on correctly retagging all the controlled trials, but this is not being<br />

done for other study types.<br />

Field searching<br />

It is possible to shorten the search time by searching in a specific field. This works<br />

well if there is a recent article by a particular author renowned for work in the<br />

area of interest or if a relevant study in a particular journal in the library has<br />

recently been published on the same topic. Searching in specific fields will prove<br />

to be invaluable in these circumstances. To search for an article with “colorectal<br />

cancer” in the title using PubMed, select the title field in the limits option using<br />

the fields pull-down menu in the “Tag Term” default tag for the selected search<br />

term. Another option is to simply type “colorectal cancer[ti]” in the query box.<br />

As with truncation this turns off the automatic mapping and exploding features<br />

and will not get articles with the words “colorectal neoplasms” in the article title.<br />

The field-label abbreviations can be found by accessing the help menu. The<br />

most commonly used field labels are abstract (ab), title (ti), source (so), journal<br />

(jn), and author (au). The difference between source and journal is that “source”<br />

is the abbreviated version of the journal title, while “journal” is the full journal<br />

title. In PubMed the journal or the author can be selected simply by using the<br />

journals database located on the left-hand side bar or by typing in the author’s<br />

last name and initials in the query box. Remember, when searching using “text<br />

words,” the program searches for those words in any of the available fields. For<br />

example, if “death” is one search term then articles where “death” is an author’s<br />

name as well as those in which it occurs in the title or abstract will be retrieved.<br />

Normally this isn’t a problem but once again could be a problem when using<br />

“wildcard” searches.<br />

The Cochrane Library<br />

The Cochrane Library owes it genesis to an astute British epidemiologist and<br />

doctor, Archie Cochrane, who is best known for his influential book Effectiveness<br />

and Efficiency: Random Reflections on Health Services, published in 1971. In the<br />

book, he suggested that because resources would always be limited they should<br />

be used to provide equitably those forms of health care which had been shown<br />

in properly designed evaluations to be effective. In particular, he stressed the<br />

importance of using evidence from randomized controlled trials (RCTs) because


48 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

these were likely to provide much more reliable information than other sources<br />

of evidence. Cochrane’s simple propositions were soon widely recognized as<br />

seminally important – by lay people as well as by health professionals. In his 1971<br />

book he wrote: “It is surely a great criticism of our profession that we have not<br />

organized a critical summary, by specialty or subspecialty, adapted periodically,<br />

of all relevant randomised controlled trials.” 1<br />

His challenge led to the establishment of an international collaboration<br />

to develop the Oxford Database of Perinatal Trials. In 1987, the year before<br />

Cochrane died, he referred to a systematic review of randomized controlled trials<br />

(RCTs) of care during pregnancy and childbirth as “a real milestone in the history<br />

of randomized trials and in the evaluation of care” and suggested that other specialties<br />

should copy the methods used.<br />

The Cochrane Collaboration was developed in response to Archie Cochrane’s<br />

call for systematic and up-to-date reviews of all health care-related RCTs. His<br />

suggestion that the methods used to prepare and maintain reviews of controlled<br />

trials in pregnancy and childbirth should be applied more widely was<br />

taken up by the Research and Development Programme, initiated to support the<br />

United Kingdom’s National Health Service. Funds were provided to establish a<br />

“Cochrane Centre,” to collaborate with others, in the United Kingdom and elsewhere,<br />

to facilitate systematic reviews of randomized controlled trials across all<br />

areas of health care. When the Cochrane Centre was opened in Oxford in October<br />

1992, those involved expressed the hope that there would be a collaborative<br />

international response to Cochrane’s agenda. This idea was outlined at a<br />

meeting organized six months later by the New York Academy of Sciences. In<br />

October 1993 – at what was to become the first in a series of annual Cochrane<br />

Colloquia – 77 people from 11 countries co-founded the Cochrane Collaboration.<br />

It is an international organization that aims to help people make wellinformed<br />

decisions about health care by preparing, maintaining, and ensuring<br />

the accessibility of systematic reviews of the effects of health-care interventions.<br />

The Cochrane Library comprises several databases. Each database focuses on<br />

a specific type of information and can be searched individually or as a whole.<br />

Current databases are:<br />

The Cochrane Database of Systematic Reviews (CDSR) contains systematic<br />

reviews of the effects of health care prepared by the Cochrane Collaboration.<br />

In addition to complete reviews, the database contains protocols for reviews<br />

currently being prepared.<br />

The Database of Abstracts of Reviews of Effects (DARE) includes structured<br />

abstracts of systematic reviews which have been critically appraised<br />

1 A. L. Cochrane. Effectiveness & Efficiency: Random Reflections on Health Services. London: Royal Society<br />

of <strong>Medicine</strong>, 1971.


Searching the medical literature 49<br />

by reviewers at the NHS Centre for Reviews and Dissemination in York and<br />

by other people. DARE is meant to complement the CDSR.<br />

The Cochrane Central Register of Controlled Trials (CENTRAL) is a bibliographic<br />

database of controlled trials identified by contributors to the<br />

Cochrane Collaboration and others.<br />

Cochrane Methodology Register focuses on articles, books, and conference<br />

proceedings that report on methods used in controlled trials. Bibliographic<br />

in nature, the register’s contents is culled from both MEDLINE and hand<br />

searches.<br />

Health Technology Assessment Database is a centralized location to find completed<br />

and ongoing health technology assessments that study the implications<br />

of health-care interventions around the world. Medical, social, ethical,<br />

and economic factors are considered for inclusion.<br />

NHS Economic Evaluation Database identifies and summarizes economic<br />

evaluations throughout the world that impact health care decision making.<br />

As with MEDLINE, there are various interfaces for searching the Cochrane<br />

Library. The interface that is linked directly from the Cochrane Collaborations<br />

homepage (http://www.cochrane.org) is the Wiley InterScience interface<br />

(Fig. 5.10). While it is subscription <strong>based</strong>, it is possible to view the abstracts<br />

Fig. 5.10 The opening page of<br />

the Cochrane Collaboration.


50 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

without a subscription. Some countries or regions have subsidized full-text<br />

access to the Cochrane Library for their health-care professionals. Consult the<br />

homepage to see if you live in one of these areas.<br />

The Cochrane Library supports various search techniques. The searcher can<br />

opt to search all text or just the record title, author, abstract, keywords, tables,<br />

or publication type. The default for a quick search is a combination of title,<br />

abstract, or keyword. The advanced search feature allows you to search multiple<br />

fields using Boolean operators. You can also restrict by product, record status,<br />

or date. You can also opt to search using MeSH terms since MeSH descriptors<br />

and qualifiers are supported by the search engine as the explode feature. The<br />

Cochrane Library also supports wildcard searching using the asterisk ∗ .Oncea<br />

search is complete, you can opt to save your searches. The “My Profile” feature<br />

is similar to “My NCBI” as it allows you to store titles, articles, and searches and<br />

to set up journal and search e-mail update alerts. There is no cost to register,<br />

although some services are fee-<strong>based</strong>, such as purchasing individual documents<br />

online through Pay-Per-View. Always check with your health sciences library first<br />

prior to purchasing any information to ensure that it’s not available by another<br />

method.<br />

TRIP database<br />

Sometimes when conducting a search, it is helpful to start in a database with<br />

an interface that can search numerous resources at once from one search query<br />

while at the same time providing the results in one convenient location. The TRIP<br />

database (http://www.tripdatabase.com) was created in 1997 with the intended<br />

purpose of providing an evidence-<strong>based</strong> method of answering clinical questions<br />

in a quick and timely manner by reducing the amount of search time needed.<br />

Freely available on the Web since 2004, TRIP had developed a systematic, federated<br />

searching approach to retrieving information from such resources as various<br />

clinical practice guideline databases, Bandolier, InfoPOEMS, Cochrane, Clinical<br />

<strong>Evidence</strong>, and core medical journals. Additionally, each search is performed<br />

within PubMed’s Clinical Queries service. All potential information sources are<br />

reviewed by an in-house team of information experts and clinicians and external<br />

experts to assess quality and clinical usefulness prior to being included.<br />

The TRIP database has a very straightforward searching interface that supports<br />

both basic and advanced techniques. For basic searching the search terms<br />

are entered into a search box. TRIP supports Boolean searching as well as the<br />

asterisk ∗ for truncation. Phrase searching is supported by using quotation marks,<br />

such as, “myocardial infarction.” There is also a mis-spelling function that will<br />

automatically activate if no results are found. The advance search allows for title<br />

or title and text searching. These results are assigned search numbers (#1) and


Searching the medical literature 51<br />

can be combined using Boolean operators (#1 AND #2). Results can be sorted<br />

by relevance or year prior to conducting the search. Once the search has been<br />

run, the results can further be sorted by selecting more specialized filters such as<br />

systematic reviews, evidenced-<strong>based</strong> synopses, core primary research, and subject<br />

specialty. The PubMed Clinical Query results are also provided separately<br />

by therapy, diagnosis, etiology, prognosis, and systematic reviews. With a “My<br />

Trip” account, a keyword auto-search function can be set up that will provide<br />

one with regular clinical updates. These will automatically be e-mailed with any<br />

new records that have the selected keyword in the title.<br />

The advantage of the TRIP database is that more than one evidence-<strong>based</strong><br />

resource can be searched at a time. The main disadvantage is that although Trip<br />

uses carefully selected filters to ensure quality retrievals, you lose some of the<br />

searching control that you would have searching the original database. However,<br />

in many cases the time saved outweighs this consideration.<br />

Specific point of care databases<br />

For information at the point of care, DynaMed, Clinical <strong>Evidence</strong>, and <strong>Essential</strong><br />

<strong>Evidence</strong> Plus are fee-<strong>based</strong> databases designed to be provide quick, evidence<strong>based</strong><br />

answers to clinical questions that commonly arise at the bedside. The<br />

information is delivered in a compact format that highlights the pertinent information<br />

while at the same time providing enough background information for<br />

further research if required.<br />

Developed by a family physician, DynaMed (http://www.ebscohost.com/<br />

dynamed/) has grown to provide clinically organized summaries for nearly 2000<br />

medical topics covering basic information such as etiology, diagnosis and history,<br />

complications, prognosis, treatment, prevention and screening, references<br />

and guidelines, and patient information. DynaMed uses a seven-step evidence<strong>based</strong><br />

methodology to create topic summaries that are organized both alphabetically<br />

and by category. The selection process includes daily monitoring of<br />

the content of over 500 medical journals and systematic review databases. This<br />

includes a systematic search using such resources as PubMed’s Clinical Queries<br />

feature, the Cochrane Library databases, and the National Guidelines Clearinghouse.<br />

Once this step is complete, relevance and validity are determined and the<br />

information is critically appraised. DynaMed uses the Users’ Guides to <strong>Evidence</strong>-<br />

Based Practice from the <strong>Evidence</strong>-Based <strong>Medicine</strong> Working Group, Centre for<br />

Health <strong>Evidence</strong> as a basis for determining the level of evidence. DynaMed ranks<br />

information into three levels: Level 1 (likely reliable), Level 2 (mid-level), and<br />

Level 3 (lacking direction). All authors and reviewers of DynaMed topics are<br />

required to have some clinical practice experience.


52 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

Formally known as InfoPOEMS with InfoRetriever, <strong>Essential</strong> <strong>Evidence</strong> Plus<br />

(http://essentialevidenceplus.com/) provides filtered, synopsized, evidence<strong>based</strong><br />

information, including EBM guidelines, topic reviews, POEMs, Derm<br />

Expert, decision support tools and calculators, and ICD-9 codes, that has also<br />

been developed by physicians. Individual topics can be searched or can be<br />

browsed by subject, database, and tools. At the bed side POEMS can be invaluable<br />

as they summarize articles by beginning with the “clinical question,” followed<br />

by the “bottom line” and rounded out with the complete reference, the<br />

study design, the setting, and the article synopsis. The bottom line provides the<br />

conclusion arrived at to answer the clinical question and provides a level of evidence<br />

ranking <strong>based</strong> on the five levels of evidence ranking from the Centre for<br />

<strong>Evidence</strong>-Based <strong>Medicine</strong> in Oxford. Sources used to find information for <strong>Essential</strong><br />

<strong>Evidence</strong> Plus include EBM Guidelines and Abstracts of Cochrane Systematic<br />

Reviews.<br />

Clinical <strong>Evidence</strong>, published by the British Medical Journal is available on their<br />

website at www.clinicalevidence.org. Clinical <strong>Evidence</strong> is a decision-support tool<br />

sponsored by the British Medical Journal, the BMJ. An international group of peer<br />

reviewers publish summaries of systematic reviews of important clinical questions.<br />

These are regularly updated and integrated with various EBM resources to<br />

summarize the current state of knowledge and uncertainty about various clinical<br />

conditions. It is primarily focused on conditions in internal medicine and<br />

surgery and does cover many newer technologies. The evidence provided is rated<br />

as definitely beneficial, probably beneficial, uncertain, probably not beneficial,<br />

or definitely not beneficial.<br />

Created in 1999, it has been redesigned and revised by an international advisory<br />

board, clinicians, patient support groups, and contributors. They aim for<br />

sources that have high relevance and validity and require low time and effort by<br />

the user. Their reviews are transparent and explicit. Their reviews try to show<br />

when uncertainty stems from gaps in the best available evidence. Clinical <strong>Evidence</strong><br />

is currently available in print, using a PDA interface and online. It is free in<br />

the UK National Health Services in Scotland and Wales, to most clinicians in the<br />

United States through the United Health Foundation and in several other countries.<br />

The complete list is available on their website. It has been translated into<br />

Italian, Spanish, Russian, German, Hungarian, and Portuguese. It is available for<br />

free to people in developing countries through an initiative sponsored by the BMJ<br />

and the World Health Organization.<br />

Efficient searching at the point of care databases<br />

The searching techniques described in this chapter are designed to find primary<br />

studies of medical research. These comprehensive searching processes will


Searching the medical literature 53<br />

Systems – Computerized decision support<br />

Guidelines<br />

Fig. 5.11 The Haynes 5S<br />

knowledge acquisition pyramid<br />

Summaries –<br />

<strong>Evidence</strong>-<strong>based</strong> textbooks<br />

Dyna-Medy -<br />

Synopses – <strong>Evidence</strong>-<strong>based</strong> journal abstracts of<br />

single articles<br />

ACP Journal Club<br />

Syntheses – Systematic reviews or meta-analyses<br />

Cochrane c Collaboration<br />

i<br />

Studies –<br />

Original journal articles, clinical research<br />

Find them in PubMed<br />

best serve the doer in answering clinical questions for the purpose of critically<br />

reviewing the most current available evidence for that question. The practice<strong>based</strong><br />

learner must find primary sources at the point of care and will not perform<br />

comprehensive PubMed searches on a regular basis. They will be looking<br />

for pre-appraised sources and well done meta-analyses such as those done by<br />

the Cochrane Collaboration. Most clinicians will want to do the most efficient<br />

searching at the point of care possible to aid the patient sitting in front of them.<br />

An increasing number of sites on the Internet are available for doing this point of<br />

care searching.<br />

David Slawson and Allen Shaughnessy proposed an equation to determine the<br />

usefulness of evidence (or information) to practicing clinicians. They described<br />

the usefulness as equal to the relevance times validity divided by effort (to<br />

obtain). Always turning to primary sources of evidence whenever a clinical question<br />

comes up is very inefficient at best and impossible for most busy practitioners.<br />

The busy clinician in need of rapid access to the most current literature<br />

requires quick access to high quality pre-appraised and summarized sources of<br />

evidence that can be accessed during a patient visit.<br />

For the “users,” the 5S schema of Haynes is a construct to help focus the skills<br />

of Information Mastery. This is a process of helping to find the best evidence<br />

at the point of care. The sources that are higher up on the appraisal pyramid<br />

(Fig. 5.11) are the ones easiest to use and needing the least amount of critical<br />

appraisal by the user.<br />

The highest level is that of systems, which are decision support tools integrated<br />

into the daily practice of medicine through mechanisms such as computerized<br />

order entry systems or electronic medical records. The system links<br />

directly to the high quality information needed at the point of care and seamlessly<br />

integrated into the care process. There are very few systems that have been


54 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

developed and most of the current ones are standalone applications set up by an<br />

institution within its electronic medical record or IT system.<br />

The next level is synthesis, which are critically appraised topics and guidelines.<br />

Many of these are through publishing enterprises such as Clinical <strong>Evidence</strong> published<br />

by the British Medical Journal. This print-<strong>based</strong> resource summarizes the<br />

best available evidence of prevention and treatment interventions for commonly<br />

eoncountered clinical problems in internal medicine. <strong>Evidence</strong> is presented as<br />

being beneficial, equivocal, or not beneficial.<br />

The third level is synopses of critically appraised individual studies. These<br />

might be found in CAT banks such as Best Bets and <strong>Evidence</strong> Based On Call.<br />

Finding your answer here is a matter of trial and error.<br />

The fourth level is summaries, which is synonymous with systematic reviews.<br />

The primary ones in the category are from the Cochrane Database of Systematic<br />

Reviews, described earlier in the book as a database of systematic reviews<br />

authored and updated by the worldwide Cochrane Collaboration. The Database<br />

of Abstracts of Reviews of Effects (DARE) is a database of non-Cochrane systematic<br />

reviews catalogued by the Centre for Reviews and Dissemination at the University<br />

of York in the United Kingdom and presented with critical reviews.<br />

The fifth level is individual studies, which are original research studies found<br />

through Ovid MEDLINE or PubMed. The Cochrane Central Register of Controlled<br />

Trials repository of randomized controlled trials is a more comprehensive<br />

source for RCTs than MEDLINE, including meeting abstracts and unique<br />

EMBASE records. Finally, the lowest level is “Expert Opinion” or Replication level,<br />

which is not considered bona fide evidence, but only anecdote or unsubstantiated<br />

evidence. Included in this level are textbooks that are not explicitly evidence<br />

<strong>based</strong>.<br />

Final thoughts<br />

Conducting a thorough search can be a daunting task. The process of identifying<br />

papers is an iterative one. It is best initially to devise a strategy on paper.<br />

No matter how thorough a search strategy is, inevitably some resources will be<br />

missed and the process will need to be repeated and refined. Use the results of<br />

an initial search to retrieve relevant papers which can then be used to further<br />

refine the searches by searching the bibliographies of the relevant papers for articles<br />

missed by the initial search and by performing a citation search using either<br />

Scopus or Web of Science databases. These identify papers that have cited the<br />

identified relevant studies, some of which may be subsequent primary research.<br />

These “missed” papers are invaluable and provide clues on how the search may<br />

be broadened to capture further papers by studying the MeSH terms that have<br />

been used. Google, Google Scholar, and PogoFrog (www.pogofrog.com) can also


Searching the medical literature 55<br />

be used as a resource to not only find information but to help design a strategy to<br />

use in other databases such as PubMed and Cochrane. These records can be used<br />

to design a strategy that can be executed within a more specialized database. The<br />

whole procedure may then be repeated using the new terms identified. This iterative<br />

process is sometimes referred to as “snowballing.”<br />

Searching for EBM can be time consuming, but more and more database<br />

providers are developing search engines and features that are designed to find<br />

reliable, valid, and relevant information quickly and efficiently. Podcasts, RSS<br />

feeds, and alerts are just a few of the advances that demonstrate how technology<br />

is continually advancing to improve access and delivery of information to<br />

the office as well as the bedside. Always remember that, if the information isn’t<br />

found in the first source consulted, there are a myriad of options available to the<br />

searcher. Finally, the new reliance on electronic searching methods has increased<br />

the role of the health sciences librarian who can provide guidance and assistance<br />

in the searching process and should be consulted early in the process.<br />

Databases and websites are updated frequently and it is the librarian’s role to<br />

maintain a competency in expert searching techniques to help with the most<br />

difficult searching challenge.


6<br />

Study design and strength of evidence<br />

Louis Pasteur’s theory of germs is ridiculous fiction.<br />

Pierre Pachet, Professor of Physiology, Toulouse University, 1872<br />

Learning objectives<br />

In this chapter you will learn:<br />

the unique characteristics, strengths, and weaknesses of common clinical<br />

research study designs<br />

descriptive – cross-sectional, case reports, case series<br />

timed – prospective, retrospective<br />

longitudinal – observational (case–control, cohort, non-concurrent<br />

cohort), interventional (clinical trial)<br />

the levels of evidence and how study design affects the strength of evidence.<br />

There are many types of research studies. Since various research study designs<br />

can accomplish different goals, not all studies will be able to show the same<br />

thing. Therefore, the first step in assessing the validity of a research study is to<br />

determine the study design. Each study design has inherent strengths and weaknesses.<br />

The ability to prove causation and expected potential biases will largely<br />

be determined by the design of the study.<br />

Identify the study design<br />

When critically appraising a research study, you must first understand what different<br />

research study designs are able to accomplish. The design of the study will<br />

suggest potential biases you can expect. There are two basic categories of studies<br />

that are easily recognizable. These are descriptive and longitudinal studies. We<br />

will discuss each type and its subtypes.<br />

56


Study design and strength of evidence 57<br />

One classification commonly used to characterize longitudinal clinical<br />

research studies is by the direction of the study in time. Characterizations in this<br />

manner, or so-called timed studies, have traditionally been divided into prospective<br />

and retrospective study designs. Prospective studies begin at a time in the<br />

past and subjects are followed to the present time. Retrospective studies begin at<br />

the present time and look back on the behavior or other characteristics of those<br />

subjects in the past. These are terms which can easily be used incorrectly and<br />

misapplied, and because of this, they should not be referred to except as generalizations.<br />

As we will see later in this chapter, “retrospective” studies can be of<br />

several types and should be identified by the specific type of study rather than<br />

the general term.<br />

Descriptive studies<br />

Descriptive studies are records of events which include studies that look at a<br />

series of cases or a cross-section of a population to look for particular characteristics.<br />

These are often used after several cases are reported in which a novel<br />

treatment of several patients yields promising results, and the authors publishing<br />

the data want other physicians to know about the therapy. Case reports describe<br />

individual patients and case series describe accounts of an illness or treatment<br />

in a small group of patients. In cross-sectional studies the interesting aspects of<br />

a group of patients, including potential causes and effects, are all observed at the<br />

same time.<br />

Case reports and case series<br />

Case reports or small numbers of cases are often the first description of a new<br />

disease, clinical sign, symptom, treatment, or diagnostic test. They can also be<br />

a description of a curriculum, operation, patient-care strategy, or other healthcare<br />

process. Some case reports can alert physicians to a new disease that is<br />

about to become very important. For example, AIDS was initially identified<br />

when the first cases were reported in two case series in 1981. One series consisted<br />

of two groups of previously healthy homosexual men with Pneumocystis<br />

carinii pneumonia, a rare type of pneumonia. The other was a series of men<br />

with Kaposi’s sarcoma, a rare cancer. These diseases had previously only been<br />

reported in people who were known to be immunocompromised. This was the<br />

start of the AIDS epidemic, a fact that was not evident from these first two<br />

reports. It quickly became evident as more clinicians noticed cases of these rare<br />

diseases.<br />

Since most case reports are descriptions of rare diseases or rare presentations<br />

of common diseases, they are unlikely to occur again very soon, if ever.


58 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

A recent case series reported on two cases of stroke in young people related to<br />

illicit methamphetamine use. To date, physicians have not been deluged with a<br />

rash of young methamphetamine users with strokes. Although it makes pathophysiological<br />

sense, the association may only be a fluke. Therefore, case reports<br />

are a useful venue to report unusual symptoms of a common illness, but have<br />

limited value. New treatments or tests described in a study without any control<br />

group also fall under this category of case reports and case series. At best, these<br />

descriptive studies can suggest future directions for research on the treatment or<br />

test being reported.<br />

Case studies and cross-sectional studies have certain strengths. They are<br />

cheap, relatively easy to do with existing medical records, and potential clinical<br />

material is plentiful. If you see new presentations of disease or interesting<br />

cases, you can easily write a case report. However, their weaknesses outweigh<br />

their strengths. These studies do not provide explanations and cannot show association<br />

between cause and effect. Therefore, they do not provide much useful<br />

evidence! Since no comparison is made to any control group, contributory cause<br />

cannot be proven. A good general rule for case studies is to “take them seriously<br />

and then ignore them.” By this it is meant that you should never change your<br />

practice <strong>based</strong> solely on a single case study or series since the probability of seeing<br />

the same rare presentation or rare disease is quite remote.<br />

There is one situation in which a case series may be useful. Called the “all-ornone<br />

case series,” this occurs when there is a very dramatic change in the outcome<br />

of patients reported in a case series. There are two ways this can occur.<br />

First, all patients died before the treatment became available and some in the<br />

case series with the treatment survive. Second, some patients died before the<br />

treatment became available, but none in the case series with the treatment die.<br />

This all-or-none idea is roughly what happened when penicillin was first introduced.<br />

Prior to this time, most patients with pneumonia died of their illness.<br />

When penicillin was first given to patients with pneumonia, most of them lived.<br />

The credibility of these all-or-none case reports depends on the numbers of cases<br />

reported, the relative severity of the illness, and the accuracy and detail of the<br />

case descriptions given in the report.<br />

The case series can be abused. It can be likened to a popular commercial for<br />

Life cereal from the 1970s. In the scene, two children are unsure if they will like<br />

the new cereal Life, so they ask their little brother, Mikey, to try it. He liked it and<br />

they both decided that since “Mikey liked it!” they would like it too. Too often, a<br />

series of cases is presented showing apparent improvement in the condition of<br />

several patients that is then attributed to a particular therapy. The authors conclude<br />

that this means it should be used as a new standard of care. The fact that<br />

everyone got better is not proof that the therapy or intervention in question is<br />

causative. This is called the “Mikey liked it” phenomenon. 1<br />

1 This construct is attributed to J. Hoffman, Emergency Medical Abstracts, 2000.


Study design and strength of evidence 59<br />

Cross-sectional studies<br />

Cross-sectional studies are descriptive studies that look at a sample of a population<br />

to see how many people in that population are afflicted with a particular disease<br />

and how many have a particular risk factor. Cross-sectional studies record<br />

events and observations and describe diseases, causes, outcomes, effects, or risk<br />

factors in a single population at a single instant in time.<br />

The strengths of cross-sectional studies are that they are relatively cheap, easy,<br />

and quick to do. The data are usually available through medical records or statistical<br />

databases. They are useful initial exploratory studies especially to screen<br />

or classify aspects of disease. They are only capable of demonstrating an association<br />

between the cause and effect. They have no ability to determine the<br />

other elements of contributory cause. In order to draw conclusions from this<br />

study, patient exposure to the risk factor being studied must continue until the<br />

outcome occurs. If the exposure began long before the outcome occurs and<br />

is intermittent, it will be more difficult to associate the two. If done properly,<br />

cross-sectional studies are capable of calculating the prevalence of disease in<br />

the population. Prevalence is the percentage of people in the population with<br />

the outcome of interest at any point in time. Since all the cases are looked at in<br />

one instant of time, cross-sectional studies cannot calculate incidence, the rate<br />

of appearance of new cases over time. Another strength of cross-sectional studies<br />

is that they are ideal study designs for studying the operating characteristics<br />

of diagnostic tests. We compare the test being studied to the “gold standard” test<br />

in a cross-section of patients for whom the test might be used.<br />

The trade-off to the ease of this type of study is that the rules of cause and effect<br />

for contributory cause cannot be fulfilled. Since the risk factor and outcome are<br />

measured at the same time, you cannot be certain which is the cause and which<br />

the effect. A cross-sectional study found that teenagers who smoked early in life<br />

were more likely to become anxious and depressed as adults than those who<br />

began smoking at a later age. Does teenage smoking cause anxiety and depression<br />

in later years, or are those who have subclinical anxiety or depression more<br />

likely to smoke at an early age? It is impossible to tell if the cause preceded the<br />

effect, the effect was responsible for the cause, or both are related to an unknown<br />

third factor called a confounding or surrogate variable. Confounding or surrogate<br />

variables are more likely to apply if the time from the cause to the effect is<br />

short. For example, it is very common for people to visit their doctor just before<br />

their death. The visit to the doctor is not a risk factor for death but is a “surrogate”<br />

marker for severe and potentially life-threatening illness. These patients<br />

visit their doctors for symptoms associated with their impending deaths.<br />

Cross-sectional studies are subject to prevalence–incidence bias. Prevalence–<br />

incidence bias is defined as a situation when the element that seems to cause<br />

an outcome is really an effect of or associated with that cause. This occurs when<br />

a risk factor is strongly associated with a disease and is thought to occur before


60 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

the disease occurs. Thus the risk factor appears to cause the disease when in<br />

reality it simply affects the duration or prognosis of the disease. An association<br />

was noted between HLA-A2 antigen and the presence of acute lymphocytic<br />

leukemia in children in a cross-sectional study. It was assumed to be a risk factor<br />

for occurrence of the disease. Subsequent studies found that long-term survivors<br />

had the HLA-A2 antigen and its absence was associated with early mortality.<br />

The antigen was not a risk factor for the disease but an indicator of good<br />

prognosis.<br />

Longitudinal studies<br />

Longitudinal study is a catchall term describing either observations or interventions<br />

made over a given period of time. There are three basic longitudinal study<br />

designs: case–control studies, cohort studies, and clinical trials. These are analytic<br />

or inferential studies, meaning that they look for a statistical association<br />

between risk factors and outcomes.<br />

Case–control studies<br />

These studies were previously called retrospective studies, but looking at data<br />

in hindsight is not the only attribute of a case–control study. There is another<br />

unique feature that should be used to identify a case–control study. The subjects<br />

are initially selected because they either have the outcome of interest –<br />

cases – or do not have the outcome of interest – controls. They are grouped at the<br />

start of the study by the presence or absence of the outcome, or in other words,<br />

are grouped as either cases or controls. This type of study is good to screen for<br />

potential risk factors of disease by reviewing elements that occurred in the past<br />

and comparing the outcomes. The ratio between cases and controls is arbitrarily<br />

set rather than reflecting their true ratio in the general population.. The study<br />

then examines the odds of exposure to the risk factor among the cases and compares<br />

this to the odds of exposure among the controls. Figure 6.1 is a schematic<br />

description of a case–control study.<br />

The strengths of case–control studies are that they are relatively easy, cheap,<br />

and quick to do from previously available data. They can be done using current<br />

patients and asking them about events that occurred in the past. They are well<br />

suited for studying rare diseases since the study begins with subjects who already<br />

have the outcome. Each case patient may then be matched up with one or more<br />

suitable control patients. Ideally the controls are as similar to the cases as possible<br />

except for the outcome and then their degree of exposure to the risk factor<br />

of interest can be calculated. Case–controls are good exploratory studies and<br />

can look at many risk factors for one outcome. The results can then be used to


Study design and strength of evidence 61<br />

Risk factor Risk factor Risk factor Risk factor<br />

absent present absent present<br />

PAST<br />

Information gathered from chart reviews or questionnaires<br />

Cases Controls PRESENT<br />

Start of study<br />

LOOK<br />

BACK<br />

IN<br />

TIME<br />

suggest new hypotheses for a later study with stronger research study design,<br />

such as a cohort study or clinical trial.<br />

Unfortunately, there are many potentially serious weaknesses in case–control<br />

studies, which in general, make them only fair sources of evidence. Since the data<br />

are collected retrospectively, data quality may be poor. Data often come from a<br />

careful search of the medical records of the cases and controls. The advantage<br />

of these records being easily available is counteracted by their questionable reliability.<br />

These studies rely on subjective descriptions to determine exposure and<br />

outcome, and the subjective standards of the record reviewers to determine the<br />

presence of the cause and effect. This is called implicit review of the medical<br />

records. Implicit review of charts introduces the researcher’s bias in interpreting<br />

the measurements or outcomes. Stronger case–control studies will use explicit<br />

reviews. An explicit review only uses clearly objective measures in reviews of<br />

medical charts, or the chart material is reviewed in a blinded manner using previously<br />

determined outcome descriptors. These chart reviews are better but are<br />

more difficult to perform.<br />

When a patient is asked to remember something about a medical condition<br />

that occurred in the past, their memory is subject to recall or reporting<br />

bias. Recall or reporting bias occurs because those with the disease are more<br />

likely to recall exposure to many risk factors simply because they have the disease.<br />

Another problem is that subjects in the sample may not be representative<br />

of all patients with the outcome. This is called sampling or referral bias and<br />

Fig. 6.1 Schematic diagram of a<br />

case–control study.


62 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

commonly occurs in studies done at specialized referral centers. These referred<br />

patients may be different from those seen in a primary-care practice and often in<br />

referral centers, only the most severe cases of a given disorder will be seen, thus<br />

limiting the generalizability of the findings.<br />

When determining which of many potential risk factors is associated with<br />

an outcome using a case–control study a derivation set is developed. A derivation<br />

set is the initial series of results of a study. The results of the derivation set<br />

should be used cautiously since any association discovered may have turned up<br />

by chance alone. The study can then be repeated using a cohort study design to<br />

look at those factors that have the highest correlation with the outcome in question<br />

to see if the association still holds. This is called a validation set and has<br />

greater generalizability to the population.<br />

Other factors to be aware of when dealing with case–control studies are that<br />

case–controls can only study one disease or outcome at a given time. Also, prevalence<br />

or incidence cannot be calculated since the ratio of cases to controls is preselected<br />

by the researchers. In addition, they cannot prove contributory cause<br />

since they cannot show that altering the cause will alter the effect and the study<br />

itself cannot show that the cause preceded the effect. Often times, researchers<br />

and clinicians can extrapolate the cause and effect from knowledge of biology or<br />

physiology.<br />

Cohort studies<br />

These were previously called prospective studies since they are usually done<br />

from past to present in time. The name comes from the Latin cohors, meaning a<br />

tenth of a legion marching together in time. However, they can be and are now as<br />

frequently done retrospectively and called non-concurrent cohort studies. The<br />

cohort is a group of patients who are selected <strong>based</strong> on the presence or absence<br />

of the risk factor (Fig. 6.2). They are followed in time to determine which of them<br />

will develop the outcome or disease. The probability of developing the outcome<br />

is the incidence or risk, and can be calculated for each group. The degree of risk<br />

can then be compared between the two groups.<br />

The cohort study can be one of the strongest research study designs. They can<br />

be powerful studies that can determine the incidence of disease and are able to<br />

show that the cause is associated with the effect more often than by chance alone.<br />

They can also show that the cause preceded the effect. They do not attempt to<br />

manipulate the cause and cannot prove that altering the cause alters the effect.<br />

Cohort studies are an ideal study design for answering questions of etiology,<br />

harm, or prognosis as they collect the data in an objective and uniform fashion.<br />

The investigators can predetermine the entry criteria, what measurements are to<br />

be made, and how they are best made. As a result, there is usually no recall bias,


Study design and strength of evidence 63<br />

Exposed Controls (not PAST<br />

to risk factor exposed to risk factor) Start of the study<br />

FOLLOW<br />

FORWARD IN<br />

TIME<br />

Information gathered using a uniform data gathering protocol<br />

Disease<br />

No disease<br />

Disease<br />

No disease<br />

except as a possibility in non-concurrent cohort studies where the researchers<br />

are asking for subjective information from the study subjects.<br />

The main weakness of cohort studies is that they are expensive in time and<br />

money. The startup and ongoing monitoring costs may be prohibitive. This is a<br />

greater problem when studying rare or uncommon diseases as it may be difficult<br />

to get enough patients to find clinically or statistically significant differences<br />

between the patients who are exposed and those not exposed to the risk factor.<br />

Since the cohort must be set up prospectively by the presence or absence of the<br />

risk factor, they are not good studies to uncover new risk factors.<br />

Confounding variables are factors affecting both the risk factor and the outcome.<br />

They may affect the exposed and unexposed groups differently and lead<br />

to a bias in the conclusions. There are often reasons why patients are exposed to<br />

the risk factor that may lead to differences in the outcome. For example, patients<br />

may be selected for a particular therapy, the risk factor in this case, because they<br />

are sicker or less sick, which then cause differences in outcomes that result.<br />

Patients who leave the study, called patient attrition, can cause loss of data<br />

about the outcomes. The cause of their attrition from the study may be directly<br />

related to some conditions of the study. Therefore, it is imperative for researchers<br />

to account for all patients. In practice an acceptable level of attrition is less than<br />

20%. However, this should be used as a guide rather than an absolute value. A<br />

value of attrition lower than 20% may bias the study if the reason patients were<br />

lost from the study is related to the risk factor. In long-running studies, patients<br />

Fig. 6.2 Schematic diagram of a<br />

cohort study.


64 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

may change some aspect of their behavior or exposure to the risk factor after the<br />

initial grouping of subjects, leading to misclassification bias. Safeguards to prevent<br />

these issues should be clearly outlined in the methods section of the study.<br />

A special case of the cohort study, the non-concurrent cohort study is also<br />

called a database study. It is essentially a cohort study that begins in the present<br />

and utilizes data on events that took place in the past. The cohort is still separated<br />

by the presence or absence of the risk factor that is being studied, although<br />

this risk factor is usually not the original reason that patients were entered into<br />

the study. Non-concurrent cohort studies are not retrospective studies, but have<br />

been called “retrospective cohort studies” in the past. They have essentially the<br />

same strengths and weaknesses as cohort studies, but are more dependent on<br />

the quality of the recorded data from the past.<br />

In a typical non-concurrent cohort study design, a cohort is put together in the<br />

past and many baseline measurements are made. The follow-up measurements<br />

and determination of the original outcomes are made when the data are finally<br />

analyzed at the end of the study. The data will then be used for another, later<br />

study and analyzed for a new risk factor other than the one for which the original<br />

study was done. For example, a cohort of patients with trauma due to motorvehicle-accident<br />

is collected to look at the relationship of wearing seat belts to<br />

death. After the data are collected, the same group of patients is looked at to see<br />

if there is any relationship between severe head injury and the wearing of seat<br />

belts. Both data elements were collected as part of the original study.<br />

In general, for a non-concurrent cohort study, the data are available from<br />

databases that have already been set up. The data should be gathered in an objective<br />

manner or at least without regard for the association which is being sought.<br />

Data gatherers are ideally blinded to the outcomes. Since non-concurrent cohort<br />

studies rely on historical data, they may suffer some of the weaknesses associated<br />

with case–control studies regarding recall bias, the lack of uniformity of data<br />

recorded in the data base, and subjective interpretation of records.<br />

To review<br />

Subjects in case–control studies are initially grouped according to the presence<br />

or absence of the outcome and the ratio between cases and controls is<br />

arbitrary and not reflective of their true ratio in the population.<br />

Subjects in cohort studies are initially grouped according to the presence or<br />

absence of risk factors regardless of whether the group was assembled in the<br />

past or the present.<br />

Clinical trials<br />

A clinical trial is a cohort study in which the investigator intervenes by manipulating<br />

the presence or absence of the risk factor, usually a therapeutic maneuver.


Study design and strength of evidence 65<br />

Intervention<br />

Hypothesized<br />

outcome<br />

Subjects eligible for<br />

the study. Inclusion<br />

criteria. Exclusion criteria<br />

Randomization<br />

Blinding<br />

Alternate<br />

outcome<br />

Hypothesized<br />

outcome<br />

Comparison<br />

Alternate<br />

outcome<br />

PAST<br />

Start study<br />

Clinical trials are human experiments, also called interventional studies. Traditional<br />

cohort and case–control studies are observational studies in which there<br />

is no intentional intervention. An example of a clinical trial is a study in which a<br />

high-soy-protein diet and a normal diet were given to middle-aged male smokers<br />

to determine if it reduced their risk of developing diabetes. The diet is the<br />

intervention. A cohort study of the same ‘risk factor’ would look at a group of<br />

middle-aged male smokers and see which of them ate a high-soy-protein diet<br />

and then follow them for a period of time to determine their rates of developing<br />

diabetes. Figure 6.3 is a schematic diagram of a randomized clinical trial.<br />

Clinical trials are characterized by the presence of a control group identical<br />

to the experimental patients in every way except for their exposure to the intervention<br />

being studied. Patients entering controlled clinical trials should be randomized,<br />

meaning that all patients signed up for the trial should have an equal<br />

chance of being placed in either the control group (also called the comparison<br />

group, placebo group, or standardized therapy group) or the experimental group,<br />

which gets the intervention being tested. Subjects and experimenters should ideally<br />

be blinded to the therapy and group assignment during the study, such that<br />

the experimenters and subjects are unaware if the patient is in the control or<br />

experimental group, and are thus unaware whether they are receiving the experimental<br />

treatment or the comparison treatment.<br />

PRESENT<br />

End study<br />

Fig. 6.3 Schematic diagram of a<br />

randomized clinical trial.


66 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

Clinical trials are the only study design that can fulfill all the rules of contributory<br />

cause. They can show that the cause and effect are associated more than<br />

by chance alone, that the cause precedes the effect, and that altering the cause<br />

alters the effect. When properly carried out they will have fewer methodological<br />

biases than any other study design.<br />

However, they are far from perfect. The most common weakness of controlled<br />

clinical trials is that they are very expensive. Because of the high costs, multicenter<br />

trials that utilize cooperation between many research centers and are<br />

funded by industry or government are becoming more common. Unfortunately,<br />

the high cost of these studies has resulted in more of them being paid for by<br />

large biomedical (pharmaceutical or technology) companies and as a result, the<br />

design of these studies could favor the outcome that is desired by the sponsoring<br />

agency. This could represent a conflict of interest for the researcher, whose salary<br />

and research support is dependent on the largess of the company providing the<br />

money. Other factors that may compromise the research results are patient attrition<br />

and patient compliance.<br />

There may be ethical problems when the study involves giving potentially<br />

harmful, or withholding potentially beneficial, therapy. The Institutional Review<br />

Boards (IRB) of the institutions doing the research should address these. A<br />

poorly designed study should not be considered ethical by the IRB. However, just<br />

because the IRB approves the study doesn’t mean that the reader should not critcally<br />

read the study. It is still the reader’s responsibility to determine how valid a<br />

study is <strong>based</strong> upon the methodology. In addition, the fact that a study is a randomized<br />

controlled trial does not in itself guarantee validity, and there can still<br />

be serious methodological problems that will bias the results.


7<br />

Instruments and measurements: precision and validity<br />

Not everything that can be counted counts, and not everything that counts can be<br />

counted.<br />

Albert Einstein (1879–1955)<br />

Learning objectives<br />

In this chapter you will learn:<br />

different types of data as basic elements of descriptive statistics<br />

instrumentation and measurement<br />

precision, accuracy, reliability, and validity<br />

how researchers should optimize these factors<br />

All clinical research studies involve observations and measurements of the phenomena<br />

of interest. Observations and measurements are the desired output of a<br />

study. The instruments used to make them are subject to error, which may bias<br />

the results of a study. The first thing we will discuss is the type of data that can<br />

be generated from clinical research. This chapter will then introduce concepts<br />

related to instruments and measurements.<br />

Types of data and variables<br />

There are several different ways of classifying data. They can be classified by their<br />

function as independent or dependent variables, their nature as nominal, ordinal,<br />

interval, or ratio variables, and whether they are continuous, discrete, or<br />

dichotomous variables.<br />

When classifying variables by function we want to know what the variable does<br />

in the experiment. Is it the cause or the effect? In most clinical trials one variable<br />

is held constant relative to the other. The independent variable is under the control<br />

of or can be manipulated by the investigator. Generally this is the cause we<br />

67


68 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

are interested in, such as a drug, a treatment, a risk factor, or a diagnostic test.<br />

The dependent variable changes as a result of or as an effect of the action of the<br />

independent variable. It is usually the outcome of exposure to the treatment or<br />

risk factor, or the presence of a particular diagnosis. We want to find out if changing<br />

the independent variable will produce a change in the dependent variable.<br />

The nature of each variable should be evident from the study design or there is a<br />

serious problem in the way the study was conducted.<br />

When classifying variables by their nature, we mean the hierarchy that<br />

describes the mathematical characteristics of the value generated for that variable.<br />

The choice of variables becomes very important in the application of statistical<br />

tests to the data. Nominal data are simply named categories. One can assign<br />

a number to each of these categories, but it would have no intrinsic significance<br />

and cannot be used to compare one piece of the data set to another. Changing<br />

the number assignment has no effect on the interpretation of the data. Examples<br />

of nominal data are classification of physicians by specialty or of patients<br />

by the type of cancer from which they suffer. There is no relationship between<br />

the various types of specialty physicians except that they are all physicians and<br />

went to medical school. There is certainly no mathematical relationship between<br />

them.<br />

Ordinal data are nominal data for which the order of the variables has importance<br />

and intrinsic meaning. However, there is still no mathematical relationship<br />

between data points. Typical examples of ordinal data include certain pain<br />

scores that are measured by scales called Likert scales, severity of injury scores<br />

as reflected in a score such as the Trauma Score where lower numbers are predictive<br />

of worse survival than higher ones, or the grading and staging of a tumor<br />

where higher number stages are worse than lower ones. Common questionnaires<br />

asking the participant to state whether they agree, are neutral, or disagree with<br />

a statement are also examples of an ordinal scale. Although there is a directional<br />

value to each of these answers, there is no numerical or mathematical relationship<br />

between them.<br />

Interval data are ordinal data for which the interval between each number is<br />

also a meaningful real number. However, interval data have only an arbitrary zero<br />

point and, therefore, there is no proportionality ratio relationship between two<br />

points. One example is temperature in degrees Celsius where 64 ◦ Cis32 ◦ C hotter<br />

than 32 ◦ C but not twice as hot. Another example is the common IQ score where<br />

100 is average, but someone with a score of 200 is not twice as smart since a score<br />

of 200 is super-genius, and less than 0.01% the population has a score this high.<br />

Ratio data are interval data that have an absolute zero value. This makes<br />

the results take on meaning for both absolute and relative changes in the variable.<br />

Examples of ratio variables are the temperature in degrees Kelvin where<br />

100 ◦ Kelvin is 50 ◦ K hotter than 50 ◦ K and is twice as hot, age where a 10-yearold<br />

is twice as old as a 5-year-old, and common biological measurements such


Instruments and measurements: precision and validity 69<br />

as pulse, blood pressure, respiratory rate, blood chemistry measurements, and<br />

weight.<br />

Data can also be described as either having or lacking continuity. Continuous<br />

data may take any value within a defined range. For most purposes we choose<br />

to round off to an easily usable number of digits. This is called the number of<br />

significant places, which is taught in high school and college, although it is often<br />

forgotten by students quickly thereafter. Height is an example of a continuous<br />

measure since a person can be 172 cm or 173 cm or 172.58763248 ...cmtall.The<br />

practical useful value would be 172.6 or 173 cm.<br />

Values for discrete data can only be represented by whole numbers. For example,<br />

a piano is an instrument with only discrete values in that there are only 88<br />

keys, therefore, only 88 possible notes. Scoring systems like the Glasgow Coma<br />

Score for measuring neurological deficits, the Likert scales mentioned above, and<br />

other ordinal scales contain only discrete variables and mathematically can have<br />

only integer values.<br />

We commonly use dichotomous data to describe binomial outcomes, which<br />

are those variables that can have only two possible values. Obvious examples<br />

are alive or dead, yes or no, normal or abnormal, and better or worse. Sometimes<br />

researchers convert continuous variables to dichotomous ones. Selecting<br />

a single cutoff as the division between two states does this. For example,<br />

serum sodium is defined as normal if between 135 and 145 mEq/dL. Values over<br />

145 define hypernatremia, and values below this don’t. This has the effect of<br />

dichotomizing the value of the serum sodium into either hypernatremic or not<br />

hypernatremic.<br />

Measurement in clinical research<br />

All natural phenomena can be measured, but it is important to realize that errors<br />

may occur in the process. These errors can be classified into two categories: random<br />

and systematic. Random error is characteristically unpredictable in direction<br />

or amount. Random error leads to a lack of precision due to the innate<br />

variability of the biological or sociological system being studied. This biological<br />

variation occurs for most bodily functions. For example, in a given population,<br />

there will be a more or less random variation in the pulse or blood pressure.<br />

Many of these random events can be described by the normal distribution,<br />

which we will discuss in Chapter 9. Random error can also be due to a lack of<br />

precision of the measuring instrument. An imprecise instrument will get slightly<br />

different results each time the same event is measured. In addition, certain measurements<br />

are inherently more precise than others. For example, serum sodium<br />

measured inside rat muscle cells will show less random error than the degree<br />

of depression in humans. There can also be innate variability in the way that


70 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

different researchers or practicing physicians interpret various data on certain<br />

patients.<br />

Systematic error represents a consistent distortion in direction or magnitude<br />

of the results. Systematic or systemic error is a function of the person<br />

making the measurement or the calibration of the instrument. For example,<br />

researchers could consistently measure blood pressure using a blood-pressure<br />

cuff that always reads high by 10 mmHg. More commonly, a measurement can<br />

be influenced by knowledge of other aspects of the patient’s situation leading<br />

to researchers responding differently to some patients in the study. In a<br />

study of asthma, the researcher may consistently coach some research subjects<br />

differently in performing the peak expiratory flow rate (PEFR), an effortdependent<br />

test. Another source of systematic error can occur when there is nonrandom<br />

assignment of subjects to one group in a study. For instance, researchers<br />

could preferentially assign patients with bronchitis to the placebo group when<br />

studying the effect of antibiotics on bronchitis and pneumonia. This would be<br />

problematic since bronchitis almost always gets better on its own and pneumonia<br />

sometimes gets better on its own, but it is less likely and occurs more<br />

slowly. Then, if the patients assigned to placebo get better as often as those taking<br />

antibiotics, the cause of the improvement is uncertain since it may have<br />

occurred because the placebo patients were going to get better more quickly<br />

anyway.<br />

Both types of errors may lead to incorrect results. The researcher’s job is to<br />

minimize the error in the study to minimize the bias in the study. Researchers are<br />

usually more successful at reducing systematic error than random error. Overall,<br />

it is the reader’s job to determine if bias exists, and if so to what extent and in<br />

what direction that bias is likely to change the study results.<br />

Instruments and how they are chosen<br />

Common instruments include objective instruments like the thermometer<br />

or sphygmomanometer (blood-pressure cuff and manometer) and subjective<br />

instruments such as questionnaires or pain scales. By their nature, objective<br />

measurements made by physical instruments such as automated blood-cell<br />

counters tend to be very precise. However, these instruments may also be<br />

affected by random variation of biological systems in the body. An example of<br />

this is hemodynamic pressure measurements such as arterial or venous pressure,<br />

oxygen saturation, and airway pressures taken by transducers. The actual<br />

measurement may be very precise, but there can be lots of random variation<br />

around the true measurement result. Subjective instruments include questions<br />

that must be answered either yes or no or with an ordinal scale (0, 1, 2, 3, 4, or<br />

5) or by placing an x on a pre-measured line. Measures of pain or anxiety are


Instruments and measurements: precision and validity 71<br />

common examples and these are commonly known to be unreliable, inaccurate,<br />

and often imprecise.<br />

Overall, measurements, the data that instruments give us, are the final goals<br />

of research. They are the result of applying an instrument to the process of systematically<br />

collecting data. Common instruments used in medicine measure the<br />

temperature, blood pressure, number of yes or no answers, or level of pain. The<br />

quality of the measurements is only as good as the quality of the instrument used<br />

to make them.<br />

Good instrument selection is a vital part of the research study design. The<br />

researcher must select instruments that will measure the phenomena of interest.<br />

If the researcher wishes to measure blood pressure accurately and precisely,<br />

a standard blood-pressure cuff would be a reasonable tool. The researcher could<br />

also measure blood pressure using an intra-arterial catheter attached to a pressure<br />

transducer. This will give a more precise result, but the additional precision<br />

may not help in the ultimate care of the patient. If survival is the desired outcome,<br />

a simple record of the presence or absence of death is the best measure.<br />

For measuring the cause of death, the death certificate can also be the instrument<br />

of choice but has been shown to be inaccurate.<br />

When subjective outcomes like pain, anxiety, quality of life, or patient satisfaction<br />

are measured, the selection of an instrument becomes more difficult.<br />

Pain, a very subjective measure, is appreciated differently by different people.<br />

Some patients will react more strongly and show more emotion than others in<br />

response to the same levels of pain. There are standardized pain scores available<br />

that have been validated in research trials. The most commonly used pain scale is<br />

the Visual Analog Scale (VAS). A 10-cm line is placed on the paper with one end<br />

labeled “no pain at all,” and the other end “worst pain ever.” The patient puts<br />

a mark on the scale corresponding to the pain level. If this exercise is repeated<br />

and the patient reports the same level of pain, then the scale is validated. The<br />

best outcome measure when using this scale becomes the change in the pain<br />

score and not the absolute score. Since pain is quantified differently in different<br />

patients, it is only the difference in scores that is likely to be similar between<br />

patients. In fact, when this was studied, it was found that patients would use consistently<br />

similar differences for the same degree of pain difference. 1 This study<br />

found that a difference in pain scores of 1.5 cm is a clinically important difference<br />

in degree of pain.<br />

Another type of pain score is the Likert Scale, which is a five- or six-point ordinal<br />

scale in which each of the points represents a different level of pain. A sample<br />

Likert Scale begins with 0 = no pain, continues with 1 = minimal pain, and ends<br />

1 K. H. Todd & J. P. Funk. The minimum clinically important difference in physician-assigned visual<br />

analog pain scores. Acad. Emerg. Med. 1996; 3: 142–146; and K. H. Todd, K. G. Funk, J. P. Funk & R.<br />

Bonacci. Clinical significance of reported changes in pain severity. Ann. Emerg. Med. 1996; 27: 485–<br />

489.


72 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

with 5 = worst pain ever. The reader must be careful when interpreting studies<br />

using this type of scoring system. Like the VAS for pain, personal differences<br />

in the quantification may result in large differences in the score. A patient who<br />

puts a 3 for their pain is counted very differently from a patient who puts a 4<br />

for the same level of pain. The differences in pain level have not been quantified<br />

in the same way as the VAS, and as it is an ordinal scale, the results may not be<br />

used the same way. The VAS score behaves like a continuous variable while Likert<br />

scales should be treated as ordinal variables. Because of this, Likert scales are<br />

very useful for measuring opinions about a given question. For example, when<br />

evaluating a course, you are given several graded choices such as strongly agree,<br />

agree, neutral, disagree, or strongly disagree.<br />

Similar problems will result with other questionnaires and scales. The reader<br />

must become familiar with the commonly used survey instruments in their specialty.<br />

Commonly used scores in studies of depression are the Beck Depression<br />

Inventory or the Hamilton Depression Scale. In the study of alcoholism, the commonly<br />

used scores are the CAGE score, Michigan Alcohol Screening Test (MAST),<br />

and the Alcohol Use Disorders Identification Test (AUDIT). The reader is responsible<br />

for understanding the limitations of each of these scores when reviewing<br />

the literature. This will require the reader to look further into the use of these<br />

tests when first reviewing the medical literature. Be aware that sometimes scores<br />

are developed specifically for a study, and in that case, they should be independently<br />

validated before use.<br />

A common problem in selecting instruments is the practice of measuring surrogate<br />

markers. These are markers that may or may not be related to or be predictive<br />

of the outcome of interest. For example, the degree of blood flow through<br />

a coronary artery as measured by “TIMI grade” of flow is a good measure of the<br />

flow of blood through the artery. But, it may not predict the ultimate survival<br />

of a patient. The measure of TIMI grade flow is called a disease-oriented outcome<br />

while overall survival is a patient-oriented outcome. Composite outcomes<br />

are multiple outcomes put together in the hope that the combination will more<br />

often achieve statistical significance. This is done when each individual outcome<br />

is too infrequent to expect that it will demonstrate statistical significance. Only<br />

consider using composite outcomes if all the outcomes are more or less equally<br />

important to your patient. One example is the use of death and recurrent transient<br />

ischemic attack (TIA) as an outcome. Death is important to all patients,<br />

but recurrent TIA may not have the same level of importance, and should not<br />

be considered equal when measuring outcome events. We’ll discuss composite<br />

outcomes and how to evaluate them in a future chapter.<br />

Attributes of measurements<br />

Measurements should be precise, reliable, accurate, and valid. Precision simply<br />

means that the measurement is nearly the same value each time it is measured.


Instruments and measurements: precision and validity 73<br />

This is a measure of random variation, noise, or random error. Statistically it<br />

states that for a precise measurement, there is only a small amount of variation<br />

around the true value of the variable being measured. In statistical terminology<br />

this is equivalent to a small standard deviation or range around the central value<br />

of multiple measurements. For example, if each time a physician takes a blood<br />

pressure, the same measurement is obtained, then we can say that the measurement<br />

is precise. The same measurement can become imprecise if not repeated<br />

the same way, for example if different blood-pressure cuffs are used.<br />

Reliability has been used loosely as a synonym of precision but it also incorporates<br />

durability or reproducibility of the measurement in its definition. It tells<br />

you that no matter how often you repeat the measurement you will get the same<br />

or similar result. It can be precise, in which case the results of repeated measurements<br />

are almost exactly the same. We are looking for instruments that will give<br />

precise, consistent, reproducible, and dependable data.<br />

Accuracy is a measure of the trueness of the result. This tells you how close<br />

the measured value is to the actual value. Statistically, it is equivalent to saying<br />

that the mean or arithmetic average of all measurements taken is the actual and<br />

true value of the thing being measured. For example, if indirect blood-pressure<br />

measurements use a manometer and blood-pressure cuff that correlate closely<br />

to direct intra-arterial measurements in healthy, young volunteers using a pressure<br />

transducer, it means that the blood pressure measured using the manometer<br />

and blood-pressure cuff is accurate. The measurement will be inaccurate if<br />

the manometer is not calibrated properly or if an incorrect cuff size is used. Accuracy<br />

doesn’t mean the same thing as precision. It is possible for a measurement<br />

to be accurate but not precise if the average measured result is the true value of<br />

the thing being measured but the spread around that measure is very great.<br />

Precision and accuracy are direct functions of the instruments chosen to make<br />

a particular measurement. Validity tells us that the measurement actually represents<br />

what we want to measure. We may have accurate and precise measurements<br />

that are not valid. For example, weight is a less valid measure for obesity<br />

than skin fold thickness or body mass index. Blood pressure measured with<br />

a standard blood-pressure cuff is a valid measure of the intra-arterial pressure.<br />

However, a single blood-sugar measurement is not a valid measure of overall<br />

diabetic control. A test called glycosylated hemoglobin is a valid measure of<br />

this.<br />

Types of validity<br />

There are several definitions of validity. The first set of definitions defines validity<br />

by the process with which it is determined. This includes criterion-<strong>based</strong>, predictive,<br />

and face validity. The second definition defines where validity is found in a<br />

clinical study and includes internal and external validity.


74 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

Criterion-<strong>based</strong> or construct validity is a description of how close the measurement<br />

of the phenomenon of interest is to other measurements of the<br />

same thing using different instruments. This means that there is a study showing<br />

that the measurement of interest agrees with other accepted measures of the<br />

same thing. For example, the score of patients on the CAGE questionnaire for<br />

alcoholism screening correlates with the results on the more complex and previously<br />

validated Michigan Alcohol Screening Test (MAST) for the diagnosis of<br />

alcoholism. Similarly, blood-pressure cuff readings correlate with intra-arterial<br />

blood pressure as recorded by an electrical pressure transducer.<br />

Predictive validity is a type of criterion-<strong>based</strong> validity that describes how well<br />

the measurement predicts an outcome event. This could be the result of another<br />

measurement or the presence or absence of a particular outcome. For example,<br />

lack of fever in an elderly patient with pneumonia predicts a higher mortality<br />

than in the same group of patients with fever. This was determined from studies<br />

of factors related to the specific outcome of mortality in elderly patients with<br />

pneumonia. We would say that lack of fever in elderly pneumonia patients gives<br />

predictive validity to the outcome of increased mortality.<br />

Finally, face validity is how much common sense the measurement has. It is a<br />

statement of the fact that the instrument measures the phenomenon of interest<br />

and that it makes sense. For example, the measured performance of a student<br />

on one multiple-choice examination should predict that student’s performance<br />

on another multiple-choice examination. Performance on an observed examination<br />

of a standardized patient accurately measures the student’s ability to<br />

accurately perform a history and physical examination on any patient. However,<br />

having face validity doesn’t mean that the measure can be accepted without verification.<br />

In this example, it must be validated because the testing situation may<br />

cause some students to freeze up, which they wouldn’t do when face-to-face with<br />

a real patient, thus decreasing its face validity.<br />

Validity can also be classified by the potential effect of bias or error on the<br />

results of a study. Internal and external validity are the terms used to describe<br />

this and are the most common ways to classify validity. You should use this<br />

schema when you assess any research study. Internal validity exists when precision<br />

and accuracy are not distorted by bias introduced into a study. An internally<br />

valid study precisely and accurately measures what is intended. Internal validity<br />

is threatened by problems in the way a study is designed or carried out, or<br />

with the instruments used to make the measurements. External validity exists<br />

when the measurement can be generalized and the results extrapolated to other<br />

clinical situations or populations. External validity is threatened when the population<br />

studied is too restrictive and you cannot apply the results to another and<br />

usually larger, population.<br />

Schematically, truth in the study is a function of internal validity. The results<br />

of an internally valid study are true if there is no serious source of bias that can


Instruments and measurements: precision and validity 75<br />

produce a fatal flaw and invalidate the study. Truth in the universe relating to all<br />

other patients with this problem is only present if the study is externally valid.<br />

The process by which this occurs will be discussed in a later chapter.<br />

Improving precision and accuracy<br />

In the process of designing a study, the researcher should maximize precision,<br />

accuracy, and validity. The methods section detailing the protocol used in the<br />

study should enable the reader to determine if enough safeguards have been<br />

taken to ensure a valid study. The protocol should be explicit and given in enough<br />

detail to be reproduced easily by anyone reading the study.<br />

There are four possible error patterns that can occur in the process of measuring<br />

data.<br />

(1) Both precision and accuracy can be good: the result is equal to the true value<br />

and there is only a small degree of variation around that true value, or the<br />

standard deviation is small.<br />

(2) The results may be precise but not accurate: the result is not equal to the true<br />

value, but there is only a small degree of variation around that value; this<br />

pattern is characteristic of systematic error or bias.<br />

(3) Results that are accurate but not precise: the result is equal to the true value<br />

but there is a large amount of variation around that value, or the standard<br />

deviation is large. This is typical of random error, a statistical phenomenon.<br />

(4) The result may be neither accurate nor precise: this is due to both random<br />

and systematic error and in this case the result of the study is not<br />

equal to the true value and there is a large amount of variability around that<br />

value.<br />

Look for these patterns of error or potential error when reviewing a study.<br />

Using exactly reproducible and objective measurements, standardizing the<br />

performance of the measurements and intensively training the observers will<br />

increase precision. Automated instruments can give more reliable measurements,<br />

assuming that they are regularly calibrated. The number of trained<br />

observers should be kept to a minimum to increase precision, since having more<br />

observers increases the likelihood that one will make a serious error.<br />

Making unobtrusive measurements reduces subject bias. Unobtrusive measurements<br />

are those which cannot be detected by the subject. For example, taking<br />

a blood pressure is obtrusive while simply observing a patient for an outcome<br />

like death or living is usually non-obtrusive. Watching someone work and<br />

recording his or her efficiency is obtrusive since it could result in a change in<br />

behavior, called the Hawthorne effect. Therefore, unobtrusive measurements<br />

are best made in a blinded manner. If the observer is unaware of the group to<br />

which the patient is assigned, there is less risk that the measurement will be


76 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

biased. Blinding creates the climate for consistency and fairness in the measurements,<br />

and results in reduced systematic error. Non-blinded measurements can<br />

lead to differential treatment being given to one of the groups being studied.<br />

This can lead to contamination or confounding of the results. In single blinding,<br />

either the researcher or the patient doesn’t know who is in each group.<br />

In double blinding, neither the researchers nor subject knows who is in each<br />

group. Triple blinding occurs if the patient, person treating the patient, and<br />

the researcher measuring the outcome are all blind to the treatment being<br />

rendered.<br />

Tests of inter- and intra-rater reliability<br />

Different observers can obtain different results when they make a measurement.<br />

Several observers may measure the temperature of a child using slightly different<br />

techniques when using the thermometer like varying the time the thermometer<br />

is left in the patient or reading the mercury level in different ways.<br />

Precision is improved when inter- or intra-observer variation is minimized.<br />

The researcher should account for variability between observers and between<br />

measurements made by the same observer. Variability between two observers<br />

or between multiple observations by a single observer can introduce bias into<br />

the results. Therefore a subset of all the measurements should be repeated<br />

and the variability of the results measured. This is referred to as inter-observer<br />

and intra-observer variability. Inter-observer variability occurs when two or<br />

more observers obtain different results when measuring the same phenomenon.<br />

Intra-observer variability occurs when the same observer obtains different<br />

results when measuring the same phenomenon on two or more occasions. Tests<br />

for inter-observer and intra-observer variability should be done before any study<br />

is completed.<br />

Both the inter-observer and intra-observer reliability are measured by the<br />

kappa statistic. The kappa statistic is a quantitative measure of the degree<br />

of agreement between measurements. It measures the degree of agreement<br />

beyond chance between two observers, called the inter-rater agreement, or<br />

between multiple measurements made by a single observer, called the intra-rater<br />

agreement.<br />

The kappa statistic applies because physicians and researchers often assume<br />

that all diagnostic tests are precise. However, many studies have demonstrated<br />

that most non-automated tests have a degree of subjectivity in their interpretation.<br />

This has been seen in commonly used radiologic tests such as CT scan,<br />

mammography, and angiography. It is also present in tests commonly considered<br />

to be the gold standard such as the interpretation of tissue samples from<br />

autopsy, biopsy, or surgery.


Instruments and measurements: precision and validity 77<br />

Resident 1<br />

Normal Abnormal<br />

Resident 2<br />

Normal 90 0 90<br />

Fig. 7.1 Observed agreement<br />

between two residents when<br />

one (no. 1) reads them all as<br />

normal and the other (no. 2)<br />

reads90asnormaland10as<br />

abnormal.<br />

Abnormal 10 0 10<br />

100 0<br />

Here is a clinical example of how the kappa statistic applies. One morning,<br />

two radiology residents were reading mammograms. There were 100 mammograms<br />

to be read. The first resident, Number 1, had been on night call and was<br />

pretty tired. He didn’t really feel like reading these and knew that all of his readings<br />

would be reviewed by the attending. He also reasoned that since this was a<br />

screening clinic for young women with an average age of 32, there would be very<br />

few positive studies. This particular radiology department had a computerized<br />

reading system where the resident pushes either the “normal” or the “cancer”<br />

button on a console and that reading would be entered into the file. After reading<br />

the first three as negative, he fell asleep on the “negative” button, making all<br />

one hundred readings negative.<br />

The second resident, Number 2, was really interested in mammography and<br />

had slept all night, since she was not on call. She carefully read each study and<br />

pushed the appropriate button. She read 90 films as normal and 10 as suspicious<br />

for early breast cancer. The two residents’ readings are tabulated in the 2 × 2<br />

table in Fig. 7.1.<br />

The level of agreement that was observed was 90/100 or 90%. Is this agreement<br />

of 90% very good? What would the agreement be if they read the mammograms<br />

by chance alone? Assuming that there are 90% normals and 10% abnormals, we<br />

can assume that each read their films with that proportion of each result and do<br />

the same 2 × 2 table (Fig. 7.2). Agreement by chance would be (81 + 1)/100 or<br />

82%.<br />

Kappa is the ratio of the actual agreement beyond chance and the potential<br />

agreement beyond chance. The actual agreement beyond chance is the difference<br />

between the actual agreement found and that expected by chance. In our<br />

example it is 90 – 82 = 8% (0.08). The potential agreement beyond chance is the<br />

difference between the highest possible agreement (100%) and that expected by<br />

chance alone. In our example, 100 – 82 = 18% (0.18). This makes Kappa = (0.90 –<br />

0.82)/(1.00 – 0.82) = 0.08/0.18 = 0.44.


78 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

Table 7.1. Interpretation of the kappa statistic<br />

Actual agreement between measurements beyond chance<br />

Kappa =<br />

Potential agreement between measurements beyond chance<br />

Range: 0–1 (0 = no agreement; 1 = complete agreement)<br />

Numerical level of kappa Qualitative significance<br />

0.0–0.2 slight<br />

0.2–0.4 fair<br />

0.4–0.6 moderate<br />

0.6–0.8 substantial<br />

0.8–1.0 almost perfect<br />

Fig. 7.2 Observed agreement<br />

between two residents when<br />

both (no. 1 and no. 2) read 90 as<br />

normal and 10 as abnormal, but<br />

there is no relationship between<br />

their readings. The 90% read<br />

normal by no. 1 are not the<br />

same as the 90% read as normal<br />

by no. 2.<br />

Resident 1<br />

Normal Abnormal<br />

Resident 2<br />

Normal 81 9 90<br />

Abnormal 9 1 10<br />

90 10<br />

Resident 1<br />

Normal Abnormal<br />

Resident 1<br />

Normal Abnormal<br />

Resident 2<br />

Normal<br />

25 25 50<br />

Resident 2<br />

Normal<br />

50 0 50<br />

Abnormal<br />

25 25 50<br />

50 50 A<br />

Abnormal<br />

0 50 50<br />

50 50 B<br />

Fig. 7.3 Kappa for chance<br />

agreement only (A, κ = 0.0) and<br />

for perfect agreement (B, κ =<br />

1.0).<br />

Overview of kappa statistic<br />

You should use the kappa statistic when you want to know the precision of a<br />

measurement or the inter-observer or intra-observer consistency. This gives a<br />

reasonable estimate of how “easily” the measurement is made. The “easier” it is<br />

to make a measurement, the more likely that two different observers will agree<br />

on the result and that agreement is not just due to chance. Some experts have


Instruments and measurements: precision and validity 79<br />

related the value of kappa to qualitative descriptors, which are given in Table 7.1.<br />

In general, look for a kappa higher than 0.6 before you consider the agreement to<br />

be reasonably acceptable.<br />

Kappa ranges from 0 to 1 where 0 means that there is no agreement and 1<br />

means there is complete agreement beyond that expected by chance alone. You<br />

can see from making a 2 × 2 table that if there is an equal number in each cell the<br />

agreement occurs purely by chance (Fig. 7.3). Similarly if there is perfect agreement,<br />

it is very unlikely that the agreement occurred completely by chance. However,<br />

it is still possible: if there are only a few readings in each cell, 100% agreement<br />

could occur by chance, even though the chance of this happening is very<br />

small. Confidence intervals, which we will discuss later in the book, should be<br />

calculated to determine the statistically feasible range within which 95% of possible<br />

kappa values will be found.<br />

There are other statistics that more or less measure the same thing as the kappa<br />

statistic. These are the standard deviation of repeated measurements, coefficient<br />

of variation, correlation coefficient of paired measurements, intraclass correlation<br />

coefficient and Cronbach’s alpha. 2<br />

2 A more detailed discussion of kappa can be found in D. L. Sackett, R. B. Haynes, P. Tugwell & G. H.<br />

Guyatt Clinical Epidemiology: a Basic Science for Clinical <strong>Medicine</strong>. 2nd edn. Boston: Little Brown,<br />

1991.


8<br />

Sources of bias<br />

Of all the causes which conspire to blind<br />

Man’s erring judgment, and misguide the mind;<br />

What the weak head with strongest bias rules, –<br />

Is pride, the never-failing vice of fools.<br />

Alexander Pope (1688–1744): Essay on Criticism<br />

Learning objectives<br />

In this chapter you will learn:<br />

sources of bias<br />

threats to internal and external validity<br />

how to tell when bias threatens the conclusions of a study<br />

All studies involve observations and measurements of phenomena of interest,<br />

but the observations and instruments used to make these measurements are<br />

subject to error. Bias introduced into a study can result in systematic error which<br />

may then affect the results of the study and could invalidate the conclusions.<br />

Since there is no such thing as a perfect study, in reading the medical literature<br />

you should be familiar with common sources of bias in clinical studies. By<br />

understanding how these biases could affect the results of the study, it is possible<br />

to detect bias and predict the potential effect on the conclusions. You can then<br />

determine if this will invalidate the study conclusions enough to deter you from<br />

using the results in your patients’ care. This chapter will give you a schema for<br />

looking for bias, and present some common sources of bias.<br />

Overview of bias in clinical studies<br />

Bias was a semilegendary Greek statesman who tried to make peace between<br />

two city-states by lying about the warlike intention of the enemy state. His ploy<br />

80


Sources of bias 81<br />

failed and ultimately he told the truth, allowing his city to win the war. His name<br />

became forever associated with slanting the truth as a means to accomplish an<br />

end.<br />

Bias is defined as the systematic introduction of error into a study that can<br />

distort the results in a non-random way. It is almost impossible to eliminate all<br />

sources of bias, even in the most carefully designed study. It is the job of the<br />

researcher to attempt to remove as much bias as possible and to identify potential<br />

sources of bias for the reader. It is the job of the reader to find any sources of<br />

bias and assess the importance and potential effects of bias on the results of the<br />

study. Virtually no study is 100% bias-free and not all bias will result in an invalid<br />

study and in fact, some bias may actually increase the validity of a study.<br />

After identifying a source of bias, you must determine the likely effect of that<br />

bias on the results of the study. If this effect is likely to be great and potentially<br />

decrease the results found by the research, internal validity and the conclusions<br />

of the study are threatened. If it could completely reverse the results of the study,<br />

it is called a “fatal” flaw. The results of a study with a fatal flaw should generally<br />

not be applied to your current patients. If the bias could have only small potential<br />

effects, then the results of these studies can be accepted and used with caution.<br />

Bias can be broken down into three areas according to its source: the population<br />

being studied, the measurement of the outcome, and miscellaneous sources.<br />

Bias in the population being studied<br />

Selection bias<br />

Selection bias or sampling bias occurs when patients are selected in a manner<br />

that will systematically influence the outcome of the study. There are several<br />

ways that this type of bias can occur. Subjects who are volunteers or paid to be<br />

in the study may have different characteristics than the “average person” with<br />

the disease in question. Another form of selection bias occurs when patients are<br />

chosen to be in a study <strong>based</strong> upon certain physical or social characteristics.<br />

These characteristics may then change the outcome of the study. Commonly,<br />

selection bias exists in studies of therapy when patients chosen to be one arm<br />

of the study are ‘selected’ by some characteristics determined by the physicians<br />

enrolling them in the study. A few examples will help demonstrate the effects of<br />

this bias.<br />

An investigator offered free psychiatric counseling to women who had just<br />

had an abortion if they took a free psychological test. He found the incidence of<br />

depression was higher in these women than in the general population. He concluded<br />

that having an abortion caused depression. It is very likely that women<br />

who had an abortion and were depressed, therefore needing counseling, would


82 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

preferentially sign up to be in the study. Women who had an abortion and were<br />

not depressed would be less likely to sign up for the study and take the free psychological<br />

test. This is a potentially fatal flaw of this study, and therefore, the<br />

conclusion is very likely to be biased.<br />

Patients with suspected pulmonary embolism (PE, blood clot in the lung), were<br />

studied with angiograms, an x-ray of the blood vessels in the lung capable of<br />

showing a blood clot. It was found that those patients with an angiogram positive<br />

for pulmonary embolus were less likely to have a deep vein thrombosis (DVT,<br />

blood clot in a leg vein) than those patients with an angiogram negative for pulmonary<br />

embolus. The authors concluded that DVT was not a risk factor for PE.<br />

This study did not include all patients in whom a physician would suspect a possible<br />

PE but instead only included those with a high enough clinical suspicion of<br />

a PE to be referred for an angiogram. This is a form of selection bias. The presence<br />

of a DVT is a well known risk factor for a PE, and if diagnosed, could lead to<br />

direct treatment for a PE rather than an angiogram to make the diagnosis more<br />

certain. Therefore, patients suspected of having PE and who didn’t have clinical<br />

signs of a DVT were more likely to be selected for the angiogram. Similarly,<br />

those DVT patients with no signs or symptoms of PE who were entered into the<br />

study only because they had a DVT wouldn’t have a PE. This is a fatal flaw and<br />

would seriously skew the results, so the results of this study should not change a<br />

physician’s approach to these patients.<br />

Referral bias<br />

Referral bias is a special form of selection bias. Studies performed in tertiary care<br />

or referral centers often use only patients referred for specialty care as subjects.<br />

This eliminates cases that are milder and more easily treated or those diagnosed<br />

at an earlier stage and who are more likely to be seen in a primary care provider’s<br />

office. Overall, the subjects in the study are not like those patients with similar<br />

complaints seen in the primary care office, who will be much less likely to have<br />

unusual causes for their symptoms. This limits the external validity of the study<br />

and the results should not be generalized to all patients with the same complaint.<br />

An example will help to understand referral bias. Patients presenting to a neurology<br />

clinic with headaches occurring days to weeks after apparently minor<br />

head traumas were given a battery of tests: CT scan of the head, EEG, MRI of<br />

the brain, and various psychological tests. Most of these tests were normal, but<br />

some of the MRIs showed minor abnormalities. Most of the patients with the<br />

abnormalities on the MRI had a brief loss of consciousness at the time of injury.<br />

The authors concluded that all patients with any loss of consciousness after<br />

minor head trauma should have immediate MRI scans done. This is an incorrect<br />

conclusion. The study patients reflected only those who were referred to the<br />

neurologist, who therefore had persistent problems from their head injury. The


Sources of bias 83<br />

researchers did not measure the percentage of all patients with head injuries who<br />

had loss of consciousness for a brief period of time and who had the reported<br />

MRI abnormalities. The results, even if significant in this selected population,<br />

would not apply to the general population of all head-injured patients.<br />

Spectrum bias<br />

Spectrum bias occurs when only patients with classical or severe symptoms are<br />

selected for a study. This makes the expected outcomes more or less likely than<br />

for the population as a whole. For example, patients with definite subarachnoid<br />

hemorrhages (bleeding in or around the brain) who have the worst headache<br />

of their life and present with coma or a severe alteration of their mental status<br />

will almost all have a positive CT of their head showing the bleed. Those<br />

patients who have similar headaches but no neurological symptoms are much<br />

less likely to have a positive CT of the head. Selecting only those patients with<br />

severe symptoms will bias the study and make the results inapplicable to those<br />

with less severe symptoms.<br />

Detection bias<br />

Detection bias is a form of selection bias that preferentially includes patients in<br />

a study if they have been exposed to a particular risk factor. In these cases, exposure<br />

causes a sign or symptom that precipitates a search for the disease and then<br />

is blamed for causing the disease. Estrogen therapy was thought to be a risk factor<br />

for the development of endometrial cancer. Patients in a tumor registry who<br />

had cancer were compared to a similar group of women who were referred for<br />

dilatation and curettage (D&C) (diagnostic scraping of the uterine lining) or hysterectomy<br />

(removal of the uterus). The proportion of women taking estrogen was<br />

the same in both groups, suggesting no relationship between estrogen use and<br />

cancer of the uterus. However, many of the women in the D&C or hysterectomy<br />

group who were taking estrogen turned out to have uterine cancer. Did estrogen<br />

cause cancer? Estrogen caused the bleeding, which led to a search for a cause of<br />

the bleeding. This led to the use of a D&C, which subsequently detected uterine<br />

cancer in these patients. This and subsequent studies showed that there was a<br />

relationship between postmenopausal estrogen therapy and the development of<br />

this cancer.<br />

Recall bias<br />

Recall or reporting bias occurs most often in a retrospective study, either a case–<br />

control or non-concurrent cohort study. When asked about certain exposures,<br />

subjects with the outcome in the study are more likely than controls to recall


84 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

the factors to which they were exposed. It is human nature to search for a reason<br />

for an illness, and patients with an illness will be much more aware of their exposures<br />

than those without an illness. This is a potential problem whenever subjective<br />

information is used to determine exposure and is less likely to occur when<br />

objective information is used. This is illustrated by a study that was performed<br />

looking for the connection of childhood leukemia to living under high-tension<br />

wires. Mothers of children with leukemia were more likely to remember living<br />

anywhere near a high-tension power line than were mothers who did not have a<br />

leukemic child. Exposure suspicion bias is a type of recall bias that occurs on the<br />

part of the researcher. When asking subject patients about exposure, researchers<br />

might phrase the question in ways that encourage recall bias in the study subjects.<br />

The control subjects similarly might be asked in subtly different ways that<br />

could make them less likely to recall the exposure.<br />

Non-respondent bias<br />

Non-respondent bias is a bias in the results of a study because of patients who<br />

don’t respond to a survey or who drop out of a study. It occurs because those<br />

people who don’t respond to a survey may be different in some fundamental<br />

way from those who do respond. The reasons for not responding are numerous,<br />

but may be related to the study. Past studies have noted that smokers are<br />

less likely than non-smokers to respond to a survey when it contains questions<br />

about smoking. This will lead to bias in the results of such a survey. It is also true<br />

that healthy people are more likely to participate in these surveys than unhealthy<br />

ones. In this case, the bias of having more healthy people in the study group will<br />

underestimate the apparent ill effects of smoking.<br />

Membership bias<br />

Membership bias occurs because the health of some group members differs in a<br />

systematic way from the general population. This is obvious when one group of<br />

subjects is chosen from members of a health club, has higher average education,<br />

or is from other groups that might intrinsically be more health-conscious than<br />

the average person. It is a problem with studies that look at nurses or physicians<br />

and attempt to extrapolate the results to the general population. Higher socioeconomic<br />

status and generally more healthy living are factors that may distinguish<br />

these groups and limit generalizability to others in the population.<br />

A recent review of all studies of thrombolytic therapy, the use of clot-dissolving<br />

medication to treat acute myocardial infarction, AMI or heart attacks, was conducted.<br />

The reviewers found that on average, patients who were eligible for the<br />

studies were younger and healthier than patients who either were ineligible for


Sources of bias 85<br />

inclusion or not enrolled in the study but treated with these drugs anyway. Overall,<br />

study patients got more intensive therapy for their AMI in many ways. The<br />

mortality for study patients was less than half that of ineligible patients and<br />

about two thirds that of non-study patients.<br />

Berkerson’s bias is a specific bias that occurs when patients in the control<br />

group are selected because they are patients in a selected ward of the hospital.<br />

These patients may share group characteristics that separate them from the<br />

normal population. This difference in baseline characteristics will affect the outcome<br />

of the study.<br />

Bias in the measurements of the outcome<br />

Subject bias<br />

Subject bias is a constant distortion of the measurement by the subject. In<br />

general, patients try to please their doctors and will tell them what they think<br />

the doctor wants to hear. They also may consciously change their behavior or<br />

responses in order to please their physicians. They may not report some side<br />

effects, may overestimate the amount of medications taken and may report more<br />

improvement if they know they were given a therapy approved of by their doctor<br />

rather than the placebo or control therapy. Only effective blinding of subjects<br />

and ideally, also of observers, can prevent this bias from occurring.<br />

Observer bias<br />

Observer bias is the conscious or unconscious distortion in perception of reporting<br />

the measurement by an observer. It may occur when physicians treat patients<br />

differently because of the group to which they are assigned. Physicians in a study<br />

may give more intensive adjunctive treatment to the patients who are assigned<br />

to the intervention group rather than to the placebo or comparison group. They<br />

may interpret the answers to questions on a survey differently in patients known<br />

to be in the active treatment rather than control group. An observer not blinded<br />

to patient selection may report the results of one group of patients differently<br />

from those of the other group. One form of this bias occurs when patients who<br />

are the sickest may be either preferentially included or excluded from the sample<br />

because of bias on the part of the observer making the assignment to each group.<br />

This is known as filtering and is a form of selection bias.<br />

Data collected retrospectively by reviewing the medical records may have<br />

poor data quality. The records used to collect data may contain inadequate<br />

detail and possess questionable reliability. They may also use varying and subjective<br />

standards to judge symptoms, signs of disease severity, or outcomes.


86 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

This is a common occurrence in chart review or retrospective case–control or<br />

non-concurrent cohort studies. The implicit review of charts introduces the<br />

researcher’s bias in interpreting both measurements and outcomes. If there are<br />

no objective and explicit criteria for evaluating the medical records, the information<br />

contained in them is open to misinterpretation from the observer. It has<br />

been shown that when performing implicit chart reviews, researchers subconsciously<br />

fit the response that best matched their hypothesis. Researchers came<br />

up with different results if they performed a blinded chart review as opposed to<br />

an unblinded review. Explicit reviews are better and can occur when only clearly<br />

objective outcome measures are reviewed. Even when the outcomes are more<br />

objective it is better to have the chart material reviewed in a blinded manner.<br />

The Hawthorne effect was first noticed during a study of work habits of<br />

employees in a light bulb factory in Illinois during the 1920s. It occurs because<br />

being observed during the process of making measurements changes the behavior<br />

of the subject. In the physical sciences, this is known as the Heisenberg Uncertainty<br />

Principle. If subjects change their behavior when being observed, the outcome<br />

will be biased. One study was done to see if physicians would prescribe<br />

less expensive antibiotics more often than expensive new ones for strep throat.<br />

In this case, the physicians knew that they were being studied and in fact, they<br />

prescribed many more of the low-price antibiotics during the course of the study.<br />

After the study was over, their behavior returned to baseline, thus they acted<br />

differently and changed their clinical practices when being observed. This and<br />

other observer biases can be prevented through the use of unobtrusive, blinded,<br />

or objective measurements.<br />

Misclassification bias<br />

Misclassification bias occurs when the status of patients or their outcomes is<br />

incorrectly classified. If a subject is given an inaccurate diagnosis, they will be<br />

counted with the wrong group, and may even be treated inappropriately due to<br />

their misclassifaction. This bias could then change the outcome of the study. For<br />

instance, in a study of antibiotic treatment of pneumonia, patients with bronchitis<br />

were misclassified as having pneumonia. Those patients were more likely to<br />

get better with or without antibiotics, making it harder to find a difference in the<br />

outcomes of the two treatment groups. Patients may also change their behaviors<br />

or risk factors after the initial grouping of subjects, resulting in misclassification<br />

bias on the basis of exposure. This bias is common in cohort studies.<br />

Misclassification of outcomes in case control studies can result in failure to<br />

correctly distinguish cases from controls and lead to a biased conclusion. One<br />

must know how accurately the cases and controls are being identified in order to<br />

avoid this bias. If the disorder is relatively common, some of the control patients<br />

may be affected but not have the symptoms yet. One way of compensating for


Sources of bias 87<br />

this bias is to dilute the control group with extra patients. This will reduce the<br />

extent to which misclassification of cases incorrectly counted as controls will<br />

affect the data.<br />

Let’s say that a researcher wanted to find out if people who killed themselves by<br />

playing Russian Roulette were more likely to have used alcohol than those who<br />

committed suicide by shooting themselves in the head. The researcher would<br />

look at death investigations and find those that were classified as suicides and<br />

those that were classified as Russian Roulette. However, the researcher suspects<br />

that some of the Russian Roulette cases may have been misclassified as suicides<br />

to “protect the victim.” To compensate for this, or dilute the effect of the bias,<br />

the researcher decides that the control group will include three suicide deaths<br />

for every one Russian Roulette death. Obviously if Russian Roulette deaths are<br />

routinely misclassified, this strategy will not result in any change in the bias. This<br />

is called outcome misclassification. Outcome classification <strong>based</strong> upon subjective<br />

data including death certificates, is more likely to exhibit this misclassification.<br />

This will most likely result in an outcome that is of smaller size than the<br />

actual effect. This bias can be prevented with objective standards for classification<br />

of patients, which should be clearly outlined in the methods section of a<br />

study.<br />

Miscellaneous sources of bias<br />

Confounding<br />

Confounding refers to the presence of several variables that could explain the<br />

apparent connection between the cause and effect. If a particular variable is<br />

present more often in one group of patients than in another, it may be responsible<br />

for causing a significant effect. For example, a study was done to look for<br />

the effect of antioxidant vitamin E intake on the outcome of cardiovascular disease.<br />

It turned out that the group with high vitamin E intake also had a lower rate<br />

of smoking, a higher socioeconomic status, and higher educational level than<br />

the groups with lower vitamin E intake. It is much more likely that those other<br />

variables are responsible for all or part of the decrease in observed cases of cardiovascular<br />

disease. There are statistical ways of dealing with confounding variables<br />

called multivariate analyses. The rules governing the application of these<br />

types of analyses are somewhat complex and will be discussed in greater detail<br />

in Chapter 14. When looking at studies always look for the potential presence of<br />

confounding variables and at least make certain that the authors have adjusted<br />

for those variables. However, no matter how well the authors have adjusted, it<br />

can be very difficult to completely remove the effects of confounding from a<br />

study.


88 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

Contamination and cointervention<br />

Contamination occurs when the control group receives the same therapy as the<br />

experimental group. Contimination is more commonly seen in randomized clinical<br />

trials, but can also exist in observational studies. In an observational study,<br />

it occurs if the control group is exposed to the same risk factor as the study<br />

group. However, there may be an environmental situation by which those classified<br />

as not exposed to the risk factor are actually exposed. For example, a study<br />

is done to look at the effect of living near high-tension wires on the incidence<br />

of leukemia. Those patients who live within 30 miles of a high-tension wire are<br />

considered the exposed group and those who live more than 30 miles away are<br />

considered the unexposed control group. Those people who live 30 to 35 miles<br />

from high-tension wires could be misclassified as unexposed although they may<br />

truly have a similar degree of exposure as those within 30 miles. In fact, families<br />

living 60 miles from the wires may be equally affected by the electrical field if the<br />

wires have four times the amount of current.<br />

Cointervention occurs when one group or the other receives different medical<br />

care <strong>based</strong> partly or totally upon their group assignment. This occurs more often<br />

in randomized trials, but could be present in an observational study when the<br />

group exposed to one particular treatment also receives different therapy than<br />

the unexposed group. This can easily occur in studies with historical controls,<br />

since patients in the past may not have had access to the same advances in medical<br />

care as the patients who are currently being treated. The end results of the<br />

historical comparison would be different if both groups had received the same<br />

level of medical care.<br />

Patient attrition<br />

Patient attrition occurs when patients drop out of a study or are lost to followup,<br />

leading to a loss of valuable information. Patients who drop out may do so<br />

because a treatment or placebo is ineffective or there are too many unwanted<br />

side effects. Therefore, it is imperative for researchers to account for all patients<br />

enrolled in the study. In practice a drop-out rate less than 20% is an acceptable<br />

level of attrition. However, even a lower rate of attrition may bias the study if the<br />

reason patients were lost from the study is directly related to one of the study<br />

variables. If there is a differential rate of attrition between the intervention and<br />

comparison groups, an even lower rate of attrition may be very important.<br />

How the authors dealt with outcome measurements of subjects who dropped<br />

out, were lost to follow-up, or for whom the outcome is unknown is extremely<br />

important. These study participants cannot be ignored and left out of final<br />

data calculations; this will certainly introduce bias into the final results. In<br />

this instance, the data can be analyzed using a best case/worst case strategy,


Sources of bias 89<br />

assuming that missing patients all had a poor outcome in one analysis and<br />

a good outcome in the other. The researcher can then compare the results<br />

obtained from each group and see if the loss of patients could have made a big<br />

difference.<br />

For subjects who switch groups or don’t complete therapy and for whom the<br />

outcome is known, an intention-to-treat strategy should be used. The final outcome<br />

of those patients who changed groups or dropped out of the study is analyzed<br />

with the group to which they were originally assigned. We will discuss the<br />

issues of attrition and intention to treat further in the chapter on the randomized<br />

clinical trial (Chapter 15).<br />

External validity and surrogate markers<br />

External validity refers to all problems in applying the study results to a larger<br />

or different population. External validity can be called into question when the<br />

subjects of a study are from only one small subgroup of the general population.<br />

Age, gender, ethnic or racial groups, socioeconomic groups, and cultural groups<br />

are examples of variables that can affect external validity. Simply having a clearly<br />

identified group of patients in a study does not automatically mean there will<br />

be lack of external validity. There ought to be an a-priori reason that the results<br />

could be different in other groups. For example, we know that women respond<br />

differently than men to various drugs. Therefore, a study of a particular drug performed<br />

only on men could lack external validity when it comes to recommending<br />

the drug to women. Overall, each study must be looked at separately and the<br />

reader must determine whether external validity exists.<br />

Poor external validity can lead to inappropriate extrapolation or generalization<br />

of the results of a study to groups to which they do not apply. In a study of patients<br />

with myocardial infarction (MI), those who had frequent premature ventricular<br />

contractions (PVCs) had increased mortality in the hospital. This led to the recommendation<br />

that antiarrhythmic drugs to suppress the PVCs should be given to<br />

all patients with MI. Later studies found an increased number of deaths among<br />

patients on long-term antiarrhythmic drug therapy. Subsequent recommendations<br />

were that these drugs only be used to treat immediately life-threatening<br />

PVCs. The original study patients all had acute ischemia (lack of oxygen going to<br />

the heart muscle) while the long-term patients did not, making extrapolation to<br />

that population inappropriate.<br />

The outcome chosen to be measured should be one that matters to the patient.<br />

Ideally it is a measure of faster resolution of the problem such as reduction of<br />

pain or death rate due to the illness. In these cases, all patients would agree that<br />

the particular outcome is important. However, there are studies that look at other<br />

outcomes. These may be important in the overall increase in medical knowledge,


90 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

but not immediately important to an individual patient. In fact, these results,<br />

called surrogate endpoints, may not translate into improved health at all.<br />

Suppose that a researcher wanted to see if there was any relationship between<br />

the timing of students’ taking of Step I of the USMLE and their final score. The<br />

researcher would look at all of the scores and correlate them with the date the test<br />

is taken. The researcher finds that there is a strong association between board<br />

scores and date, with the higher scores occurring among students taking the<br />

boards at earlier dates. The study would conclude that medical students should<br />

be taking the boards as early as possible in the cycle. What the researcher might<br />

be missing is that the timing of taking the exam and the score are both dependent<br />

on another factor, class rank. Therefore the variable of timing of the USMLE<br />

is a surrogate marker for overall class rank.<br />

Final concerns<br />

There are a few more miscellaneous concerns for validity when evaluating outcome<br />

measurements. Are the measured outcomes those that are important<br />

to patients? Were all of the important outcomes included and reported upon<br />

or were only certain main outcomes of the research project included? If certain<br />

outcomes were measured to the exclusion of others, suspect foul play. A<br />

study may find a significant improvement in one outcome, for instance diseasefree<br />

survival, while the outcome of importance for patients is overall survival,<br />

which shows no improvement. The problems associated with subgroup analysis<br />

and composite endpoints will be discussed in the chapter on Type I errors<br />

(Chapter 11).<br />

There is a definite publication bias toward the publication of studies that show<br />

a positive result. Studies that show no effect or a negative result are more difficult<br />

to get published or may never be submitted for publication. Authors are aware of<br />

the decreased publication of negative studies, and as a result, it takes longer for<br />

negative studies to be written.<br />

Chance can also lead to errors in the study conclusions. The action of chance<br />

error causes distortion of the study results in a random way. Researchers can<br />

account for this problem with the appropriate use of statistical tests, which will<br />

be addressed in the next several chapters.<br />

Studies supported by or run by drug companies or other proprietary interests<br />

are inherently biased. Since these companies want their products to do<br />

well in clinical trials, the methods used to bias these studies can be quite subtle.<br />

Drug-company sponsorship should be a red flag to look more carefully for<br />

sources of bias in the study. In general, all potential conflicts of interest should be<br />

clearly stated in any medical study article. Many journals now have mandatory


Sources of bias 91<br />

Table 8.1. Looking for sources of bias: a checklist<br />

Check the methods section for the following<br />

(1) The methods for making all the measurements were fully described with a clearly<br />

defined protocol for making these measurements.<br />

(2) The observers were trained to make the measurements and this training was<br />

adequately described and standardized.<br />

(3) All measurements were made unobtrusively, the subjects were blinded to the<br />

measurement being made, and the observers (either the ones providing care or the<br />

ones making the measurements or interpreting the results) were blinded.<br />

(4) Paired measurements were made (test–retest reliability) or averaged and<br />

intra-observer or inter-observer reliability of repeated measurements was<br />

measured.<br />

(5) The measurements were checked against a known “gold standard” (the<br />

measurement accepted as being the truth) and checked for their validity either<br />

through citations from the literature or by a demonstration project in the current<br />

study. Readers may have to decide for themselves if a measurement has face<br />

validity. You will know more about this as you learn more background material<br />

about the subject.<br />

(6) The reasons for inclusion and exclusion must be spelled out and appropriate.<br />

(7) Patients who drop out or cross over must be clearly identified and the results<br />

appropriately adjusted for this behavior.<br />

(8) The most appropriate outcome measure should be selected. Be suspicious of<br />

composite or surrogate outcome measures.<br />

requirements that this be included and prominently displayed. However, as the<br />

examples below illustrate, there are still some problems with this policy.<br />

In one case, Boots Pharmaceuticals, the maker of Synthroid, a brand of<br />

levothyroxine, a thyroid hormone commonly taken to replace low thyroid levels,<br />

sponsored a study of their thyroid hormone against generic thyroid replacement<br />

medication. The study was done at Harvard and when the researchers found that<br />

the two drugs were equivalent, they submitted their findings to JAMA. The company<br />

notified both Harvard and JAMA that they would sue them in court if the<br />

study were printed. Harvard and JAMA both stepped down and pulled the article.<br />

That news was leaked to the Wall Street Journal, which published an account of<br />

the study. Finally, Boots relented and allowed the study to be published in JAMA.<br />

In the second case, a researcher at the Hospital for Sick Children in Toronto<br />

was the principal investigator in a study of a new drug to prevent the side<br />

effect of iron accumulation in children who needed to receive multiple transfusions.<br />

The drug appeared to be associated with severe side effects. When the<br />

researcher attempted to make this information known to authorities at the university,<br />

the company threatened legal action and the researcher was removed


92 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

from the project. When other scientists at the university stood up to support the<br />

researcher, the researcher was fired. When the situation became public and the<br />

government stepped in, the researcher was rehired by the university, but in a<br />

lower position. The issues of conflict of interest in clinical research will be discussed<br />

in more detail in Chapter 16.<br />

This chapter was an introduction to common sources of bias. Students must<br />

evaluate each study on its own merits. If readers think bias exists, one must be<br />

able to demonstrate how that bias could have affected the study results. For<br />

more information, there is an excellent article by Dr. David Sackett on sources of<br />

bias. 1 The accompanying checklist (Table 8.1) will help the novice reader identify<br />

potential sources of bias.<br />

1 D. L. Sackett. Bias in analytic research. J. Chronic Dis. 1979; 32: 51–63.


9<br />

Review of basic statistics<br />

There are three kinds of lies: lies, damned lies, and statistics.<br />

Benjamin Disraeli, Earl of Beaconsfield (1804–1881)<br />

Learning objectives<br />

In this chapter you will learn:<br />

evaluation of graphing techniques<br />

measures of central tendency and dispersion<br />

populations and samples<br />

the normal distribution<br />

use and abuse of percentages<br />

simple and conditional probabilities<br />

basic epidemiological definitions<br />

Clinical decisions ought to be <strong>based</strong> on valid scientific research from the medical<br />

literature. Useful studies consist of both epidemiological and clinical research.<br />

The competent interpreter of these studies must understand basic epidemiological<br />

and statistical concepts. Critical appraisal of the literature and good medical<br />

decision making require an understanding of the basic tools of probability.<br />

What are statistics and why are they useful in medicine?<br />

Nature is a random process. It is virtually impossible to describe the operations<br />

of a given biological system with a single, simple formula. Since we cannot measure<br />

all the parameters of every biological system we are interested in, we make<br />

approximations and deduce how often they are true. Because of the innate variation<br />

in biological organisms it is hard to tell real differences in a system from<br />

random variation or noise. Statistics seek to describe this randomness by telling<br />

us how much noise there is in the measurements we make of a system. By filtering<br />

out this noise, statistics allow us to approach a correct value of the underlying<br />

facts of interest.<br />

93


94 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

Descriptive and inferential statistics<br />

Descriptive statistics are concerned with the presentation, summarization,<br />

and utilization of data. These include techniques for graphically displaying the<br />

results of a study and mathematical indices that summarize the data with a few<br />

key numbers. These key numbers are measures of central tendency such as the<br />

mean, median, andmode and measures of dispersion such as standard deviation,<br />

standard error of the mean, range, percentile,andquartile.<br />

In medicine, researchers usually study a small number of patients with a given<br />

disease, a sample. What researchers are actually interested in finding out is how<br />

the entire population of patients with that disease will respond. Researchers<br />

often compare two samples for different characteristics such as use of certain<br />

therapies or exposure to a risk factor to determine if these changes will be present<br />

in the population. Inferential statistics are used to determine whether or not<br />

any differences between the research samples are due to chance or if there is a<br />

true difference present. Also inferential statistics are used to determine if the data<br />

gathered can be generalized from the sample to a larger group of subjects or the<br />

entire population.<br />

Visual display of data<br />

The purpose of a graph is to visually display the data in a form that allows the<br />

observer to draw conclusions about the data. Although graphs seem straightforward,<br />

they can be deceptive. The reader is responsible for evaluating the accuracy<br />

and truthfulness of graphic representations of the data. There are several<br />

common features that should be present in a proper graph. Lack of these items<br />

can lead to deception.<br />

First, there must be a well-defined zero point. Lack of zero point (Fig. 9.1) is<br />

always improper. A lack of a well-defined zero point makes small differences look<br />

bigger by emphasizing only the upper portion of the scale. It is proper to start at<br />

zero, break the line up with two diagonal hash marks just above the zero point,<br />

and then continue from a higher value (as in Fig. 9.2). This still exaggerates the<br />

changes in the graph, but now the reader is warned and will consider the results<br />

accordingly.<br />

The axes of the graph should be relatively equally proportioned. Lack of proportionality,<br />

a much more subtle technique than lack of a well-defined zero, is<br />

also improper. It serves to emphasize the drawn-out axis relative to the other<br />

less drawn-out axis. This visually exaggerates smaller changes in the axis that is<br />

drawn to the larger scale (Fig. 9.3). Therefore, both axes should have their variables<br />

drawn to roughly the same scale (Fig. 9.4).


Review of basic statistics 95<br />

Mean final exam score<br />

100<br />

80<br />

60<br />

Fig. 9.1 Improper graph due to<br />

the lack of a defined zero point.<br />

This makes the change in mean<br />

final exam scores appear to be<br />

much greater (relatively) than<br />

they truly are.<br />

40<br />

1996 1998 2000 2002<br />

Year<br />

Mean final exam score<br />

100<br />

80<br />

60<br />

Fig. 9.2 Proper version of the<br />

graph in Figure 9.1 created by<br />

putting in a defined zero point.<br />

Although the change in mean<br />

final exam scores still appears to<br />

be relatively greater than they<br />

truly are, the reader is notified<br />

that this distortion is occurring.<br />

0<br />

1996 1998 2000 2002<br />

Year<br />

Another deceptive graphing technique can be seen in some pharmaceutical<br />

advertisements. This consists of the use of three-dimensional shapes to demonstrate<br />

the difference between two groups, usually the effect of a drug on a patient<br />

outcome. One example uses cones of different heights to demonstrate the difference<br />

between the endpoint of therapy for the drug produced by the company<br />

and its closest competitor. The height of each cone is the percentage of patients<br />

responding in each group. Visually, the cones represent a larger volume than simple<br />

bars or even triangles, making the drug being advertised look like it caused<br />

a much larger effect. For more information on deceptive graphing techniques,<br />

please refer to E.R. Tufte’s classic book on graphing. 1<br />

1 E. R. Tufte. The Visual Display of Quantitative Information. Cheshire, CT: Graphics Press, 1983.


96 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

Fig. 9.3 Improper graph due to<br />

the lack of proportionality of the<br />

x and y axes. This makes it<br />

appear as if the change in mean<br />

final exam scores occurred over<br />

a much shorter time period than<br />

in reality.<br />

Mean final exam score<br />

100<br />

80<br />

60<br />

40<br />

20<br />

0<br />

1996 1998 2000 2002<br />

Year<br />

Fig. 9.4 Proper graph with<br />

proportioned x and y axes,<br />

givingatruerepresentationof<br />

the rise in exam scores gradually<br />

over time.<br />

Mean final exam score<br />

100<br />

80<br />

60<br />

40<br />

20<br />

0<br />

1996 1998 2000 2002<br />

Year<br />

Types of graph<br />

Stem-and-leaf plots<br />

Stem-and-leaf plots are shortcuts used as preliminary plots for graphs called<br />

simple histograms. The stem is made up of the digits on the left side of each<br />

value (tens, hundreds, or higher) and the leaves are the digits on the right side<br />

(units, or lower) of each number. Let’s take, for example, the following grades on<br />

a hypothetical statistics exam:


Review of basic statistics 97<br />

Stem<br />

5 87<br />

Reorder these as: 5 78<br />

6 5685676<br />

6 5566678<br />

7 5514688688 7 1455668888<br />

8 47886794 8 44677889<br />

9 630908 9 003689<br />

Leaves<br />

Fig. 9.5 Stem-and-leaf plot of<br />

grades in a hypothetical<br />

statistics exam.<br />

10<br />

Fig. 9.6 Bar graph of the data in<br />

Fig. 9.5.<br />

8<br />

Number with score<br />

6<br />

4<br />

2<br />

0<br />

50<br />

60 70 80 90<br />

Score (decile)<br />

96 93 84 75 75 71 65 74 58 87 66 90 76 68 65 78 78 66 76 88 99 88 78<br />

90 86 98 67 66 87 57 89 84 78<br />

In this example, the first digit forms the stem and the second digit, the leaves.<br />

In creating the stem-and-leaf plot, first list the tens digits, and then next to them<br />

all the units digits which have that ‘tens’ digit in common. Our example becomes<br />

the stem-and-leaf plot in Fig. 9.5.<br />

This can be rotated 90 ◦ counterclockwise and redrawn as a bar graph or histogram.<br />

The x-axis shows the categories, the tens digits in our example, and the<br />

y-axis shows the number of observations in each category. The y-axis can also<br />

show the percentages of the total that each observation occurs in each category.<br />

This shows the relationship between the independent variable, in this case the<br />

exam scores, and the dependent variable, in this instance the number of students<br />

with a score in each 10% increment of grades.


98 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

Fig. 9.7 Histogram of the data in<br />

Fig. 9.5.<br />

10<br />

8<br />

Number with score<br />

6<br />

4<br />

2<br />

0<br />

50 60 70 80 90<br />

Score (decile)<br />

Bar graphs, histograms, and frequency polygons<br />

The most common types of graphs used in the medical articles are bar graphs,<br />

histograms, and frequency polygons. The bar graph (Fig. 9.6) that would represent<br />

the data in our previous stem-and-leaf plot is drawn by replacing the numbers<br />

with bars. A histogram is a bar graph in which the bars touch each other<br />

(Fig. 9.7). As a rule, the author should attempt to make the contrast between<br />

bars on a histogram as clear as possible. A frequency polygon shows how often<br />

each observation occurs (Fig. 9.8 is a frequency polygon of the data in Fig. 9.5). A<br />

cumulative frequency polygon (Fig. 9.9) shows how the number of accumulated<br />

events is distributed. Here the y-axis is usually the percentage of the total events.<br />

Box-and-whisker plots<br />

Box-and-whisker plots (Fig. 9.10) are common ways to represent the range of<br />

values for a single variable. The central line in the box is the median, the middle<br />

value of the data as will be described below. The box edges are the 25th and 75th<br />

percentile values and the lines on either side represent the limits of 95% of the<br />

data. The stars represent extreme outliers.<br />

Measures of central tendency and dispersion<br />

There are two numerical measures that describe a data set, the central tendency<br />

and the dispersion. There are three measures of central tendency, describing the<br />

center of a set of variables: the mean, median, and mode.


Review of basic statistics 99<br />

10<br />

Fig. 9.8 Frequency polygon of<br />

thedatainFig.9.5.<br />

8<br />

Number with score<br />

6<br />

4<br />

2<br />

0<br />

50 60 70 80 90<br />

Score (decile)<br />

30<br />

Fig. 9.9 Cumulative frequency<br />

polygon of the data in Fig. 9.5.<br />

24<br />

Cumulative number<br />

18<br />

12<br />

6<br />

0<br />

50 60 70 80 90<br />

Score (decile)<br />

The mean (μ or ¯x ) is the arithmetical center, commonly called the arithmetic<br />

average. It is the sum of all measurements divided by the number of measurements.<br />

Mathematically, μ = (x i )/n. In this equation, x i is the numerical<br />

value of the i th data point and n is the total number of data points. The<br />

mean is strongly effected by outliers. These are extreme numbers on either<br />

the high or low end of the distribution that will produce a high degree of<br />

skew. There will not be a truly representative central value if the data are<br />

highly skewed and the mean can misstate the data. It makes more sense to


100 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

Fig. 9.10 Box-and-whisker plot<br />

of scores on the statistics test by<br />

student height.<br />

100<br />

90<br />

∗<br />

Score<br />

80<br />

70<br />

∗<br />

60<br />

0<br />

∗<br />

70<br />

Student height<br />

∗<br />

use the median if this is the case. The mean should not be used for ordinal<br />

data and is meaningless in that setting unless the ordinal data has been<br />

shown to behave like continuous data in a symmetrical distribution. This is<br />

a common error and may invalidate the results of the experiment or portray<br />

them in a misleading manner.<br />

The median (M) is the middle value of a set of data points. There are the same<br />

number of data points above and below M. For an even number of data<br />

points, M, is the average of the two middle values. The median is less affected<br />

by outliers and by data that are highly skewed. It should be used when dealing<br />

with ordinal variables or when the data are highly skewed. There are special<br />

statistical tests for dealing with these types of data.<br />

The mode is the most common value or the one value with the largest number<br />

of data points. It is used for describing nominal and ordinal data and is rarely<br />

used in clinical studies.<br />

There are several ways to describe the degree of dispersion of the data. The common<br />

ones are the range, percentiles, variance, and standard deviation. The standard<br />

error of the mean is a measure that describes the dispersion of a group of<br />

samples.<br />

The range is simply the highest value to the lowest value. It gives an overview<br />

of the data spread around a central value. It should be given whenever there<br />

is either a large spread of data values with many outliers or when the range<br />

is asymmetrical about the value of central tendency. It also should be given<br />

with ordinal data.<br />

Quartiles divide the data into fourths, and percentiles into hundredths. The<br />

lowest quarter of values lie below the lower quartile or 25th percentile, the


Review of basic statistics 101<br />

lower half below the 50th percentile, and the lowest three-quarters below<br />

the upper quartile or 75th percentile. The interquartile range is the range of<br />

values from the 25th to the 75th percentile values.<br />

The variance (σ 2 or s 2 ) is a statistical measure of variation. It is the average of<br />

the squares of the difference between each value and the mean or the sum<br />

of the squares of the difference between each value and the mean divided by<br />

n (the number of data points in the sample). It is often divided by n − 1, and<br />

either method is correct. This assumes a normal distribution of the variables<br />

(see below). Mathematically, s 2 = ( (x i – μ) 2 )/(n − 1). The standard deviation<br />

(SD, s,orσ ) is simply the square root of the variance.<br />

The standard error of the mean (SEM) is the standard deviation of the means<br />

of multiple samples that are all drawn from the same population. If the population<br />

size is greater than 30 and the<br />

√<br />

distribution is normal, the SEM is estimated<br />

by the equation SEM = SD/ n, (where n is sample size).<br />

Populations and samples<br />

A population is the set of all possible members of the group being studied. The<br />

members of the population have various attributes in common and the more<br />

characteristics they have in common, the more homogeneous and therefore<br />

restrictive the population. An example of a fairly restricitive population would<br />

be all white males between 40 and 65 years of age. With a restrictive population,<br />

the generalizability of the population is often a problem. The less the members<br />

of the sample have in common, the more generalizable the results of data gathered<br />

for that population. For example, a population that included all males is<br />

more generalizable than one that only includes white males between 40 and 65<br />

years of age. The population size is symbolized by capital N.<br />

A sample is a subset of the population chosen for a specific reason. An example<br />

could be all white males available to the researcher on a given day for a study.<br />

Reasons to use a sample rather than the entire population include convenience,<br />

time, cost, andlogistics. The sample may or may not be representative of the<br />

entire population, an issue which has been discussed in the chapter on sources<br />

of bias (Chapter 8). The sample size is symbolized by lower-case n.<br />

Histograms or frequency polygons show how many subjects in a sample or<br />

population (the y-axis) have a certain characteristic value (the x-axis). When<br />

plotted in this manner, we call the graph a distribution of values for the given<br />

sample. Distributions can be symmetrical or skewed. By definition, a symmetrical<br />

distribution is one for which the mean, median, and mode are identical.<br />

Many curves or distributions of variables are asymmetrical. Skew describes the<br />

degree to which the curve is asymmetrical. Figures 9.11 and 9.12 show symmetrical<br />

and skewed distributions. They are said to be skewed to the right (positive


102 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

Fig. 9.11 Symmetrical curve.<br />

Mean, median, and mode are<br />

the same.<br />

Mean = median = mode<br />

Fig. 9.12 Skewed curve (to the<br />

right). Mode


Review of basic statistics 103<br />

Fig. 9.14 Curve with skew to the<br />

left (negative skew).<br />

Area under each segment of the curve<br />

0.1% 2.2% 13.6% 34.1% 34.1% 13.6% 2.2% 0.1%<br />

Fig. 9.15 The normal<br />

distribution.<br />

−4 −3 −2 −1 μ +1 +2 +3 +4<br />

Number of standard deviations from the mean (μ)<br />

The normal distribution<br />

The Gaussian or normal distribution (Fig. 9.15) is also called the bell-shaped<br />

curve. It is named after Carl Frederick Gauss, a German mathematician. However,<br />

he did not discover the bell-shaped curve. Abraham de Moivre, a French<br />

mathematician, discovered it about 50 years before Gauss published his thesis. It<br />

is a special case of a symmetrical distribution, and it describes the frequency of<br />

occurrence of many naturally occurring phenomena. For the purposes of most<br />

statistical tests, we assume normality in the distribution of a variable. It is better<br />

defined by giving its properties:<br />

(1) The mean, median, and mode are equal so that we can say that the curve is<br />

symmetric around the mean and not skewed or has a skew = 0.


104 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

Table 9.1. Properties of the normal distribution<br />

(1) One standard deviation (± 1 SD) on either side of the mean encompasses 68.2% of<br />

the population<br />

(2) Two standard deviations (± 2 SD) is an additional 27.2% (95.4% of total)<br />

(3) Three (± 3 SD) is an additional 4.4% (99.8% of total)<br />

(4) Four (± 4 SD) is an additional 0.2% (99.99% of total)<br />

(5) Five (± 5 SD) includes (essentially) everyone (99.9999% of total)<br />

(2) The tails of the curve approach the x-axis asymptotically, that is they get<br />

closer and closer to the x-axis as you move away from the mean and they<br />

never quite reach it no matter how far you go.<br />

There are specific numerical equivalents to the standard deviations of the normal<br />

distribution, as shown in Table 9.1. For all practical purposes 68% of the population<br />

are within one standard deviation of the mean (± 1SD),95% are within<br />

two standard deviations of the mean (± 2SD),and99% are within three standard<br />

deviations of the mean (± 3 SD). The 95% interval (± 2 SD) is a range commonly<br />

referred to as the normal range or the Gaussian definition of the normal range.<br />

The normal distribution is the basis of most statistical tests and concepts we will<br />

use in critical interpretation of the statistics used in the medical literature.<br />

Percentages<br />

Percentages are commonly used in reporting results in the medical literature.<br />

Percentage improvement or percentage of patients who achieve one of two<br />

dichotomous endpoints are the preferred method of reporting the results. These<br />

arecommonlycalledevent rates. A percentage is a ratio or fraction, the numerator<br />

divided by the denominator, multiplied by 100 to create a whole number.<br />

Obviously, inaccuracies in either the numerator or denominator will result in<br />

inaccuracy of the percentage.<br />

Percentages can be misleading in two important ways. Percentofapercent<br />

will usually show a very large result, even when there is only a small absolute<br />

change in the variables. Consider two drugs, we’ll call them t-PA and SK, which<br />

have different mortality rates. In a particular study, the mortality rate for patients<br />

given t-PA was 7%, which is referred to as the experimental event rate (EER) while<br />

the mortality for SK was 8%, which is the control event rate (CER). The absolute<br />

difference, called the absolute risk reduction, is calculated as ARR =|EER – CER|<br />

and is 1% in this example. The relative improvement in mortality, referred to as<br />

the relative risk reduction, is calculated by RRR =|EER – CER|/CER) is (1/8 ×<br />

100% = 12.5%, a much larger and more impressive number than the 1% ARR.


Review of basic statistics 105<br />

Using the latter without prominently acknowledging the former is misleading<br />

and is a commonly used technique in pharmaceutical advertisements.<br />

The second misleading technique is called the percentages of small numbers,<br />

and can be misleading in a more subtle way. In this case, the percentage<br />

is most likely to be simply inaccurate. Twenty percent of ten subjects seems<br />

like a large number, yet represents only two subjects. For example, the fact that<br />

those two subjects had an adverse reaction to a drug could have occurred simply<br />

by chance and the percentage could be much lower (< 1%) or higher (> 50%)<br />

when the same intervention is studied in a larger sample of the population. To<br />

display these results properly when there are only a small number of subjects<br />

in a study, the percentage may be given as long as the overall numbers are also<br />

given with equal prominence. The best way to deal with this is through the use<br />

of confidence intervals, which will be discussed in the next chapter.<br />

Probability<br />

Probability tells you the likelihood that a certain event will or will not occur relative<br />

to all possible related events of interest. Mathematically it is expressed as<br />

the number of times the event of interest occurs divided by the number of times<br />

all possible related events occur. This can be written as P(x) = n x /N where P(x)is<br />

the probability of an event x occurring in a total of N possible outcome events. In<br />

this equation, n x is the number of times x occurs. The letter P (or p) symbolizes<br />

probability. For flipping a coin once, the probability of a head is P(head). This<br />

is calculated as P(head) = 1/2, or the outcome of interest (one head)/the total<br />

number of possible outcomes of the coin toss (one head plus one tail).<br />

Two events are said to be independent, not to be confused with the independent<br />

variable of an experiment, when the occurrence of one of the events does<br />

not depend on the occurrence of the other event. In other words, the two events<br />

occur by independent mechanisms. The toss of a coin is a perfect example. Each<br />

toss is an independent event. The previous toss has no influence on the next one.<br />

Since the probability of a head on one toss is 1/2, if the same coin is tossed again,<br />

the probability of flipping a head does not change. It is still 1/2. The probability<br />

will continue to be 1/2 no matter how many heads or tails are thrown, unless of<br />

course, the coin is rigged.<br />

Similarly, events are said to be dependent, not to be confused with the dependent<br />

variable of an experiment, if the probability of one event affects the outcome<br />

of the other. An example would be the probability of first drawing a red<br />

ball and then a yellow ball from a jar of colored balls, without replacing the one<br />

you drew out first. This means that the probabilities of selecting one or another<br />

colored ball will change each time one is selected and removed from the jar.


106 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

Events are said to be mutually exclusive if the occurrence of one absolutely<br />

precludes the occurrence of the other. For example, gender in humans is a mutually<br />

exclusive property. If someone is a biological male they cannot also be a biological<br />

female. Another example is a coin flip. Heads or tails obtained on the flip<br />

of a coin are mutually exclusive events as a coin will only land on the head or tail.<br />

Conditional probability allows us to calculate complex probabilities, such as<br />

the probability that one event occurs given that another event has occurred. If<br />

the two events are a and b, the notation for this is P(a | b). This is read as “the<br />

probability of event a if event b occurs.” The vertical line means “conditional<br />

upon.” This construct can be used to calculate otherwise complex probabilities<br />

in a very simple manner.<br />

If two events are mutually exclusive, the probability that either event occurs<br />

can be easily calculated. The probability that event a or event b occurs is simply<br />

the sum of the two probabilities. P(a or b) = P(a) + P(b). The probability of<br />

a head or a tail occurring when a coin is flipped is P(head) + P (tail), which is<br />

1/2 + 1/2 = 1, or a certain event. Similarly, the probability that event a and event<br />

b occurs is the product of the two probabilities. P(a and b) = P(a) × P(b). The<br />

probability of getting two heads on two flips of a coin is P(head on 1st flip) ×<br />

P(head on 2nd flip) which is 1/2 × 1/2 = 1/4.<br />

Determining the probability that at least one of several mutually exclusive<br />

events will occur is a bit more complex, but the above rules allow us to make<br />

this a simple calculation: P(at least one event will occur) = 1–P(none of the<br />

events will occur). We can calculate P(none of the events occurring) = P(not a) ×<br />

P(not b) × P(not c) ×···For example, if we want to know the probability of getting<br />

at least one head in three flips of a coin, we could calculate the probability<br />

of getting one head, two heads, and three heads and add them up, then subtract<br />

the probabilities of events that overlap, in this case getting two heads and one tail<br />

can be done three ways with three coins. Using the above rule, the probability of<br />

at least one head is 1 – P(no heads). The probability of no heads is the probability<br />

of three tails (1/2) 3 = 1/2 × 1/2 × 1/2 = 1/8, thus making the probability of<br />

at least one head 1 – 1/8 = 7/8. This is an important concept in the evaluation<br />

of the statistical significance of the results of studies and the interpretation of<br />

simple lab tests.<br />

Many lab tests use the Gaussian distribution to define the normal values. This<br />

considers ± 2 SD as the cutoff point for normal vs. abnormal results. This means<br />

that 95% of the population will have a normal result and 5% will have an abnormal<br />

result. Physicians routinely do a number of tests at once, such as a Complete<br />

Metabolic Profile, SMA-C, or SMA-20. What is the significance of one abnormal<br />

result out of the 20 tests ordered in these panels? We want to know the probability<br />

that a normal person will have at least one abnormal lab test in a panel<br />

of 20 tests by chance alone. The probability that each test will be normal is 95%.<br />

Therefore, the probability that all the tests are normal is (0.95) 20 = 0.36. Then, the


Review of basic statistics 107<br />

Table 9.2. Commonly used probabilities in epidemiology<br />

Prevalence<br />

Incidence<br />

Attack rate<br />

Crude mortality rate<br />

Age-specific mortality rate<br />

Infant mortality rate<br />

Neonatal mortality rate<br />

Perinatal mortality rate<br />

Maternal mortality rate<br />

Probability of the presence of disease: number of existing<br />

cases of a disease/total population<br />

Probability of the occurrence of new disease: number of<br />

new cases of a disease/total population<br />

A specialized form of incidence relating to a particular<br />

epidemic, expressed as a percentage: the number of<br />

new cases of a disease/number of persons exposed in<br />

the outbreak under surveillance<br />

Number of deaths for a given time period and<br />

place/mid-period population during the same time<br />

period and at the same place<br />

Number of deaths in a particular age group/total<br />

population of the same age group in the same period of<br />

time, using the mid-period population<br />

Deaths in infants under 1 year of age/total number of live<br />

births<br />

Deaths in infants under 28 days of age/total number of<br />

live births<br />

(Stillbirths + deaths in infants under 7 days of age)/(total<br />

number of live births + total number of stillbirths)<br />

All pregnancy related deaths/total number of live births.<br />

probability that at least one test is abnormal becomes 1 – 0.36 = 0.64. This means<br />

that there is a 64% chance that a normal person will have at least one abnormal<br />

test result that occurred purely by chance alone, when in reality that person is<br />

normal.<br />

Basic epidemiology<br />

Epidemiology is literally the study of epidemics, but is commonly used to<br />

describe the study of disease in populations. Many of the studies that medical<br />

students will learn how to evaluate are epidemiological studies. On a very<br />

simplistic level, epidemiology describes the probability of certain events occurring<br />

in a population (Table 9.2). These probabilities are described in terms of<br />

rates. This could be a rate of exposure to a toxin, disease, disability, death, or any<br />

other important outcome. In medicine, rates are usually expressed as number<br />

of cases per unit of population. The unit of population most commonly used is<br />

100 000, although other numbers can be used. The rates can also be expressed as<br />

percentages.


108 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

The prevalence of disease is the percentage of the population that has existing<br />

cases of the disease at a given time. It is the probability that a given person<br />

in this population has the disease of interest. It is calculated as the number<br />

of cases of a disease divided by the total population at risk for the disease.<br />

The number of new cases and the resolution of existing cases affect prevalence.<br />

Prevalence increases as the number of new cases increases and as the<br />

mortality rate decreases.<br />

The incidence of a disease is the number of new cases of the disease for a given<br />

unit of population in a given unit of time. It is the probability of the occurrence<br />

of a new patient with that disease. It is the number of new cases in a<br />

given time period divided by the total population. Incidence is only affected<br />

by the occurrence of new cases of disease. The occurrence of new cases can<br />

be influenced by factors such as mass exposure to a new infectious agent or<br />

a change in the diet of the society.<br />

The mortality rate is the incidence or probability of death in a certain time<br />

period. It is the number of people who die within a certain time divided by<br />

the entire population at risk of death during that time.<br />

An excellent resource for learning more statistics is a CD-ROM called<br />

ActivStats, 2 a review of basic statistics and probability. There is also an electronic<br />

textbook called StatSoft, 3 which includes some good summaries of basic statistical<br />

information.<br />

2 P. Velleman. ActivStats 3.0. Ithaca, NY: Data Description, 2006.<br />

3 StatSoft. www.statsoftinc.com/textbook/stathome.html.


10<br />

Hypothesis testing<br />

<strong>Medicine</strong> is the science of uncertainty and the art of probability.<br />

Sir William Osler (1849–1919)<br />

Learning objectives<br />

In this chapter you will learn:<br />

steps in hypothesis testing<br />

potential errors of hypothesis testing<br />

how to calculate and describe the usage of control event rates (CER), experimental<br />

event rates (EER), relative rate reduction (RRR), and absolute rate<br />

reduction (ARR)<br />

the concepts underlying statistical testing<br />

Interpretation of the results of clinical trials requires an understanding of the statistical<br />

processes used to analyze data. Intelligent readers of the medical literature<br />

must be able to interpret these results and determine for themselves if they<br />

are important enough to use for their patients.<br />

Introduction<br />

Hypothesis testing is the foundation of the scientific method. Roger Bacon suggested<br />

the beginnings of this process in the thirteenth century. Sir Francis Bacon<br />

further defined it in the fifteenth century, and it was first regularly used in scientific<br />

research in the eighteenth and nineteenth centuries. It is a process by which<br />

new scientific information is added to previously discovered facts and processes.<br />

Previously held beliefs can be tested to determine their validity, and expected<br />

outcomes of a proposed new intervention can be tested against a previously<br />

used intervention. If the result of the experiment shows that the newly thoughtup<br />

hypothesis is true, then researchers can design a new experiment to further<br />

109


110 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

Table 10.1. Steps in hypothesis testing<br />

(1) Gather background information<br />

(2) State hypothesis<br />

(3) Formulate null hypothesis (H 0 )<br />

(4) Design a study<br />

(5) Decide on a significance level (α)<br />

(6) Collect data on a sample<br />

(7) Calculate the sample statistic (P)<br />

(8) Reject or accept the null hypothesis (by comparing P to α)<br />

(9) Begin all over again, step 1<br />

increase our knowledge. If the hypothesis being tested is false, it is “back to the<br />

drawing board” to come up with a new hypothesis (Table 10.1).<br />

The hypothesis<br />

A hypothesis is a statement about how the study will relate the predictors, cause<br />

or independent variable, and outcomes, effect or dependent variable. For example,<br />

a study is done to see if taking aspirin reduces the rate of death among<br />

patients with myocardial infarction (heart attack). The hypothesis is that there<br />

is a relationship between daily intake of aspirin and a reduction in the risk of<br />

death caused by myocardial infarction. Another way to state this hypothesis is<br />

that there is a reduced death rate among myocardial infarction patients who are<br />

taking aspirin. This is a statement of what is called the alternative hypothesis (H a<br />

or H 1 ). The alternative hypothesis states that a difference does exist between two<br />

groups or there is an association between the predictor and outcome variables.<br />

The alternative hypothesis cannot be tested directly by using statistical methods.<br />

The null hypothesis (H 0 ) states that no difference exists between groups or<br />

there is no association between predictor and outcome variables. In our example,<br />

the null hypothesis states that there is no difference in death rate due to<br />

myocardial infarction between those patients who took aspirin daily and those<br />

who did not. The null hypothesis is the basis for formal testing of statistical significance.<br />

By starting with the proposition that there is no association, statistical<br />

tests estimate the probability that an observed association occurred due<br />

to chance alone. The customary scientific approach is to accept or reject the<br />

null hypothesis. Rejecting the null hypothesis is a vote in favor of the alternative<br />

hypothesis, which is then accepted by default.<br />

The only knowledge that can be derived from statistical testing is the probability<br />

that the null hypothesis was falsely rejected. Therefore the validity of the


Hypothesis testing 111<br />

alternative hypothesis is accepted by exclusion if the test of statistical significance<br />

rejects the null hypothesis. For statisticians, the reference point for significance<br />

of the results is the probability that the null hypothesis is rejected when in<br />

fact the null hypothesis is true and there really is no difference between groups.<br />

This appears to be a lot of double talk, but is actually the way statisticians talk.<br />

The goal is for this to occur less than 5% of the time (P < 0.05) which is the basis to<br />

the usual definition of statistical significance, P < 0.05. The letter P stands for the<br />

probability of obtaining the observed difference or effect size between groups by<br />

chance if in reality the null hypothesis is true and there is no difference between<br />

the groups. In other words, the probability of falsely rejecting the null hypothesis.<br />

Where did this 5% notion come from and what does it mean statistically? Sir<br />

Ronald Fisher, a twentieth-century British mathematician and founder of modern<br />

statistics one day said it, and since he was the expert it stuck. He reasoned<br />

that “if the probability of such an event (falsely rejecting the null hypothesis) were<br />

sufficiently small – say, 1 chance in 20, then one might regard the result as significant.”<br />

Prior to this, a level of P = 0.0047 (or one chance in 212) had been accepted<br />

as the level of significance.<br />

His reasoning was actually pretty sound, as the following experiment shows.<br />

How much would you bet on the toss of a coin? You pay $1.00, or £1.00 in Sir<br />

Ronnie’s experiment, if tails come up and you get paid the same amount if it’s<br />

heads. How many tails in a row would you tolerate before beginning to suspect<br />

that the coin is rigged? Sir Ronald reasoned that in most cases the answer<br />

would be about four or five tosses. The probability of four tails in a row is (1/2) 4<br />

or 1 in 16, and for five tails in a row (1/2) 5 or 1 in 32. One in 20 (5%) is about<br />

halfway between. 1 Is it coincidental that 95% of the population corresponds<br />

almost exactly to ± 2 SD of the normal distribution? It is sobering to realize that in<br />

experimental physics, the usual P value is 0.0001 as physicists want to be really<br />

sure where a particular subatomic particle is or what it’s mass or momentum<br />

are before telling the press. There is always talk in biomedical research circles,<br />

usually by pharmaceutical or biotech companies, that the level of significance of<br />

0.05 is too low and should be increased to 0.1. This means that we would accept<br />

one chance in ten that the difference found was not true and only occurred by<br />

chance! This would be a poor decision, and the reasoning why will be evident by<br />

the end of this book.<br />

Errors in hypothesis testing<br />

The results of a clinical study are tested by application of a statistical test to the<br />

experimental results. The researcher asks the question “what is the probability<br />

that the difference between groups that I found was obtained purely by chance,<br />

1 From G. R. Norman & D. L. Streiner. Biostatistics: The Bare <strong>Essential</strong>s. St Louis: Mosby, 1994.


112 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

Fig. 10.1 Possible outcomes of a<br />

study.<br />

Actually is a positive<br />

result (absolute truth)−<br />

H 0 actually false<br />

Is the study actually valid?<br />

Actually is a negative<br />

result (absolute truth)−<br />

H 0 actually true<br />

Experiment found positive<br />

results − H 0 found to be false<br />

Experiment found negative<br />

results − H 0 found to be true<br />

Correct conclusion<br />

(Power = 1 − β)<br />

Type II error<br />

β<br />

Type I error<br />

α<br />

Correct<br />

conclusion<br />

and that there is actually no difference between the two groups?” Statistical tests<br />

are able to calculate this probability.<br />

In general there are four possible outcomes of a study. These are shown in<br />

Fig. 10.1. They compare the result found in the study with the actual state of<br />

things. The universal truth cannot always be determined, and this is what’s<br />

referred to as clinical uncertainty. Researchers can only determine how closely<br />

they are approaching this universal truth by using statistical tests.<br />

A Type I error occurs when the null hypothesis is rejected even though it is<br />

really true. In other words, concluding that there is a difference or association<br />

when in actuality there is not. This is also called a false positive study result.<br />

There are many ways in which a Type I error can occur in a study, and the reader<br />

must be aware of these since the writer will rarely point them out. Often the<br />

researcher will spin the results to make them appear more important and significant<br />

than the study actually supports. Manipulation of variables using techniques<br />

such as data dredging, snooping or mining, one-tailed testing, subgroup<br />

analysis, especially if done post hoc, and composite-outcome endpoints may<br />

result in the occurrence of this type of error.<br />

A Type II error occurs when the null hypothesis is not rejected even though it<br />

is really false. In other words, the researcher concludes that there is not a difference<br />

when in reality there is. This is also called a false negative study result. An<br />

example would be concluding there is no relationship between hyperlipidemia<br />

and coronary artery disease when there truly is a relationship. Power represents<br />

the ability of the study to detect a difference when it exists. By convention the<br />

power of a study should be greater than 80% to be considered adequate. Think of<br />

an analogy to the microscope. As the power of the microscope increases, smaller<br />

differences between cells can be detected.<br />

A Type II error can only be made in negative clinical trials or those trials<br />

that report no statistically significant difference between groups or no association<br />

between cause and effect. Therefore, when reading negative clinical trials,<br />

one needs to assess the chance that a Type II error occurred. This is important


Hypothesis testing 113<br />

because a negative result may not be due to the lack of an effect but simply<br />

because of low power or the inability to detect the effect. From an interpretation<br />

perspective, the question one asks is “for a given Type II error level and<br />

an effect difference that I consider clinically important, did the researcher use<br />

a large enough sample size”? Both of these concepts will be discussed in more<br />

detail in the next two chapters.<br />

Type III and IV errors are not usually found in biostatistical or epidemiological<br />

textbooks and yet are extremely common. Type III errors are those that compare<br />

the intervention to the wrong comparator, such as a drug that is not usually used<br />

for the problem or the incorrect dose of a drug. This is fairly common in the literature<br />

and includes studies of new drugs against placebo instead of older drugs.<br />

Studies of drugs for acute treatment of migraine headaches may be done against<br />

drugs that are useful for that indication, but in doses that are inadequate for the<br />

management of the pain. The reader must have a working knowledge of the standard<br />

therapy and determine if the new intervention is being tried against the best<br />

current therapy. Studies of new antibiotics are often done against an older antibiotic<br />

that is no longer used as standard therapy.<br />

Type IV errors are those in which the wrong study was done. For example, a<br />

new antiviral drug for influenza is tested against placebo. The drug should at<br />

least have been tested against an old antiviral drug previously shown to be effective,<br />

and not against placebo, which is a Type III error. But, since the current<br />

standard is prevention in the form of influenza vaccine, the correct study should<br />

in fact have been comparing the new drug against the strategy of prevention<br />

with vaccine. This is a much more complex study, but would really answer the<br />

question posed about the drugs. Any study of a new treatment should be compared<br />

to the effect of both currently available standard therapies and prevention<br />

programs.<br />

Effect size<br />

The actual results of the measurements showing a difference between groups<br />

are given in the results section of a scientific paper. There are many different<br />

ways to express the results of a study. The effect size, commonly called δ, is the<br />

magnitude of the outcome, association, or difference between groups that one<br />

observes. This result can be given either as an absolute or as a relative number.<br />

It often can be expressed as either an absolute difference or the percentage with<br />

the outcome in each group or the event rate.<br />

The expression of the results will be different for different types of data. The<br />

effect size for outcomes that are dichotomous can be expressed as percentages<br />

that achieved the result of interest in each of the groups. When continuous outcomes<br />

are evaluated, the mean and standard deviations of two or more groups


114 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

can be compared. If the distribution of values is skewed, the range should also be<br />

given. A statistical test will then calculate the P value for the difference between<br />

the two mean values, and will show the probability that the difference found<br />

occurred by chance alone. If the measure is an ordinal number, the median is<br />

the measure that should be compared. In that case, special statistical methods<br />

can be used to determine the P value for the difference found.<br />

The clinically significant effect size is the difference that is estimated to be<br />

important in clinical practice. It is statistically easier to detect a large effect like<br />

one representing a 90% change than a small effect like one representing a 2%<br />

change. Therefore, it should be easier to detect a difference which is likely to<br />

be clinically important. However, if the sample size is very large, even a small<br />

effect size may be detected. This effect size may not be clinically important even<br />

though it is statistically significant. This concept will be addressed in more detail<br />

later.<br />

Event rates<br />

In any study, researchers are interested in how many events of interest happen<br />

within each of two treatment groups. The outcome of interest must be a dichotomous<br />

variable for this set of calculations. The most common varilables are survival,<br />

admission to the hospital, patients who had relief of pain, or patients who<br />

were cured of infection. Usually a positive outcome such as survival or cure is<br />

used. However, a negative outcome such as death can also be used. The reader<br />

ought to be able to clearly determine the outcome being measured and the differences<br />

between the groups are usually expressed as percentages. The control<br />

group consists of those subjects treated with placebo, comparison, or the current<br />

standard therapy. The experimental group consists of those subjects treated<br />

with the experimental therapy. For studies of risk, the control group is those not<br />

exposed to the risk factor, while the experimental group is those exposed to the<br />

risk factor being studied.<br />

The rate of success or failure can be calculated for each group. The control<br />

event rate (CER) is the percentage of control patients who have the outcome<br />

of interest. Similarly, the experimental event rate (EER) is the percentage of<br />

experimental patients who have the outcome of interest. The absolute difference<br />

between the two is the absolute rate reduction (ARR). Similarly, the relative<br />

rate reduction (RRR) is the percentage of the difference between the groups.<br />

This is the difference between the two outcome rates as a percentage of one of<br />

the event rates, usually by convention, the CER. This is, in fact, a percentage of<br />

a percentage and the reader must be careful when interpreting this result. The<br />

RRR always overestimates the effect of therapy when compared with the ARR<br />

(Fig. 10.2).


Hypothesis testing 115<br />

Events of<br />

interest<br />

Other events Totals<br />

Control or placebo group A B CE = Control group events<br />

Fig. 10.2 Event rates.<br />

Experimental group C D EE = Experimental group<br />

events<br />

Formulas<br />

CER = control patients with outcome of interest / total control patients = A/CE<br />

EER = experimental patients with outcome of interest / total experimental patients = C/EE<br />

ARR = |CER − EER|<br />

RRR = |CER − EER|/CER<br />

Confidence = √n × (signal / noise)<br />

Where the signal is the event rate, the noise is the standard deviation, and n is the sample size<br />

Fig. 10.3 Confidence and<br />

standard error of the mean<br />

(SEM).<br />

SEM = σ/√n<br />

where n is the sample size and σ is the standard deviation<br />

Signal-to-noise ratio<br />

Nearly all commonly used statistical tests are <strong>based</strong> on the concept of the signalto-noise<br />

ratio. The signal is the relationship the researcher is interested in and<br />

the noise represents random error. Statistical tests determine how much of the<br />

difference between two groups is likely due to random noise and how much is<br />

likely due to systematic or real differences in the results of interest. The statistical<br />

measure of noise for continuous variables is the standard deviation or standard<br />

error of the mean (Fig. 10.3).<br />

The confidence of the statistical results of a study can be expressed as proportional<br />

to the signal times the square root of the sample size (n) divided by the<br />

noise. Confidence is analogous to the power of a study. The signal is the effect size<br />

and the noise is the standard deviation of the effect size. Confidence in a particular<br />

result increases when the strength of the signal or effect size increases. It also<br />

increases as the noise level or standard deviation decreases. Finally, it increases<br />

as the sample size increases, but only in proportion to the square root of the sample<br />

size. To double the confidence, you must quadruple the sample size. Remember<br />

this relationship when evaluating study results.<br />

Standard deviation tells the reader how close individual scores cluster around<br />

their mean value. A related number, the standard error of the mean (SEM) tells<br />

the reader how close the mean scores from repeated samples will be to the true


116 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

Fig. 10.4 The 95% confidence<br />

intervals (95% CI).<br />

95% CI = μ ± Z 95% (σ/√n)<br />

Where Z 95% = 1.96 (the number of standard deviations which defines 95% of the data),<br />

σ/√n = SEM, and μ = mean<br />

Therefore, 95% CI = μ ± 1.96(SEM)<br />

population mean. This is the mathematical basis for many statistical tests. There<br />

are some limitations on the use of SEM. It should not be used to describe the<br />

dispersion of data in a sample. The standard deviation does this and using SEM<br />

is dishonest since it under-represents differences between groups. The SEM is a<br />

measure of the variability of the sample means if the study were repeated. For<br />

all practical purposes, the SEM is the standard deviation of the means of all the<br />

possible samples taken from the population. The 95% confidence interval may<br />

be calculated from the SEM and the clearest way to report variation in a study<br />

would be simply to show the 95% confidence intervals. A more detailed explanation<br />

of standard deviation and SEM can be found in an excellent article by David<br />

Streiner. 2<br />

Confidence intervals<br />

Confidence intervals (CI) are another way to represent the level of significance<br />

and are the preferred way to do this. The actual definition is that 95% of such<br />

intervals calculated from the same experiment repeated multiple times contain<br />

the true value of the variable for that population. For all practical purposes in<br />

plain English, the 95% CI means that 95% of the time we expect the true mean<br />

to be between the upper and lower limits of the confidence interval. This means<br />

that if we were to repeat the experiment 20 times, in 19 of those repeated experiments<br />

the value of the effect size would lie within the stated CI range. This gives<br />

more information than a simple P value, since one can see a range of potentially<br />

likely values. If the data assume a normal distribution and we are measuring<br />

independent events, the SEM can be used to calculate 95% confidence intervals<br />

(Fig. 10.4).<br />

Statistical tests<br />

The central limit theorem is the theoretical basis for most statistical tests. It<br />

states that if we select equally sized samples of a variable from a population with<br />

2 D. L. Streiner. Maintaining standards: differences between the standard deviation and standard error,<br />

and when to use each. Can. J. Psychiatry 1996; 41: 498–502.


Hypothesis testing 117<br />

Frequency of<br />

observations<br />

P < 0.05<br />

P > 0.05<br />

Increasing value<br />

95% CI<br />

of the variable 95% CI<br />

95% CI<br />

a non-normal distribution the distribution of the means of these samples will be<br />

a normal distribution. This is true as long as the samples are large enough. For<br />

most statistical tests, the sample size considered large enough is 30. For smaller<br />

sample sizes, other more complex statistical approximations can be used.<br />

Statistical tests calculate the probability that a difference between two groups<br />

obtained in a study occurred by chance. It is easier to visualize how statistical<br />

tests work if we assume that the distribution of each of two sample variables is<br />

two normal distributions graphed on the same axis. Very simplistically and for<br />

visual effectiveness, we can represent two sample means with their 95% confidence<br />

intervals as bell-shaped curves. There are two tails at the ends of the<br />

curves, each representing half of the remaining 5% of the confidence interval.<br />

If there is only some overlap of the areas on the tails or if the two curves are<br />

totally separate with no overlap, the results are statistically significant. If there is<br />

more overlap such that the value central tendency of one distribution is inside<br />

the 95% confidence interval of the other, the results are not statistically significant<br />

(Fig. 10.5). While this is a good way to visualize the process, it cannot be<br />

translated into simple overlap of the two 95% confidence intervals, as statistical<br />

significance depends on multiple other factors.<br />

Statistical tests are <strong>based</strong> upon the principle that there is an expected outcome<br />

(E) that can be compared to the observed outcome (O). Determining the value of<br />

E is problematic since we don’t actually know what value to expect in most cases.<br />

One estimate of the expected value is that found in the control group. Actually,<br />

there are complex calculations for determining the expected value that are part<br />

of the statistical test. Statistical tests calculate the probability that O is different<br />

from E or that the absolute difference between O and E is greater than zero<br />

and occurred by chance alone. This is done using a variety of formulas, is the<br />

meat of statistics, and is what statisticians get paid for. They also get paid to help<br />

researchers decide what to measure and how to ensure that the measure of interest<br />

is what is actually being measured. To quote Sir Ronnie Fisher again: “To call<br />

in the statistician after the experiment is done may be no more than asking him<br />

95% CI<br />

Fig. 10.5 The relationship<br />

between the overlap of 95% of<br />

possible variable values and the<br />

level of statistical significance.


118 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

to perform a postmortem examination: he may be able to say what the experiment<br />

died of.” 3<br />

One does not need to be a trained statistician to know which statistical test<br />

to use, but it helps. What is the average physician to do? The list in Appendix 3<br />

is one place to start. It is an abbreviated list of the specific statistical tests that<br />

the reader should look for in evaluating the statistics of a study. As one becomes<br />

more familiar with the literature, one will be able to identify the correct statistical<br />

tests more often. If the test used in the article is not on this list, the reader ought<br />

to be a bit suspicious that perhaps the authors found a statistician who could<br />

save the study and generate statistically significant results, but only by using an<br />

obscure test.<br />

The placebo effect<br />

There is an urban myth that the placebo effect occurs at an average rate of about<br />

35% in any study. The apparent placebo effect is actually more complex and<br />

made up of several other effects. These other effects, which can be confused<br />

with the true placebo effect, are the natural course of the illness, regression to<br />

the mean, other timed effects, and unidentified parallel interventions. The true<br />

placebo effect is the total perceived placebo effect minus these other effects.<br />

The natural course of the disease may result in some patients getting better<br />

regardless of the treatment given while others get worse. In some cases, it will<br />

appear that patients got better because of the treatment, when really the patients<br />

got better because of the disease process. This was demonstrated in a previous<br />

example when patients with bronchitis appeared to get better with antibiotic<br />

treatment, when in reality, the natural course of bronchitis is clinical improvement.<br />

This concept is true with almost all illnesses including serious infections<br />

and advanced cancers.<br />

Regression to the mean is the natural tendency for a variable to change with<br />

time and return toward the population mean. If endpoints are re-measured<br />

they are likely to be closer to the mean than an initial extreme value. This is a<br />

commonly seen phenomenon with blood pressure values. Many people initially<br />

found to have an elevated blood pressure will have a reduction in their blood<br />

pressure over time. This is partly due to their relaxing after the initial pressure<br />

reading and partly to regression to the mean.<br />

Other timed effects that may affect the outcome measurements include the<br />

learning curve. A person gets better at a task each time it is performed. Similarly,<br />

a patient becomes more relaxed as the clinical encounter progresses. This<br />

explains the effect known as white coat hypertension, the phenomenon by which<br />

3 Indian Statistical Congress, Sankhya, 1938. Sir Ronald Fisher, 1890–1962.


Hypothesis testing 119<br />

a person’s blood pressure will be higher when the doctor takes it and lower when<br />

taken later by a machine, a non-physician, or repeatedly by their own physician.<br />

Some of this effect is due to the stress engendered by the presence of the doctor;<br />

as a patient becomes more used to having the doctor take their blood pressure,<br />

the blood pressure decreases.<br />

Unidentified parallel interventions may occur on the part of the physician,<br />

health-care giver, investigator, or patient. This includes things such as unconscious<br />

or conscious changes in lifestyle instituted as a result of the patient’s medical<br />

problem. For example, patients who are diagnosed with elevated cholesterol<br />

may increase their exercise while they also began taking a new drug to help lower<br />

their cholesterol. This can result in a greater-than-expected rate of improvement<br />

in outcomes both in those assigned to the drug and in the control or placebo<br />

group.<br />

The reader’s goal is to differentiate the true treatment effect from the perceived<br />

treatment effect. The true treatment effect is the difference between the<br />

perceived treatment effect and the various types of placebo effect as described<br />

above. Studies should be able to differentiate the true treatment effect from the<br />

perceived effect by the appropriate use of a control group. The control group is<br />

given the placebo or a standard therapy that is equivalent to the placebo since<br />

the standard therapy would be given regardless of the patients’ participation in<br />

the study.<br />

A recent meta-analysis combined the results of multiple studies that had<br />

placebo and no-treatment arms. They compared the results obtained by all the<br />

patients in these two groups and found that the overall effect size for these two<br />

groups was the same. The only exception was in studies for pain where an overall<br />

positive effect favored the placebo group by the amount of 6.5 mm on a 100-mm<br />

pain scale. 4 As demonstrated by previous pain studies, this difference is not clinically<br />

significant.<br />

4 A. Hróbjartsson & P. C. Gøtzsche. Is the placebo powerless? An analysis of clinical trials comparing<br />

placebo with no treatment N. Engl. J. Med. 2001; 344: 1594–1602.


11<br />

Type I errors and number needed to treat<br />

If this be error, and upon me prov’d,<br />

I never writ, nor no man ever lov’d.<br />

William Shakespeare (1564–1616): Sonnet 116<br />

Learning objectives<br />

In this chapter you will learn:<br />

how to recognize Type I errors in a study<br />

the concept of data dredging or data snooping<br />

the meaning of number needed to treat to benefit (NNTB) and number<br />

needed to treat to harm (NNTH)<br />

how to differentiate statistical from clinical significance<br />

other sources of Type I errors<br />

Interpreting the results of a clinical trial requires an understanding of the statistical<br />

processes that are used to analyze these results. Studies that suffer from a<br />

Type I error may show statistical significance when the groups are not actually<br />

different. Intelligent readers of the medical literature must be able to interpret<br />

these results and determine for themselves if these results are important enough<br />

to use for their patients.<br />

Type I error<br />

This occurs when the null hypothesis is rejected even though it is really true.<br />

In other words, studies that have a Type I error conclude that there is a positive<br />

effect size or difference between groups when in reality there is not. This is a false<br />

positive study result. Alpha (α), known as the level of significance, is defined as<br />

the maximum probability of making a Type I error that the researcher is willing<br />

to accept. Alpha is the probability of rejecting the null hypothesis when it is really<br />

120


Type I errors and number needed to treat 121<br />

δ<br />

α = 0.05<br />

P < 0.05<br />

δ<br />

α = 0.05<br />

P > 0.05<br />

Fig. 11.1 One- and two-tailed<br />

tests for the same effect size δ.<br />

95% CI Mean outside<br />

95% CI<br />

95% CI<br />

One-tailed tests<br />

Two-tailed tests<br />

Mean inside<br />

95% CI<br />

true and is predetermined before conducting a statistical test. The probability of<br />

obtaining the actual difference or effect size by chance if the null hypothesis is<br />

true is P. This is calculated by performing a statistical test.<br />

The researcher minimizes the risk of a Type I error by setting the level of significance<br />

(α) very low. By convention, the alpha level is usually set at 0.05 or 0.01.<br />

In other words, with α = 0.05 the researcher expects to make a Type I error in<br />

one of 20 trials. The researcher then calculates P using a statistical test. He or she<br />

compares P to α.Ifα = 0.05, P must be less than 0.05 (P < 0.05) to show statistical<br />

significance. There are two situations for which this must be modified: two-tailed<br />

testing and multiple variables.<br />

One-tailed vs. two-tailed tests<br />

If researchers have an a-priori reason to believe that one group is clearly going to<br />

be different from the other and they know the direction of that difference, they<br />

can use a one-tailed statistical test. It is important to note that the researcher<br />

must hypothesize either an increase or a decrease in the effect, not just a difference.<br />

This means that the normal distribution of one result is only likely to overlap<br />

the normal distribution of the other result on one side or in one direction.<br />

This is demonstrated in Fig. 11.1.<br />

One-tailed tests specify the direction that researchers think the result will be.<br />

When asking the question “is drug A better than drug B?” the alternative hypothesis,<br />

H a is that drug A is better than drug B. The null hypothesis, H 0 is that either<br />

there is no difference or drug A is worse than drug B. This states that we are only<br />

interested in drug A if it is better and we have good a-priori reason to think that it<br />

really is better. It removes from direct experimentation the possibility that drug<br />

A may actually be worse that drug B.


122 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

It is best to do a two-tailed test in almost all circumstances. The use of a onetailed<br />

test can only be justified if previous research demonstrated that drug A<br />

actually appears to be better and certainly is no worse than drug B. When doing<br />

a two-tailed test, there is no a-priori assumption about the direction of the result.<br />

A two-tailed test asks the question “is there any difference between groups?” In<br />

this case, the alternative hypothesis H a is that drug A is different from drug B.<br />

This can mean that drug A is either better or worse, but not equivalent to drug B.<br />

The null hypothesis H 0 states that there is no difference between the two drugs<br />

or that they are equivalent.<br />

For α = 0.05, P must only be < 0.10 for statistical significance with the onetailed<br />

test. It must be < 0.05 for the two-tailed test. This means that we will accept<br />

a Type I error one in 10 trials with a one-tailed test rather than one in 20 with a<br />

two-tailed test. Conceptually this means that for a total probability of a randomly<br />

occurring error of 0.05, each tail of the normal distribution contributes 0.025 of<br />

alpha. For a one-tailed test, each tail contributes 0.05 of alpha. This requirement<br />

for α = 0.05 is less stringent if a one-tailed test is used.<br />

Multiple outcomes<br />

The probability of making a Type I error is α for each outcome being measured.<br />

If two variables are measured, the probability of a Type I error or a false positive<br />

result is α for each variable. The probability that at least one of these two variables<br />

is a false positive is one minus the probability that neither of them is a false<br />

positive. The probability that neither is a false positive is the probability that the<br />

first variable is not a false positive (1 – α) and that the second variable is not a<br />

false positive (1 – α). This makes the probability that neither variable is a false<br />

positive (1 – α) × (1 – α), or (1 – α) 2 . The probability that at least one of the two<br />

is falsely positive then becomes 1 – (1 – α) 2 . Therefore, the probability that one<br />

positive and incorrect outcome will occur only by chance if n variables are tested<br />

is 1 – (1 − α) n .<br />

This probability becomes sizable as n gets very large. Data dredging, mining,<br />

or snooping is a technique by which the researcher looks at multiple variables<br />

in the hope that at least one will show statistical significance. This result is then<br />

emphasized as the most important positive result in the study. This is a common<br />

example of a Type I error. Suspect this when there are many variables being<br />

tested, but only a few of them show statistical significance. This can be substantial<br />

in studies of DNA sequences looking for genetic markers of disease. Typically<br />

the researcher will look at hundreds or thousands of DNA sequences and<br />

see if any are related to phenotypic signs of disease. A few of these may be positively<br />

associated by chance alone if α of 0.05 is used as the standard for statistical<br />

significance.


Type I errors and number needed to treat 123<br />

For example, if a researcher does a study that looks at 20 clinical signs of a<br />

given disease, it is possible that any one of them will be statistically significantly<br />

associated with the disease. For one variable, the probability that this association<br />

occurred by chance only is 0.05. Therefore the probability that no association<br />

occurred by chance is 1 – 0.05 = 0.95. The probability that at least one of the 20<br />

variables tested will be positively associated with the disease by chance alone is<br />

1 minus the probability of no association. Since this is 0.95 for each variable, the<br />

probability that at least one occurred by chance becomes 1 – 0.95 20 or 1 – 0.36 =<br />

0.64. Therefore, there is a 64% likelihood of coming up with one association that<br />

is falsely positive and occurred only by chance. If there are two values that show<br />

an association, one cannot know if both occurred by chance alone or if one result<br />

is truly statistically significant. Then the question becomes which result is the<br />

significant value and which result is a false positive.<br />

One way to get around this problem is by applying the Bonferroni correction.<br />

Firstwemustcreateanewlevelofα, whichwillbeα/n. This is the previous<br />

α divided by n, the number of variables being compared, not the sample size.<br />

Therefore, P must be


124 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

Table 11.1. Rules of thumb for 95% confidence intervals<br />

(1) If the point value for one (experimental) group is within the 95% CI for the other<br />

(control) group, there is likely to be no statistical significance for the difference<br />

between values.<br />

(2) If the point value for one (experimental) group is outside the 95% CI for the other<br />

(control) group, there is likely to be statistical significance for the difference<br />

between values.<br />

(3) If the 95% CI for a difference includes 0, the difference found is not statistically<br />

significant.<br />

(4) If the 95% CI for a ratio includes 1, the ratio is not statistically significant.<br />

more information than a simple P < 0.05 value since one can see a statistically<br />

plausible range of values.<br />

The limits of the 95% CI display the precision of the results. If the CI is very<br />

wide, the results are not very precise. This means that there is a great deal of<br />

random variation in the result and a very large or small value could be the true<br />

effect size. Similarly if the CI is very narrow, the results are very precise and we<br />

are more certain of the true result.<br />

If the 95% confidence interval around the difference between two groups in<br />

studies of the therapy includes the zero point, P > 0.05. The zero point is the<br />

point at which there is no difference between the two groups or the null hypothesis<br />

is true. If one limit of the CI is just near, and the interval does not cross the<br />

zero point, the result may only be slightly statistically significant. The addition of<br />

a few more subjects could make the result more statistically significant. However,<br />

the true effect may be very small and not clinically important.<br />

Statistical significance vs. clinical significance<br />

A study of a population with a very large sample size can show statistical significance<br />

at the α = 0.05 level when the actual clinical difference between the<br />

two groups is very small. For example, if a study measuring the level of pain perception<br />

using a visual analog scale showed a statistically significant difference<br />

in pain scores of 6.5 points on a scale 0–100, one might think this was important.<br />

But, another study found that patients could not actually discriminate a<br />

difference on this scale of less than 13 points. Therefore, although statistically<br />

significant, a difference of 6.5 points would not be clinically important.<br />

Clinicians must decide for themselves whether a result has reasonable clinical<br />

significance. They must then help their patients decide how much benefit will<br />

accrue from the therapy and how much risk they are willing to accept as a result


Type I errors and number needed to treat 125<br />

Standard therapy<br />

Experimental therapy<br />

patient died<br />

patient survived<br />

CER = 0.5 EER = 0.8 ARR = |0.8 − 0.5| = 0.3<br />

NNTB = 1/0.3 = 3.33 ≈ 4<br />

of potential side effects or failure of the treatment. If a difference in effect size of<br />

the magnitude found in the study will not change the clinical situation of a given<br />

patient, then that is not an important result. Clinicians must look at the overall<br />

impact of small effect size on patient care. This may include issues of ultimate<br />

survival, potential side effects and toxicities, quality of life, adverse outcomes,<br />

and costs to the patient and society. We will cover formal decision analysis in<br />

Chapter 30 and cost-effectiveness analysis in Chapter 31.<br />

Fig. 11.2 Number needed to<br />

treat to benefit. For every 10<br />

patients treated with the<br />

experimental treatment, there<br />

are three additional survivors.<br />

The number needed to treat is<br />

10/3 = 3.33 ≈ (rounded up to)<br />

4. Therefore we must treat four<br />

patients to get one additional<br />

survivor.<br />

Number needed to treat<br />

A useful numerical measure of clinical significance is the number needed to treat<br />

to benefit (NNTB). The NNTB is the number of patients that must be treated<br />

with the proposed therapy in order to have one additional successful result.<br />

To calculate NNTB one must first calculate the absolute risk reduction (ARR).<br />

This requires that the study outcomes are dichotomous and one can calculate<br />

the experimental (EER) and control (CER) event rates. The ARR is the absolute<br />

difference of the event rates of the two groups being compared (|EER – CER|).<br />

The NNTB is one divided by the ARR. By convention, NNTB is given as 1/ARR<br />

rounded up to the nearest integer. Figure 11.2 is a pictorial description of NNTB.<br />

It is ideal to see small NNTBs in studies of treatment as this means that the new<br />

and experimental treatment is a lot better than the standard, control, or placebo<br />

treatment. One can compare the NNTB to the risk of untreated disease and the<br />

risks of side effects of treatment. The related concept, the number needed to<br />

treat to harm (NNTH), is the number of patients that one would need to expose<br />

to a risk factor before an additional patient is harmed by side effects of the treatment.<br />

The concepts of NNTB and NNTH help physicians balance the benefit and


126 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

risk of therapy. The NNTH is usually calculated from studies of risk, and will be<br />

discussed in the chapter on risk assessment (Chapter 13).<br />

For studies of prevention, NNTB tends to be much larger than for studies of<br />

therapy. This difference is fine if the intervention is relatively cheap and not<br />

dangerous. For example, one aspirin taken daily can prevent death after a heart<br />

attack. The NNTB to prevent one death in the first 5 weeks after a heart attack<br />

is 40. Since aspirin is very cheap and has relatively few side effects, this is a<br />

reasonable number. The following two examples will demonstrate the use of<br />

NNTB.<br />

(1) A study of treatment for migraine headache tested a new drug sumatriptan<br />

against placebo. In the sumatriptan group, 1067 out of 1854 patients had mild<br />

or no pain at 2 hours. In the placebo group, 256 out of 1036 patients had mild<br />

or no pain at 2 hours. First the event rates are calculated, then the ARR and<br />

RRR, and finally the NNTB:<br />

EER = 1067/1854 = 58% = 0.58 and CER = 256/1036 = 25% = 0.25.<br />

ARR = 0.58 – 0.25 = 0.33. In this case we ought to say absolute rate increase<br />

(ARI) since this is the absolute increase in well-being due to the drug. This<br />

means that 33% more patients taking sumatriptan for headache will have<br />

clinical improvement compared to patients taking placebo.<br />

RRR = 0.33/0.25 = 1.33. This is the relative risk reduction or in this case<br />

relative rate increase (RRI) and means that patients treated with sumatriptan<br />

are one-and-a-third times more likely to show improvement in their<br />

headache compared with patients treated with placebo therapy. The RRR<br />

always makes the improvement look better than the ARR.<br />

NNTB = 1/0.33 = 3. You must treat three patients with sumatriptan to<br />

reduce pain of migraine headaches in one additional patient. This looks<br />

like a very reasonable number for NNTB. However, bear in mind that clinicians<br />

would never recommend placebo, and it is likely that the NNTB would<br />

not be nearly this low if sumatriptan were compared against other migraine<br />

medications. This is an example of a false comparison, very common in the<br />

medical literature, especially among studies sponsored by pharmaceutical<br />

companies.<br />

(2) Streptokinase (SK) and tissue plasminogen activator (t-PA) are two drugs that<br />

can dissolve blood clots in the coronary arteries and can treat myocardial<br />

infarction (MI). A recent study called GUSTO compared the two in the treatment<br />

of MI. In the most positive study comparing the use of these in treating<br />

MI, the SK group had a mortality of 7% (CER) and the t-PA group had a mortality<br />

of 6% (EER). This difference was statistically significant (P < 0.05).<br />

ARR =|6% – 7%| or 1%. This means that there is a 1% absolute improvement<br />

in survival when t-PA is used rather than SK.<br />

RRR = (|6–7|)/6 or 16%. This means that there is a relative increase in survival<br />

of 16% when t-PA is used rather than SK. This is the figure that was used


Type I errors and number needed to treat 127<br />

in advertisements for the drug that were sent out to cardiologists, familymedicine,<br />

emergency-medicine, and critical-care physicians.<br />

NNTB = 1/1% = 1/0.01 = 100. This means that you must treat 100 patients<br />

with the experimental therapy to save one additional life. This may not be<br />

reasonable especially if there is a large cost difference or significantly more<br />

side effects. In this case, SK costs $200 per dose while t-PA costs $2000 per<br />

dose. There was also an increase in the number of symptomatic intracranial<br />

bleeds with t-PA. The ARR for symptomatic intracranial bleeds was about<br />

0.3%, giving an NNTH of about 300. That means for every 300 patients who<br />

get t-PA rather than streptokinase, one additional patient will have a symptomatic<br />

intracranial bleed.<br />

The number needed to screen to benefit (NNSB) is a related concept that looks at<br />

how many people need to be screened for a disease in order to prevent one additional<br />

death. For example, to prevent one additional death from breast cancer<br />

one must screen 1200 women beginning at age 50. Since the potential outcome<br />

of not detecting breast cancer is very bad and the screening test is not invasive<br />

with very rare side effects, it is a reasonable screening test. We will discuss screening<br />

tests in Chapter 28.<br />

The number needed to expose to harm (NNEH) is the number of patients that<br />

must be exposed to a risk factor in order for one additional person to have the<br />

outcome of interest. This can be a negative outcome such as lung cancer from<br />

exposure to secondhand smoke or a positive one such as reduction in dental<br />

caries from exposure to fluoride in the water. The NNEH to secondhand smoke to<br />

cause one additional case of lung cancer in a non-smoking spouse after 14 years<br />

of exposure is 1300. This NNEH is very high, meaning that very few of the people<br />

who are at risk will develop the outcome. However, the baseline exposure rate is<br />

high, with 25% of the population being smokers and the cost of intervention is<br />

very low, thus making reduction of secondhand smoke very desirable.<br />

For all values of NNTB and other similar numbers, confidence intervals should<br />

be given in studies that calculate these statistics. The formulas for these are very<br />

complex and are given in Appendix 4. There are several convenient NNTB calculators<br />

on the Web. Two recommended sites are those of the University of British<br />

Columbia 1 and the Centre for <strong>Evidence</strong>-Based <strong>Medicine</strong> at Oxford University.<br />

Other sources of Type I error<br />

There are three other common sources of Type I error that are seen in research<br />

studies and may be difficult to spot. Authors with a particular bias will do<br />

many things to make their preferred treatment seem better than the comparison<br />

1 www.spph.ubc.ca/sites/healthcare/files/calc/clinsig.html


128 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

treatment. Authors may do this because of a conflict of interest, or simply<br />

because they are zealous in defense of their original hypothesis.<br />

An increasingly common device for reporting results uses composite endpoints.<br />

A composite endpoint is the combination of two or more endpoints or<br />

outcome events into one combined event. These are most commonly seen when<br />

a single important endpoint such as a difference in death rates shows results that<br />

are small and not statistically significant. The researcher then looks at other endpoints<br />

such as reduction in recurrence of adverse clinical events. The combination<br />

of both decreased death rates and reduced adverse events may be decreased<br />

enough to make the study results statistically significant. A recent study looked at<br />

the anticoagulant low-molecular-weight heparin (LMWH) for the prevention of<br />

death in certain types of cardiac events such as unstable angina and non-Q-wave<br />

myocardial infarctions. The final results were that death, heart attack, urgent<br />

surgery, or angioplasty revascularization occurred in fewer of the LMWH group<br />

than in the standard heparin group. However, there was no difference between<br />

groups for death. It was only when all the outcomes were put together that the<br />

difference achieved statistical significance. In addition, the LMWH group had<br />

more intracranial bleeds and the NNTB for the composite endpoint was almost<br />

equal to the NNTH for the bleeds.<br />

Sometimes a study will show a non-significant difference between the intervention<br />

and comparison treatment for the overall sample group being studied.<br />

In some cases, the authors will then look at subgroups of the study population to<br />

find one that demonstrates a statistically significant association. This post-hoc<br />

subgroup analysis is not an appropriate way to look for significance and is a form<br />

of data dredging. The more subgroups that are examined, the more likely it is that<br />

a statistically significant outcome will be found – and that it will have occurred<br />

by chance. This can determine a hypothesis for the next study of the same intervention.<br />

In that subsequent study, only that subgroup will be the selected study<br />

population and improvement looked for in that group only.<br />

A recent study of stroke found that patients treated with thrombolytic therapy<br />

within 3 hours did better than those treated later than 3 hours. The authors concluded<br />

that this was the optimal time to begin treatment and the manufacturer<br />

began heavily marketing these very expensive and possibly dangerous drugs.<br />

Subsequent studies of patients within this time frame have not found the same<br />

degree of reduction in neurological deficit found in the original study. It turns<br />

out that the determination of the 3-hour mark was a post-hoc subgroup analysis<br />

performed after the data were obtained. The authors looked for some statistically<br />

significant time period in which the drug was effective, and came to rest on<br />

3 hours. To obtain the true answer to this 3-hour mark question, a randomized<br />

controlled clinical trial explicitly looking at this time window should be done to<br />

determine if the results are reproducible.


Type I errors and number needed to treat 129<br />

Finally, there can be a serious problem if a clinical trial is stopped early because<br />

of apparently excellent results early in the study. The researchers may feel that it<br />

is unethical to continue the trial when the results are so dramatic that they have<br />

achieved statistical significance even before the required number of patients<br />

have been enrolled. This is becoming more common, having more than doubled<br />

in the past 20 years. One problem is that there may be an apparently large<br />

treatment effect size initially, when in reality only a few outcome events have<br />

occurred in a small study population. The reader can tell if this is likely to<br />

have happened by looking at the 95% confidence intervals and seeing that they<br />

are very wide, and often barely statistically significant. When a trial is stopped<br />

early, there is also a danger that the trial won’t discover adverse effects of therapy<br />

and the trial will not determine if the side effects are more or less likely to<br />

occur than the beneficial events. One proposed solution to this problem is that<br />

there be prespecified stopping rules. These might include a minimum number of<br />

patients to be enrolled and also a more stringent statistical threshold for stopping<br />

the study. It has been suggested that an α level of 0.001 be set as the statistically<br />

significant level that must be met if the study is stopped early. Even this may not<br />

prevent overly optimistic results from being published, and all research must be<br />

reviewed in the context of other studies of the same problem. If these other studies<br />

are congruent with the results of the study stopped early, it is very likely that<br />

the results are valid. However, if the results seem too good to be true, be careful,<br />

they probably are.


12<br />

Negative studies and Type II errors<br />

If I had thought about it, I wouldn’t have done the experiment. The literature was full of<br />

examples that said you can’t do this.<br />

Spencer Silver on the work that led to the unique adhesives<br />

for 3M Post-It R○ Notepads<br />

Learning objectives<br />

In this chapter you will learn:<br />

how to recognize Type II errors in a study<br />

how to interpret negative clinical trials using 95% confidence intervals<br />

how to use a nomogram to determine the appropriate sample size and<br />

interpret a Type II error<br />

Interpretation of the results of negative clinical trials requires an understanding<br />

of the statistical processes that can account for these results. Intelligent readers<br />

of the medical literature must be able to interpret these results and determine<br />

for themselves if they are important enough to ignore in clinical practice.<br />

The problem with evaluating negative studies<br />

Negative studies are those that conclude that there is no statistically significant<br />

association between the cause and effect variables or no difference between the<br />

two groups being compared. This may occur because there really is no association<br />

or difference between groups, a true negative result, or it can occur<br />

because the study was unable to determine that the association or difference<br />

was statistically significant. If there really is a difference or association, the latter<br />

finding would be a false negative result and this is a critical problem in medical<br />

research.<br />

In a college psychology class, an interesting experiment was done. There were<br />

two sections of students in the lab portion of the class and each section did the<br />

130


Negative studies and Type II errors 131<br />

same experiment. On separate days, each student was given a cup of coffee, one<br />

day they got real Java and the next day decaf. After drinking the coffee, they were<br />

given a simple test of math problems that had to be completed in a specified<br />

time and each of the students’ scores was then calculated. For both groups, the<br />

scores under the influence of caffeine were highest. However, when a statistical<br />

test was applied to the results, they were not statistically significant, meaning<br />

that the results could have occurred by chance greater than 5% of the time. Does<br />

caffeine improve scores on a simple math test? Are the results really no different<br />

or was the study falsely negative?<br />

Type II error<br />

This type of error occurs when the null hypothesis H 0 , is accepted and no difference<br />

is found between the groups even though the groups truly are different. In<br />

other words, the researcher concludes that there isn’t a difference, when in fact<br />

there is a difference. An example would be concluding there is no relationship<br />

between familial hyperlipidemia and the occurrence of coronary artery disease<br />

when there truly is a relationship. Another would be concluding that caffeine<br />

intake does not increase the math scores of college psychology students when in<br />

fact it does. This is called a β or Type II error.<br />

The researcher should define beta (β) as the maximum probability of making<br />

a Type II error or failing to reject the null hypothesis when it is actually<br />

false. This is a convoluted way of saying that it finds the alternative hypothesis<br />

to be false, when it ain’t! Beta is the probability of the occurrence of this wrong<br />

conclusion that an investigator must be willing to accept. The researcher does<br />

not set β directly. It can be calculated from the expected study results before a<br />

study is done. Practically, the value of β is estimated from the conditions of the<br />

experiment.<br />

Power is the ability to detect a statistically significant difference when it actually<br />

exists. Power is one minus the probability that a type II error is made and is<br />

equal to 1 – β. The researcher can reduce β, and thereby increase the power, by<br />

selecting a sufficiently large sample size (n). Other changes that can be made to<br />

lower the probability of making a Type II or β error include increasing the difference<br />

one wants to detect, using a one-tailed rather than a two-tailed test, and<br />

increasing the level of α from 0.05 to 0.1 or even higher.<br />

Determining power<br />

In statistical terminology, power means that the study will reject the null hypothesis<br />

when it really is false. By convention one sets up the experiment so that β


132 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

is no greater than 0.20. Equivalently, power should be more than 0.80 to be considered<br />

adequate for most studies. Remember, that a microscope with greater<br />

power will be able to detect smaller differences between cells.<br />

Power depends on several factors. These include the type of variable, statistical<br />

test, degree of variability, effect size, and the sample size. The type of variable<br />

can be dichotomous, ordinal, or continuous, and for a high power, continuous<br />

variables are best. For the statistical test, a one-tailed test has more power than a<br />

two-tailed test. The degree of variability is <strong>based</strong> on the standard deviation, and<br />

in general, the smaller the standard deviation, the greater the power. The bigger<br />

the better is the basic principle when using the effect size and the sample size<br />

to increase a study’s power. These concepts are directly related to the concept<br />

of confidence discussed<br />

√<br />

in Chapter 10. The confidence formula (confidence =<br />

(signal/noise)<br />

√<br />

× n) can be written as confidence = (effect size/standard deviation)<br />

× n. According to this formula, as effect size or sample size increases, confidence<br />

increases, thus the power increases. As the standard deviation increases,<br />

confidence decreases and the power decreases.<br />

Effect of sample size on power<br />

Sample size (n) has the most obvious effect on the power of a study with power<br />

increasing in proportion to the square root of the sample size. If the sample size<br />

is very large, an experiment is more likely to show statistical significance even if<br />

there is a small effect size. This is a purely mathematical issue. The smaller the<br />

sample size, the harder it is to find statistical significance even if one is looking<br />

for a large effect size. Remember the two groups of college psychology students<br />

at the start of this chapter. It turns out, when the scores for the two groups<br />

were combined, the results were statistically significant. Figure 12.1 demonstrates<br />

the effect of increasing sample size to obtain a statistically significant<br />

result.<br />

For example, one does a study to find out if ibuprofen is good for relieving the<br />

pain of osteoarthritis. The results were that patients taking ibuprofen had 50%<br />

less pain than those taking placebo. However, in this case. there were only five<br />

patients in each group and the result, although very large in terms of effect size,<br />

was not statistically significant. If one then repeats the study and gets exactly<br />

the same results with 25 patients in each group, then the result turns out to be<br />

statistically significant. This change in statistical significance occurred because<br />

of an increase in power.<br />

In the extreme, studies of tens of thousands of patients will often find very tiny<br />

effect sizes, such as 1% difference or less, to be statistically significant. This is<br />

the most important reason to use the number needed to treat instead of only<br />

P < 0.05 as the best indicator of the clinical significance of a study result. In


Negative studies and Type II errors 133<br />

δ<br />

P < 0.05<br />

δ<br />

P > 0.05<br />

95% CI 95% CI<br />

cases like this, although the results are statistically significant, the patient will<br />

most likely have minimal, if any, benefit from the treatment. In terms of confidence<br />

intervals, a larger sample size will lead to narrower 95% confidence<br />

intervals.<br />

Effect of effect size on power<br />

Before an experiment is done, effect size is estimated as the difference between<br />

groups that will be clinically important. The sample size needed to detect the<br />

predetermined effect size can then be calculated. Overall, it is easier to detect<br />

a large effect such as a 90% change rather than a small one like a 2% change<br />

(Fig. 12.2). However, as discussed above, if the sample size is large enough, even a<br />

very small effect size may be statistically significant but not clinically important.<br />

Another technique to be aware of is that the conditions of the experiment can be<br />

manipulated to show a large effect size, but this is usually at the cost of making a<br />

Type III or IV error.<br />

95% CI<br />

95% CI<br />

Fig. 12.1 Effect of changing<br />

sample size. Two variables with<br />

different sample sizes and the<br />

same effect size (δ). The area<br />

under the curves is proportional<br />

tothesamplesize(n). The<br />

samples on the left with a small<br />

sample size are not statistically<br />

significantly different (p ><br />

0.05). The ones on the right with<br />

a larger sample size have an<br />

effect size that is statistically<br />

significant (p< 0.05).<br />

Effect of level of significance on power<br />

The magnitude of the level of significance, α, tells the reader how willing the<br />

researchers are to have a result that occurred only by chance. If α is large, the<br />

study will have more power to find a statistically significant difference between


134 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

δ 1<br />

δ 2<br />

P > 0.05 P < 0.05<br />

95% CI 95% CI<br />

95% CI<br />

95% CI<br />

Fig. 12.2 Effect of changing effect size. Two variables with different effect sizes and the same<br />

sample size. The results of the group on the left with a small effect size are not statistically<br />

significantly different (p > 0.05). The ones on the right with a larger effect size have a result<br />

that is statistically significant (p < 0.05).<br />

P > 0.05<br />

δ<br />

P < 0.10<br />

δ<br />

95% CI 95% CI<br />

90% CI<br />

90% CI<br />

Fig. 12.3 Effect of changing alpha. Two variables with different levels of α. The samples on<br />

the left with a small α (= 0.05) are not statistically significantly different (p > 0.05). The<br />

ones on the right with a larger α (= 0.1) have an effect size that is statistically significant<br />

(p < 0.10).<br />

groups. If α is very small, researchers are willing to accept only a tiny likelihood<br />

that the effect size found occurred by chance alone. In general, as the level of α<br />

increases, we are willing to have a greater likelihood that the effect size occurred<br />

by chance alone (Fig. 12.3). We are more likely to find the difference to be statistically<br />

significant if the level of α is larger rather than smaller. In medicine, we<br />

generally set α at 0.05, while in physics α may be set at 0.0001 or lower. Those<br />

in medicine today who believe that 0.05 is too stringent and we should go to<br />

an α level of 0.1 might not be comfortable knowing that the treatment they were


Negative studies and Type II errors 135<br />

P > 0.05<br />

SD 1<br />

δ<br />

SD 2<br />

δ<br />

P < 0.05<br />

95% CI 95% CI<br />

receiving was better than something cheaper, less toxic, or more commonly used<br />

by a chance factor of 10%.<br />

Effect of standard deviation on power<br />

The smaller the standard deviation of the data-sets, the better the power of the<br />

study. If two samples each have small standard deviations, a statistical test is<br />

more likely to find them different than if they have large standard deviations.<br />

Think of the standard deviation as defining the width of a normal distribution<br />

around the mean value found in the study. When the two normal distributions<br />

are compared, the one with the smallest spread will have the most likelihood of<br />

being found statistically significant (Fig. 12.4).<br />

95% CI<br />

95% CI<br />

Fig. 12.4 Effect of changing<br />

precision (standard deviation).<br />

Two variables with the same<br />

sample size and effect size. In<br />

thecaseontheleftthereisa<br />

large standard deviation, while<br />

on the right there is a small<br />

standard deviation. The situation<br />

on the right will be statistically<br />

significant (p < 0.05) while<br />

the one on the left will not<br />

(p > 0.05).<br />

Negative studies<br />

A Type II error can only be made in a negative clinical trial. These are trials<br />

reporting no statistically significant difference or association. Therefore, when<br />

reading negative clinical trials, one needs to assess the chance that a Type II<br />

error occurred. This is important because a negative result may not be due to<br />

the lack of an important effect, but simply because of the inability to detect that<br />

effect statistically. This is called a study with low power. From an interpretation<br />

perspective, the question one asks is, “For a given β level and a difference that<br />

I consider clinically important, did the researcher use a large enough sample<br />

size?”<br />

Since the possibility of a Type II error is a non-trivial problem, one must perform<br />

his or her own interpretation of a negative clinical trial. The three common<br />

ways of doing this are through the interpretation of the confidence intervals, by


136 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

using sample size nomograms, and with published power tables. We will discuss<br />

the first two methods since they can be done most simply without specialized<br />

references.<br />

Evaluating negative studies using confidence intervals<br />

Confidence intervals (CI) can be used to represent the level of significance. There<br />

are several rules of thumb that must be remembered before using CIs to determine<br />

the potential of a Type II error. First, if the point estimate value of one<br />

variable is within the 95% CI range of the other variable, there is no statistical<br />

significance to the difference between the two groups. Second, if the 95%<br />

CI for a difference includes 0, the difference found is not statistically significant.<br />

Last, if the 95% CI for a ratio includes 1, the ratio found is not statistically<br />

significant.<br />

Unlike P values, which are only a single number, 95% CIs allow the reader to<br />

actually see a range of possible values that includes the true value with 95% certainty.<br />

For the difference between two groups, it gives the range of the most likely<br />

difference between the two groups under consideration. For a given effect size,<br />

one can look at the relationship between the limits of the CI and the null point,<br />

the point at which there is no difference or association. A 95% CI that is skewed<br />

in one direction and where one end of the interval is very near the null point can<br />

have occurred as a result of low power. In that case, a larger sample size might<br />

show a statistically significant effect.<br />

For example, in a study of the effect of two drugs on pain, the change in the<br />

visual analog score (VAS) was found to be 25mm with a 95% CI from –5mm to<br />

55mm. This suggests that a larger study could find a difference that was statistically<br />

significant, although maybe not as large as 25mm. If one added a few more<br />

patients, the CI would be narrower and would most likely not include the null<br />

point, 0 in this case. If there were no other evidence available, it might be reasonable<br />

to use the better drug until either a more powerful study or a well-done<br />

meta-analysis showed a clear-cut superiority of one treatment over the other,<br />

or showed equivalence of the two drugs. On the other hand, if the 95% CI were<br />

–15mm to 60mm, it would be unlikely that adding even a large number of additional<br />

patients would change the results. There is approximately the same degree<br />

of the 95% CI on either side of the null point, suggesting that the true values are<br />

most likely to be near the null point and less likely to be near either extreme.<br />

In this case, consider the study to be negative, at least until another and much<br />

larger study comes along.<br />

The 95% CI can also be used to evaluate positive studies. If the absolute<br />

risk reduction (ARR) for an intervention is 0.05 with a 95% CI of 0.01 to 0.08,<br />

the intervention achieves statistical significance, but barely achieves clinical


Negative studies and Type II errors 137<br />

Begin with the<br />

sample size that<br />

was used in the<br />

study<br />

Use the nomogram<br />

to determine the<br />

effect size (δ) that<br />

could be found<br />

with this sample<br />

Potential δ < clinically<br />

important δ<br />

Ignore results (lacks<br />

power)<br />

Potential δ > clinically<br />

important δ<br />

Accept results<br />

significance. In this case, if the intervention is extremely expensive or dangerous,<br />

its use should be strongly debated <strong>based</strong> upon such a small effect<br />

size.<br />

Fig. 12.5 Sequence of events for<br />

analyzing negative studies using<br />

sample size.<br />

Evaluating negative studies using a nomogram<br />

There are two ways to analyze the results of a negative study using published<br />

nomograms from an article by Young and others. 1 These begin either with the<br />

sample size or with the effect size. Either method will show, for a study with sufficient<br />

power, what sample size was necessary or what effect size could be found<br />

to produce statistical significance.<br />

In the first method, use the nomogram to determine the effect size that the<br />

sample size of the study had the power to find. Begin with the sample size and<br />

work backward to find the effect size. If the effect size that could potentially have<br />

been found with this sample size was larger than the effect size that a clinician<br />

or patient would consider clinically important, accept the study as negative. In<br />

other words, in this study, the clinically important difference could have been<br />

found and was not. On the other hand, if the clinically important effect size could<br />

not have been found with the sample size that was enrolled, the study was too<br />

small. Ignore the study and consider the result a Type II error. Wait for confirmatory<br />

studies before using the information (Fig. 12.5).<br />

The second way of analyzing a negative study is to determine the sample size<br />

needed to get a clinically important effect size. Use the nomograms starting from<br />

the effect size that one considers clinically important and determine the sample<br />

size that would be needed to find this effect size. This clinically important effect<br />

size will most likely be larger than the actual difference found in the study. If the<br />

actual sample size is greater than the sample size required to find a clinically<br />

important difference, accept the results as negative. The study had the power<br />

1 M. J. Young, E. A. Bresnitz & B. L. Strom. Sample size nomograms for interpreting negative clinical<br />

studies. Ann. Intern. Med. 1983; 99: 248–251.


138 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

Decide on a<br />

clinically<br />

important<br />

difference<br />

Use the nomogram<br />

to determine the<br />

sample size (n)<br />

needed to find this<br />

difference<br />

Actual n < required n<br />

Ignore results (lacks<br />

power)<br />

Actual n > required n<br />

Accept results<br />

Fig. 12.6 Schematic of sequence<br />

of events for analyzing negative<br />

studies using effect size.<br />

to find a clinically important effect size and did not. If the actual study sample<br />

size is less than the required sample size to find a clinically important difference,<br />

ignore the results with the caveats listed below. The study didn’t have the power<br />

to find a difference that is clinically important (Fig. 12.6).<br />

There are some caveats which must be considered in using this method to<br />

evaluate negative studies. If the needed sample size is huge, it is unlikely that<br />

a group that large can ever be studied, so accept the results as a negative study.<br />

If the needed sample size is within about one order of magnitude greater than<br />

the actual sample size, wait for the bigger study to come along before using the<br />

information. This process is illustrated in Fig. 12.7 (dichotomos variables) and<br />

Fig. 12.8 (continuous variables). The CD-ROM has some sample problems that<br />

will help you understand this process.<br />

Using a nomogram for dichotomous variables<br />

Dichotomous variables are those for which there are only two possible values<br />

(e.g., cured or not cured).<br />

(1) Identify one group as the control group and the other as the experimental<br />

group, which should be evident from the study design.<br />

(2) Decide what relative rate reduction (RRR) would be clinically important.<br />

(3) RRR = (CER – EER)/CER, where CER = control event rate and EER = experimental<br />

event rate.<br />

(4) Locate this % change on the horizontal axis.<br />

(5) Extend a vertical line to intersect with the diagonal line representing the percentage<br />

response rate of the control group (CER).<br />

(6) Extend a horizontal line from the intersection point to the vertical axis and<br />

read the required sample size (n) for each group.<br />

Using a nomogram for continuous variables<br />

Continuous variables are those for which multiple possible values can exist and<br />

which have proportional intervals.


Negative studies and Type II errors 139<br />

α = 0.05 (twosided)<br />

and<br />

β = 0.20<br />

Sample size,<br />

n<br />

1000<br />

500<br />

200<br />

100<br />

50<br />

30<br />

20<br />

10<br />

10<br />

80 70 60 50 40 30 20 10<br />

Relative rate reduction (or increase one wants to find, RRR) (%)<br />

Percentage<br />

responding in<br />

control group<br />

20 30 50 100 200 500<br />

Fig. 12.7 Nomogram for<br />

dichotomous variables. If a study<br />

found a 20% relative risk<br />

reduction and there was a 60%<br />

response rate in the control<br />

group (vertical line), you would<br />

find this effect size statistically<br />

significant only if there was a<br />

sample size of more than 200 in<br />

each group (horizontal line). If<br />

the actual study had only 100<br />

patients in each group and<br />

found a 20% relative risk<br />

reduction, which was not<br />

statistically significant, you<br />

should wait until a slightly larger<br />

study (200 per group) is done.<br />

AfterM.J.Young,E.A.Bresnitz&<br />

B. L. Strom. Sample size<br />

nomograms for interpreting<br />

negative clinical studies. Ann.<br />

Intern. Med. 1983; 99: 248–251<br />

(used with permission.)<br />

α = 0.05 (twosided)<br />

and<br />

β = 0.20<br />

Sample<br />

size, n<br />

1000<br />

500<br />

200<br />

100<br />

50<br />

20<br />

10<br />

5<br />

1<br />

.1<br />

SD<br />

60<br />

.1 .2 .5 1 2 3 4 8 10<br />

15 20 30 40 50<br />

.2 .5 1.0 2 5 10 20 50 100<br />

Actual effect size<br />

Fig. 12.8 Nomogram for<br />

continuous variables. If a study<br />

found a difference of 1 unit and<br />

the control group had a standard<br />

deviation of 2 (vertical line), you<br />

would find this effect size<br />

statistically significant only if<br />

therewasasamplesizeofmore<br />

than 70 per group (horizontal<br />

line). If the actual study found<br />

an effect size of only 0.5, and<br />

you thought that was clinically<br />

but not statistically significant,<br />

you would need to wait for a<br />

larger study (about 250 in each<br />

group) to be done before<br />

accepting that this was a<br />

negative study. After M. J.<br />

Young, E. A. Bresnitz & B. L.<br />

Strom. Sample size nomograms<br />

for interpreting negative clinical<br />

studies. Ann. Intern. Med. 1983;<br />

99: 248–251 (used with<br />

permission.)


140 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

(1) Decide what difference (absolute effect size) is clinically important.<br />

(2) Locate this difference on the horizontal axis.<br />

(3) Extend a vertical line to the diagonal line representing the standard deviation<br />

of the data being measured. You can use the SD for either group.<br />

(4) Extend a horizontal line to the vertical axis and read the required sample size<br />

(n) foreachgroup.<br />

Non-inferiority studies and equivalence studies<br />

Sometimes a goal of a research study can be to determine that the experimental<br />

treatment is no worse than the standard treatment or placebo. In that<br />

case, an approach has been suggested that only seeks to show non-inferiority<br />

of the experimental therapy to the comparison. In these studies, the alternative<br />

hypothesis is that the experimental therapy is inferior to the standard therapy.<br />

This comes from a null hyopothesis that states that the experimental treatment<br />

is equal to or better than the placebo or control treatment. In order for this study<br />

to be done, there must have been previous research studies showing that when<br />

compared to standard therapy or placebo, there is either no difference or the<br />

results were not statistically significant. It is also possible that there was a difference<br />

but the studies were of very poor quality, possibly lacking correct randomization<br />

and blinding so that the majority of physicians would not accept the<br />

results.<br />

It is important for the reader to recognize that what the authors are essentially<br />

saying is that they are willing to do a one-tailed test for showing that the treatment<br />

is equal to or better than the control or placebo group. This leads to a value<br />

of P for statistical significance on one tail that should be less than 0.05 rather<br />

than the traditional 0.025. The standard two-tailed statistical tests should not be<br />

done as they are more likelty to lead to a failure to find statistical significance,<br />

which in this case would be most likely a Type II error. In other words, they will<br />

most likely find that there is no difference in the groups when in fact there is a<br />

difference. Non-inferiority studies are most often seen in drug studies used by<br />

manufacturers to demonstrate that a new drug is at least as good as the standard<br />

drugs that are available. Of course, common sense would dictate that if a new<br />

drug is more expensive than a standard one and if it does not have a track record<br />

of safety, there ought to be no reason to use the new drug simply because it is not<br />

inferior.


13<br />

Risk assessment<br />

We saw the risk we took in doing good,<br />

But dared not spare to do the best we could.<br />

Robert Frost (1874–1963): The Exposed Nest<br />

Learning objectives<br />

In this chapter you will learn:<br />

the basic concept and measures of risk<br />

the meanings, calculations, uses, and limitations of:<br />

absolute risk<br />

relative risk<br />

odds ratios<br />

attributable risk and number needed to harm<br />

attributable risk percent<br />

the use of confidence intervals in risk<br />

how to interpret the concept of “zero risk”<br />

Risk is present in all human activities. What is the risk of getting breast cancer if<br />

a woman lives on Long Island and is exposed to organochlorines? What is the<br />

risk of getting lung cancer because there is a smoke residue on a co-worker’s<br />

sweater? What is the risk of getting paralyzed as a result of spinal surgery? How<br />

about the risk of getting diarrhea from amoxicillin? Some of these risks are real<br />

and others are, at best, minimally increased risks of modern life. Risks may be<br />

those associated with a disease, with therapy, or with common environmental<br />

factors. Physicians must be able to interpret levels of risk for better care of their<br />

patients.<br />

141


142 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

Measures of risk<br />

First one must understand that risk is the probability that an event, disease, or<br />

outcome will occur in a particular population. The absolute risk of an event,<br />

disease, or outcome in exposed subjects is defined as the ratio of patients who<br />

are exposed to the risk factor and develop the outcome of interest to all those<br />

patients exposed to the risk. For example, if we study 1000 people who drink<br />

more than two cups of coffee a day and 60 of them develop pancreatic cancer,<br />

the risk of developing pancreatic cancer among people drinking more than two<br />

cups of coffee a day is 60/1000 or 6%. This can also be written as a conditional<br />

probability, P outcome | risk = probability of the outcome if exposed to the risk<br />

factor. The same calculation can be done for people who are not exposed to the<br />

risk and who nevertheless get the outcome of interest. Their absolute risk is the<br />

ratio of those not exposed to the risk factor and who have the outcome of interest<br />

to all those not exposed to the risk factor.<br />

Risk calculations can help us in many clinical situations. They can help associate<br />

an etiology such as smoking to an outcome such as lung cancer. Risk calculations<br />

can estimate the probability of developing an outcome such as the<br />

increased risk of endometrial cancer because of exposure to estrogen therapy.<br />

They can demonstrate the effectiveness of an intervention on an outcome such<br />

as showing a decreased mortality from measles in children who have been vaccinated<br />

against the disease. Finally, they can target interventions that are most<br />

likely to be of benefit. For example, they can measure the effect of aspirin as<br />

opposed to stronger blood thinners like heparin or low-molecular-weight heparin<br />

on mortality from heart attacks.<br />

The data used to estimate risk come from research studies. The best estimates<br />

of risk come from randomized clinical trials (RCTs) or well done cohort studies.<br />

These studies can separate groups by the exposure and then measure the risk<br />

of the outcome. They can also be set up so that the exposure precedes the outcome,<br />

thus showing a cause and effect relationship. The measure of risk calculated<br />

from these studies is called the relative risk, which will be defined shortly.<br />

Relative risk can also be measured from a cross-sectional study, but the cause<br />

and effect cannot be shown from that study design. Less reliable estimates of<br />

risk may still be useful and can come from case–control studies, which start with<br />

the assumption that there are equal numbers of subjects with and without the<br />

outcome of interest. The estimates of risk from these studies approximate the<br />

relative risk calculated from cohort studies using a calculation known as an odds<br />

ratio, which will also be defined shortly.<br />

There are several measures associated with any clinical or epidemiological<br />

study of risk. The study design determines which way the data are gathered and<br />

this determines the type of risk measures that can be calculated from a given


Risk assessment 143<br />

Direction of sampling<br />

B<br />

(case−control study)<br />

Fig. 13.1 A pictorial way to look<br />

at studies of risk. Note the<br />

difference in sampling direction<br />

for different types of studies.<br />

Disease Disease<br />

present (D+) absent (D−)<br />

Risk present (R+)<br />

Direction of sampling<br />

A<br />

(Cohort study or RCT)<br />

Risk absent (R−)<br />

a b a + b<br />

c d c + d<br />

a + c b + d n<br />

(a + b + c + d)<br />

Population<br />

study. Patients are initially identified either by exposure to the risk factor as in<br />

cohort studies or RCTs, by their outcome as in case–control studies, or by both as<br />

in cross-sectional studies. These are summarized in Fig. 13.1.<br />

Absolute risk<br />

Absolute risk is the probability of the outcome of interest in those exposed or<br />

not exposed to the risk factor. It compares those with the outcome of interest<br />

and the risk factor (a) to all subjects in the population exposed to the risk factor<br />

(a + b). In probabilistic terms, it is the probability of the outcome if exposed to<br />

the risk factor, also written as P outcome | risk = P (O+ |R+). One can also do<br />

this for patients with the outcome of interest who are not exposed to the risk factor<br />

(c) and compare them to all of those who are not exposed to the risk factor<br />

[c/(c + d)]. Probabilistically it is written as P outcome | no risk = P (O+ |R−).<br />

Absolute risk only gives information about the risk of one group, either those<br />

exposed to the risk factor or those not exposed to the risk factor. It can only be<br />

calculated from cross-sectional studies, cohort studies, or randomized clinical<br />

trials, because in these study designs, you can calculate the incidence of a particular<br />

outcome for those exposed or not exposed to the risk factor. One must<br />

know the relative proportions of the factors in the total population in order to<br />

calculate this number, as demonstrated in the rows of the 2 × 2 table in Fig. 13.1.


144 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

Absolute risk (if exposed to risk factor)<br />

= number exposed and with outcome / number exposed<br />

= a/(a + b) = P{outcome | risk present}<br />

Absolute risk (if not exposed to risk factor)<br />

= number not exposed and with outcome / number not exposed<br />

= c/(c + d) = P{outcome | risk absent}<br />

Disease<br />

present (D+)<br />

Disease<br />

absent (D−)<br />

Risk present (R+) a b a + b<br />

Risk absent (R−) c d c + d<br />

Cohort study or RCT<br />

Direction of sampling<br />

a + c b + d n<br />

Fig. 13.2 Absolute risk.<br />

The absolute risk is the probability that someone with the risk factor has the<br />

outcome of interest. In the 2 × 2 diagram (Fig. 13.2), patients labeled a are those<br />

with the risk factor who have the outcome and those labeled a + b are all patients<br />

with the risk factor. The ratio a/(a + b) is the probability that one will have the<br />

outcome if exposed to the risk factor. This is a statement of conditional probability.<br />

The same can be done for the row of patients who were not exposed to the<br />

risk factor. The absolute risk for them can be written as c/(c + d). These absolute<br />

risks are the same as the incidence of disease in the cohort being studied.<br />

Relative risk<br />

Relative risk (RR) is the ratio of the two absolute risks. This is the absolute risk<br />

of the outcome in subjects exposed to the risk factor divided by the absolute risk<br />

of the outcome in subjects not exposed to the risk factor. It shows whether that<br />

risk factor increases or decreases the outcome of interest. In other words, it is the<br />

ratio of the probability of the outcome if exposed to the probability of the outcome<br />

if not exposed. Relative risk can only be calculated from cross-sectional<br />

studies, cohort studies or randomized clinical trials. The larger or smaller<br />

the relative risk, the stronger the association between the risk factor and the<br />

outcome.<br />

If the RR is greater than 1, the risk factor is associated with an increase in the<br />

rate of the outcome. If the RR is less than 1, the risk factor is associated with a<br />

reduction in the rate of the outcome. If it is 1, there is no change in risk from the<br />

baseline risk level and it is said that the risk factor has no effect on the outcome.<br />

The higher the relative risk, the stronger the association that is discovered. A relative<br />

risk greater than 4 is usually considered very strong. Values below this could<br />

have been obtained because of systematic flaws in the study. This is especially<br />

true for observational studies like cross-sectional and cohort studies where there<br />

may be many confounding variables that could be responsible for the results. In


Risk assessment 145<br />

Relative risk<br />

= (incidence of outcome in exposed group)/<br />

(incidence of outcome in non-exposed group)<br />

= P{outcome | risk} ÷ P{outcome | no risk}<br />

= [a/(a + b)]/[c/(c + d)]<br />

Disease<br />

present (D+)<br />

Disease<br />

absent (D−)<br />

Risk present (R+) a b a + b<br />

Risk absent (R−) c d c + d<br />

Cohort study or RCT<br />

Direction of sampling<br />

a + c b + d n<br />

Fig. 13.3 Relative risk.<br />

studies showing a reduction in risk, look for RR to be less than 0.25 for it to be<br />

considered a strong result.<br />

A high relative risk does not prove that the risk factor is responsible for outcome:<br />

it merely quantifies the strength of association of the two. It is always possible<br />

that a third unrecognized factor, a surrogate or confounding variable, is<br />

the cause of the association because it equally affects both the risk factor and the<br />

outcome. The calculation of relative risk is pictured in Fig. 13.3.<br />

Data collected for relative-risk calculations come from cross-sectional studies,<br />

cohort studies, non-concurrent cohort studies, and randomized clinical<br />

trials. These studies are used because they are the only ones capable of calculating<br />

incidence. Importantly, cohort studies should demonstrate complete<br />

follow-up of all study subjects, as a large drop-out rate may lead to invalid<br />

results. The researchers should allow for an adequate length of follow-up in order<br />

to ensure that all possible outcome events have occurred. This could be years<br />

or even decades for cancer while it is usually weeks or days for certain infectious<br />

diseases. This follow-up cannot be done in cross-sectional studies, which<br />

can only show the strength of association but not that the cause preceded the<br />

effect.<br />

Odds ratio<br />

An odds ratio is the calculation used to estimate the relative risk or the association<br />

of risk and outcome for case–control studies. In case–control studies, subjects<br />

are selected <strong>based</strong> upon the presence or absence of the outcome of interest.<br />

This study design is used when the outcome is relatively rare in the population<br />

and calculating relative risk would require a cohort study with a huge number of<br />

subjects in order to find enough patients with the outcome. In case–control studies,<br />

the number of subjects selected with and without the outcome of interest are<br />

independent of the true ratio of these in the population. Therefore the incidence,<br />

the rate of occurrence of new cases of each outcome associated with and without


146 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

Odds of having risk factor if outcome is present = a/c<br />

Odds of having risk factor if outcome is not present = b/d<br />

Case−control study<br />

Direction of sampling<br />

Disease<br />

present (D+)<br />

Disease<br />

absent (D−)<br />

Odds ratio = (a/c)/(b/d) = ad/bc.<br />

This is also called the “cross product”.<br />

Risk present (R+) a b a + b<br />

Risk absent (R−) c d c + d<br />

a + c b + d n<br />

Fig. 13.4 Odds ratio.<br />

the risk factor, cannot be calculated. Relative risk cannot be calculated from this<br />

study design.<br />

Odds are a different way of saying the same thing as probabilities. Odds tell<br />

someone the number of times an event will happen divided by the number of<br />

times it won’t happen. Although they are different ways of expressing the same<br />

number, odds and probability are mathematically related. In case–control studies,<br />

one measures the individual odds of exposure in subjects with the outcome<br />

as the ratio of subjects with and without the risk factor among all subjects with<br />

that outcome. The same odds can be calculated for exposure to the risk factor<br />

among those without the outcome.<br />

The odds ratio compares the odds of having the risk factor present in the subjects<br />

with and without the outcome under study. This is the odds of having the<br />

risk factor if a person has the outcome divided by the odds of having the risk factor<br />

if a person does not have the outcome. Overall, it is an estimate of the relative<br />

risk for case–control studies (Fig. 13.4).<br />

Using the odds ratio to estimate the relative risk<br />

The odds ratio best estimates the relative risk when the disease is very rare. The<br />

rationale for this is not intuitively obvious. Cohort-study patients are evaluated<br />

on the basis of exposure and then outcome is determined. Therefore, one can<br />

calculate the absolute risk or the incidence of disease if the patient is or is not<br />

exposed to the risk factor and subsequently the relative risk can be calculated.<br />

Case–control study patients are evaluated on the basis of outcome and exposure<br />

is then determined. The true ratio of patients with and without the outcome<br />

in the general population cannot be known from the study, but is an arbitrary<br />

ratio set by the researcher. One can only look at the ratio of the odds of risk in<br />

the diseased and non-diseased groups, hence the odds ratio. In the case–control


Risk assessment 147<br />

In the cohort study, the relative risk (RR) is [a/(a+b)]/[c/(c+d )]. If the disease or<br />

outcome is very rare, a


148 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

Attributable risk percent (ARP)<br />

= {[a/(a + b)] − [c/(c + d)]}/[c/(c + d)]<br />

Disease<br />

present (D+)<br />

Disease<br />

absent (D−)<br />

Absolute attributable risk (AAR)<br />

= {[a/(a + b)] − [c/(c + d)]}<br />

treat to T<br />

Number needed to harm (NNH)<br />

= 1/AAR<br />

Risk present (R+) a b a + b<br />

Risk absent (R−) c d c + d<br />

Cohort study or RCT<br />

Direction of sampling<br />

a + c b + d n<br />

Fig. 13.6 Attributable risk and<br />

the number needed to harm.<br />

for those not exposed to the risk factor. It can also be reported relative to those<br />

exposed to the risk factor. It tells you how much of the change in risk is due to the<br />

risk factor either absolutely or relative to the risk in the control group. For example,<br />

95% of cases of lung cancer can be attributed to smoking. This percentage<br />

is risk of cases of lung cancer relative to people who don’t smoke. The attributable<br />

risk of lung cancer in non smokers would be 5% and is the absolute attributable<br />

risk divided by the absolute risk in smokers. Attributable risk can only be calculated<br />

from cross-sectional studies, cohort studies or randomized clinical trials<br />

that can provide good measurement of the incidence of the outcome. This construct<br />

tries to quantify the contribution of other unidentifiable risk factors to the<br />

differences in outcomes between exposed and non-exposed groups.<br />

Attributable risk quantitates the contribution of the risk factor in producing<br />

the outcome in those exposed to the risk factor. It is helpful in calculating the<br />

cost–benefit ratio of eliminating the risk factor from the population. Absolute<br />

attributable risk, also known as the absolute risk increase, is analogous to absolute<br />

risk reduction between the control and experimental event rates that was<br />

mentioned in the previous chapters. It allows for the calculation of the number<br />

neededtotreattoharm (NNTH = 1/AAR or 1/ARI). This was previously called the<br />

number needed to harm (NNH). It tells us how many people need to be exposed<br />

before one additional person will be harmed or one additional harmful outcome<br />

will occur (Fig. 13.6).<br />

Putting risk into perspective<br />

A large increase in relative risk may represent a clinically unimportant increase in<br />

personal risk. This is especially true if the outcome is relatively rare in the population.<br />

For instance, several years ago there was a concern that the influenza<br />

vaccine could cause a serious and potentially fatal neurologic syndrome called<br />

Guillain–Barré syndrome (GBS). This syndrome consists of progressive weakness


of the muscles of the body in an ascending pattern. It is usually reversible, but<br />

may require a period of time on a ventilator getting artificial respiration. There<br />

were 74 cases of this related to the influenza vaccine in 1993–1994. The odds ratio<br />

for that season was 1.5, meaning a 50% increase in the number of cases. Since<br />

the base incidence of this disease is approximately two in one million, even a 10-<br />

fold increase in risk would have little impact on the general population. This risk<br />

needed to be balanced against the number of lives saved by the influenza vaccine.<br />

That number is thousands of times greater than the small increased risk of<br />

GBS with the vaccine. Although the news of this possible reaction was alarming<br />

to many patients, it had very little clinical significance.<br />

Similarly, a small increase in relative risk may represent a clinically important<br />

increase in personal risk if the outcome is common in the population. For example,<br />

if an outcome has an incidence of 12 in 100, increasing the risk even by 1.5,<br />

the same 50% increase as seen in the previous example, will have a significant<br />

impact on the general population. In this case, the examination of all possible<br />

outcome data is necessary to determine if eliminating the risk is associated with<br />

appropriate gains. For example, it is known that the use of conjugated estrogens<br />

in postmenopausal women can reduce the rate of osteoporosis but these estrogens<br />

are associated with an increased risk of endometrial carcinoma. Would the<br />

decreased morbidity and mortality due to osteoporosis balance the increase in<br />

morbidity and mortality due to endometrial cancer among women using conjugated<br />

estrogens? Good clinicians must be able to interpret these risks for patients<br />

and help them make an informed decision.<br />

Confidence intervals give an idea of the relative precision of a study result.<br />

They represent the standard error of the relative risk or odds ratio. They should<br />

always be reported whenever relative risk or odds ratios are reported! Small, or<br />

as the statisticians say tight, confidence intervals suggest that the sampling error<br />

due to random events is small, leading to a very precise result. A large confidence<br />

interval is also called loose and suggests that there is a lot of random error leading<br />

to a very imprecise result. For example if the RR is 2 and the CI is 1.01 to 6,<br />

there is indeed an association, but it may be very strong (6) or very weak (1.01).<br />

Remember, if the confidence interval for a relative risk or odds ratio includes the<br />

number 1, there is no statistical association between risk factor and outcome.<br />

Statistically this is equivalent to a study result with P > 0.05.<br />

The confidence interval allows someone to look at the spread of the results,<br />

and interpret the strengths and weaknesses of the results. Loose confidence<br />

intervals should suggest a need for more research. Usually they represent small<br />

samples and the addition of one or two new events could dramatically change<br />

the numbers. Very tight intervals that are close to one suggest a high degree of<br />

precision in the result, but also a low strength of association which may not be<br />

clinically important.<br />

Risk assessment 149


150 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

Reporting relative risk and odds ratios<br />

Over the past 15 years, more and more epidemiologic cohort and case–control<br />

studies have been reporting their results in terms of relative risk and odds ratios.<br />

The intelligent consumer of the medical literature will be able to determine<br />

whether these resulting measures of risk were used correctly. Sometimes these<br />

measures are not used correctly, as illustrated below.<br />

A recent example of this was a report in the New England Journal of<br />

<strong>Medicine</strong> about the effect of race and gender on physician referral for cardiac<br />

catheterization. 1,2 The original study reported that physicians, when given standardized<br />

scenarios, were more likely to refer white men, white women, and black<br />

men than black women for evaluation of coronary artery disease (CAD). The<br />

newspapers reported that blacks and women were 40% less likely to be referred<br />

for cardiac catheterization than whites and men. The actual study showed that<br />

90.6% of the white men, white women, and black men were referred while<br />

78.8% of the black women were referred. The authors incorrectly calculated<br />

the odds ratios for these numbers and came up with an odds ratio of 0.4. The<br />

actual odds associated with a 90.6% probability are 9.6 to 1 while those associated<br />

with a 78.8% probability are 3.7 to 1. When the data were recalculated<br />

for men and women or whites and blacks, the results showed that men were<br />

referred more often (90.6%) than women (84.7%) and whites (90.6%) more often<br />

than blacks (84.7%). The odds here were men (9.6), women (5.5), whites (9.6)<br />

and blacks (5.5), making the odds ratio for both of these comparisons equal<br />

to 0.6.<br />

However, there were two problems with these numbers. First, the outcome was<br />

not rare in the diseased group. All of the groups were equal in size and the outcome<br />

was not rare in the general population. This distorts the odds ratio as an<br />

approximation of the relative risk. Second, the study was a clinical trial with the<br />

risk factors of race and gender being the independent variable and the referral<br />

for catheterization, the dependent variable. Therefore, the relative risk and<br />

not the odds ratio should have been calculated. Had this been done, the relative<br />

risk for white vs. black and men vs. women was 0.93 with the 95% CI from<br />

0.89 to 0.99. Not only is the risk much smaller than reported in the news, but it<br />

approaches the null point suggesting lack of clinical significance or the possibility<br />

of a type I error. Ultimately, the original report using odds ratios led to a<br />

distortion in reporting of the study by the media.<br />

1 K.A.Schulman,J.A.Berlin,W.Harless,J.F.Kerner,S.Sistrunk,B.J.Gersch,R.Dubé, C. K. Taleghani,<br />

J. E. Burke, S. Williams, J. M. Eisenberg & J. J. Escarce. The effect of race and sex on physicians’ recommendations<br />

for cardiac catheterization. N.Engl.J.Med.1999; 340: 618–626.<br />

2 L. M. Schwartz, S. Woloshin & H. G. Welch. Misunderstandings about the effects of race and sex on<br />

physicians’ referrals for cardiac catheterization. N. Engl. J. Med. 1999; 341: 279–283.


Risk assessment 151<br />

A user’s guide to the trials of harm or risk<br />

The following standardized set of methodological criteria can be used for the<br />

critical assessment of a trial studying risk, also called harm. It is <strong>based</strong> upon<br />

the Users’ Guides to the Medical Literature published by JAMA andusedwith<br />

permission. 3 The University of Alberta (www.med.ualberta.ca/ebm) has online<br />

worksheets for evaluating articles of therapy that use this guide and are available<br />

as free use documents.<br />

(1) Was the study valid?<br />

(a) Except for the exposure under study, were the compared groups similar<br />

to each other? Was this an RCT, a cross-sectional, cohort, or case–control<br />

study? Were other known prognostic factors similar or adjusted for?<br />

(b) Were the outcomes and exposures measured in the same way in the compared<br />

groups? Was there recall or interviewer bias? Was the exposure<br />

opportunity similar?<br />

(c) Was follow-up sufficiently long and complete? What were the reasons for<br />

incomplete follow-up?<br />

(d) Were risk factors similar in those lost and not lost to follow-up?<br />

(e) Is the temporal relationship correct? Did the exposure precede the outcome?<br />

(f) Is there a dose–response relationship? Did the risk of the outcome<br />

increase with the quantity or duration of the exposure?<br />

(2) What are the results?<br />

(a) How strong is the association between exposure and outcome? What are<br />

the RR’s or OR’s? Was the correct measure of risk used for the study? RR<br />

used for cross-sectional, cohort, or RCTs and OR for case–control studies.<br />

(b) How precise is the estimate of risk? Were there wide or narrow confidence<br />

intervals?<br />

(c) If the study results were negative, did the study have a sufficiently large<br />

sample size?<br />

(3) Will the results help me in patient care?<br />

(a) Can the results be applied to my patients? Were patients similar for demographics,<br />

severity, co-morbidity, and other prognostic factors?<br />

(b) Are treatments and exposures similar?<br />

(c) What is the magnitude of the risk? What is the absulute increase and its<br />

reciprocal, NNTH?<br />

(d) Should I attempt to stop the exposure? How strong is the evidence? What<br />

is the magnitude of the risk? Are there any adverse effects of reducing<br />

exposure?<br />

3 G. H. Guyatt & D. Rennie (eds.). Users’ Guides to the Medical Literature: a Manual for <strong>Evidence</strong>-Based<br />

Practice. Chicago: AMA, 2002. See also Bibliography.


152 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

What does a zero numerator mean? Is there ever zero risk?<br />

What if you read a study that found no instances of a particular outcome? A zero<br />

numerator does not mean that there is no risk. One can still infer an estimate of<br />

the potential size of the risk. There is an excellent article by Hanley and Lippman-<br />

Hand that shows how to handle this eventuality. 4 Their example is used here.<br />

Suppose a given study shows no adverse events in 14 consecutive patients.<br />

What is the largest number of adverse events we can reasonably expect? What<br />

we are doing here is calculating the upper limit of the 95% CI for this sample.<br />

The rule of three can be used to determine this risk. The maximum number of<br />

events that can be expected to occur when none have been observed is 3/n. For<br />

this study finding no adverse events in a study of 14 patients, the upper limit<br />

of the 95% CI is 3/14 = 21.4%. One could expect to see as many as one adverse<br />

event in every 5 patients and still have come up with no events in the 14 patients<br />

in the initial study.<br />

Assume that the study of 14 patients resulted in no adverse outcomes. What<br />

if in reality there is an adverse outcome rate of 1:1000? The probability of no<br />

adverse events in one patient is 1 minus the probability of at least one adverse<br />

event in one patient. Another way of writing this is p(no adverse event in one<br />

patient) = 1–p(at least one adverse event in one patient). This makes the probability<br />

of no adverse events = 1 – 0.001 = 0.999. Therefore p(no adverse events in<br />

n patients) is 0.999 n . For 14 patients this is 0.986, or there is a 98.6% chance that<br />

in 14 patients we would find no adverse outcome events.<br />

Now suppose that the actual rate of adverse outcomes is 1:100. p(no adverse<br />

outcomes in one patient) = 1 – 0.01 = 0.99. p(no adverse events in 14 patients) =<br />

0.99 14 . This means that there is a 86.9% chance that we would find no adverse<br />

outcome in these 14 patients. We can continue to reduce the actual adverse<br />

event rate to 1:10, and using the same process we get p(no adverse events in 14<br />

patients) = (0.90) 14 . Now 22.9% is the chance we would find no adverse outcome<br />

events in these 14 patients.<br />

Similarly, for an actual rate of 1:5 p(no adverse event in 14 patients) = 0.8 14<br />

or 3.5%, and for an actual rate of 1:6 you get you will get a potential event rate<br />

of 7.7%. Therefore the 95% CI lies between event rates of 1:5 and 1:6. The rate<br />

estimated by our rule of three for adverse events is 3/n = 1/4.7 = 21.4%. When<br />

actually calculated the true number is 1/5.5 = 18.2%.<br />

Mathematically one must solve the equation (1 – maximum risk) n = 0.05 to<br />

find the upper limit of 95% CI. Solving the equation for the maximum risk, 1 –<br />

maximum risk = n√ 0.05, and maximum risk = 1– n√ 0.05. For n > 30, n√ 0.05 is<br />

close to (n –3)/n, making the maximum risk = 1–[(n –3)/n] = 3/n. The actual<br />

numbers are shown in Table 13.1. One can use a similar process to approximate<br />

4 J. A. Hanley & A. Lippman-Hand. If nothing goes wrong, is everything all right? Interpreting zero<br />

numerators. JAMA 1983; 249: 1743–1745.


Risk assessment 153<br />

Table 13.1. Actual vs. estimated rates of<br />

adverse events if there is a zero numerator<br />

Rate found in study Exact 95% CI Rule 3/n<br />

0/10 26% 30%<br />

0/20 14% 15%<br />

0/30 10% 10%<br />

0/100 0.3% 0.3%<br />

Table 13.2. Approximate maximum event rate<br />

for small numerators<br />

Number of events<br />

in the numerator<br />

Estimate of maximum<br />

number of events<br />

0 3/n<br />

1 4/n<br />

2 5/n<br />

3 7/n<br />

4 9/n<br />

the upper limit of the 95% CI if there are 1, 2, 3, or 4 events in the numerator.<br />

Table 13.2 is the estimate of the maximum number of events one might expect if<br />

the actual number of events found is from 0 to 4.<br />

For example, studies of head-injured patients to date have shown that none<br />

of the 2700 low-risk patients, those with laceration only or bump without loss<br />

of consciousness, headache, vomiting, or change in neurological status, had any<br />

intracranial bleeding or swelling. Therefore, the largest risk of intracranial injury<br />

in these low-risk patients would be 3/2700 = 1/900 = 0.11%. This is the upper<br />

limit of the 95% confidence interval.<br />

To find the upper limit of the 99% CI, use rule of 4.6/n, which can be derived in<br />

a similar manner. Table 13.3 gives the 95% CIs for extreme results with a variety<br />

of sample sizes.<br />

General observations on the nature of risk<br />

Most people don’t know how to make reasonable judgments about the nature of<br />

risk, even in terms of risks that they know they are exposed to. This was articulated<br />

in 1662 by the Port Royal monks in their treatise about the nature of risk. If<br />

people did have this kind of judgment, very few people would be smoking. There


154 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

Table 13.3. 95% confidence limits on extreme results<br />

And the % is 0, the true<br />

And the % is 100, the true<br />

If the denominator is % could be as high as % could be as low as<br />

10 26% 74%<br />

20 14% 86%<br />

30 10% 90%<br />

40 7% 93%<br />

50 6% 94%<br />

60 5% 95%<br />

70 4% 96%<br />

80 4% 96%<br />

90 3% 97%<br />

100 3% 97%<br />

150 2% 98%<br />

300 1% 99%<br />

are several important biases that come into play when talking about risk. The<br />

physician should be aware of this when discussing risks with their patient.<br />

People are more likely to risk a poor outcome if due to voluntary action rather<br />

than imposed action. They are likely to smoke and accept the associated risks<br />

because they think it is their choice rather than an addiction. Similarly, they will<br />

accept risks that they feel they have control over rather than risks controlled by<br />

others. Because of this, people are much more likely to be very upset when they<br />

find out that their medication causes a very uncommon, but previously known,<br />

side effect.<br />

One only has to read the newspapers to know that there are more stories on<br />

the front page about catastrophic accidents like plane crashes or fatal automobile<br />

accidents than minor automobile accidents. This is also true of medical<br />

situations. Patients are more willing to accept the risk of death from cancer or<br />

sudden cardiac death than death due to unforeseen complications of routine<br />

surgery. If there is a clear benefit to avoiding a particular risk, for example that<br />

one shouldn’t drink poison, patients are more likely to accept a bad outcome<br />

if they engage in that risky behavior. A major exception to this rule is cigarette<br />

smoking, because of the social nature of smoking and the addictive nature of<br />

nicotine.<br />

People are democratic about their perception of risk. They are more willing<br />

to accept risk that is distributed to all people rather than risk that is biased to<br />

some people. Natural risks are more acceptable than man-made risks. There is<br />

a perception that man-made objects ought not to fail, while if there is a natural<br />

disaster it is God’s will. Risk that is generated by someone in a position of


trust such as a doctor is less acceptable than that generated by someone not<br />

in that position like one’s neighbor. We are more accepting of risks that are<br />

likely to affect adults than of those primarily affecting children, risks that are<br />

more familiar over those that are more exotic, and random events like being<br />

struck by lightning rather than catastrophes such as a storm without adequate<br />

warning.<br />

Risk assessment 155


14<br />

Adjustment and multivariate analysis<br />

Stocks have reached what looks like a permanently high plateau.<br />

Irving Fisher, Professor of Economics, Yale University, 1929<br />

Learning objectives<br />

In this chapter you will learn:<br />

the essential features of multivariate analysis<br />

the different types of multivariate analysis<br />

the limitations of multivariate analysis<br />

the concept of propensity scoring<br />

the Yule–Simpson paradox<br />

Studies of risk often look at situations where there are multiple risk factors associated<br />

with a single outcome, which makes it hard to determine whether a single<br />

statistically significant result is a chance occurrence or a true association<br />

between cause and effect. Since most studies of risk are observational rather than<br />

interventional studies, confounding variables are a significant problem. There<br />

are several ways of analyzing the effect of these confounding variables. Multivariate<br />

analysis and propensity scores are methods of evaluating data to determine<br />

the strength of any one of multiple associations uncovered in a study. They are<br />

attempts to reduce the influence of confounding variables on the study results.<br />

What is multivariate analysis?<br />

Multivariate analysis answers the question “What is the importance of one risk<br />

factor for the risk of a disease, when controlling for all other risk factors that<br />

could contribute to that disease?” Ideally, we want to quantitate the added risk<br />

for each individual risk factor. For example, in a study of lipid levels and the risk<br />

for coronary-artery disease, it was found that after adjusting for advancing age,<br />

156


Adjustment and multivariate analysis 157<br />

smoking, elevated systolic blood pressure, and other factors, there was a 19%<br />

decrease in coronary heart disease risk for each 8% decrease in total cholesterol<br />

level.<br />

In studies of diseases with multiple etiologies, the dependent variable can<br />

be affected by multiple independent variables. In the example described above,<br />

coronary heart disease is the dependent variable. Smoking, advancing age, elevated<br />

systolic blood pressure, other factors, and cholesterol levels are the independent<br />

variables. The process of multivariate analysis looks at the changes in<br />

magnitude of risk associated with each independent variable when all the other<br />

contributing independent variables are held fixed.<br />

In studies using multivariate analysis, the dependent variable is most often an<br />

outcome variable. Some of the most commonly used outcome variables are incidence<br />

of new disease, death, time to death, and disease-free survival. In studies<br />

involving small populations or uncommon outcomes, there may not be enough<br />

outcome endpoints for analysis. In these cases, composite variables are often<br />

used to get enough outcome endpoints to enable a valid statistical analysis to be<br />

done. The independent variables are the risk factors that are suspected of influencing<br />

the outcome.<br />

How multivariate analysis works: determining risk<br />

Multivariate analysis looks at the changes in magnitude of the risk of a dependent<br />

variable associated with each suspected risk factor when the other suspected risk<br />

factors are held fixed. A schematic example of how this works can be seen in<br />

Fig. 14.1. 1<br />

If more variables are to be adjusted for, further division into even smaller<br />

groups must be done. This is shown in Fig. 14.2. One will notice that as more<br />

and more variables are added, the number of patients in each cell of every 2 × 2<br />

table gets smaller and smaller. This will result in the confidence intervals of each<br />

odds ratio or relative risk getting larger and larger.<br />

What can multivariate analysis do?<br />

Some studies will look at multiple risk factors to determine which are most<br />

important in making a diagnosis or predicting the outcome of a disease. The output<br />

of these studies is often the result of a multivariate analysis. Although this can<br />

suggest which variables are most important, those important variables should be<br />

1 Demonstrated to me by Karen Rossnagel from the Institute of Social <strong>Medicine</strong>, Epidemiology and<br />

Health Economics of the Charité University Medical Center in Berlin, Germany.


158 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

D+ D−<br />

R+ A B<br />

R− C D<br />

RR or OR for the<br />

entire group:<br />

RR = [A/(A + B)]<br />

[C/(C + D)]<br />

OR = AD/BC<br />

Separate into two groups by the presence or absence of a potential<br />

confounding other risk factor, for instance age < 65 or age > 65<br />

Age < 65 group<br />

D+ D−<br />

R1+ A’ B’<br />

R1− C’ D’<br />

RR or OR for the group<br />

with age < 65:<br />

Age > 65 group<br />

D+ D−<br />

R2+ A” B”<br />

R2−<br />

C”<br />

D”<br />

RR or OR for the group<br />

with age > 65:<br />

RR’ = [A’/(A’ + B’)]<br />

[C’/(C' + D’)]<br />

RR” =<br />

[A”/(A” + B”)]<br />

[C”/(C” + D”)]<br />

OR’ = A’D’/B’C’<br />

Fig. 14.1 The method of<br />

adjusting for a single variable in<br />

a multivariate analysis.<br />

Combine result statistically to create<br />

overall Adjusted RR or OR<br />

OR” = A” D” / B” C”<br />

specifically evaluated in more detail in another study. The important variables<br />

are referred to as the derivation set and if the statistical significance found initially<br />

is still present after the multivariate analysis, it is less likely to be due to a<br />

Type I error. The researchers still need to do a follow-up or validation study to<br />

verify that the association did not occur purely by chance. Multivariate analysis<br />

can also be used for data dredging to confirm statistically significant results<br />

already found as a result of simple analysis of multiple variables. Finally, multivariate<br />

analysis can combine variables and measure the magnitude of effect of<br />

different combinations of variables on the outcome.<br />

There are four basic types of multivariate analysis depending on the type of<br />

outcome variable. Multiple linear regression analysis is used when the outcome<br />

variable is continuous. Multiple logistic regression analysis is used when the<br />

outcome variable is a binary event like alive vs dead, or disease-free vs recurrent<br />

disease. Discriminant function analysis is used when the outcome variable<br />

is categorical such as better, worse, or about the same. Proportional hazards<br />

regression analysis (Cox regression) is used when the outcome variable is the


Adjustment and multivariate analysis 159<br />

D+ D−<br />

R+<br />

R−<br />

A<br />

C<br />

B<br />

D<br />

Separate into two groups by the presence or absence of the first potential confounding risk factor (age < 65 or > 65)<br />

D+ D−<br />

D+ D−<br />

R1+<br />

R1−<br />

A’<br />

B’<br />

C’ D’<br />

Age < 65<br />

Age > 65<br />

R2+<br />

R2−<br />

A” B”<br />

C” D”<br />

Separate each group further into two groups by the presence or absence of the second potential confounding risk factor (systolic blood pressure > 150mm Hg or 150<br />

BP < 150<br />

BP > 150<br />

time to the occurrence of a binary event. An example of this is the time to death<br />

or time to tumor recurrence among treated cancer patients.<br />

Fig. 14.2 Two confounding<br />

variables tested to see if the<br />

relationship between risk and<br />

outcome would still be true.<br />

Assumptions and limitations<br />

There are several types of problems associated with the interpretation of the<br />

results of multivariate analysis. These include overfitting, underfitting, linerarity,<br />

interaction, concomitance, coding, and outliers. All of these can produce error<br />

during the process of adjustment and should be considered by the author of the<br />

study.<br />

Overfitting occurs when too many independent variables allow the researcher<br />

to find a relationship when in fact none exists. Overfitting leads to a Type I<br />

error. For example, in a cohort of 1000 patients there are 20 deaths due to<br />

cancer. If there are 15 baseline characteristics considered as independent<br />

variables, it is likely that one or two will cause a result which has statistical<br />

significance by chance alone. As a rule of thumb, there should be at least 10,<br />

and some statisticians say at least 20, outcome events per independent variable<br />

of importance for statistical tests to be valid. In the example here, with<br />

only 20 outcome events, adjustment for one or at most two independent<br />

variables is all that should be done. Overfitting of variables is characterized<br />

by large confidence intervals for each outcome measure.


160 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

Fig. 14.3 Non-linear curves and<br />

the effect of crossing curves.<br />

%<br />

surviving<br />

A<br />

B<br />

Time<br />

Underfitting occurs when there are too few outcome events to find a difference<br />

that actually exists. Underfitting causes a Type II error. For example,<br />

a study of cigarette smokers followed 200 patients of whom two got lung<br />

cancer over 10 years. This may not have been long enough time to follow<br />

the cohort and the number of cancer cases is too small to find a relationship<br />

between smoking and lung cancer. Too few cases of the outcome<br />

of interest may make it impossible to find any statistical relationship with<br />

any of the independent variables. Like overfitting, underfitting of variables<br />

is also characterized by large confidence intervals. To minimize the effects<br />

of underfitting, the sample size should be large enough for there to be at<br />

least 10 and preferably 20 outcome events for each independent variable<br />

chosen.<br />

Linearity assumes that a linear relationship exists between the independent<br />

and dependent variables, and this is not always true. Linearity means that<br />

a change in the independent variable always produces the same proportional<br />

change in the dependent variable. If this is not true, one cannot use<br />

linear regression analysis. In the Cox method of proportional hazards, the<br />

increased risk due to an independent variable is assumed to be constantly<br />

proportional over time. This means that when the risks of two treatments are<br />

plotted over time, the curves will not cross. If there is a crossover (Fig. 14.3),<br />

the early survival advantage of treatment B may not be noted since the initial<br />

improvement in survival in that group may be cancelled out by the later<br />

reduction in survival.<br />

Interaction between independent variables must be evaluated. For example,<br />

smoking and oral contraceptive (OC) use are both risk factors for pulmonary<br />

embolism in young women. When considering the risk of both of these<br />

factors, it turns out that they interact. The risk of pulmonary embolism is<br />

greater in smokers using OCs than with either risk factor alone. In cases like<br />

this, the study should include enough patients with simultaneous presence<br />

of both risk factors so that the adjustment process can determine the degree<br />

of interaction between the independent variables.


Adjustment and multivariate analysis 161<br />

Concomitance refers to a close relationship between variables. Unless there<br />

is no relationship between two apparently closely related independent variables<br />

being evaluated, only one should be used. If one measures both ventricular<br />

ejection fraction and ventricular contractility and correlates them<br />

to cardiovascular mortality, it is possible that one will get redundant results.<br />

In most cases, both independent variables will predict the dependent variable,<br />

but it is possible that only one variable would be predictive, when in<br />

fact they both ought to give the same result. This is an example of concomitance.<br />

Researchers should use the variable that is most important clinically<br />

as the primary independent variable. In this example, ventricular ejection<br />

fraction is easier to measure clinically and therefore more useful in a<br />

study.<br />

Coding of the independent variables can affect the final result in unpredictable<br />

ways. For example, if the age is used as an independent variable and is<br />

recorded in 1-year intervals, 10-year intervals or as a dichotomous value<br />

such as less than or greater than 65, the results of a study will likely be different.<br />

There should always be a clear explanation about how the independent<br />

variables were coded for the analysis and why that method of coding was<br />

chosen. One can suspect that the authors selected the coding scheme that<br />

led to the best possible results and should be skeptical when reading studies<br />

in which this information is not explicitly given. Some authors might participate<br />

in post-hoc coding as a method of data dredging.<br />

Outliers are influential observations that occur when one data point or a<br />

group of points clearly lie outside the majority of data. These should be<br />

explained during the discussion of the results and an analysis that includes<br />

and excludes these points should be presented. Outliers can be caused by<br />

error in the way the data are measured or by extreme biological variation<br />

in the sample. A technique called stratified analysis can be used to evaluate<br />

outliers.<br />

In evaluation of any study using multivariate analysis, the standard processes<br />

in critical appraisal should be followed. There should be an explicit hypothesis,<br />

the data collection should be done in an objective, non-biased and thorough<br />

manner, and the software package used should be specified. An excellent<br />

overview on multivariate analysis is by J. Concato and others. 2<br />

Finally, it may not be possible to completely identify all of the confounders<br />

present in a study, especially when studying multifactorial chronic illnesses. Any<br />

study that uses multivariate analysis should be followed up with a study that<br />

looks specifically at those factors that are most important.<br />

2 J. Concato, A. R. Feinstein & T. R. Holford. The risk of determining risk with multivariable models. Ann.<br />

Intern. Med. 1993; 118: 201–210.


162 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

Propensity scores<br />

Propensity scores are another mathematical method for adjusting results of<br />

a study to attempt to decrease the effect of confounders. They were developed<br />

specifically to counteract selection bias that can occur in an observational<br />

study. Patients may be selected <strong>based</strong> upon characteristics that are not explicitly<br />

described in the methods of the study. They have become a popular tool<br />

for adjustment over the past few years. Standard adjustment is done after the<br />

final results of the study are complete. Propensity scores are used before any<br />

calculations are done and typically use a scoring system to create different levels<br />

of likelihood or propensity for placing a particular patient into one or the<br />

other group. Patients with a high propensity score are those most likely to get the<br />

therapy being tested when compared to those with a low propensity score. The<br />

propensity score can then be used to stratify the results and determine whether<br />

one group will actually have a different result than the other groups. Usually the<br />

groups being compared are the ones with the highest or lowest propensity scores.<br />

Patients who are likely to benefit the most from the chosen therapies will have<br />

the highest propensity scores. If a study is done using a large sample including<br />

patients who are less likely to benefit from the therapy, the study results may not<br />

be clinically or statistically important. But if the data are reanalyzed using only<br />

those groups with high propensity scores, it may be possible to show that there<br />

is improvement and justify the use of the drug at least in the group most likely<br />

to respond positively. The main problem with propensity scores is that the external<br />

validity of the result is limited. Ideally, the treatment should only be used for<br />

groups that have the same propensity scores as the group in the study. Those with<br />

much lower propensity scores should not have the drug used for them unless a<br />

study shows that they would also benefit from the drug.<br />

Another use of propensity scores is to determine the effect of patients who<br />

drop out of a research study. The patients’ propensity to attain the outcome<br />

of interest can be calculated using this score. Be aware, if there are too many<br />

coexisting confounding variables, it is unlikely that these approximations are<br />

reasonable and valid. One downfall of propensity scores is that they are often<br />

used as a means of obtaining statistically significant results, which are then<br />

generalized to all patients who might meet the initial study inclusion criteria.<br />

Propensity scores should be critically evaluated using the same rules applied to<br />

multivariate analysis as described in the start of this chapter.<br />

Yule–Simpson paradox<br />

This statistical anomaly was discovered independently by Yule in 1903 and rediscovered<br />

by Simpson in the 1950s. It states that it is possible for one of two groups


Adjustment and multivariate analysis 163<br />

to be superior overall and for the other group to be superior in multiple subgroups.<br />

For example, one hospital has a lower overall mortality rate while a second<br />

competing hospital has a higher overall mortality rate but lower mortality<br />

in the various subgroups such as high risk and low risk patients. This is a purely<br />

mathematical phenomenon that occurs when there are large discrepancies in<br />

the sizes of these two subgroups between the two hospitals. Table 14.1 below<br />

demonstrates how this might occur.<br />

Ideally, adjustment of the data should compensate for the potential for the<br />

Yule–Simpson paradox. However, this is not always possible and it is certainly<br />

reasonable to assume that particular factors may be more important than others<br />

and that these may not be adjusted for in the data. Readers should be careful to<br />

determine that all important factors have been included in the adjustments and<br />

still consider the possibility of the Yule–Simpson paradox if the results are fairly<br />

close together or if discrepant results occur for subgroups.<br />

Table 14.1. Yule–Simpson paradox: mortality of patients with pneumonia in<br />

two hospitals a<br />

Characteristic High risk patients Low risk patients Total mortality<br />

Hospital A 30/100 = 30% 1/10 = 10% 31/110 = 28%<br />

Hospital B 6/10 = 60% 20/100 = 20% 26/110 = 24%<br />

a Hospital A has lower mortality for each of the subgroups while Hospital B has<br />

lower total mortality.


15<br />

Randomized clinical trials<br />

One pill makes you larger,<br />

and one pill makes you small.<br />

And the ones your mother gives you,<br />

don’t do anything at all.<br />

Grace Slick, The Jefferson Airplane: White Rabbit, from Surrealistic Pillow, 1967<br />

Learning objectives<br />

In this chapter you will learn:<br />

the unique features of randomized clinical trials (RCTs)<br />

how to undertake critical interpretation of RCTs<br />

The randomized clinical trial (RCT) is the ultimate paradigm of clinical research.<br />

Many consider the RCT to be the most important medical development of the<br />

twentieth century, as their results are used to dictate clinical practice. Although<br />

these trials are often put on a pedestal, it is important to realize that as with all<br />

experiments, there may be flaws in the design, implementation, and interpretation<br />

of these trials. The competent reader of the medical literature should be<br />

able to evaluate the results of a clinical trial in the context of the potential biases<br />

introduced into the research experiment, and determine if it contains any fatal<br />

flaws<br />

Introduction<br />

The clinical trial is a relatively recent development in medical research. Prior to<br />

the 1950s, most research was <strong>based</strong> upon case series or uncontrolled observations.<br />

James Lind, a surgeon in the British Navy, can claim credit for performing<br />

the first recorded clinical trial. In 1747, aboard the ship Salisbury, he took 12<br />

sailors with scurvy and divided them into six groups of two each. He made sure<br />

they were similar in every way except for the treatment they received for scurvy.<br />

164


Randomized clinical trials 165<br />

Dr. Lind found that the two sailors who were given oranges and lemons got better<br />

while the other ten did not. After that trial, the process of the clinical trial went<br />

relatively unused until it was revived with studies of the efficacy of streptomycin<br />

for the treatment of tuberculosis done in 1948. The randomized clinical trial or<br />

randomized controlled trial has remained the premier source of new knowledge<br />

in medicine since then.<br />

A randomized clinical trial is an experiment. In an RCT, subjects are randomly<br />

assigned to one of two or more therapies and then treated in an identical manner<br />

for all other potential variables. Subjects in an RCT are just as likely or unlikely<br />

to get the therapy of interest as they are to get the comparator therapy. Ideally<br />

the researchers are blinded to the group in which the subjects are allocated. The<br />

randomization code is not broken until the study is finally completed. There are<br />

variations on this theme using blinded safety committees to determine if the<br />

study should be stopped. Sometimes it is warranted to release the results of the<br />

study, which is stopped early because it showed a huge benefit and continuing<br />

the study would not be ethical.<br />

Physician decision making and RCTs<br />

There are several ways that physicians make decisions on the best treatment<br />

for their patients. Induction is the retrospective analysis of uncontrolled clinical<br />

experience or extension of the expected mechanism of disease as taught<br />

in pathophysiology. It is doing that which “seems to work,” “worked before,”<br />

or “ought to work.” “Abdication” or seduction is someone doing something<br />

because others tell them that it is the right thing to do. These may be teachers,<br />

consultants, colleagues, advertisements, pharmaceutical representatives,<br />

authors of medical textbooks, and others. One accepts their analysis of the medical<br />

information on faith and this dictates what one actually does for his or her<br />

patient.<br />

Deduction is the prospective analysis and application of the results of critical<br />

appraisal of formal randomized clinical trials. This method of decision making<br />

will successfully withstand formal attempts to demonstrate the worthlessness<br />

of a proven therapy. Therapy proven by well-done RCTs is what physicians<br />

should be doing for their patients, and it is what medical students should integrate<br />

into clinical practice for the rest of their professional lives. One note of<br />

caution belongs here. It is not possible to have an RCT for every question about<br />

medicine. Some diseases are so rare or therapies so dangerous that it is unlikely<br />

that a formal large RCT will ever be done to answer that clinical query. For these<br />

types of questions, observational studies or less rigorous forms of evidence may<br />

need to be applied to patients.


166 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

Table 15.1. Schema for randomized clinical trials<br />

Ultimate objective Specific treatment Target disorder<br />

cure drug therapy disease<br />

reduce mortality surgery illness<br />

prevent recurrence other therapies predicament<br />

limit deterioration<br />

nutrition<br />

prevention<br />

psychological support<br />

relieve distress<br />

deliver reassurance<br />

allow the patient to die comfortably<br />

There are three global issues to identify when evaluating an RCT (Table 15.1).<br />

These are (1) the ultimate objective of treatment, (2) the nature of the specific<br />

treatment, and (3) the treatment target. The ultimate objective of treatment must<br />

be defined before the commencement of the trial. While we want therapy to cure<br />

and eliminate all traces of disease, more often than not other outcomes will be<br />

sought. Therapy can reduce mortality or prevent a treatable death, prevent recurrence,<br />

limit structural or functional deterioration, prevent later complications,<br />

relieve the current distress of disease including pain in the terminal phase of illness,<br />

or deliver reassurance by confidently estimating the prognosis. These are all<br />

very different goals and any study should specify which ones are being sought.<br />

After deciding on the specific outcome one wishes to achieve, one must then<br />

decide which element of sickness the therapy will most affect. This is not always<br />

the disease or the pathophysiologic derangement itself. It may be the illness<br />

experience of the patient or how that pathophysiologic derangement affects the<br />

patient through the production of certain signs and symptoms. Finally, it could<br />

also be how the illness directly or indirectly affects the patient through disruption<br />

of the social, psychological, and economic function of their lives.<br />

Characteristics of RCTs<br />

The majority of RCTs are drug studies or studies of therapy. Often, researchers<br />

or drug companies are trying to prove that a new drug is better than drugs that<br />

are currently in use for a particular problem. Other researched treatments can<br />

be surgical operations, physical or occupational therapy, procedures, or other<br />

modalities to modify illness. We will use the example of drug trials for most of<br />

this discussion. However, any other medical question can be substituted for the


Randomized clinical trials 167<br />

subject of an RCT. The basic rules to apply to critically evaluate RCTs are covered<br />

in the following pages.<br />

Hypothesis<br />

The study should contain a hypothesis regarding the use of the drug in the general<br />

medical population or the specific population tested. There are two basic<br />

types of drug study hypotheses. First, the drug can be tested against placebo, or<br />

second, the drug can be tested against another regularly used active drug for the<br />

same indication. “Does the drug work better than nothing?” looks at how well the<br />

drug performs against a placebo or inert treatment. The placebo effect has been<br />

shown to be relatively consistent over many studies and has been approximated<br />

to account for up to 35% of the treatment effect. A compelling reason to compare<br />

the drug against a placebo would be in situations where there is a question<br />

of the efficacy of standard therapies. The use of Complementary and Alternative<br />

<strong>Medicine</strong>s (CAM) is an example of testing against placebo and can often be<br />

justified since the CAM therapy is expected to be less active than standard medical<br />

therapy. Testing against placebo would also be justified if the currently used<br />

active drug has never been rigorously tested against active therapy. Otherwise,<br />

the drug being tested should always be compared against an active drug that is<br />

in current use for the same indication and is given in the correct dose for the<br />

indication being tested.<br />

The other possibility is to ask “Does the drug work against another drug which<br />

has been shown to be effective in the treatment of this disease in the past?”<br />

Beware of comparisons of drugs being tested against drugs not commonly used<br />

in clinical practice, with inadequate dosage, or uncommon routes of administration.<br />

These caveats also apply to studies of medical devices, surgical procedures,<br />

or other types of therapy. Blinding is difficult in studies of modalities such as procedures<br />

and medical devices, and should be done by a non-participating outside<br />

evaluation team. Another way to study these modalities is by ‘expert <strong>based</strong>’ randomization.<br />

In this method, various practitioners are selected as the basis of randomization<br />

and patients enrolled in the study are randomized to the practitioner<br />

rather than the modality.<br />

When ranking evidence, the well-done RCT with a large sample size is the<br />

highest level of evidence for populations. A subgroup of RCTs called the n-of-<br />

1 trial is stronger evidence in the individual patient and will be discussed later.<br />

An RCT can reduce the uncertainty surrounding conflicting evidence obtained<br />

from lesser quality studies as illustrated in the following example. Over the past<br />

20 years, there were multiple studies that demonstrated decreased mortality if<br />

magnesium was given to patients with acute myocardial infarction (AMI). Most<br />

of these studies were fairly small and showed no statistically significant improvement<br />

in survival. However, when they were combined in a single systematic


168 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

review, also called a meta-analysis, there was definite statistical and clinical<br />

improvement. Since then, a single large randomized trial called ISIS-4, enrolled<br />

thousands of patients with AMI and showed no beneficial effect attributable to<br />

giving magnesium. It is therefore very unlikely that magnesium therapy would<br />

benefit AMI patients.<br />

RCTs are the strongest research design capable of proving cause-and-effect<br />

relationships. The cause is often the treatment, preventive medicine, or diagnostic<br />

test being studied, and the effect is the outcome of the disease being treated,<br />

disease prevention by early diagnosis or disease diagnosis by a test. The study<br />

design alone does not guarantee a quality study and a poorly designed RCT can<br />

give false results. Thus, just like all other studies, critical evaluation of the components<br />

is necessary before accepting the results.<br />

The hypothesis is usually found at the end of the introduction. Each study<br />

should contain at least one clearly stated, unambiguous hypothesis. One type<br />

of hypothesis to be aware of is a single hypothesis attempting to prove multiple<br />

cause-and-effect relationships. This cannot be analyzed with a single statistical<br />

test and will lead to data dredging. Multiple hypotheses can be analyzed<br />

with multivariate analysis and the risks noted in Chapter 14 should be considered<br />

when analyzing these studies. The investigation should be a direct test of<br />

the hypothesis, although occasionally it is easier and cheaper to test a substitute<br />

hypothesis. For example, drug A is studied to determine its effect in reducing<br />

cardiovascular mortality, but what is measured is its effect on exercise-stress-test<br />

performances. In this case, the exercise-stress-test performance is a surrogate<br />

outcome and is not necessarily related to the outcome in which most patients<br />

are interested, mortality.<br />

Inclusion and exclusion criteria<br />

Inclusion and exclusion criteria for subjects should be clearly spelled out so<br />

that anyone reading the study can replicate the selection of patients. These criteria<br />

ought to be sufficiently broad to allow generalization of the study results<br />

from the study sample to a large segment of the population. This concept is also<br />

called external validity and was discussed in Chapter 8. The source of patients<br />

recruited into the study should minimize sampling or referral bias. For instance,<br />

patients selected from specialty health-care clinics often are more severely ill or<br />

have more complications than most patients. They are not typical of all patients<br />

with a particular disease so the results of the RCT may not be generalizable to<br />

all patients with the disorder. A full list of the reasons for patients’ exclusion, the<br />

number of patients excluded for each reason, and the methods used to determine<br />

exclusion criteria must be defined in the study. Additionally, these reasons<br />

should have face validity. Commonly accepted exclusions are patients with<br />

rapidly fatal diseases that are unrelated to the disease being studied, those with


Randomized clinical trials 169<br />

absolute or relative contraindications to the therapy, and those likely to be lost<br />

to follow-up. Beware if there are too many subjects excluded without sound reasons,<br />

as this may be a sign of bias.<br />

Randomization<br />

Randomization is the key to the success of the RCT. The main purpose of randomization<br />

is to create study groups that are equivalent in every way except<br />

for the intervention being studied. Proper randomization means subjects have<br />

an equal chance of inclusion into any of the study groups. By making them<br />

as equal as possible, the researcher seeks to limit potential confounding variables.<br />

If these factors are equally distributed in both groups, bias due to them is<br />

minimized.<br />

Some randomization schemes have the potential for bias. The date of admission<br />

to hospital, location of bed in hospital (Berkerson’s bias), day of birth, and<br />

common physical characteristics such as eye color, all may actually be confounding<br />

variables and result in unequal qualities of the groups being studied. The first<br />

table in most research papers is a comparison of baseline variables of the study<br />

and control groups. This documents the adequacy of the randomization process.<br />

In addition, statistical tests should be done to show the absence of statistically<br />

significant differences between groups. Remember that the more characteristics<br />

looked at, the higher the likelihood that one of them will show differences<br />

between groups, just by chance alone. The characteristics listed in this first table<br />

should be the most important ones or those most likely to confound the results<br />

of the study.<br />

Allocation of patients to the randomization scheme should be concealed.<br />

This means that the process of randomization itself is completely blinded. If a<br />

researcher knew to which study group the next patient was going to be assigned,<br />

it would be possible to switch their group assignment. This can have profound<br />

effects on the study results, acting as a form of selection bias. Patients who<br />

appeared to be sicker could be assigned to the study group preferred by the<br />

researcher, resulting in better or worse results for that group. Current practice<br />

requires that the researcher states whether allocation was concealed. If this is<br />

not stated, it should be assumed that it was not done and the effect of that bias<br />

assessed.<br />

There are two new randomization schemes that merit consideration as methods<br />

of solving more and more complex questions of efficacy. The first is to allow<br />

all patients requiring a particular therapy to choose whether they will be randomized<br />

or be able to freely choose their own therapy. The researchers can then<br />

compare the group that chose randomization with the group that chose to selfselect<br />

their therapy. This has an advantage if the outcome of the therapy being


170 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

studied has strong components of quality of life measures. It answers the question<br />

of whether the patient’s choice of being in a randomized trial has any effect<br />

on the outcome when compared with the possibility of having either of the<br />

therapies being studied as a choice. Another method of randomization is using<br />

expertise as the point of randomization. In this method, the patients are not randomized,<br />

but the therapy is randomized by the provider, with one provider or<br />

group of providers using one therapy and another, the comparison therapy. This<br />

is a useful method for studying surgical techniques or complementary and alternative<br />

medicine therapies.<br />

Blinding<br />

Blinding prevents confounding variables from affecting the results of a study. If<br />

all subjects, treating clinicians, and observers are blinded to the treatment being<br />

given during the course of the research, any subjective effects that could lead to<br />

biased results are minimized. Blinding prevents observer bias, contamination,<br />

and cointervention bias in either group. Lack of blinding can lead to finding an<br />

effect where none exists, or vice versa. No matter how honest, researchers may<br />

subconsciously tend to find what they want to find. Ideally, tests for adequacy of<br />

blinding should be done in any RCT. The simplist test is to ask participants if they<br />

knew which therapy they were getting. If there is no difference in the responses<br />

between the two groups, the blinding was successful and there is not likely to be<br />

any bias in the results due to lack of blinding.<br />

Some types of studies make blinding challenging, athough they can be done.<br />

Studies of different surgical methods or operations can be done with blinding by<br />

using sham operations. This has been successfully performed and in some cases<br />

found that standard therapeutic surgical procedures were not particularly beneficial.<br />

A recent series of studies showed that when compared to sham arthroscopic<br />

surgery for osteoarthritis, actual arthroscopic surgery had no benefit on<br />

outcomes such as pain and disability. Similar use of sham with acupuncture<br />

showed an equal degree of benefit from real acupuncture and sham acupuncture,<br />

with both giving better results than patients treated with no acupuncture.<br />

A recent review of studies of acupuncture for low back pain found that there was<br />

a dramatic effect of blinding on the outcomes of the studies. The non-blinded<br />

studies found acupuncture to be relatively useful for the short-term treatment of<br />

low back pain with a very low NNTB. However, when blinded studies were analyzed,<br />

no such effect was found and the results, presented in Table 15.2, were not<br />

statistically significant. 1<br />

1 E. Ernst & A. R. White. Acupuncture for back pain: a meta-analysis of randomized controlled trials.<br />

Arch. Intern. Med. 1998; 158: 2235–2241.


Randomized clinical trials 171<br />

Table 15.2. Effects of acupuncture on short-term outcomes in back pain<br />

Improved with Improved Relative<br />

Type Number acupuncture with control Benefit NNT<br />

of study of trials (%) (%) (95 % CI) (95% CI)<br />

Blinded 4 73/127 (57) 61/123 (50) 1.2 (0.9 to 1.5) 13 (5 to no benefit)<br />

Non-blinded 5 78/117 (67) 33/87 (38) 1.8 (1.3 to 2.4) 3.5 (2.4 to 6.5)<br />

Description of methods<br />

The methods section should be so detailed that the study could be duplicated by<br />

someone uninvolved with the study. The intervention must be well described,<br />

including dose, frequency, route, precautions, and monitoring. The intervention<br />

also must be reasonable in terms of current practice since if the intervention<br />

being tested is being compared to a non-standard therapy, the results<br />

will not be generalizable. The availability, practicality, cost, invasiveness, and<br />

ease of use of the intervention will also determine the generalizability of the<br />

study. In addition, if the intervention requires special monitoring it may be too<br />

expensive and difficult to carry out and therefore, impractical in most ordinary<br />

situations.<br />

Instruments and measurements should be evaluated using the techniques discussed<br />

in Chapter 7. Appropriate outcome measures should be clearly stated,<br />

and their measurements should be reproducible and free of bias. Observers<br />

should be blinded and should record objective outcomes. If there are subjective<br />

outcomes measured in the study, use caution. Subjective outcomes don’t<br />

automatically invalidate the study and observer blinding can minimize bias from<br />

subjective outcomes. Measurements should be made in a manner that ensures<br />

consistency and maximizes objectivity in the way the results are recorded. For<br />

statistical reasons, beware of composite outcomes, subgroup analysis, and posthoc<br />

cutoff points, which can all lead to Type I errors.<br />

The study should be clear about the method, frequency, and duration of<br />

patient follow-up. All patients who began the study should be accounted for at<br />

the end of the study. This is important because patients may leave the study for<br />

important reasons such as death, treatment complications, treatment ineffectiveness,<br />

or compliance issues, all of which will have implications on the application<br />

of the study to a physician’s patient population. A study attrition rate of<br />

> 20% is a rough guide to the number that may invalidate the final results. However,<br />

even a smaller percentage of patient drop-outs may affect the results of a<br />

study if not taken into consideration. The results should be analyzed with an<br />

intention-to-treat analysis or using a best case/worst case analysis.


172 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

Analysis of results<br />

The preferred method of analysis of all subjects when there has been a significant<br />

drop-out or crossover rate is to use an intention-to-treat methodology. In this<br />

method, all patient outcomes are counted with the group to which the patient<br />

was originally assigned even if the patient dropped out or switched groups. This<br />

approximates real life where some patients drop out or are non-compliant for<br />

various reasons. Patients who dropped out or switched therapies must still be<br />

accounted for at the end of the trial since if their fates are unknown, it is impossible<br />

to accurately determine their outcomes. Some studies will attempt to use<br />

statistical models to estimate the outcomes that those patients should have had<br />

if they had completed the study, but the accuracy of this depends on the ability<br />

of the model to mimic reality.<br />

A good example of intention-to-treat analysis was in a study of survival after<br />

treatment with surgery or radiation for prostate cancer. The group randomized<br />

to radical prostatectomy surgery or complete removal of the prostate gland, did<br />

much better than the group randomized to either radiation therapy or watchful<br />

waiting with no treatment. Some patients who were initially randomized to the<br />

surgery arm of the trial were switched to the radiation or watchful waiting arm<br />

of the trial when, during the surgery, it was discovered that they had advanced<br />

and inoperable disease. These patients should have been kept in their original<br />

surgery group even though their cancerous prostates were not removed. When<br />

the study was re-analyzed using an intention-to-treat analysis, the survival in all<br />

three groups was identical. Removing those patients biased the original study<br />

results since patients with similarly advanced cancer spread were not removed<br />

from the other two groups.<br />

Another biased technique involves removing patients from the study. Removing<br />

patients after randomization for reasons associated with the outcome is<br />

patently biased and grounds to invalidate the study. Leaving them in the analysis<br />

as an intention-to-treat is honest and will not inflate the results. However, if the<br />

outcomes of patients who left the study are not known, a best case/worst case<br />

scenario should be applied and clearly described so that the reader can determine<br />

the range of effects applicable to the therapy.<br />

In the best case/worst case analysis, the results are re-analyzed considering<br />

that all patients who dropped out or crossed over had the best outcome possible<br />

or worst outcome possible. This should be done by adding the drop-outs of the<br />

intervention group to the successful patients in the intervention group and at the<br />

same time subtracting the drop-outs of the comparison group from the successful<br />

patients in that group. The opposite process, subtracting drop out patients<br />

from the intervention group and adding them to the comparison group, should<br />

then be done. This will give a range of possible values of the final effect size. If<br />

this range is very large, we say that the results are sensitive to small changes that


Randomized clinical trials 173<br />

could result from drop-outs or crossovers. If the range is very small, we call the<br />

results robust, as they are not likely to change drastically because of drop-outs<br />

or crossovers.<br />

Compliance with the intervention should be measured and noted. Lack of<br />

compliance may influence outcomes since the reason for non-compliance may<br />

be directly related to the intervention. High compliance rates in studies may<br />

not be duplicated in clinical practice. Other clinically important outcomes<br />

that should be measured include adverse effects, direct and indirect costs,<br />

invasiveness, and monitoring of an intervention. A blinded and independent<br />

observer should measure these outcomes, since if the outcome is not objectively<br />

measured, it may limit the usefulness of the therapy. Remember, no adverse<br />

effects among n patients could signify as many as 3/n adverse events in actual<br />

practice.<br />

Results should be interpreted using the techniques discussed in the sections<br />

on statistical significance (Chapters 9–12). Look for both statistical and clinical<br />

significance. Look at confidence intervals and assess the precision of the results.<br />

Remember, narrow CIs are indicative of precise results while wide CIs are imprecise.<br />

Determine if any positive results could be due to Type I errors. For negative<br />

studies determine the relative likelihood of a Type II error.<br />

Discussion and conclusions<br />

The discussion and conclusions should be <strong>based</strong> upon the study data and limited<br />

to settings and subjects with characteristics similar to the study setting and<br />

subjects. Good studies will also list weaknesses of the current research and offer<br />

directions for future research in the discussion section. Also, the author should<br />

compare the current study to other studies done on the same intervention or<br />

with the same disease.<br />

In summary, no study is perfect, all studies have flaws, but not all flaws are<br />

fatal. After evaluating a study using the standardized format presented in this<br />

chapter, the reader must decide if the merits of a study outweigh the flaws before<br />

accepting the conclusions as valid.<br />

Further problems<br />

A study published in JAMA in February 1995 reviewed several systematic reviews<br />

of clinical trials, and found that if the trials were not blinded or the results were<br />

incompletely reported there was a trend of showing better results. 2 This highlights<br />

the need for the reader to be careful in evaluating these types of trials.<br />

2 K. F. Schulz, I. Chalmers, R. J. Hayes & D. G. Altman. Empirical evidence of bias. Dimensions of<br />

methodological quality associated with estimates of treatment effects in controlled trials. JAMA 1995;<br />

273: 408–412.


174 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

Fig. 15.1 Effect of blinding and<br />

sample size on results in trials of<br />

acupuncture for low back pain.<br />

From E. Ernst & A. R. White. Arch.<br />

Intern. Med. 1998; 158:<br />

2235–2241.<br />

100<br />

80<br />

Percent improved with acupuncture<br />

60<br />

40<br />

Blind < 50<br />

Blind 100−200<br />

20<br />

Nonblind < 50<br />

Nonblind 50−100<br />

0<br />

0 20 40 60 80 100<br />

Percent improved with control<br />

Always look for complete randomization, total double blinding, and reporting<br />

of all potentially important outcomes. An example of this phenomenon can be<br />

seen in the systematic review of studies of acupuncture for back pain that was<br />

described earlier.<br />

L’Abbé plotsare a graphic technique for presenting the results of many individual<br />

clinical trials. 3 The plot provides a simple visual representation of all the<br />

studies of a particular clinical question. It is a way of looking for the presence<br />

of bias in the studies done on a single question. The plot shows the proportion<br />

of patients in each study who improved taking the control therapy against<br />

the proportion who improved taking the active treatment. Each study is represented<br />

by one point and the size of the circle around that point is proportional<br />

to the sample size of the study. The studies closest to the diagonal show the least<br />

effect of therapy, and farther from the diagonal show a greater effect. In addition<br />

to getting an idea of the strength of the difference between the two groups,<br />

one can also look for the effects of blinding, sample size, or any other factor on<br />

the study results. Figure 15.1 shows the results of studies of the effectiveness of<br />

acupuncture on short-term improvements in back pain. The studies are divided<br />

by blinded vs. non-blinded and by size of sample. One can clearly see that the<br />

results of the blinded trials were less spectacular than the unblinded ones.<br />

3 K. A. L’Abbé, A. S. Detsky & K. O’Rourke. Meta-analysis in clinical research. Ann. Intern. Med. 1987;<br />

107: 224–233.


Randomized clinical trials 175<br />

The n-of-1 trial<br />

An n-of-1 trial is done like any other experiment, but with only one patient as a<br />

subject. Some have called this the highest level of evidence available. However, it<br />

is only useful in the one patient to whom it is applied. It is a useful technique to<br />

determine optimal therapy in a single patient when there appears to be no significant<br />

advantage of one therapy over another <strong>based</strong> on reported clinical trials. In<br />

order to justify the trial, the effectiveness of therapy must really be in doubt, the<br />

treatment should be continued long-term if it is effective, and the patient must<br />

be highly motivated to allow the researcher to do an experiment on them. It is<br />

helpful if there is a rapid onset of action of the treatment in question and rapid<br />

cessation when treatment is discontinued. There should be easily measurable<br />

and clinically relevant outcome measures and sensible criteria for stopping the<br />

trial.<br />

Additionally, the patient should give informed consent before beginning the<br />

trial. The researcher must have a willing pharmacist and pharmacy that can dispense<br />

identical, unlabeled active and placebo or comparison medications. Endpoints<br />

must be measurable with as much objectivity as possible. Also, the patient<br />

should be asked if they knew which of the two treatments they were taking and a<br />

statistician should be available to help evaluate the results. 4<br />

A user’s guide to the randomized clinical trial<br />

of therapy or prevention<br />

The following is a standardized set of methodological criteria for the critical<br />

assessment of a randomized clinical trial article looking for the best therapy<br />

which can be used in practice. It is <strong>based</strong>, with permission, upon the Users’<br />

Guides to the Medical Literature published by JAMA. 5 The University of Alberta<br />

(www.med.ualberta.ca.ebm) has online worksheets for evaluating articles of<br />

therapy that use this guide.<br />

(1) Was the study valid?<br />

(a) Was the assignment of patients to treatments really randomized?<br />

(i) Was similarity between groups documented?<br />

(ii) Was prognostic stratification used in allocation?<br />

(iii) Was there allocation concealment?<br />

(iv) Were both groups of patients similar at the start of the study?<br />

(b) Were all patients who entered the study accounted for at its conclusion?<br />

4 For more information on the n-of-1 RCT, consult D. L. Sackett, R. B. Haynes, P. Tugwell & G. H. Guyatt.<br />

Clinical Epidemiology: a Basic Science for Clinical <strong>Medicine</strong>. 2nd edn. Boston: Little Brown, 1991, pp.<br />

225–238.<br />

5 G. H. Guyatt & D. Rennie (eds.). Users’ Guides to the Medical Literature: A Manual for <strong>Evidence</strong>-Based<br />

Practice. Chicago: AMA, 2002. See also Bibliography.


176 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

(i) Was there complete follow-up of all patients?<br />

(ii) Were drop-outs, withdrawals, non-compliers, and those who<br />

crossed over handled appropriately in the analysis?<br />

(c) Were the patients, their clinicians, and the study personnel including<br />

recorders or measurers of outcomes blind to the assigned treatment?<br />

(d) Were the baseline factors the same in both groups at the start of the trial?<br />

(e) Aside from the intervention being tested, were the two groups of patients<br />

treated in an identical manner?<br />

(i) Was there any contamination?<br />

(ii) Were there any cointerventions?<br />

(iii) Was the compliance the same in both groups?<br />

(2) What are the results?<br />

(a) How large was the effect size and were both statistical and clinical significance<br />

considered? How large is the treatment effect?<br />

(i) If statistically significant, was the difference clinically important?<br />

(ii) If not statistically significant, was the study big enough to show a<br />

clinically important difference if it should occur?<br />

(iii) Was appropriate adjustment made for confounding variables?<br />

(b) How precise are the results? What is the size of the 95% confidence intervals?<br />

(3) Will the results help me care for my patient?<br />

(a) Were the study patients recognizably similar to my own?<br />

(i) Are reproducibly defined exclusion criteria stated?<br />

(ii) Was the setting primary or tertiary care?<br />

(b) Were all clinically relevant outcomes reported or at least considered?<br />

(i) Was mortality as well as morbidity reported?<br />

(ii) Were deaths from all causes reported?<br />

(iii) Were quality-of-life assessments conducted?<br />

(iv) Was outcome assessment blind?<br />

(c) Is the therapeutic maneuver feasible in my practice?<br />

(i) Is it available, affordable, and sensible?<br />

(ii) Was the maneuver administered in an adequately blinded manner?<br />

(iii) Was compliance measured?<br />

(d) Are the benefits worth the costs?<br />

(i) Can I identify all the benefits and costs, including non-economic<br />

ones?<br />

(ii) Were all potential harms considered?<br />

The CONSORT statement<br />

Beginning in 1993, the Consolidated Standards of Reporting Trials Group, known<br />

as the CONSORT group began their attempt to standardize and improve the


Randomized clinical trials 177<br />

Table 15.3. Template for the CONSORT format of an RCT showing the flow of<br />

participants through each stage of the study<br />

1. Assessed for eligibility (n = ...)<br />

2. Enrollment: Excluded (n = ...) Not meeting inclusion criteria (n = ...), Refused to<br />

participate (n = ...), Other reasons (n = ...)<br />

3. Randomized (n = ...)<br />

4. Allocation: Allocated to intervention (n = ...), Received allocated intervention<br />

(n = ...), Did not receive allocated intervention (n = ...) (give reasons)<br />

5. Follow-up: Lost to follow up (n = ...) (give reasons), Discontinued intervention<br />

(n = ...) (give reasons)<br />

6. Analysis: Analyzed (n = ...), Excluded from analysis (n = ...) (give reasons)<br />

reporting of the process of randomized clinical trials. This was as a result of laxity<br />

of reporting of the results of these trials. Currently most medical journals require<br />

that the CONSORT requirements be followed in order for an RCT to be published.<br />

Look for the CONSORT flow diagram at the start of any RCT and be suspicious<br />

that there are serious problems if there is no flow diagram for the study. The<br />

CONSORT flow diagram is outlined in Table 15.3.<br />

Ethical issues<br />

Finally, there are always ethical issues that must be considered in the evaluation<br />

of any study. Informed consent must be obtained from all subjects. This is<br />

a problem in some resuscitation studies, where other forms of consent such as<br />

substituted or implied consent may be used. Look for Institutional Review Board<br />

(IRB) approval of all studies. If it is not present, it may be an unethical study. It is<br />

the responsibility of the journal to publish only ethical studies. Therefore most<br />

journals will not publish studies without IRB approval. Decisions about whether<br />

or not to use the results of unethical studies are very difficult and beyond the<br />

scope of this book. As always, in the end, readers must make their own ethical<br />

judgment about the research.<br />

All the major medical journals now require authors to list potential conflicts<br />

of interest with their submissions. These are important to let the reader know<br />

that there may be a greater potential for bias in these studies. However, there are<br />

always potential reasons to suspect bias <strong>based</strong> upon other issues that may not<br />

be so apparent. These include the author’s need to “publish or perish,” desire to<br />

gain fame, and belief in the correctness of a particular hypothesis. A recent study<br />

on the use of bone-marrow transplantation in the treatment of stage 3 breast<br />

cancers showed a positive effect of this therapy. However, some time after publication,<br />

it was discovered that the author had fabricated some of his results, making<br />

the therapy look better than it actually was.


178 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

All RCTs should be described, before initiation of the research, in a registry<br />

of clinical trials. This can be seen on the ClinicalTrials.gov website, a project of<br />

the National Institutes of Health in the United States. This is a registry of clinical<br />

trials conducted around the world. The site gives information about the purpose<br />

of the clinical trial, who may participate, locations, and phone numbers for more<br />

details. These details should be adequate for anyone else to duplicate the trial.<br />

The purpose of the registry is to get the details of the trial published on-line prior<br />

to initiation of the trial itself. This way, the researchers cannot spin the results<br />

to look better by reporting different outcomes than were originally specified or<br />

by using different methods than originally planned. Most journals will no longer<br />

publish trials that are not registered in this or a similar international registry.<br />

The question of placebo controls is one ethical issue which is constantly being<br />

discussed. Since there are therapies for almost all diseases, is it ever ethical to<br />

have a placebo control group? This is still a contentious area with strong opinions<br />

on both sides. One test for the suitability of placebo use is clinical equipoise.<br />

This occurs when the clinician is unsure about the suitability of a therapy and<br />

there is no other therapy that works reasonably well to treat the condition. Here<br />

placebo therapy can be used. Both the researcher and the patient must be similarly<br />

inclined to choose either the experimental or a standard therapy. If this is<br />

not true, placebo ought not to be used.


16<br />

Scientific integrity and the responsible<br />

conduct of research<br />

John E. Kaplan, Ph.D.<br />

Integrity without knowledge is weak and useless, and knowledge without integrity is<br />

dangerous and dreadful.<br />

Samuel Johnson (1709–1784)<br />

Learning objectives<br />

In this chapter you will learn:<br />

what is meant by responsible conduct of research<br />

how to be a responsible consumer of research<br />

how to define research misconduct and how to deal with it<br />

how conflicts of interest may compromise research, and how they are managed<br />

why and how human participants in research studies are protected<br />

what constitutes responsible reporting of research findings<br />

how peer review works<br />

The responsible conduct of research<br />

The conduct and ethics of biomedical researchers began to receive increased<br />

attention after World War II. This occurred in part as a response to the atrocities<br />

of Nazi medicine and in part because of the increasing rate of technological<br />

advances in medicine. This interest intensified in the United States in<br />

response to the publicity surrounding improper research practices, particularly<br />

the Tuskeegee syphilis studies, studies of the effects of LSD on unsuspecting subjects,<br />

and studies of radiation exposure. While these issues triggered important<br />

reforms, the focus was largely restricted to protection of human experimental<br />

subjects.<br />

The conduct of scientists again became an area of intense interest in the 1980s<br />

after a series of high-profile cases of scientific misconduct attracted the attention<br />

179


180 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

both of the US public and of the US Congress, which conducted a series of investigations<br />

into the matter. These included the misconduct cases regarding Robert<br />

Gallo, a prominent AIDS researcher, and Nobel Laureate David Baltimore. Even<br />

cases that were not found to be misconduct increased public and political interest<br />

in the behavior of researchers. This interest resulted in the development of<br />

federally prescribed definitions of scientific misconduct. Now there are requirements<br />

that federally funded institutions adopt policies for responding to allegations<br />

of research fraud and for protecting the whistle-blowers. This was followed<br />

by the current requirement that certain researchers be given ethics training with<br />

funding from federal research training grants.<br />

This initial regulation was scandal-driven and was focused on preventing<br />

wrong or improper behavior. As these policies were implemented, it became<br />

apparent that this approach was not encouraging proper behavior. This new<br />

focus on fostering proper conduct by researchers led to the emergence of the<br />

field now generally referred to as the responsible conduct of research. This development<br />

is not the invention of the concept of scientific integrity, but it has significantly<br />

increased the attention bestowed on adherence to existing rules, regulations,<br />

guidelines, and commonly accepted professional codes for the proper<br />

conduct of research. It has been noted that much of what constitutes responsibleconductofresearchwouldbeachievedifwealladheredtothebasiccodeof<br />

conduct we learned in kindergarten: play fair, share, and tidy up.<br />

The practice of evidence-<strong>based</strong> medicine requires high quality evidence. A primary<br />

source of such evidence is from scientifically <strong>based</strong> clinical research. To be<br />

able to use this evidence, one must be able to believe what one reads. For this<br />

reason it is absolutely necessary that the research be trustworthy. Research must<br />

be proposed, conducted, reported, and reviewed responsibly and with integrity.<br />

Research, and the entire scientific enterprise, are <strong>based</strong> upon trust. In order for<br />

that trust to exist, the consumer of the biomedical literature must be able to<br />

assume that the researcher has acted responsibly and conducted the research<br />

honestly and objectively.<br />

The process of science and proper conduct of evidence-<strong>based</strong> medicine are<br />

equally dependent on the consumption and application of research findings<br />

being conducted with responsibility and integrity. This requires readers to be<br />

knowledgeable and open-minded in reading the literature. They must know the<br />

factual base and understand the techniques of experimental design, research,<br />

and statistical analysis. It is as important that the reader consumes and applies<br />

research without bias as it is that the research is conducted and reported without<br />

bias. Responsible use of the literature requires that the reader be conscientious<br />

in obtaining a broad and representative, if not complete, view of that segment.<br />

Building one’s knowledge-base on reading a selected part of that literature, such<br />

as abstracts alone, risks incorporating incomplete or wrong information into<br />

clinical practice and may lead to bias in the interpretation of the work. Worse


Scientific integrity and the responsible conduct of research 181<br />

would be to act on pre-existing bias and selectively seek out only those studies<br />

in the literature that one agrees with or that support one’s point of view, and to<br />

ignore those parts that disagree. In addition, it is essential that when one uses or<br />

refers to the work of others their contribution be appropriately referenced and<br />

credited.<br />

Scientists conducting research with responsiblity and integrity constitutes the<br />

first line of defense in ensuring the truth and accuracy of biomedical research. It<br />

is important to recognize that the accuracy of scientific research does not depend<br />

upon the integrity of any single scientist or study, but instead depends on science<br />

as a whole. It relies on findings being reproduced and reinforced by other scientists,<br />

which is a mechanism that protects against a single finding or study being<br />

uncritically accepted as fact. In addition, the process of peer review further protects<br />

the integrity of the scientific record.<br />

Research misconduct<br />

Research or scientific misconduct represents events in which error is introduced<br />

into the body of scientific knowledge knowingly, through deception and misrepresentation.<br />

Research misconduct does not mean honest error or differences in<br />

opinion. Errors occurring as the result of negligence in the way the experiment<br />

is conducted are also not generally considered research misconduct. However,<br />

negligence in the experiment does fall outside the scope of responsible conduct<br />

of science guidelines.<br />

In many respects, research misconduct is a very tangible concept. This contrasts<br />

to other areas within the broad scope of responsible conduct of research.<br />

Both the agencies sponsoring research and the institutions conducting research<br />

develop policies to deal with research misconduct. These policies require that<br />

a specific definition of research misconduct be developed. This effort has been<br />

fraught with controversy and resulted in a proliferation of similar, but not identical,<br />

definitions from various government agencies that sponsor research. Nearly<br />

all definitions agree that three basic concepts underlie scientific misconduct.<br />

These include fabrication, falsification, and plagiarism. In a nutshell, definitions<br />

agree that scientists should not lie, cheat, or steal. These ideas have now<br />

been included in a new single federal definition (Federal Register: November 2,<br />

2005 [Volume 70, Number 211]).<br />

The previous definition of research misconduct from the National Institutes<br />

of Health, the agency sponsoring most US government funded biomedical<br />

research, also included a statement prohibiting “other serious deviations from<br />

accepted research practices.” This statement is difficult to define specifically but<br />

reflects the belief that there are other behaviors besides fabrication, falsification,


182 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

and plagiarism that constitute research misconduct. A government-wide definition<br />

has been developed and approved. According to this policy “research misconduct<br />

is defined as fabrication, falsification, or plagiarism in proposing, performing,<br />

or reviewing research, or in reporting research results.” (Federal Register:<br />

November 2, 2005 (Volume 70, Number 211]).<br />

These three types of misconduct are defined as follows:<br />

Fabrication is making up data and recording or reporting them.<br />

Falsification is manipulating research materials, equipment, or processes, or<br />

changing or omitting data such that the research is not accurately represented<br />

in the research record.<br />

Plagiarism is the appropriation of another person’s ideas, processes, results,<br />

or words without giving appropriate credit.<br />

It is likely that the vast majority of scientists, and people in general, know that<br />

it is wrong to lie, cheat, or steal. This probably includes those who engage in such<br />

behavior. There are clearly numerous motivations that lead people to engage in<br />

such practices. These may include, but are not limited to, acting on personal or<br />

political biases, having personal financial incentives, personal and professional<br />

ambition, and fear of failure. In our system of research, the need for financial<br />

support and desire for academic advancement as measures of financial and professional<br />

success are dependent upon the productivity of a research program.<br />

Until there are some fundamental changes in the way research is funded, these<br />

questionable incentives are likely to remain in place.<br />

Many people believe that a substantial amount of research misconduct goes<br />

unreported because of concerns that there will be consequences to the whistleblower.<br />

All institutions in the United States that engage in federally supported<br />

research must now have in place formal policies to prevent retaliation against<br />

whistle-blowers. Unfortunately, it is unlikely that someone will be able to recognize<br />

scientific misconduct simply by reading a research study unless the misconduct<br />

is plagiarism of work they did or is very familiar to them. Usually such misconduct,<br />

if found at all, is discovered locally or during the review process prior to<br />

publication and may never be disclosed to the general scientific community.<br />

Conflict of interest<br />

Conflicts of interest may provide the motivation for researchers to act outside of<br />

the boundaries of responsible conduct of research. Webster’s dictionary defines<br />

conflict of interest as “A conflict between the private interests and professional<br />

responsibilities of a person in a position of trust.” A useful definition in the context<br />

of biomedical research and patient care was stated by D. F. Thompson who<br />

stated that “a conflict of interest is a set of conditions in which professional<br />

judgement concerning a primary interest (such as patient welfare or the validity


Scientific integrity and the responsible conduct of research 183<br />

Primary<br />

professional<br />

interest<br />

Other<br />

interest<br />

Fig. 16.1 Conflict of interest<br />

schematic.<br />

Judgment<br />

Bias<br />

Decision<br />

Alternative<br />

decision<br />

of research) tends to be unduly influenced by secondary interest (such as financial<br />

gain).” 1 These relationships are diagrammed in Fig. 16.1. It is very important<br />

to recognize that conflicts of interest per se are common among people with<br />

complex professional careers. Simply having conflict of interest is not necessarily<br />

wrong and is often unavoidable. What is wrong is when one is inappropriately<br />

making decisions founded on these conflicts or when one accepts a new<br />

responsibility over a previous professional interest. An example of this would be<br />

a physician becoming a part owner of a lab, to which he or she sends patients for<br />

bloodwork, at the cost of the physician’s previous priority of patient care. Decisions<br />

that are made <strong>based</strong> upon the bias produced by these interests are especially<br />

insidious when they result in the compromise of patient care or in research<br />

misconduct.<br />

Many of the rules regarding conflict of interest focus on financial gain, not<br />

because it is the worst consequence, but because it is more objective and regulable.<br />

There is substantial reason for concern that financially <strong>based</strong> conflicts<br />

of interest have affected research outcomes. Recent studies of calcium channel<br />

blockers, 2 non-steroidal anti-inflammatory drugs, 3 and health effects of secondhand<br />

smoke 4 each found that physicians with financial ties to manufacturers<br />

were significantly less likely to criticize safety or efficacy. A study of clinical-trial<br />

publications 5 determined a significant association between positive results and<br />

pharmaceutical company funding. Analysis of the cost-effectiveness of six oncology<br />

drugs 6 found that pharmaceutical company sponsorship of economic analyses<br />

led to a reduced likelihood of reporting unfavorable results.<br />

1 D. F. Thompson. Understanding financial conflicts of interest. N.Engl.J.Med.1993; 329: 573–576.<br />

2 H. T. Stelfox, G. Chua, K. O’Rourke & A. S. Detsky. Conflict of interest in the debate over calciumchannel<br />

antagonists. N.Engl.J.Med.1998; 338: 101–106.<br />

3 P.A.Rochon,J.H.Gurwitz,R.W.Simms,P.R.Fortin,D.T.Felson,K.L.Minaker&T.C.Chalmers.A<br />

study of manufacturer-supported trials of nonsteroidal anti-inflammatory drugs in the treatment of<br />

arthritis. Arch. Intern. Med. 1994; 154: 157–163.<br />

4 R. M. Werner & T. A. Pearson. What’s so passive about passive smoking? Secondhand smoke as a cause<br />

of atherosclerotic disease. JAMA 1998; 279: 157–158.<br />

5 R. A. Davidson. Source of funding and outcome of clinical trials. J. Gen. Intern. Med. 1986; 1: 155–158.<br />

6 M. Friedberg, B. Saffran, T. J. Stinson, W. Nelson & C. L. Bennett. Evaluation of conflict of interest in<br />

economic analyses of new drugs used in oncology. JAMA 1999; 282: 1453–1457.


184 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

Most academic institutions attempt to manage researcher’s potential conflict<br />

of interest. This is justified as an attempt to limit the influence of those conflicts<br />

and protect the integrity of research outcomes and patient-care decisions.<br />

Surprisingly, some academicians have argued against such management on the<br />

grounds that it impugns the integrity of honest physicians and scientists. Some<br />

institutions have decided that limiting the opportunity for outside interests prevents<br />

recruitment and retention of the best faculty. The degree to which these<br />

activities are conflicts of interests remains an ongoing debate in the academic<br />

community.<br />

Nearly all academic institutions engaging in research currently have policies<br />

to manage and/or limit conflicts of interest. Most of these focus exclusively on<br />

financial conflicts and are designed primarily to protect the institutions financially.<br />

Increased awareness of the consequences of conflict of interest will hopefully<br />

result in the development of policies that offer protection to research subjects<br />

and preserve the integrity of the research record.<br />

There are several ways that institutions choose to manage conflict of interest.<br />

The most common is requiring disclosure of conflicts of interest with the rationale<br />

that individuals are less likely to act on conflicts if they are known. Other<br />

methods include limitations on the value of outside interests such as limiting<br />

the equity a researcher could have in a company with whom they work or limiting<br />

the amount of consultation fees they can collect. Recently some professional<br />

organizations have suggested that the only effective management for potential<br />

conflicts of interest is their complete elimination.<br />

Some of the most difficult conflicts occur when physicians conduct clinical<br />

studies where they enroll their own patients as research subjects. This can place<br />

the performance of the research and patient care in direct conflict. Another common<br />

area of conflict is in studies funded by pharmaceutical companies. Often<br />

they desire a veto in all decisions affecting the conduct and publication of the<br />

results.<br />

Research with human participants<br />

In order to obtain definitive information on the pathophysiologic sequelae of<br />

human disease, as well as to assess risk factors, diagnostic modalities, and therapeutic<br />

interventions, it is necessary to use people as research subjects. After<br />

several instances of questionable practices in studies using human subjects, the<br />

US Congress passed the National Research Act in 1974. One outcome of this legislation<br />

was the publication of the Belmont Report that laid the foundation of<br />

ethical principles which govern the conduct of human studies and provide protection<br />

for human participants. These principles are respect for personal autonomy,<br />

beneficence, and justice.


Scientific integrity and the responsible conduct of research 185<br />

The principle of respect for persons manifests itself in the practice of informed<br />

consent. Informed consent requires that individuals be made fully aware of the<br />

risks and benefits of the experimental protocol and that they be fully able to<br />

evaluate this information. Consent must be fully informed and entirely free of<br />

coercion.<br />

The principle of beneficence manifests itself in the assessment of risk and<br />

benefit. The aim of research involving human subjects is to produce benefits<br />

to either the research subject, society at large, or both. At the same time, the<br />

magnitude of the risks must be considered. The nature of experimental procedures<br />

generally dictates that everything about them is not known and so<br />

risks, including some that are unforeseen, may occur. Research on human subjects<br />

should only take place when the potential benefits outweigh the potential<br />

risks. Another way of looking at this is the doctrine of clinical equipoise.<br />

At the onset of a study, the research aims, treatment and control, are equally<br />

likely to result in the best outcome. At the very least, the comparison group<br />

must be receiving a treatment consistent with the current standard of care.<br />

The application of this principle could render some placebo-controlled studies<br />

unethical.<br />

The principle of justice manifests itself in the selection of research subjects.<br />

This principle dictates that the benefits and the risks of research be distributed<br />

fairly within the population. There should be no favoritism shown when<br />

enrolling patients into a study. For example, groups should be selected for inclusion<br />

into the research study <strong>based</strong> on characteristics of patients who would benefit<br />

from the therapy, and not because they are poor or uneducated.<br />

The responsibilities for ensuring that these principles are applied rest with<br />

Institutional Review Boards (IRBs). These must include members of varying<br />

background, both scientific and non-scientific, who are knowledgeable of the<br />

institution’s commitments and regulations, applicable law and ethics, and standards<br />

of professional conduct and practice. The IRB must approve both the initiation<br />

and continuation of each study involving human participants. The IRB<br />

seeks to ensure that risk is minimal and reasonable in relation to the anticipated<br />

benefit of the knowledge gained. The IRB evaluates whether selection of research<br />

subjects is equitable and ensures that consent is informed and documented, that<br />

provisions are included to monitor patient safety, and that privacy and confidentiality<br />

are protected.<br />

One of the most difficult roles for the physician is the potential conflict<br />

between patient care responsibilities and the objectivity required of a researcher.<br />

Part of the duty of the IRB ought to be an evaluation of the methodology of<br />

the research study. Some researchers disagree with this role. But, it ensures<br />

that subjects, our patients, are not subjected to useless or incompetently done<br />

research.


186 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

Peer review and the responsible reporting of research<br />

Peer review and the responsible reporting of research are two important and<br />

related subjects that impact directly on the integrity of the biomedical research<br />

record. Peer review is the mechanism used to judge the quality of research and<br />

is applied in several contexts. This review mechanism is founded on the premise<br />

that a proposal or manuscript is best judged by individuals with experience and<br />

expertise in the field.<br />

The two primary contexts are the evaluation of research proposals and<br />

manuscript reviews for journals. This mechanism is used by the National Institutes<br />

of Health and nearly every other non-profit sponsor of biomedical research<br />

(e.g., American Heart Association, American Cancer Society, etc.) to evaluate<br />

research proposals. Almost all journals also use this mechanism. In general, readers<br />

should be able to assume that journal articles are peer-reviewed although it<br />

is important to be aware of those that are not. Readers should have a lower level<br />

of confidence in research reported in journals that are not peer-reviewed. In general,<br />

society-sponsored and high-profile journals are peer-reviewed. If there are<br />

doubts, check the information for authors section, which should describe the<br />

review process.<br />

To be a responsible peer reviewer, one must be knowledgeable, impartial, and<br />

objective. It is not as easy as it might seem to meet all of these criteria. The more<br />

knowledgeable a reviewer is in the field of a proposal, the more likely they are to<br />

be a collaborator, competitor, or friend of the investigators. These factors, as well<br />

as potential conflicts of interest, may compromise their objectivity. Prior to publication<br />

or funding, proposals and manuscripts are considered privileged confidential<br />

communications that should not be shared. It is the responsibility of<br />

the reviewer to honor this confidentiality. It is similarly the responsibility of the<br />

reviewer not to appropriate any information gained from peer review into his or<br />

her own work.<br />

As consumers and, perhaps, contributors to the biomedical literature, we<br />

need research to be reported responsibly. Responsible reporting of research also<br />

includes making each study a complete and meaningful contribution as opposed<br />

to breaking it up to achieve as many publications as possible. Additionally, it is<br />

important to make responsible conclusions and issue appropriate caveats on the<br />

limitations of the work. It is necessary to offer full and complete credit to all those<br />

who have contributed to the research, including references to earlier works. It is<br />

essential to always provide all information that would be essential to others who<br />

would repeat or extend the work.


17<br />

Applicability and strength of evidence<br />

Find out the cause of this effect, Or rather say, the cause of this defect, For this effect<br />

defective comes by cause.<br />

William Shakespeare (1564–1616): Hamlet<br />

Learning objectives<br />

In this chapter you will learn:<br />

the different levels of evidence<br />

the principles of applying the results of a study to a patient<br />

The final step in the EBM process is the application of the evidence found in clinical<br />

studies to an individual patient. In order to do this, the reader of the medical<br />

literature must understand that all evidence is not created equal and that some<br />

forms of evidence are stronger than others. Once a cause-and-effect relationship<br />

is discovered, can it always be applied to the patient? What if the patient is of a<br />

different gender, socioeconomic, ethnic, or racial group than the study patients?<br />

This chapter will summarize these levels of evidence and help to put the applicability<br />

of the evidence into perspective. It will also help physicians decide how<br />

to apply lower levels of evidence to everyday clinical practice.<br />

Applicability of results<br />

The application of the results of a study is often difficult and frustrating for<br />

the clinician. Overall, one must consider the generalizability of study results to<br />

patients. A sample question would be; “Is a study of the risk of heart attack that<br />

was done in men applicable to a woman in your practice?” Answering this question<br />

involves inducing the strength of a presumed cause-and-effect relationship<br />

in that patient <strong>based</strong> upon uncertain evidence. This is the essence of the art of<br />

187


188 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

Table 17.1. Criteria for application of results<br />

(in decreasing order of importance)<br />

Strength of research design<br />

Strength of result<br />

Consistency of studies<br />

Specificity (confounders)<br />

Temporality (time–related)<br />

Dose–response relationship<br />

Biological plausibility<br />

Coherence (consistency in time)<br />

Analogous studies<br />

Common sense<br />

Source: After Sir Austin Bradford Hill. A Short<br />

Textbook of Medical Statistics. Oxford: Oxford<br />

University Press, 1977, pp. 309–323.<br />

Fig. 17.1 The application of<br />

evidence to a particular clinical<br />

situation (from Chapter 2).<br />

Best evidence<br />

Clinical situation<br />

Clinical experience<br />

Patient values<br />

medicine and is a blend of the available evidence, clinical experience, the clinical<br />

situation, and the patient’s preferences (Fig. 17.1).<br />

One must consider the strength of the evidence for a particular intervention or<br />

risk factor. The stronger the study, the more likely it is that those results will be<br />

borne out in practice. A well-done RCT with a large sample size is the strongest<br />

evidence for the efficacy of a practice in a defined population. However, these<br />

are very expensive and difficult to perform, and physicians often must make vital<br />

clinical decisions <strong>based</strong> upon less stringent evidence.<br />

Sir Austin Bradford Hill, the father of modern biostatistics and epidemiology,<br />

developed a useful set of rules to determine the strength of causation <strong>based</strong> upon<br />

the results of a clinical study. These are summarized in Table 17.1.<br />

Levels of evidence<br />

Strength of the research design<br />

The strongest design for evaluation of a clinical question is a systematic review<br />

(SR) of multiple randomized clinical trials. Ideally, the studies in these reviews


Applicability and strength of evidence 189<br />

will be homogeneous and done with carefully controlled methodology. The process<br />

of data analysis of these meta-analyses is complex, and is the basis of<br />

the Cochrane Collaboration, a loose network of physicians who systematically<br />

review various topics and publish the results of their reviews. We will discuss<br />

these further in Chapter 33.<br />

The randomized clinical trial (RCT) with human subjects is the strongest single<br />

research design capable of proving causation. It is least likely to have methodological<br />

confounders and is the only study design that can show that altering<br />

the cause alters the effect. Confounding variables both recognized and unrecognized,<br />

can and should be evenly distributed between control and experimental<br />

groups through adequate randomization, minimizing the likelihood of bias<br />

due to these differences. Ideally, the study should be double-blinded. In studies<br />

with strong results, those results should be accompanied by narrow confidence<br />

intervals. Clearly, the strongest evidence is the RCT carried out in the exact population<br />

that fits the patient in question. However, such a study is rarely available<br />

and physicians must use what evidence they can find, combined with their previous<br />

knowledge, to determine how the evidence produced by the study should<br />

be used. The n-of-1 RCT is another way of obtaining high quality evidence for a<br />

patient, but is difficult to perform and usually outside the scope of most medical<br />

practices at this time.<br />

The next best level of evidence comes from observational studies. The<br />

results of such studies may only represent association and can never prove<br />

that changes in a cause can change the effect. The strongest observationalstudy<br />

research design supporting causation is a cohort study, which can be<br />

done with either a prospective or a non-concurrent design. Cohort studies<br />

can show that cause precedes effect but not that altering the cause alters the<br />

effect. Bias due to unrecognized confounding variables between the two groups<br />

might be present and should be sought and controlled for using multivariate<br />

analysis.<br />

A case–control study is a weaker research design that can still support causation.<br />

The results of these studies can prove an association between the cause<br />

and the effect. Sometimes, the cause can be shown to precede the effect. However,<br />

altering the cause cannot be shown to alter the effect. A downside to these<br />

studies is that they are subject to many methodological problems that may bias<br />

the outcome. But, for uncommon and rare diseases, this may be the strongest<br />

evidence possible and can provide high-quality evidence if the study is done correctly.<br />

Finally, case reports and descriptive studies including case series and crosssectional<br />

studies have the lowest strength of evidence. These studies cannot<br />

prove cause and effect, they can only suggest an association between two variables<br />

and point the way toward further directions of research. For very rare conditions<br />

they can be the only, and therefore the best, source of evidence. This is


190 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

true when they are the first studies to call attention to a particular disorder or<br />

when they are of the “all-or-none” type.<br />

Hierarchies of research studies<br />

There are several published hierarchies of classification for research studies. A<br />

system published by the Centre for <strong>Evidence</strong>-Based <strong>Medicine</strong> of Oxford University<br />

grades studies into levels from 1 through 5 and is an excellent grading<br />

scheme for clinical studies. This system grades studies by their overall quality<br />

and design. Level 1 studies are very large RCTs or systematic reviews. Level 2<br />

studies are smaller RCTs with less than 50 subjects, RCTs with lower quality, or<br />

large high-quality cohort studies. Level 3 studies are smaller cohort or case–<br />

control studies. Level 4 evidence comes from case reports and low-level case–<br />

control and cohort studies. Finally, Level 5 is expert opinion or consensus <strong>based</strong><br />

upon experience, physiology, or biological principles. <strong>Evidence</strong>-<strong>based</strong> medicine<br />

resources such as critically appraised topics (CATs) or Journal Club Banks must<br />

be evaluated on their own merits and should be peer-reviewed.<br />

These levels of evidence are cataloged for articles of therapy or prevention,<br />

etiology or harm, prognosis, diagnosis, decision and economic analyses. This<br />

scheme, developed at the Centre for <strong>Evidence</strong>-Based <strong>Medicine</strong> at Oxford University<br />

is shown in Appendix 1.<br />

Another classification scheme uses levels A through D to designate the<br />

strength of recommendations <strong>based</strong> upon the available evidence. Grade A is the<br />

strongest evidence and D the weakest. For studies of therapy or prevention, the<br />

following is a brief description of this classification of recommendations.<br />

Grade A is a recommendation <strong>based</strong> on the strongest study design and consists<br />

of sublevels 1a to 1c. 1a is systematic reviews with homogeneity, free of worrisome<br />

variations, also known as heterogeneity, in the direction and degree<br />

of the results between individual studies. Heterogeneity, whether statistically<br />

significant or not, does not necessarily disqualify a study and should<br />

be addressed on an individual basis. Sublevel 1b is an individual randomized<br />

clinical trial with narrow confidence intervals. Studies with wide confidence<br />

intervals should be viewed with care and would not qualify as 1b level<br />

of evidence. Finally, the inclusion of the all-or-none case series as 1c evidence<br />

is somewhat controversial. These studies may be helpful for studying<br />

new, uniformly fatal, or very rare disorders, but should be viewed with care<br />

as they are incapable of proving any elements of contributory cause and are<br />

only considered preliminary findings.<br />

Grade B is a recommendation <strong>based</strong> on the next level of strength of design and<br />

includes 2a, systematic reviews of homogeneous cohort studies; 2b, strong<br />

individual cohort studies or weak RCTs with less than 80% follow-up; and


Applicability and strength of evidence 191<br />

2c, outcomes research. Also included are 3a, systematic reviews of homogeneous<br />

case–control studies, and 3b, individual case–control studies.<br />

Grade C is a recommendation <strong>based</strong> on the weakest study designs and<br />

includes level 4, case series and lower-quality cohort and case–control studies.<br />

These studies fail to clearly define comparison groups, to measure exposures<br />

and outcomes in the same objective way in both groups, to identify<br />

or appropriately control known confounding variables, or carry out a sufficiently<br />

long and complete follow-up of patients.<br />

Finally, grade D recommendations are not <strong>based</strong> upon any scientific studies<br />

and are therefore the lowest level of evidence. Also called level 5, they consist<br />

of expert opinion without explicit critical appraisal of studies. It is <strong>based</strong><br />

solely upon personal experience, applied physiology, or the results of bench<br />

research.<br />

These strength-of-evidence recommendations apply to average patients. Individual<br />

practitioners can modify them in light of a patient’s unique characteristics,<br />

risk factors, responsiveness, and preferences about the care they receive.<br />

A level that fail to provide a conclusive answer can be preceded by a minus<br />

sign –. This may occur because of wide confidence intervals that result in a lack of<br />

statistical significance but fails to exclude a clinically important benefit or harm.<br />

This also may occur as a result of a systematic review with serious and statistically<br />

significant heterogeneity. <strong>Evidence</strong> with these problems is inconclusive and<br />

can only generate Grade C recommendations.<br />

A new proposal for grading evidence is in the recently published GRADE<br />

scheme. This stands for the Grading of Recommendations Assessment, Development<br />

and Evaluation Working Group. Established in 2000, it consists of a group<br />

of EBM researchers and practitioners, many of whom had other quality of evidence<br />

schemes that they regularly used and which were often in conflict with<br />

each other. This group has created a uniform schema for classifying the quality<br />

of research studies <strong>based</strong> on the ability to prove the cause and effect relationship.<br />

The scheme is outlined in Table 17.2. 1 Software for the GRADE process is available<br />

as shareware on their website: www.gradeworkinggroup.org and through<br />

the Cochrane Collaboration.<br />

Strength of results<br />

The actual strength of association is the next important issue to consider. This<br />

refers to the clinical and statistical significance of the results. It is reflected in<br />

1 D. Atkins, D. Best, P. A. Briss, M. Eccles, Y. Falck-Ytter, S. Flottorp, G. H. Guyatt, R. T. Harbour, M.C.<br />

Haugh, D. Henry, S. Hill, R. Jaeschke, G. Leng, A. Liberati, N. Magrini, J. Mason, P. Middleton, J.<br />

Mrukowicz, D. O’Connell, A. D. Oxman, B. Phillips, H. J. Schunemann, T. T. Edejer, H. Varonen,<br />

G. E. Vist, W. R. Williams Jr. & S. Zaza; Grade Working Group. Grading quality of evidence and strength<br />

of recommendations. BMJ. 2004; 328: 1490.


192 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

Table 17.2. GRADE recommendations<br />

High<br />

Moderate<br />

Low<br />

Very low<br />

Randomized Clinical Trial – Further research is unlikely to change our<br />

confidence in the estimate of the effects<br />

Further research is likely to have an important impact on our confidence<br />

in our estimate of the effects and may change the estimate<br />

Cohort studies – Further research is likely to have an important impact on<br />

our confidence in our estimate of the effects and is likely to change the<br />

estimate<br />

Any other evidence – Any estimate of effect is uncertain<br />

Decrease grade if:<br />

1. Serious (−1) or very serious (−2) limitations to study quality<br />

2. Important inconsistency<br />

3. Some (−1) or major (−2) uncertainty about directness<br />

4. Imprecise or sparse data (–1)<br />

5. High probability of reporting bias (−1)<br />

Increase grade if:<br />

1. Strong evidence of association – significant relative risk >2(< 0.5) <strong>based</strong> on<br />

consistent evidence from two or more observational studies with no plausible<br />

confounders (+1)<br />

2. Very strong evidence of association – significant relative risk >5(< 0.2) <strong>based</strong> on<br />

direct evidence with no major threats to validity (+1)<br />

3. <strong>Evidence</strong> of a dose response gradient (+1)<br />

4. All plausible confounders would have reduced the effect (+1)<br />

the magnitude of the effect size or the difference found between the two groups<br />

studied. The larger the effect size and lower the P value, the more likely that the<br />

results did not occur by chance alone and there is a real difference between the<br />

groups. Other common measures of association are odds ratios and relative risk:<br />

the larger they are, the stronger the association. A relative risk or odds ratio over<br />

5 or over 2 with very narrow confidence intervals should be considered strong.<br />

Confidence intervals (CI) quantify the precision of the result and give the potential<br />

range of this strength of association. Confidence intervals should be routinely<br />

given in any study.<br />

Even if the effect size, odds ratio (OR), or relative risk (RR) is statistically significant,<br />

one must decide if this result is clinically important. There are a number<br />

of factors to consider when assessing clinical importance. First, lower levels of<br />

RR or OR may be important in situations where the baseline risk level is fairly<br />

high. However, if the CI for these measures is overly wide, the results are less<br />

precise and therefore less meaningful. Second, finding no effect size or one that<br />

was not statistically significant may have occurred because of lack of power. The<br />

skew of the CI may give a subjective sense of the power of a negative study. Last,


Applicability and strength of evidence 193<br />

other measures of strength of association include the number needed to treat to<br />

get benefit (NNTB), obtained from randomized clinical trials, and the number<br />

needed to screen to get benefit (NNSB) and number needed to treat to get harm<br />

(NNTH), obtained from cohort or case–control studies.<br />

John Snow performed what is acknowledged as the first modern recorded epidemiologic<br />

study in 1854. Known as the Broad Street Pump study, he proved that<br />

the cause of a cholera outbreak in London was the pump on Broad Street. This<br />

pump was supplied by water from one company and was associated with a high<br />

rate of cholera infection in the houses it fed, while a different company’s pump<br />

had a much lower rate of infection. The relative risk of death was 14, suggesting<br />

a very strong association between consumption of water from the tainted pump<br />

and death due to cholera. A modern-day example is the high strength of association<br />

in the connection between smoking and lung cancer. Here the relative risk<br />

in heavy smokers is about 20. With such high association, competing hypotheses<br />

for the cause of lung cancer are unlikely and the course for the clinician should<br />

be obvious.<br />

Consistency of evidence<br />

The next feature to consider when looking at levels of evidence is the consistency<br />

of the results. Overall, it is critical that different researchers in different settings<br />

and at different times should have done research on the same topic. The results<br />

of these comparable studies should be consistent, and if the effect size is similar<br />

in these studies, the evidence is stronger. Be aware that less consistency exists in<br />

those studies that use different research designs, clinical settings, or study populations.<br />

A good example of the consistency of evidence occurred with studies<br />

looking at smoking and lung cancer. For this association, prior to the 1965 Surgeon<br />

General’s report, there were 29 retrospective and 7 prospective studies, all<br />

of which showed an association between smoking and lung cancer.<br />

A single study that shows results that are discordant from many other studies<br />

suggests the presence of bias in that particular study. However, sometimes<br />

a single large study will show a discordant result compared with multiple small<br />

studies. This may be due to lack of power of the small studies and if this occurs,<br />

the reader must carefully evaluate the methodology of all the studies and use<br />

those studies with the best and least-biased methodology. In general, large studies<br />

result in more believable results. If a study is small, a change in the outcome<br />

status of one or two patients could change the entire study conclusion from positive<br />

to negative.<br />

Specificity<br />

The next characteristic to consider is the specificity of the results. This means<br />

making sure that the cause in the study is the actual factor associated with the


194 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

effect. Often, the putative risk factor is confused with a confounding factor or a<br />

surrogate marker may produce both cause and effect.<br />

Specificity can be a problematic feature of generalization as there are usually<br />

multiple sources of causation in chronic illness and multiple effects from one<br />

type of cause. For example, before the advent of milk pasteurization, there were<br />

multiple diverse diseases associated with the consumption of milk. A few of these<br />

were tuberculosis, undulant fever, typhoid, and diphtheria. To attribute the cause<br />

of all these diseases to the milk ignores the fact that what they have in common<br />

is that they are all caused by bacteria. The milk is simply the vehicle and once the<br />

presence of bacteria and their role in human diseases were determined, it could<br />

be seen that ridding milk of all bacteria was the solution to preventing milkborne<br />

transmission of these diseases. Then the next step was inspecting the cows for<br />

those same diseases and eradicating them from the herd.<br />

We can relate this concept to cancer of the lung in smokers. Overall, the death<br />

rate in smokers is higher than in non-smokers. For most causes of death, the<br />

increase in death rate in smokers is about double (200%) that of nonsmokers.<br />

However, for lung cancer specifically, the increase in the death rate in smokers<br />

is almost 2000%, an increase of 20 times. This lung cancer death rate is more<br />

specific than the increased death rate for other diseases. In those other diseases,<br />

smoking is a less significant risk factor, since there are multiple other factors that<br />

contribute to the death rate for those diseases. However, it is still a factor! In lung<br />

cancer, smoking is a much more significant factor in the death rate.<br />

Temporal relationship<br />

The next characteristic that should be considered is the temporal relationship<br />

between the purported cause and effect. In order to have a temporal relationship,<br />

there should be an appropriate chronological sequence of events found by the<br />

study. The disease progression should follow a predictable path from risk-factor<br />

exposure to the outcome and that pattern should be reproducible from study to<br />

study. Be aware that it is also possible that the effect may produce the cause. For<br />

example, some smokers quit smoking just prior to getting sick with lung cancer.<br />

While they may attribute their illness to quitting, the illness was present long<br />

before they finally decided to quit. Is quitting smoking the cause and lung cancer<br />

the effect? In this case, the cancer may appear to be the cause and the cessation<br />

of smoking the effect. The causality may be difficult to determine in many cases,<br />

especially with slowly progressive and chronic diseases.<br />

Dose–response<br />

The dose–response gradient can help define cause and effect if there are varying<br />

concentrations of the cause and varying degrees of association with the effect.<br />

Usually, the association becomes stronger with increasing amounts of exposure


Applicability and strength of evidence 195<br />

to the cause. However, some cause-and-effect relationships show the opposite<br />

correlation, with increasing strength of association when exposure decreases. An<br />

example of this inverse relationship is the connection between vitamin intake<br />

and birth defects. As the consumption of folic acid increases in a population, the<br />

incidence of neural tube birth defects decreases. The direction and magnitude<br />

of the effect should also show a consistent dose–response gradient. This gradient<br />

can be demonstrated in randomized clinical trials and cohort studies but not in<br />

case–control or descriptive studies.<br />

In general, we would expect that an increased dose or duration of the cause<br />

would produce an increased risk or severity of the effect. The more cigarettes<br />

smoked, the higher the risk of lung cancer. The risk of lung cancer decreases<br />

among former smokers as the time from giving up smoking increases. Some phenomena<br />

produce a J-shaped curve relating exposure to outcome. In these cases,<br />

the risk is highest at both increased and decreased rates of exposure while it is<br />

lowest in the middle. For example, a recent study of the effect of obesity on mortality<br />

showed a higher mortality among patients with the highest and lowest body<br />

mass index with the lowest mortality among people with the mid-range levels of<br />

body mass index.<br />

Biological plausibility<br />

When trying to decide on applicability of study results, biological plausibility<br />

should be considered. The results of the study should be consistent with what we<br />

know about the biology of the body, cells, tissues, and organs, and with data from<br />

various branches of biological sciences. There should be some basic science invitro<br />

bench or animal studies to support the conclusions and previously known<br />

biologic mechanisms should be able to explain the results. Is there a reason in<br />

biology that men and women smokers will have different rates of lung cancer?<br />

For some medical issues, gender, ethnicity, or cultural background has a huge<br />

influence while for other medical issues the influence is very little. To determine<br />

which areas fall into each category, more studies of gender and other differences<br />

for medical interventions are required.<br />

Coherence of the evidence over time<br />

In order to have strong evidence, there should be consistency of the evidence<br />

over varying types of studies. The results of a cohort study should be similar to<br />

those of case–control or cross-sectional studies done on the same cause-andeffect<br />

relationship. Studies that show consistency with previously known epidemiological<br />

data are said to evidence epidemiological consistency. Also, results<br />

should agree with previously discovered relationships between the presumed<br />

cause and effect in studies done on other populations around the world. An


196 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

association of high cholesterol with increased deaths due to myocardial infarction<br />

was noted in several epidemiological studies in Scandinavian countries.<br />

A prospective study in the United States found similar results. As an aside, a<br />

potential confounding factor in this is the increase in cigarette smoking and<br />

related diseases in men after World War I and women following World War II.<br />

Analogy<br />

Reasoning by analogy is one of the weakest criteria allowing generalization.<br />

Knowing that a certain vitamin deficiency predisposes women to deliver babies<br />

with certain birth defects will marginally strengthen the evidence that another<br />

vitamin or nutritional factor has a similar effect. When using analogy, the proposed<br />

cause-and-effect relationship is supported by findings from studies using<br />

the same methods but different variables. For example, multiple studies using<br />

the same methodology have demonstrated that aspirin is an effective agent<br />

for the secondary prevention of myocardial infarction (MI). From this, one could<br />

infer that a potent anticoagulant like warfarin ought to have the same effect.<br />

However, warfarin may increase mortality because of the side effect of causing<br />

increased bleeding. How about suggesting that warfarin use decreases the risk of<br />

stroke in patients who have had transient ischemic attacks, or MI in patients with<br />

unstable angina? Again, although it is suggested by an initial study, the proposed<br />

new intervention may not prove beneficial when studied alone.<br />

Common sense<br />

Finally, in order to consider applying a study result to a patient, the association<br />

should make sense and competing explanations associating risk and outcome<br />

should be ruled out. For instance, very sick patients are likely to have a poor outcome<br />

even if given a very good drug, thus making the drug look less efficacious<br />

than it truly is. Conversely, if most patients with a disease do well without any<br />

therapy, it may be very difficult to prove that one drug is better than another<br />

for that disease. This is referred to as the Pollyanna effect. When dealing with<br />

this effect, an inordinately large number of patients would be necessary to prove<br />

a beneficial effect of a medication. There are a few consequences of not using<br />

common sense. It may lead to the overselling of potent drugs, and may result<br />

in clinical researchers neglecting more common, cheaper, and better forms of<br />

therapy. Similarly, patients thinking that a new wonder drug will cure them may<br />

delay seeking care at a time when a potentially serious problem is easily treated<br />

and complications averted.<br />

Finally, it is up to the individual physician to determine how a particular piece<br />

of evidence should be used in a particular patient. As stated earlier, this is the art


Applicability and strength of evidence 197<br />

Unsound<br />

Research<br />

<strong>Evidence</strong>-Based <strong>Medicine</strong><br />

• Questioning<br />

• Skills in EBM<br />

• <strong>Evidence</strong> Resources<br />

•<br />

Time (substitution)<br />

Patient Choice<br />

• Decision Aids<br />

• Education<br />

• Compliance aids<br />

Fig. 17.2 Pathman’s Pipeline of<br />

application of innovation from<br />

well done research to use in all<br />

appropriate patients (From P.<br />

Glasziou, used with permission).<br />

Aware Accepted Applicable Able Acted on Agreed Adhered to<br />

Sound<br />

Research<br />

5S schema for<br />

obtaining<br />

evidence<br />

Quality Improvement<br />

• Skills<br />

• Systems<br />

of medicine. There are many people that decry the slavish use of EBM in patientcare<br />

decisions. There are also those who demand that we use only the highest<br />

evidence. There must be a middle ground. We must learn to use the best evidence<br />

in the most appropriate situations and communicate this effectively to<br />

our patients. There is a real need for more high-quality evidence for the practice<br />

of medicine, however, we must treat our patients now with the highest-quality<br />

evidence available.<br />

Pathman’s Pipeline<br />

The Pathman ‘leaky’ pipeline is a model of knowledge transfer, taking the best<br />

evidence from the research arena into everyday practice. This model considers<br />

the ways that evidence will be lost in the process of diffusion into the everyday<br />

practice of medicine. It was developed by D.E. Pathman, a family physician in<br />

the 1970s, to model the reasons why physicians did not vaccinate children with<br />

routine vaccinations. It has been expanded to model the reasons that physicians<br />

don’t use the best evidence (Fig. 17.2). Any model of EBM must consider the consequences<br />

of the constructs in this model on the behavior of practicing physicians<br />

and acceptability of evidence by patients.<br />

Providers must be aware of the evidence through reading journals or getting<br />

notification through list services or other on-line resources or Continuing Medical<br />

Education (CME). They must then accept the evidence as being legitimate


198 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

and useful. This follows a bell-shaped curve with the innovators followed by the<br />

early adopters, early majority, late majority, and finally the laggards. Providers<br />

must believe that the evidence is applicable to their patients, specifically the one<br />

in their clinic at that time. They must then be able to perform the intervention.<br />

This can be a problem in rural areas or less developed countries. Finally, the<br />

providers must act upon the evidence and apply it to their patients. However,<br />

it is still up to the patient to agree to accept the evidence and finally be compliant<br />

and adhere to the evidence. The next chapter will discuss the process of<br />

communication of the best evidence to patients.


18<br />

Communicating evidence to patients<br />

Laura J. Zakowski, M.D., Shobhina G. Chheda, M.D., M.P.H., and<br />

Christine S. Seibert, M.D.<br />

Think like a wise man but communicate in the language of the people.<br />

William Butler Yeats (1865–1939)<br />

Learning objectives<br />

In this chapter you will learn:<br />

when to communicate evidence with a patient<br />

five steps to communicating evidence<br />

how health literacy affects the communication of evidence<br />

common pitfalls to communicating evidence and their solutions<br />

When a patient asks a question, the health-care provider may need to review<br />

evidence or evidence-<strong>based</strong> recommendations to best answer that question.<br />

Once familiar with study results or clinical recommendations directed at the<br />

patient’s question, communicating evidence to a patient occurs through a variety<br />

of methods. Only when the patient’s perspective is known, can this advice be<br />

tailored to the individual patient. This chapter addresses both the patient’s and<br />

the health-care provider’s role in the communication of evidence.<br />

Patient scenario<br />

To highlight the communication challenges for evidence-<strong>based</strong> medicine, we<br />

will start with a clinical case. A patient in clinic asks whether she should take<br />

aspirin to prevent strokes and heart attacks. She is a 59-year-old woman who<br />

has high cholesterol (total cholesterol 231 mg/dL, triglycerides 74 mg/dL, HDL<br />

cholesterol 52 mg/dL, and LDL cholesterol 164 mg/dL), BMI 35 kg/m 2 and<br />

sedentary lifestyle. She has worked for at least a year on weight loss and cholesterol<br />

reduction through diet and is frustrated by her lack of results. She is otherwise<br />

healthy. Her family history is significant for stroke in her mother at age 75<br />

199


200 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

Table 18.1. Steps for communicating evidence with<br />

patients<br />

1. Understand the patient’s experience and expectations<br />

2. Build partnerships<br />

3. Provide evidence, including uncertainties<br />

4. Present recommendations<br />

5. Check for understanding and agreement<br />

From:R.M.Epstein,B.S.Alper&T.E.Quill.Communicating<br />

evidence for participatory decision making. JAMA.<br />

2004; 291: 2359–2366.<br />

and heart attack in her father at age 60. She is hesitant to take medication, however,<br />

she wants to know if she should take aspirin to prevent strokes and heart<br />

attacks. Throughout the chapter, we will refer to this case and the dilemma that<br />

this patient presents.<br />

Steps to communicating evidence<br />

Questions like this do not have a simple yes or no answer; therefore more discussion<br />

between the provider and the patient is often needed. This discussion<br />

provides an opportunity for the provider to encourage the patient to be involved<br />

in the decision. Shared or participatory decision making is part of a larger effort<br />

toward patient-centered care, where neither the patient nor the provider makes<br />

the decision about what to do, rather both parties participate. The provider is<br />

responsible for getting the best available evidence to the patient, who must then<br />

be assisted in interpreting this evidence and putting it into the context of their<br />

life.<br />

Very little evidence exists as to the best approach to communicate evidence to<br />

patients in either shared or physician-driven decision-making models. However,<br />

Epstein and colleagues have proposed a step-wise approach to this discussion<br />

using a shared decision model of communication that we have found helpful<br />

(Table 18.1). We use these steps as a basis for discussion about communication<br />

of evidence.<br />

Step 1: Understand the patient’s experience and expectations<br />

Using the patient’s query about aspirin as an example, first determine why the<br />

patient is asking, using a simple question such as “What do you know about


Communicating evidence to patients 201<br />

how aspirin affects heart attacks and strokes?” This will help the provider understand<br />

if the patient has a rudimentary or more advanced understanding of the<br />

question. When communicating evidence, knowing the patient’s baseline understanding<br />

of the question avoids reviewing information of which the patient is<br />

already aware. Finding the level of understanding is a sure way to acknowledge<br />

that the process of care is truly patient-centered.<br />

This first step helps determine if it is necessary to communicate evidence. A<br />

patient with a question does not automatically trigger the need for a discussion<br />

of the evidence, since a patient may have already decided the course of action<br />

and asks the question as a means of validation of her knowledge. A patient may<br />

also ask a question that does not require a review of the evidence. For example,<br />

a patient may ask her physician’s opinion about continuing her bisphosphonate<br />

for osteoporosis. When asking her further about her perspective, she<br />

tells you that she is concerned about the cost of the treatment. In this case,<br />

communication of the benefits of bisphosphonates will not answer her question<br />

directly. Rather, understanding her financial limitations is more appropriate.<br />

For some questions about therapy, there may be no need to discuss evidence,<br />

because the patient and the provider may be in clear agreement about the treatment.<br />

Our patient’s question of aspirin as a preventive treatment against stroke<br />

and heart attacks is one that seems to require a discussion of the best available<br />

evidence.<br />

Though typical office visits are short, taking time to understand the patient’s<br />

perspective may help avoid cultural assumptions. For example, when seeing<br />

a patient who is culturally different from you, one might assume that the<br />

patient’s values are different as well. On the other hand, it is easy to make false<br />

assumptions of shared values <strong>based</strong> on misperceived similarities of backgrounds<br />

between the provider and the patient. Understanding the patient’s perspective<br />

comes from active questioning of the patient to determine their values and perspectives<br />

and avoids assumptions about similarities and differences.<br />

Patients have varying levels of understanding of health-care issues, some with<br />

vast and others with limited previous health-care experience and levels of understanding.<br />

The patient’s level of health literacy clearly affects her perspective on<br />

the question and how she will interpret any discussion of results and recommendations.<br />

During the initial phases of the discussion about her question, it is<br />

important to understand her health literacy and general literacy level. Asking the<br />

patient what she knows about the problem can provide an impression of health<br />

literacy. This may be adequate, but asking a question such as: “How comfortable<br />

are you with the way you read?” can provide an impression of general literacy.<br />

This initial step also helps to frame a benefit and risk discussion. For example,<br />

if a patient wishes to avoid taking a medication because he or she is more concerned<br />

about the side effects of treatment than the benefits of treatment, focus<br />

the discussion on the evidence in this area.


202 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

Pitfalls to understanding the patient’s experience and expectations<br />

Some of the most well-designed studies, which have the highest potential to<br />

affect practice, are often time-limited and do not always address long-term<br />

effects, about which patients frequently have an interest. Also, many studies<br />

report major morbidity and mortality of treatment, yet, patients may be more<br />

concerned about the quality-of-life effects of treatment over many years. In other<br />

studies, the use of composite outcomes can make it difficult to directly answer<br />

a patient’s question since some of these are more important to the patient than<br />

others. The patient in our example wishes to know whether aspirin reduces the<br />

risk of heart attack. Although one may find a study that shows a statistically<br />

significant reduction of myocardial infarction, if the result is only reported as<br />

a composite outcome along with other outcomes such as reduced incidence of<br />

angina and heart failure, the result will not directly address your patient’s question.<br />

Since this type of presentation of data is used by authors when an individual<br />

outcome is not itself statistically significant, the combination of outcomes<br />

is used to achieve statistical significance and get the study published. But, the<br />

composite is often made up of various outcomes not all of which have the same<br />

value to the patient. The goal of a discussion with the patient is to explain the<br />

results of each of the composite components so that she can make up her mind<br />

about which of the outcomes are important to her.<br />

Recommendations for understanding the patient’s experience<br />

and expectations<br />

The patient’s perspective on the problem as well as the available evidence determines<br />

the true need to proceed with further steps to communicate evidence. It<br />

is possible that the patient’s questions relate only to background information,<br />

which is clearly defined in the science of medicine and not dependent on your<br />

interpretation of the most recent research evidence for an answer. Then, if evidence<br />

is needed to answer a patient’s question, first check to see whether it truly<br />

addresses the patients query about her desired outcomes rather than outcomes<br />

that are not important to the patient.<br />

Step 2: Build partnerships<br />

Taking time for this step is a way to build rapport with the patient. After discussing<br />

the patient’s perspective, an impression will have developed of whether<br />

one generally agrees or disagrees with the patient. At this point in the discussion,


Communicating evidence to patients 203<br />

it should be clear what, if any, existing evidence may be of interest to the patient.<br />

The physician will also have a good understanding of whether to spend a majority<br />

of their time discussing basic or more advanced information. Using phrases<br />

such as “Let me summarize what you told me so far” or “It sounds like you are<br />

not sure what to do next” can help to build partnership that will allow a transition<br />

to the third step in the process of communicating evidence. In the example, the<br />

patient who is interested in aspirin for prevention of strokes and heart attacks is<br />

frustrated by her lack of reduction of weight or cholesterol after implementing<br />

some lifestyle changes. Expressing empathy for her struggles will likely help the<br />

patient see you as partner in her care.<br />

Step 3: Provide evidence<br />

As health-care providers, numbers are an important consideration in our<br />

decision-making process. While some may want the results this way, many<br />

patients do not want results to be that specific or in numerical form. As a general<br />

rule, patients tend to want few specific numbers, although patients’ preferences<br />

range from not wanting to know more than a brief statement or the “bottom line”<br />

of what the evidence shows to wanting to know as much as is available about<br />

the actual study results. Check the patient’s preference for information by asking:<br />

“Do you want to hear specific numbers or only general information?” Many<br />

patients aren’t sure about this, and many providers don’t ask. Another way to<br />

start is by giving minimal information and allowing the patient to ask for more, or<br />

follow this basic information by asking the patient whether more specific information<br />

is desired. Previous experiences with the patient can also assist in determining<br />

how much information to discuss.<br />

Presenting the information<br />

There are a number of ways to communicate information to patients and understanding<br />

the patient’s desires can help determine the best way to do this. The first<br />

approach is to use conceptual terms, such as “most patients” or “almost every<br />

patient” or “very few patients.” This approach avoids the use of numbers when<br />

presenting results. A second approach is to use general numerical terms, such as<br />

“half the patients” or “1 in 100 patients.” The use of numerical terms is more precise<br />

than conceptual terms, but can be more confusing to patients. While these<br />

are the most common verbal approaches, both conceptual and numerical representations<br />

can be graphed, either with rough sketches or stick figures. In a few<br />

clinical situations, more refined means of communicating evidence have been


204 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

developed, such as decision aid programs available for prostate cancer screening.<br />

The patient answers questions at a computer about his preferences regarding<br />

prostate cancer screening and treatment. These preferences then determine<br />

a recommendation for that patient about prostate cancer screening using a<br />

decision tree similar to the ones that will be discussed in Chapter 30. Unfortunately,<br />

these types of programs are not yet widely developed for most decision<br />

making.<br />

The quality of the evidence also needs to be communicated in addition to a<br />

discussion of the risks and benefits of treatment. For example, if the highest level<br />

of evidence found was an evidence-<strong>based</strong> review from a trusted source, the quality<br />

of the evidence being communicated is higher and discussions can be done<br />

with more confidence. If there is only poor quality of evidence, such as would be<br />

available only from a case series, the provider will be less confident in the quality<br />

of the evidence and should convey more uncertainty.<br />

Pitfalls to providing the evidence<br />

The most common pitfall when providing evidence is giving the patient more<br />

information than she wants or needs although often the most noteworthy pitfalls<br />

are related to the misleading nature of words and numbers. Consider this<br />

example: A patient who has a headache asks about the cause. The answer given<br />

to the patient is: “Usually headaches like yours are caused by stress. Only in<br />

extremely rare circumstances is a headache like yours caused by a brain tumor.”<br />

How frequently is this type of headache caused by stress? How frequently is<br />

this type of headache caused by a brain tumor? In this example, expressing the<br />

common nature of stress headaches as “usually” can be very vague. When residents<br />

and interns in medicine and surgery were asked to quantify this term,<br />

they chose a range of percents between 50–95%. In this example stating that<br />

headaches due to a brain tumor occurred only in “extremely rare” circumstances<br />

is also imprecise. When asked to quantify “extremely rare” residents and<br />

interns chose a range of percents between 1–10%. Knowing that the disease is<br />

rare or extremely rare may be consoling, but if there is a 1 to 10% chance that<br />

it is present, this may not be very satisfactory for the patient. It is clear that<br />

there is a great potential for misunderstanding when converting numbers to<br />

words.<br />

Unfortunately, using actual numbers to provide evidence is not necessarily<br />

clearer than words. Results of studies of therapy can be expressed in a variety<br />

of ways. For example in a study where the outcomes are reported in binary terms<br />

such as life or death, or heart attack or no heart attack, a physician can describe<br />

the results numerically as a relative risk reduction, an absolute risk reduction,<br />

a number needed to treat to benefit, length of survival or disease-free interval.<br />

When describing outcomes, results have the potential to sound quite different


Communicating evidence to patients 205<br />

to a patient. The following example describes the same outcome in different<br />

ways:<br />

Relative risk reduction: This medication reduces heart attacks by 34% when<br />

compared to placebo.<br />

Absolute risk reduction: When patients take this medication, 1.4% fewer<br />

patients taking it experienced heart attacks compared to placebo.<br />

Number needed to treat to benefit (NNTB): For every 71 patients treated with<br />

this medication, one additional patient will benefit. This also means that<br />

for every 71 patients treated, 70 get no additional benefit from taking the<br />

medication.<br />

Calculated length of disease-free interval: Patients who take this medication<br />

for 5 years will live approximately 2 months longer before they get a heart<br />

attack.<br />

When treatment benefits are described in relative terms such as a relative risk<br />

reduction, patients are more likely to think that the treatment is helpful. The<br />

description of outcomes in absolute terms such as absolute risk reduction, leads<br />

patients to perceive less benefit from the medications. This occurs because the<br />

relative changes sound bigger than absolute changes and are, therefore, more<br />

attractive. When the NNTB and length of disease-free survival are compared, a<br />

recent study showed that patients preferred treatment outcomes to be expressed<br />

as NNTB. The authors of this study suggested that patients saw the NNTB as an<br />

avoidance of a heart attack or as a gamble, thinking that “maybe I will be the one<br />

who won’t have the heart attack,” as opposed to a postponement of an inevitable<br />

event.<br />

A patient’s ability to understand study results for diagnostic tests may be hampered<br />

by using percentages instead of frequencies to describe those outcomes.<br />

Gigerenzer has demonstrated that for most people, describing results as 2%<br />

instead of 1 in 50 will more likely be confusing (see the Bibliography). Using these<br />

“natural frequencies” to describe statistical results can make it much easier to<br />

understand fairly complex statistics. When describing a diagnostic test using natural<br />

frequencies, give the sensitivity and specificity as the number who have disease<br />

and will be detected (True Positive Rate) and the number who don’t have the<br />

disease and will be detected as having it (False Positive Rate). Then you can give<br />

the numbers who have the disease and a positive or negative test as a proportion<br />

of those with a positive or negative test. The concept of natural frequencies<br />

has been described in much more detail by Gerd Gigerenzer in his book about<br />

describing risk.<br />

Patients’ interpretations of study results are frequently affected by how the<br />

results of the study are presented, or framed. For example, if a study evaluated an<br />

outcome such as life or death, this can be presented in either a positive way by<br />

saying that 4 out of 5 patients lived or a negative way, that 1 out of 5 patients died.<br />

The use of positive or negative terms to describe study outcomes does influence<br />

a patient’s decision and is described as framing bias.


206 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

A study of patients, radiologists, and business students illustrated this point.<br />

They were asked to imagine they had lung cancer and to choose between surgery<br />

and radiation therapy. When the same results were presented first in terms of<br />

death and then in terms of life, about one quarter of the study subjects changed<br />

their mind about their preference. To avoid confusion associated with use of<br />

either percentages or framing biases, using comparisons can be helpful. For<br />

example, if a patient is considering whether to proceed with a mammogram,<br />

using a statement such as “The effect of yearly screening is about the same as<br />

driving 300 fewer miles per year” is helpful, if known. This puts the risk into perspective<br />

with a common daily risk of living and helps the patient put it into perspective.<br />

We will discuss this further when talking about quantifying patient values<br />

in Chapter 30.<br />

Recommendations about providing the evidence<br />

The most important recommendation is to avoid overwhelming the patient with<br />

too much information. The key to avoiding this pitfall is to repeatedly check with<br />

the patient before and during delivery of the information to find out how much<br />

she understands. Using verbal terms such as “usually” instead of numbers is less<br />

precise, and may give unintended meaning to the information. If use of numbers<br />

is acceptable to the patient, we recommend using them. When numbers are<br />

used as part of the discussion present them in natural frequencies rather than<br />

percents. If familiar comparisons are available, this can be additionally helpful.<br />

To avoid the framing bias, results should be presented in both positive and negative<br />

terms.<br />

Another recommendation is to use a variety of examples to communicate evidence.<br />

For our example patient who is interested in aspirin to prevent heart<br />

attacks and strokes, it may be most practical to use multiple modalities for presenting<br />

information including verbal and pictorial presentations, presenting the<br />

evidence in this way: “In a large study of women like you who took aspirin for<br />

10 years, there was no difference in number of heart attacks between patients<br />

who took aspirin and those who didn’t. Two out of 1000 fewer women who took<br />

aspirin had strokes. In that study, 1 out of 1000 women experienced excessive<br />

bleeding from the aspirin.”<br />

Step 4: Present recommendations<br />

If a number of options exist and one is not clearly superior, the choices should<br />

be presented objectively. If one has a strong belief that one option is the best for<br />

the patient, state that with an explicit discussion of the evidence and how the


Communicating evidence to patients 207<br />

option best fits with the patient’s values. This step is closely connected to the<br />

strength of the evidence. When the evidence is less than robust from weak study<br />

designs or because there are no known studies available, you cannot give strong<br />

evidence-<strong>based</strong> recommendations and must mitigate this by presenting options.<br />

When the evidence is stronger, present a recommendation and explain how that<br />

recommendation may meet the patient’s goals. In all cases, the physician has<br />

to be careful about differentiating evidence-<strong>based</strong> recommendations from those<br />

generated from personal experiences or biases regarding treatment.<br />

For our patient interested in aspirin for prevention of strokes and heart attacks,<br />

we might say: “While I understand it has been hard to lose weight and reduce<br />

your cholesterol, taking an aspirin won’t help you prevent heart attacks and is<br />

only very minimally helpful in preventing strokes. I do not recommend that you<br />

take aspirin.”<br />

Step 5: Check for understanding and agreement<br />

Bringing the interview to a close should include checking for understanding by<br />

using questions such as “Have I explained that clearly?”. This may not be enough.<br />

Instead ask the patient “How would you summarize what I said?” This is more<br />

likely to indicate whether the patient understands the evidence and your recommendations.<br />

Another important part of this step is to allow the patient time to<br />

ask questions. When the physician and the patient are both in agreement that<br />

the information has been successfully transmitted and all questions have been<br />

answered, then a good decision can be made.


19<br />

Critical appraisal of qualitative research studies<br />

Steven R. Simon, M.D., M.P.H.<br />

You cannot acquire experience by making experiments. You cannot create experience. You<br />

must undergo it.<br />

Albert Camus (1913–1960)<br />

Learning objectives<br />

In this chapter you will learn:<br />

the basic concepts of qualitative research<br />

process for critical appraisal of qualitative research<br />

goals and limitations of qualitative research<br />

While the evidence-<strong>based</strong> medicine movement has espoused the critical<br />

appraisal and clinical application of controlled trials and observational studies<br />

to guide medical decision making, much of medicine and health care revolves<br />

around issues and complexities not ideally suited to quantitative research. Qualitative<br />

research is a field dedicated to characterizing and illuminating the knowledge,<br />

attitudes, and behaviors of individuals in the context of health care and<br />

clinical medicine. Whereas quantitative research is interested in testing hypotheses<br />

and estimating effect sizes with precision, qualitative research attempts to<br />

describe the breadth of issues surrounding a problem or issue, frequently yielding<br />

questions and generating hypotheses to be tested. Qualitative research in<br />

medicine frequently draws on expertise from anthropology, psychology, and<br />

sociology, fields steeped in a tradition of careful observation of human behavior.<br />

Unfortunately, some in medicine have an attitude that qualitative research is not<br />

particularly worthwhile for informing patient care. But, you will see that qualitative<br />

studies can be powerful tools to expose psychosocial issues in medicine<br />

and as hypothesis-generating studies about personal preferences of patients and<br />

health-care workers.<br />

208


Critical appraisal of qualitative research studies 209<br />

Types of qualitative research studies<br />

Qualitative research studies usually involve the collection of a body of information,<br />

through direct observation, interviews, or existing documents. Researchers<br />

then apply one or more analytic approaches to sift through the available<br />

data to identify the main themes and the range of emotions, concerns, or<br />

approaches. In the medical literature, in-depth interviews with individuals such<br />

as patients or health-care providers and focus-group interviews and discussions<br />

among patients with a particular condition are the most common study<br />

designs encountered. Observations of clinical behavior and analyses of narratives<br />

found in medical documents (e.g., medical records) also appear with<br />

some frequency. Examples of the qualitative research studies are described in<br />

Table 19.1.<br />

When is it appropriate to use qualitative research?<br />

Qualitative research is an appropriate approach to answering research questions<br />

about the social, attitudinal, behavioral, and emotional dimensions of health<br />

care. When the spectrum of perspectives needs to be known for the development<br />

of interventions such as educational programs or technological implementations,<br />

qualitative research can characterize the barriers to and facilitators of<br />

change toward the desired practice. This can be the initial research to determine<br />

the barriers to adoption of new research results in general practice. When<br />

the research question is, “Why do patients behave in a certain way?” or “What<br />

issues drive a health-care organization to establish certain policies?”, qualitative<br />

research methods offer a rigorous approach to data collection and analysis that<br />

can reduce the need to rely on isolated anecdote or opinion.<br />

What are the methods of qualitative research?<br />

Although qualitative research studies have more methodological latitude to<br />

accommodate the wide range of data used for analysis, readers of qualitative<br />

research reports can nevertheless expect to find a clear statement of the study<br />

objectives, an account of how subjects were selected to participate and the rationale<br />

behind that selection process, a description of the data elements and how<br />

they were collected, and an explanation of the analytic approach. Readers of<br />

qualitative studies should be able to critically appraise all of these components<br />

of the research methods.


210 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

Table 19.1. Examples of Qualitative Research Studies<br />

In-depth interviews<br />

In developing an intervention to improve the use of acid-reducing medications in an<br />

HMO, researchers carried out in-depth interviews with 10 full-time primary care<br />

physicians about their knowledge, attitudes, and practice regarding dyspepsia; the use<br />

of chronic acid-suppressing drugs; approaches to diagnosing and treating Helicobacter<br />

pylori infection; and the feasibility and acceptability of various potential interventions<br />

that might be used in a quality improvement program to explore the rationale<br />

underlying various medication decisions and the barriers to prescribing consistent<br />

with evidence-<strong>based</strong> guidelines. (Reference: S. R. Majumdar, S. B. Soumerai, M. Lee &<br />

D. Ross-Degnan. Designing an intervention to improve the management of<br />

Helicobacter pylori infection. Jt. Comm. J. Qual. Improv. 2001; 27:405–414.)<br />

Focus-group interviews<br />

To investigate the role of secrets in medicine, researchers conducted a series of eight<br />

focus groups among 61 primary care physicians in Israel with a wide variety of<br />

seniority, ethnic, religious, and immigration backgrounds. The authors’ analysis<br />

revealed insights about definitions, prevalence, process, and content of secrets in<br />

primary care. (Reference: S. Reis, A. Biderman, R. Mitki & J. M. Borkan. Secrets in<br />

primary care: A qualitative exploration and conceptual model. J. Gen. Intern. Med.<br />

2007; 22: 1246–1253.)<br />

Observation of clinical encounters<br />

In a study of patient–physician communication about colorectal cancer screening,<br />

researchers drew from an existing data set of videotaped primary care encounters to<br />

explore the extent to which colorectal cancer screening discussions occur in everyday<br />

clinical practice. The researchers transcribed the videotaped discussions and reviewed<br />

both the videotapes and the transcriptions, coding content related to the specific types<br />

of screening discussed, messages conveyed, and time spent. (Reference: M. S. Wolf, D.<br />

W. Baker & G. Makoul. Physician-patient communication about colorectal cancer<br />

screening. J. Gen. Intern. Med. 2007; 22: 1493–1499.)<br />

Study objective<br />

The study objective should be explicitly stated, usually in the Introduction section<br />

of the article. This objective is often framed as a research question and is<br />

the alternative or research hypothesis for the study. Unlike quantitative research<br />

studies, where the study objective is generally very specific and outcome-<strong>based</strong>,<br />

the objective or research question in qualitative studies frequently has a nonspecific<br />

or general flavor. In fact, it is one of the strengths of qualitative research<br />

that the specific details surrounding the study objective often emerge through<br />

the data collection and the analytic processes can actually change the direction


Critical appraisal of qualitative research studies 211<br />

of the research. Nevertheless, it is important for readers to be able to assess what<br />

the researchers originally set out to accomplish.<br />

Sampling<br />

While quantitative research studies generally recruit participants through random<br />

selection or other similar approaches to minimize the potential for selection<br />

bias, qualitative research studies are not concerned with accruing a pool<br />

of individuals that resemble the larger population. Instead, qualitative studies<br />

use purposive sampling, the intentional recruitment of individuals with specific<br />

characteristics to encompass the broadest possible range of perspectives<br />

on the issue being studied. In qualitative research, a sample size is generally<br />

not pre-specified. Instead, researchers identify and recruit participants until it<br />

becomes apparent that all salient attitudes or perspectives have been identified.<br />

This approach is known variously as theoretical saturation or sampling to<br />

redundancy. Readers should assess the researchers’ rationale for selecting and<br />

sampling the set of study participants, and that rationale should be consistent<br />

with the study objectives.<br />

Data Collection<br />

In assessing the validity of the results of quantitative studies, the reader can consider<br />

whether and how all relevant variables were measured, whether adequate<br />

numbers of study participants were included, and whether the data were measured<br />

and collected in an unbiased fashion. Similarly, in qualitative research<br />

studies, the reader should expect to find a credible description of how the<br />

researchers obtained the data and be able to assess whether the data collection<br />

approach likely yielded all relevant perspectives or behaviors being studied.<br />

This criterion is tricky for both researchers and readers, since determining<br />

the spectrum of relevant concepts likely comprises part of the study’s objective.<br />

Researchers should describe the iterative process by which they collected information<br />

and used the data to inform continued data collection. The approach<br />

chosen for data collection should combine feasibility and validity. Readers<br />

should ask, and authors should articulate, whether alternative approaches were<br />

considered and, if so, why they were not taken.<br />

Authors should also detail the efforts undertaken to ascertain information that<br />

may be sensitive for a variety of reasons. For example, there may be issues of<br />

privacy or social standing which could prevent individuals from revealing information<br />

relevant to the study questions. Researchers and readers must always<br />

be concerned about social desirability bias when considering the responses


212 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

or comments that participants may provide when they know they are being<br />

observed. The extent to which researchers attempt to collect richly detailed perspectives<br />

from study subjects can help to reassure the reader that subjects at least<br />

had ample opportunity to express their knowledge, attitudes, or concerns.<br />

Analysis<br />

There is no single correct approach to analyzing qualitative data. The approach<br />

that researchers take will reflect the study question, the nature of the available<br />

data, and the preferences of the researchers themselves. This flexibility can be<br />

daunting for researcher and reader alike. Nevertheless, several key principles<br />

should guide all qualitative analyses, and readers should be able to assess how<br />

well the study adhered to these principles.<br />

All data should be considered in the analysis. This point may seem obvious,<br />

but it is important that readers feel reasonably confident that the data collection<br />

not only captured all relevant perspectives but that the analysis did not disregard<br />

or overlook elements of data that should be considered. There is no sure-fire way<br />

to determine whether all data were included in the analysis, but readers can reasonably<br />

expect study authors to report that they used a systematic method for<br />

cataloguing all data elements. While not essential, many studies use computer<br />

software to manage data. Consider whether multiple observers participated in<br />

the analysis and whether the data were reviewed multiple times. The agreement<br />

between observers, also known as the inter-rater reliability, should be measured<br />

and reported.<br />

The results of interviews or open-ended questions can be analyzed using an<br />

iterative technique of identification of common themes. First the answers to<br />

questions given by an initial group are reviewed and the important themes are<br />

selected by one observer. The responses are catalogued into these themes. A second<br />

researcher goes over those same responses with the list of themes and catalogues<br />

the responses, blinded from the results of the first researcher. Following<br />

this process, inter-rater reliability is assessed and quantified using a test such as<br />

the Kappa statistic. If the degree of agreement is substantial, one reviewer can<br />

categorize and analyze the remaining responses.<br />

Studies of human subjects’ attitudes or perspectives rarely yield a set of observations<br />

that unanimously signal a common theme or perspective. It is common<br />

in qualitative studies for investigators to come upon observations or sentiments<br />

that do not seem to fit what the preponderance of their data seem to be signaling.<br />

These discrepancies are to be expected in qualitative research and, in fact, are<br />

an important part of characterizing the range of emotions or behaviors among<br />

the study participants. Readers should be suspicious of the study’s findings if the<br />

results of a qualitative study all seem to fall neatly in line with one salient emerging<br />

theory or conclusion.


Critical appraisal of qualitative research studies 213<br />

Researchers should triangulate their observations. Triangulation refers to the<br />

process by which key findings are verified or corroborated through multiple<br />

sources. For example, researchers will frequently have subjective reactions to<br />

qualitative data, and these reactions help them to formulate conclusions and<br />

should lead to further data collection. Having multiple researchers independently<br />

analyzing the primary data helps to ensure that the findings are not<br />

unduly influenced by the subjective reactions of a single researcher. Another<br />

form of triangulation involves comparing the results of the analysis with external<br />

information, either from or about the study participants or from other studies.<br />

Theories or conclusions from one study may not be consistent with existing<br />

theories in similar fields, but when such similarities are observed, or when the<br />

results would seem to fit broader social science theories or models, researchers<br />

and readers may be more confident about the validity of the analysis.<br />

Researchers frequently perform another form of triangulation known as<br />

member-checking. This approach involves taking the study findings back to the<br />

study participants and verifying the conclusions with them. Frequently, this process<br />

of member-checking will lead to additional data and further illumination of<br />

the conclusions. Since the purpose of qualitative research is, in large measure, to<br />

describe or understand the phenomena of interest from the perspective of the<br />

participants, member-checking is useful, because the participants are the only<br />

ones who can legitimately judge the credibility of the results.<br />

Readers of qualitative articles will encounter a few analytic approaches and<br />

principles that are commonly employed and deserve mention by name. A content<br />

analysis generally examines words or phrases within a wide range of texts<br />

and analyzes them as they are used in context and in relationship with other language.<br />

An example of a content analytic strategy is immersion-crystallization.<br />

Using this approach, researchers immerse themselves repeatedly in the collected<br />

data, usually in the form of transcripts or audio or video recordings, and through<br />

iterative review and interaction in investigator meetings, coupled with reflection<br />

and intuitive insight, clear, consistent, and reportable observations emerge and<br />

crystallize.<br />

Grounded theory is another important qualitative approach that readers will<br />

encounter. The self-defined purpose of grounded theory is to develop theory<br />

about phenomena of interest, but this theory must be grounded in the reality<br />

of observation. The methods of grounded theory research include coding, memoing,<br />

andintegrating. Coding involves naming and labeling sentences, phrases,<br />

words, or even body language into distinct categories; memoing means that the<br />

researchers keep written notes about their observations during data analysis and<br />

during the coding process; and integration, in short, involves bringing the coded<br />

information and memos together, through reflection and discussion, to form<br />

a theory that accounts for all the coded information and researchers’ observations.<br />

For grounded theory, as for any other qualitative approach, triangulation,<br />

member-checking and other approaches to ensuring validity remain relevant.


214 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

Applying the results of qualitative research<br />

How do I apply the results?<br />

Judging the validity of qualitative research is no easy task, but determining<br />

when and how to apply the results is even murkier. When qualitative research<br />

is intended to generate hypotheses for future research or to test the feasibility<br />

and acceptability of interventions, then applying the results is relatively straightforward.<br />

Whatever is learned from the qualitative studies can be incorporated in<br />

the design of future studies, typically quantitative, to test hypotheses. For example,<br />

if a qualitative research study suggests that patients prefer full and timely<br />

disclosure when medical errors occur, survey research can determine whether<br />

this preference applies broadly and whether there are subsets of the population<br />

for whom it does not apply. Moreover, intervention studies can test whether educating<br />

clinicians about disclosure results in greater levels of patient satisfaction<br />

or other important outcomes.<br />

But when can the results of qualitative research be applied directly to the dayto-day<br />

delivery of patient care? The answer to this question is, as for quantitative<br />

research, that readers must ask, “Were the study participants similar to those in<br />

my own environment?” If the qualitative study under review included patients<br />

or community members, were they similar in demographic and clinical characteristics<br />

to patients in my own practice or community? If the study participants<br />

were clinicians, were their clinical and professional situations similar to my own?<br />

If the answers to these questions are “yes,” or even “maybe,” then the reader<br />

can use the results of the study to reflect on his or her own practice situation. If<br />

the qualitative research study explored patients’ perceived barriers to obtaining<br />

preventive health care, for example, and if the study population seems similar<br />

enough to one’s own, then the clinician can justifiably consider these potential<br />

barriers among his or her own patients, and ask about them. Considering<br />

another example, if a qualitative study exploring patient–doctor interactions at<br />

the end of life revealed evidence of physicians distancing themselves from relationships<br />

with their patients, clinicians should reflect and ask themselves – and<br />

their patients – how they can improve in this area.<br />

Qualitative research studies rarely result in landmark findings that, in and of<br />

themselves, transform the practice of medicine or the delivery of health care.<br />

Nevertheless, qualitative studies increasingly form the foundation for quantitative<br />

research, intervention studies, and reflection on the humanistic components<br />

of health care.


20<br />

An overview of decision making in medicine<br />

Nothing is more difficult, and therefore more precious, than to be able to decide.<br />

Napoleon I (1769–1821)<br />

Learning objectives<br />

In this chapter you will learn:<br />

how to describe the decision making strategies commonly used in<br />

medicine<br />

the process of formulating a differential diagnosis<br />

how to define pretest probability of disease<br />

the common modes of thought that can aid or hinder good decision making<br />

the problem associated with premature closure of the differential diagnosis<br />

and some tactics to avoid that problem<br />

Chapters 21 to 31 teach the process involved in making a diagnosis and thereby<br />

determining the best course of management for one’s patient. First, we will<br />

address the principles of how to use diagnostic tests efficiently and effectively.<br />

Then, we will present some mathematical techniques that can help the healthcare<br />

practitioner and the health-care system policy maker come to the most<br />

appropriate medical decisions for both individuals and populations of patients.<br />

Medical decision making<br />

Medical decision making is more complex now than ever before. The way one<br />

uses clinical information will affect the accuracy of diagnoses and ultimately the<br />

outcome for one’s patient. Incorrect use of data will lead the physician away from<br />

the correct diagnosis, may result in pain, suffering, and expense for the patient,<br />

and may increase cost and decrease the efficiency of the health-care system.<br />

215


216 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

Clinical diagnosis requires early hypothesis generation called the differential<br />

diagnosis. This is a list of plausible diseases from which the patient may be suffering,<br />

<strong>based</strong> upon the information gathered in the history and physical examination.<br />

Gathering more clinical data, usually obtained by performing diagnostic<br />

tests, refines this list. However, using diagnostic tests without paying attention<br />

to their reliability and validity can lead to poor decision making and ineffective<br />

care of the patient. Overall, we are trying to measure the ability of each element<br />

of the history, physical examination, and laboratory testing to accurately distinguish<br />

patients who have a given disease from those without that disease. The<br />

quantitative measure of this is expressed mathematically as the likelihood ratios<br />

of a positive or negative test. This tells us how much more likely it is that a patient<br />

has the disease if the test is positive or how much less likely the disease is if the<br />

test is negative.<br />

Diagnostic-test characteristics are relatively stable characteristics of a test and<br />

must be considered in the overall process of diagnosis and management of a<br />

disease. The most commonly measured diagnostic-test characteristics are the<br />

sensitivity, which is the ability of a test to find disease when it is present, and<br />

specificity, defined as the ability of a test to find a patient without disease among<br />

people who are not diseased. A positive test’s ability to predict disease when it<br />

is positive is the positive predictive value. Similarly, a negative predictive value<br />

is the test’s ability to predict lack of disease when it is negative. These values<br />

both depend on the disease prevalence in a population, which is also called the<br />

pre-test probability. The likelihood ratios can then be used to revise the original<br />

diagnostic impression to calculate the statistical likelihood of the final diagnosis,<br />

the post-test probability. This can be calculated using a simple equation or<br />

nomogram.<br />

The characteristics of tests can be used to find treatment and testing thresholds.<br />

The treatment threshold is the pretest probability above which we would<br />

treat without testing. The testingthreshold is the pretest probability below which<br />

we would neither treat nor test for a particular disease. Finally, the receiver<br />

operating characteristic (ROC) curves are graphs that summarize sensitivity and<br />

specificity over a series of cutoff values. They are used to determine the overall<br />

value of a test, the best cutoff point for a test, and the best test when comparing<br />

two diagnostic tests.<br />

More advanced mathematical constructs for making medical decisions involve<br />

the use of decision trees, which quantify diagnostic and treatment pathways<br />

using branch points to help choose between treatment options. Ideally, they will<br />

show the most effective care process. This is heavily influenced by patient values,<br />

which can be quantified for this process. Finally, the cost-effectiveness of a given<br />

treatment can be determined and it will help choose between treatment options<br />

when making decisions for a population.


An overview of decision making in medicine 217<br />

Variation in medical practice and the justification for the use of<br />

practice guidelines<br />

More than ever in the current health-care debate, physician decisions are being<br />

challenged. One major reason is that not all physician decisions are correct or<br />

even consistent. A recent study of managed care organization (MCO) physicians<br />

showed that only half of the physicians in the study treated their diabetic and<br />

heart-attack patients with proven lifesaving drugs. A recent estimate of medical<br />

errors suggested that up to 98 000 deaths per year in the United States were due<br />

to preventable medical errors. This leads to the perception that many physician<br />

decisions are arbitrary and highly variable.<br />

Several studies done in the 1970s showed a marked geographic variation in<br />

the rate of common surgeries. In Maine, hysterectomy rates varied from less<br />

than 20% in one county to greater than 70% in another. This variation was true<br />

despite similar demographic patterns and physician manpower in the two counties.<br />

Studies looking at prostate surgery, heart bypass, and thyroid surgery show<br />

variation in rates of up to 300% in different counties in New England. Among<br />

Medicare patients, rates for many procedures in 13 large metropolitan areas varied<br />

by greater than 300%. Rates for knee replacement varied by 700% and for<br />

carotid endarterectomies by greater than 2000%.<br />

How well do physicians agree among themselves about treatment or diagnosis?<br />

In one study, cardiologists reviewing angiograms could not reliably agree<br />

upon whether there was an arterial blockage. Sixty percent disagreed on whether<br />

the blockage was at a proximal or distal location. There was a 40% disagreement<br />

on whether the blockage was greater or less than 50%. In another study, the same<br />

cardiologists disagreed with themselves from 8% to 37% of the time when rereading<br />

the same angiograms. Given a hypothetical patient and asked to give<br />

a second opinion about the need for surgery, half of the surgeons asked gave<br />

the opinion that no surgery was indicated. When asked about the same patient<br />

2 years later, 40% had changed their mind.<br />

Physicians routinely treat high intraocular pressure because if intraocular<br />

pressure is high it could lead to glaucoma and blindness. How high must the<br />

intraocular pressure be in order to justify treatment? In 1961, the ophthalmologic<br />

textbooks said 24 mmHg. In 1976, it was noted to be 30 mmHg without any<br />

explanation for this change <strong>based</strong> upon clinical trials.<br />

There are numerous other examples of physician disagreement. Physician<br />

experts asked to give their estimate of the effect on mortality of screening for<br />

colon cancer varied from 5% to 95%. Heart surgeons asked to estimate the 10-<br />

year failure rates of implanted heart valves varied from 3% to 95%. All of these<br />

examples suggest that physician decision making is not standardized. <strong>Evidence</strong><strong>based</strong><br />

decision making in health care, the conscientious application of the best


218 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

possible evidence to each clinical encounter, can help us regain the confidence<br />

of the public and the integrity of the profession.<br />

More standardized practice can help reduce second-guessing of physician<br />

decisions. This questioning commonly occurs with utilization review of physician<br />

decisions by managed care organizations or government payors. It can lead<br />

to rejection of coverage for “extra” hospital days or refusal of payment for recommended<br />

surgery or other therapies. This questioning also occurs in medical<br />

malpractice cases where an expert reviews care through a retrospective review of<br />

medical records. Second-guessing, as well as the marked variation in physician<br />

practices, can be reduced through the use of practice guidelines for the diagnosis<br />

and treatment of common disorders. When used to improve diagnosis, we refer<br />

to these guidelines as diagnostic clinical prediction rules.<br />

A primary cause of physician variability lies in the complexity of clinical problems.<br />

Clinical decision making is both multifaceted and practiced on highly<br />

individualized patients. Some factors to consider with clinical decision making<br />

include patient expectations, changing reimbursement policies, competition,<br />

malpractice threat, peer pressure, and incomplete information. Overall, physicians<br />

are well-meaning and confront not only biological but also sociological<br />

and political variability. We can’t know the outcomes of our decisions beforehand,<br />

but must act anyway.<br />

There are some barriers to the process of using best evidence in medical decision<br />

making. The quality of evidence that one is looking for is often only fair or<br />

poor. Some physicians believe that if there is no evidence from well-done randomized<br />

control trials, then the treatment in questions should not be used. Be<br />

aware that lack of evidence is not equal to evidence of lack of effect. Most physicians<br />

gladly accept much weaker evidence, yet don’t have the clinical expertise to<br />

put that evidence into perspective for a particular clinical encounter. They also<br />

may not be able to discern well-done RCTs or even observational studies from<br />

those that are heavily biased. This goes to show that there is a need for clinical<br />

expertise as part of the EBM process.<br />

Some of the reasons for the high degree of uncertainty in physician decision<br />

making are noted in Table 20.1. Physicians want some certainty before<br />

they are willing to use an intervention, yet tend to do what was learned in<br />

medical school or learned from the average practitioner. The rationalization for<br />

this is that if everyone is doing the treatment, it must be appropriate. Some<br />

physician treatment decisions are <strong>based</strong> on the fact that a disease is common<br />

or severe. If a disease is common, or the outcome severe, they are more willing<br />

to use whatever treatment is available. There are even times when physicians<br />

feel the need simply to do something, and the proposed treatment is all<br />

they have. There is also a certain amount of fascination with new diagnostic<br />

or treatment modalities that results in wholesale increases in usage of those<br />

methods.


An overview of decision making in medicine 219<br />

Table 20.1. Causes of variability in physician performance<br />

(1) Complexity of clinical problem multiple factors influence actions<br />

(2) Uncertainty of outcomes of variability of outcomes in studies<br />

decisions<br />

(3) Need to act feeling on our part that we have to “do<br />

something”<br />

(4) Large placebo effect spontaneous cures (sometimes doing<br />

nothing but educating is the best thing)<br />

(5) Patient expectations expectation from patients and society that<br />

what we do will work<br />

(6) Political expectations do what is cheapest and best<br />

(7) Malpractice threat don’t make any mistakes<br />

(8) Peer pressure do things the same way that other physicians<br />

are doing them<br />

(9) Technological imperative we have a new technology so let’s use it<br />

Physician judgment<br />

Patient preferences<br />

Best<br />

evidence<br />

Analysis<br />

Shared<br />

judgment<br />

Final<br />

decision<br />

Potential outcomes<br />

One way physicians can do better is by having better clinical research and<br />

improved quality of evidence for clinical decisions. Physicians must also increase<br />

their ability to use the available evidence through improving individual and collective<br />

reasoning and actions. Figure 20.1 shows the anatomy of a clinical decision,<br />

a simplified look at decision making in general and the factors that influence<br />

the process. Reduction of error in the decision-making process requires<br />

better training of physicians in all three parts of EBM: evaluating the evidence,<br />

understanding the clinical situation, and having good patient communications.<br />

Another way to reduce error is by “automating” the decision process. If there is<br />

good evidence for a certain practice, it ought to be done the best way known at all<br />

times. Practice guidelines are one way of automating part of the decision-making<br />

process for physicians.<br />

In 1910, Abraham Flexner asked physicians and medical schools to stop teaching<br />

empiricism and rely on solid scientific information. In those days, empiric<br />

facts were usually <strong>based</strong> on single-case testimonials or poorly documented<br />

Fig. 20.1 Anatomy of a decision.


220 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

Table 20.2. Components of the H&P (with a clinical example)<br />

Chief complaint<br />

History of present illness<br />

Past medical history<br />

Why the patient sought medical care (e.g., coughing up<br />

blood, hemoptysis)<br />

Description of symptoms: what, when, where, how<br />

much, etc. (e.g., coughing up spots of bright red blood<br />

four or five times a day for 3 days associated with some<br />

shortness of breath, fever, poor appetite, occasional<br />

chest pain, and fatigue)<br />

Previous illness and operations, medications and<br />

allergies, including herbal, vitamin, and supplement<br />

use (e.g., seizure disorder, on phenytoin daily, no<br />

operations or allergies)<br />

Family and social histories Hereditary illness, habits and activities, diet, etc. (e.g., iv<br />

drug abuser, homeless, poor diet, adopted and does<br />

not know about his or her family medical history)<br />

Review of systems<br />

Physical examination<br />

Review of all possible symptoms of all bodily systems.<br />

(e.g., recent weight loss and night sweats for the past 3<br />

weeks, occasional indigestion)<br />

(e.g., somewhat emaciated male in minimal respiratory<br />

distress, cervical lymphadenopathy, dullness to<br />

percussion at right upper lobe area and few rales in this<br />

area, multiple skin lesions consistent with needle<br />

marks and associated sclerosis of veins, remainder of<br />

examination normal)<br />

case presentations. He proposed teaching and applying the pathophysiological<br />

approach to diagnosis and treatment. The medical establishment endorsed this,<br />

and the modern medical school was born. Currently, we are in the throes of a<br />

paradigm shift. We want to see the empirical data for a particular therapy or diagnosis<br />

and ought to act only on evidence that is of high quality.<br />

The clinical examination<br />

In most cases in health care, a patient does not walk into the physician’s office<br />

and present with a pre-made diagnosis. They arrive with a series of signs and<br />

symptoms that one must interpret correctly in order to make a diagnosis and<br />

initiate the most appropriate therapy. The process by which this occurs begins<br />

with the clinical examination. Traditionally, this consists of several components<br />

collectively called the history and physical or H&P (Table 20.2).


An overview of decision making in medicine 221<br />

Table 20.3. OLDCARTS acronym for history of the present illness<br />

O Onset of symptoms and chronological description of change in the symptoms<br />

L Location of symptoms and radiation to other areas<br />

D Duration of individual episodes or from when symptoms started<br />

C Characteristics of the symptoms<br />

A Associated or aggravating factors<br />

R Relieving factors<br />

T Timing, when is it worse or better<br />

S Severity on a scale from 0 to 10<br />

The chief complaint is the stated reason that the patient comes to medical<br />

attention. It is often a disorder of normal functioning that alarms the patient<br />

and tells the clinician in which systems to look for pathology.<br />

The history of the present illness is a chronological description of the chief<br />

complaint. The clinician seeks to determine the onset of the symptoms,<br />

their quality, frequency, duration, associated symptoms, and exacerbating<br />

and alleviating factors. The acronym OPQRSTAAAA is often used to remind<br />

clinicians of the elements of the history of the present illness. OPQRSTAAAA<br />

stands for Onset, Position, Quality, Radiation, Severity, Timing, Aggravating,<br />

Alleviating, Associated factors, and Attribution. A brief review of the patient’s<br />

symptoms seeks to find dysfunction in any other parts of the body that could<br />

be associated with the potential disease. It is important to include all the<br />

pertinent positives and negatives in reporting the history of the present illness.<br />

Another acronym for the history of the present illness, OLDCARTS is<br />

described in Table 20.3.<br />

The past medical history, past surgical history, family history, social and occupational<br />

history, and the medication and allergy history are all designed to<br />

get a picture of the patient’s medical and social background. This puts the<br />

illness into the context of the person’s life and is an integral part of any<br />

medical history. The accuracy and adequacy of this part of the history is<br />

extremely important. Some experts feel that this is the most important part<br />

of the practice of holistic medicine, helping ensure that the physician looks<br />

at the whole patient and the patient’s environment.<br />

The review of systems gives the clinician an overview of the patient’s additional<br />

medical conditions. These may or may not be related to the chief complaint.<br />

This aspect of the medical history helps the clinician develop other<br />

hypotheses as to the cause of the patient’s problem. It also gives the clinician<br />

more insight into the patient’s overall well-being, attitudes toward illness,<br />

and comfort level with various symptoms.


222 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

Finally, the physical examination is an attempt to elicit objective signs of disease<br />

in the patient. The physical exam usually helps to confirm or deny the<br />

clinician’s suspicions <strong>based</strong> upon the history.<br />

An old adage states that in 80% of patients, the final diagnosis comes solely<br />

from the history. In another 15% it comes from the physical examination,<br />

and only in the remaining 5% from additional diagnostic testing. This may<br />

appear to overstate the value of the history and physical, but not by much.<br />

Clinical observation is a powerful tool for deciding what diseases are possible<br />

in a given patient, and most of the time the results of the H&P determine<br />

which additional data to seek. Once the H&P has been exhausted, the clinician<br />

must know how to obtain the additional required data in a reliable and<br />

accurate way by using diagnostic tests which can appropriately achieve the<br />

best outcome for the patient. For the health-care system, this must also be<br />

done at a reasonable cost not only in dollars, but also in patient lives, time,<br />

and anxiety if an incorrect diagnosis is made.<br />

Hypothesis generation in the clinical encounter<br />

While performing the H&P, the clinician develops a set of hypotheses about what<br />

diseases could be causing the patient’s problem. This list is called the differential<br />

diagnosis and some diseases on this list are more likely than others to be<br />

present in that patient. When finished with the H&P, the clinician estimates the<br />

probability of each of these diseases and rank-orders this list. The probability of<br />

a patient having a particular disease on that list is referred to as the pretest probability<br />

of disease. It may be equivalent to the prevalence of that disease in the<br />

population of patients with similar results on the medical history and physical<br />

examination.<br />

The numbers for pretest probability come from one’s knowledge of medicine<br />

and from studies of disease prevalence in medical literature. Let’s use the example<br />

of a 50-year-old North American alcoholic with no history of liver disease,<br />

who presents to an emergency department with black tarry stools that are suggestive<br />

of digested blood in the stool. This symptom is most likely caused by<br />

esophageal varices, by gastritis, or by a stomach ulcer. The prevalence of each<br />

of these diseases in this population is 5% for varices, 55% for ulcer, and 40% for<br />

gastritis. In this particular case, the probabilities add up to 100% since there are<br />

virtually no other diagnostic possibilities. This is also knows as sigma p equals<br />

one, and applies when the diseases on the list of differential diagnoses are all<br />

mutually exclusive. Rarely, a person fitting this description will turn out to have<br />

gastric cancer, which occurs in less than 1% of patients presenting like this and<br />

can be left off the list for the time being. If none of the other diseases are diagnosed,<br />

then one needs to look for this rare disease. In this case, a single diagnostic


An overview of decision making in medicine 223<br />

test, the upper gastrointestinal endoscopy, is the test of choice for detecting all<br />

four diagnostic possibilities.<br />

There are other situations when the presenting history and physical are much<br />

more vague. In these cases, it is likely that the total pretest probability can add<br />

up to more than 100%. This occurs because of the desire on the part of the physician<br />

not to miss an important disease. Therefore, each disease should be considered<br />

by itself when determining the probability of its occurrence. This probability<br />

takes into account how much the history and physical examination of the<br />

patient resemble the diseases on the differential diagnosis. The assigned probability<br />

value <strong>based</strong> on this resemblance is very high, high, moderate, low, or very<br />

low. In our desire not to miss an important disease, probabilities that may be<br />

much greater than the true prevalence of the disease are often assigned to some<br />

diagnoses on the list. We will give an example of this shortly.<br />

Physicians must take the individual patient’s qualities into consideration when<br />

assigning pretest probabilities. For example, a patient with chest pain can have<br />

coronary artery disease, gastroesophageal reflux disease, panic disorder, or a<br />

combination of the three. In general, panic disorder is much more likely in a 20-<br />

year-old, while coronary artery disease is more likely in a 50-year-old. When considering<br />

this aspect of pretest probabilities, it becomes evident that a more realistic<br />

way of assigning probabilities is to have them reflect the likelihood of that<br />

disease in a single patient rather than the prevalence in a population. This allows<br />

the clinician to consider the unique aspects of a patient’s history and physical<br />

examination when making the differential diagnosis.<br />

Constructing the differential diagnosis<br />

The differential diagnosis begins with diseases that are very likely and for which<br />

the patient has many of the classical symptoms and signs. These are also known<br />

as the leading hypotheses or working diagnoses. Next, diseases that are possible<br />

are included on the list if they are serious and potentially life- or limbthreatening.<br />

These are the active alternatives to the working diagnoses and must<br />

be ruled out of the list. This means that the clinicians must be relatively certain<br />

from the history and physical examination that these alternative diagnoses are<br />

not present. Put another way, the pretest probability of those alternative diseases<br />

is so vanishingly small that it becomes clinically insignificant. If the history and<br />

physical examination do not rule out a diagnosis, then a diagnostic test that can<br />

reliably rule it out must be performed. Diseases that can be easily treated can also<br />

be included in the differential diagnosis and occasionally, the diagnosis is confirmed<br />

by a trial of therapy, which if successful, confirms the diagnosis. Last to be<br />

included are diseases that are very unlikely and not serious, or are more difficult<br />

and potentially dangerous to treat. These diseases are less possible because they


224 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

Fig. 20.2 A 2 × 2tableviewof<br />

pretest probabilities.<br />

Common presentation<br />

Rare presentation<br />

Common disease 90% 9%<br />

Rare disease 0.9% 0.09%<br />

have already been ruled out by the history and physical, but ought to be kept in<br />

mind for future consideration if necessary or if any clues to their presence show<br />

themselves during the evaluation. A good example of this would be a patient with<br />

chest pain and no risk factors for pulmonary embolism who has a low transcutaneous<br />

oxygen saturation. Now one should begin to look more closely for the<br />

diagnosis of pulmonary embolism in this patient.<br />

When considering a diagnosis, it is helpful to have a framework for considering<br />

likelihood of each disease on one’s list. One schema for classifying this is<br />

shown in Fig. 20.2, which describes the overall probability of diseases using a 2 ×<br />

2 table. This only helps to get an overview and does not help one determine the<br />

pretest probability of each disease on the differential diagnosis. In this schema,<br />

each disease is considered as if the total probability of disease adds up to 100%.<br />

One must tailor the probabilities in one’s differential diagnosis to the individual<br />

patient. Bear in mind that a patient is more likely to present with a rare or<br />

unusual presentation of a common disease, than a common presentation of a<br />

rare disease.<br />

As stated earlier, the first step in generating a differential diagnosis is to systematically<br />

make a list of all the possible causes of a patient’s symptoms. This<br />

skill is learned through the intensive study of diseases and reinforced by clinical<br />

experience and practice. When medical students first start doing this, it is useful<br />

to make the list as exhaustive as possible to avoid missing any diseases. Think of<br />

all possible diseases by category that might cause the signs or symptoms. There<br />

are several helpful mnemonics that can help get a differential diagnosis started.<br />

One is VINDICATE (Table 20.4). Initially, list all possible diseases for a chief complaint<br />

by category. Then assign a pretest probability for each disease on the differential<br />

list. The values of pretest probability are relative and can be assigned<br />

according to the scale shown in Table 20.5. Physicians are more likely to agree<br />

with each other on prioritizing diagnoses if using a relative scale like this, rather<br />

than trying to assign a numerical probability to each disease on the list. One must<br />

consider the ramifications of missing a diagnosis. If the disease is immediately<br />

life- or limb-threatening, it needs to be ruled out, regardless of the probability<br />

assigned. If the likelihood of a disease is very very low, the diagnostician should<br />

look for evidence that the disease might be present, such as an abberrent element<br />

of the history, physical examination or diagnostic tests to suggest that the


An overview of decision making in medicine 225<br />

Table 20.4. Mnemonic to remember classification of disease<br />

for a differential diagnosis<br />

V<br />

I<br />

N<br />

D<br />

I<br />

C<br />

A<br />

T<br />

E<br />

Vascular<br />

Inflammatory/Infectious<br />

Neoplastic/Neurologic and psychiatric<br />

Degenerative/Dietary<br />

Intoxication/Idiopathic/Iatrogenic<br />

Congenital<br />

Allergic/Autoimmune<br />

Trauma<br />

Endocrine & metabolic<br />

Table 20.5. Useful schema for assigning pretest (a-priori) probabilities<br />

Pretest probability Action Interpretation<br />


226 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

or a pectoralis muscle strain are the cause of his pain. These would have very<br />

high pretest probabilities (50–90%). One should also consider slightly less likely<br />

and more serious causes which are easily treatable, such as pericarditis, spontaneous<br />

pneumothorax, pneumonia, or esophageal spasm secondary to acid<br />

reflux. These would have variably lower pretest probabilities (1–50%). Next, there<br />

are hypotheses that are much less likely, such as myocardial infarction, dissecting<br />

thoracic aortic aneurysm, and pulmonary embolism. The pretest probabilities of<br />

these are all much less than 1%. Finally, one must consider some disorders, such<br />

as lung cancer, that are so rare and not immediately life- or limb-threatening that<br />

they are ruled out because of the patient’s age.<br />

If a 39-year-old man presented with the same complaint of chest pain, but<br />

not the typical sqeezing, pressure-like pain of angina pectoris, one could look<br />

up the pretest probability of coronary artery disease in population studies. This<br />

can be found in an article by Patterson, which states that the probability that this<br />

patient has angina pectoris is about 20%. 1 This means that about 1/5 of all 39-<br />

year-old men with this presentation will have significant coronary artery disease.<br />

These data would change one’s list and put myocardial infarction higher up on<br />

the differential. Since this is a potentially dangerous disease, additional testing is<br />

required to rule it out.<br />

Making the differential diagnosis means considering diseases from three perspectives:<br />

probability of the disease, severity of the disease, and ease of treatment<br />

of the disease. The differential diagnosis is a complex interplay between these<br />

factors and the patient’s signs and symptoms.<br />

Narrowing the differential<br />

Here is a more common, everyday example. A physician is examining a 7-yearold<br />

child who is sick with a sore throat. The pysician suspects that this child<br />

might have strep throat, which is a common illness in children and thus assigns<br />

it a high pretest probability of disease. This is the working diagnosis. The differential<br />

diagnosis also includes another common disease, viral pharyngitis.<br />

Also included are uncommon diseases like epiglottitis, which is severe and lifethreatening,<br />

and mononucleosis. Finally, extremely rare diseases are included<br />

such as diphtheria and gonorrhea. For this patient’s workup, the more serious<br />

and uncommon diseases must be actively ruled out. In this case, that can almost<br />

certainly be done with an accurate history disclosing lack of sexual abuse and<br />

oral–genital contact to rule out gonorrhea. A history of diphtheria immunization<br />

and a physical examination without the typical pseudomembrane in the<br />

1 R. E. Patterson & S. F. Horowitz. Importance of epidemiology and biostatistics in deciding clinical<br />

strategies for using diagnostic tests: a simplified approach using examples from coronary artery disease.<br />

J. Am. Coll. Cardiol. 1989; 13: 1653–1665.


An overview of decision making in medicine 227<br />

Table 20.6. Differential diagnosis of sample patient<br />

Disease<br />

Pretest probability of disease<br />

Streptococcal infection 50% Likely, common, and treatable<br />

Viruses 50% Likely, common, and self-limiting<br />

Mononucleosis 1% Unlikely, uncommon, and self-limiting<br />

Epiglottitis


228 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

Table 20.7. Relative costs of tests<br />

Disease Test Cost Relative ease to treat<br />

Streptococcal infection Rapid strep antigen or $ Easy and safe<br />

throat culture<br />

Viruses Viral culture $$$ Easy and safe<br />

Epiglottitis Neck x-ray $$ Difficult<br />

Mononucleosis Epstein–Barr antigen test $$ Easy<br />

Diphtheria Culture or diphtheria serology $$$$ Difficult<br />

Gonorrhea Gonorrhea culture $$ Difficult<br />

the test will make a difference for the patient. In the previous example, if the<br />

diagnosis of strep throat was in question, a rapid strep antigen would be the test<br />

of choice to rule it in or out. We usually don’t do viral cultures since the treatment<br />

is the same whether the patient is known to have a particular virus or not.<br />

For our 39-year-old man with chest pain, the differential diagnosis would initially<br />

include anxiety, musculoskeletal, coronary artery disease, aneurysm, and<br />

pneumothorax. For anxiety and musculoskeletal causes, the pretest probability<br />

is high, as these are common in this age group. In fact, as previously discussed,<br />

the most likely cause of chest pain in a 39-year-old is going to be pain<br />

of musculoskeletal origin. For some of the other diseases on the list, their pretest<br />

probabilities would be approximately similar to that of coronary artery disease.<br />

However, because of the potential severity of heart disease and most of the other<br />

diseases on the differential, it is necessary to do some diagnostic testing to rule<br />

out those possibilities. For some of diseases such as pneumothorax, dissecting<br />

aortic aneurysm, and pneumonia, a single chest x-ray can rule them out if<br />

the image is normal. For others such as coronary artery disease or pulmonary<br />

embolism, a more complex algorithmic scheme is necessary to rule in or rule out<br />

the diseases.<br />

Strategies for making a medical diagnosis<br />

There are several diagnostic strategies that clinicians employ when using patient<br />

data to make a diagnosis. These are presented here as unique methods even<br />

though most clinicians use a combination of them to make a diagnosis.<br />

Pattern recognition is the spontaneous and instantaneous recognition of a<br />

previously learned pattern. It is usually the starting point for creating a differential<br />

diagnosis and determines those diagnoses that will be at the top of the list.<br />

This method is employed by the seasoned clinician for most patients. Usually, an<br />

experienced clinician will be able to sense when the pattern is not characteristic


An overview of decision making in medicine 229<br />

of the disease. This occurs when there is a rare presentation of common disease<br />

or common presentation of a rare disease. An experienced doctor knows when to<br />

look beyond the apparent pattern and to search for clues that the patient is presenting<br />

with an unusual disease. Premature closure of the differential diagnosis<br />

is a pitfall of pattern recognition that is more common to neophytes and will be<br />

discussed at the end of this chapter.<br />

The multiple branching strategy is an algorithmic approach to diagnosis using<br />

a preset path with multiple branching nodes that will lead to a correct final<br />

conclusion. Examples of this are diagnostic clinical guidelines or decision rules.<br />

These are tools to assist the clinician in remembering the steps to make a proper<br />

diagnosis. If they are simple and easily memorized, they can be very useful.<br />

More complex diagnostic decision tools can be of greater help when used with a<br />

computer.<br />

The strategy of exhaustion, also called diagnosis by possibility, involves “the<br />

painstaking and invariant search for but paying no immediate attention to the<br />

importance of all the medical facts about the patient.” 2 Thisisfollowedbycarefully<br />

sifting through the data for a diagnosis. Although, more often than not, this<br />

technique will usually come up with the correct diagnosis, the process is time<br />

consuming and not cost-effective. A good example of this can be found in the<br />

Case Records of the Massachusetts General Hospital feature found in each issue<br />

of the New England Journal of <strong>Medicine</strong>. This strategy is most helpful in diagnosing<br />

very uncommon diseases or very uncommon presentations of common<br />

diseases.<br />

The hypothetico-deductive strategy, also called diagnosis by probability,<br />

involves the formulation of a short list of potential diagnoses from the earliest<br />

clues about the patient. Initial hypothesis generation is <strong>based</strong> on pattern recognition<br />

to suggest certain diagnoses. This basic differential diagnosis is followed<br />

by the performance of clinical maneuvers and diagnostic tests that will increase<br />

or decrease the probability of each disease on the list. Further refinement of the<br />

differential results in a shortlist of diagnoses and the further testing or the initiation<br />

of treatment will lead to the final diagnosis. This is the best strategy to use<br />

and will lead to a correct diagnosis in most cases. A good example of this can be<br />

found in the Clinical Decision Making feature found frequently and irregularly in<br />

the New England Journal of <strong>Medicine</strong>.<br />

Heuristics: how we think<br />

Heuristics are cognitive shortcuts used in prioritizing diagnoses. They help to<br />

deal with the magnitude and complexity of clinical data. Heuristics are not<br />

2 D. L. Sackett, R. B. Haynes, P. Tugwell & G. H. Guyatt. Clinical Epidemiology: A Basic Science for Clinical<br />

<strong>Medicine</strong>. 2nd edn. Boston: Little Brown, 1991.


230 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

always helpful, but physicians should recognize the way they use them in order<br />

to solve problems effectively and prevent mistakes in clinical diagnosis. There<br />

are three important heuristics that are used in medical diagnosis. They are representativeness,<br />

availability, and competing hypotheses heuristics.<br />

Representativeness heuristic. The probability that a diagnosis is thought of<br />

is <strong>based</strong> upon how closely its essential features resemble the features of a<br />

typical description of the disease. This is analogous to the process of pattern<br />

recognition and is accurate if a physician has seen many typical and atypical<br />

cases of common diseases. It can lead to erroneous diagnosis if one initially<br />

thinks of rare diseases <strong>based</strong> upon the patient presentation. For example,<br />

because a child’s sore throat is described as very severe, a physician might<br />

immediately think of gonorrhea, which is particularly painful. The severity<br />

of the pain of the sore throat represents gonorrhea in diagnostic thinking.<br />

To ignore or minimize the more common causes of sore throat, thinking<br />

of a rare disease more often than a common one, is incorrect. Remember<br />

that unusual or rare presentations of common diseases such as strep throat,<br />

occur more often than common presentations of rare diseases such as pharyngeal<br />

gonorrhea.<br />

Availability heuristic. The probability of a diagnosis is judged by the ease with<br />

which the diagnosis is remembered. The diagnoses of patients that have<br />

been most recently cared for are the ones that are brought to the forefront<br />

of one’s consciousness. This can be thought of as a form of recall bias. If a<br />

physician recently took care of a patient with a sore throat who had gonorrhea,<br />

he or she will be more likely to look for that as the cause of sore<br />

throat in the next patient even though this is a very rare cause of sore throat.<br />

The availability heuristic is much more problematic and likely to occur if a<br />

recently missed diagnosis was of a rare and serious disease.<br />

Anchoring and adjustment. This heuristic refers to the reality that special<br />

characteristics of a patient are used to estimate the probability of a given<br />

diagnosis. A differential diagnosis is initially formed and additional information<br />

is used to increase or decrease the probability of disease. This technique<br />

is the way we think about most diagnoses, and is also called the competing<br />

hypotheses heuristic. For example, if a patient presents with a sore<br />

throat, the physician should think of common causes of sore throat and<br />

come up with diagnoses of either a viral pharyngitis or strep throat. These<br />

are the anchors. After getting more history and doing a physical examination<br />

the physician decides that the characteristics of the sore throat are more<br />

like a viral pharyngitis than strep throat. This is the adjustment, and as a<br />

result, the other diagnoses on the differential diagnosis list are considered<br />

extremely unlikely. The adjustment is <strong>based</strong> on diagnostic information from<br />

the history and physical examination and from diagnostic tests. The process<br />

is shown in Fig. 20.3. Throughout the patient encounter, new information


An overview of decision making in medicine 231<br />

0% 100%<br />

Fig. 20.3 Hypothetico-deductive<br />

strategy using anchoring and<br />

adjustment.<br />

Adjust down<br />

Adjust up<br />

Pretest probability of disease (Anchor)<br />

is compared against all diagnoses being considered, which subsequently<br />

changes the probability estimates for each diagnosis and reorders the differential.<br />

The problem of premature closure of the differential diagnosis<br />

One of the most common problems novices have with diagnosis is that they are<br />

unable to recognize atypical patterns. This common error in diagnostic thinking<br />

occurs when the novice jumps to the conclusion that a pattern exists when in<br />

reality, it does not. There is a tendency to attribute illness to a common and often<br />

less serious problem rather than search for a less likely, but potentially more serious<br />

illness. This is called premature closure of the differential diagnosis. It represents<br />

removal from consideration of many diseases from the differential diagnosis<br />

list because the clinician jumped to a too early conclusion on the nature of<br />

the patient’s illness.<br />

Sadly, this phenomena is not limited to neophytes. Even experienced clinicians<br />

can make this mistake, thinking that a patient has a common illness<br />

when, in fact, it is a more serious but less common one. No one expects the<br />

clinician to always immediately come up with the correct diagnosis of a rare<br />

presentation or a rare disease. However, the key to good diagnosis is recognizing<br />

when a patient’s presentation or response to therapy is not following the<br />

pattern that was expected, and revisiting the differential diagnosis when this<br />

occurs.<br />

Premature closure of the differential diagnosis can be avoided by following two<br />

simple rules. The first is to always include a healthy list of possibilities in the differential<br />

diagnosis for any patient. Don’t be seduced with an apparently obvious<br />

diagnosis. When one finds oneself commonly diagnosing a patient within the<br />

first few minutes of initiating the history, step back and look for other clues that<br />

could dismiss one diagnosis and add other diagnoses to the list. Then ask oneself<br />

whether those other diseases can be excluded simply through the history


232 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

and physical examination. Since most common diseases do occur commonly,<br />

the disease that was first thought of will often turn out to be correct. However, it<br />

is more likely to miss important clues of the presence of another less common<br />

disease if a physician focuses only on that first diagnosis.<br />

The second step is to avoid modifying the final list until all the relevant information<br />

has been collected. After completing the history, make a detailed and<br />

objective list of all the diseases for consideration and determine their relative<br />

probabilities. The formal application of such a list will be invaluable for the<br />

novice student and resident, and will be done in a less and less formal way by<br />

the expert.


21<br />

Sources of error in the clinical encounter<br />

Here is my secret, it is very simple: it is only with the heart that one can see rightly; what is<br />

essential is invisible to the eye.<br />

Antoine de Saint-Exupéry (1900–1944): The Little Prince<br />

Learning objectives<br />

In this chapter you will learn:<br />

the measures of precision in clinical decision making<br />

how to identify potential causes of clinical disagreement and inaccuracy in<br />

the clinical examination<br />

strategies for preventing error in the clinical encounter<br />

The clinical encounter between doctor and patient is the beginning of the medical<br />

decision making process. During the clinical encounter, the physician has<br />

the opportunity to gather the most accurate information about the nature of the<br />

illness and the meaning of that illness to the patient. If there are errors made in<br />

processing this information, the resulting decisions may not be in the patient’s<br />

best interests. This can lead to overuse, underuse, or misuse of therapies and<br />

increased error in medical practice.<br />

Measuring clinical consistency<br />

Precision is the extent to which multiple examinations of the same patient agree<br />

with one another. In addition, each part of the examination should be accurately<br />

reproducible by a second examiner. Accuracy is the proximity of a given clinical<br />

observation to the true clinical state. The synthesis of all the clinical findings<br />

should represent the actual clinical or pathophysiological derangement possessed<br />

by the patient.<br />

233


234 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

If two people measure the same parameter several times, for instance the temperature<br />

of a sick child, we can determine the consistency of this measurement.<br />

In this example, different observers can obtain different results when they measure<br />

the temperature of a child using a thermometer because they use slightly<br />

different techniques such as varying the time that the thermometer is left in the<br />

patient or reading the mercury level differently. The kappa statistic is a statistical<br />

measurement of the precision of a clinical finding and measures inter-observer<br />

consistency between measurements and intra-observer consistency, the ability<br />

of the same observer to reproduce a measurement. The kappa statistic is<br />

described in detail in Chapter 7 and should be calculated and reported in any<br />

study of the usefulness of a diagnostic test.<br />

We often assume that all diagnostic tests are precise. Many studies have<br />

demonstrated that most non-automated tests have some some degree of subjectivity<br />

in their interpretation. This has been seen in commonly used x-ray tests<br />

such as CT scan, mammography, and angiography. It is also present in tests commonly<br />

considered to be the gold standard such as the interpretation of tissue<br />

samples from biopsies or surgery.<br />

There are many potential sources of error and clinical disagreement in the process<br />

of the clinical examination. If the examiner is not aware of these, they will<br />

lead to inaccurate data. A broad classification of these sources of error includes<br />

the examiner, the examinee, and the environment.<br />

The examiner<br />

Tendencies to record inference rather than evidence<br />

The examiner should record actual findings including both the subjective ones<br />

reported by the patient and objective ones detected by the physician’s senses.<br />

The physician should not make assumptions about the meaning of exam findings<br />

prior to creating a complete differential diagnosis. For example, a physician<br />

examining a patient’s abdomen may feel a mass in the right upper quadrant and<br />

record that he or she felt the gall bladder. This may be incorrect, and in fact the<br />

mass could be a liver cancer, aneurysm, or hernia.<br />

Ensnarement by diagnostic classification schemes<br />

Jumping to conclusions about the nature of the diagnosis <strong>based</strong> on an incorrect<br />

coding scheme can lead to the wrong diagnosis through premature closure of<br />

the differential diagnosis. If a physician hears wheezes in the lungs and assumes<br />

that the patient has asthma when in fact they have congestive heart failure, there


Sources of error in the clinical encounter 235<br />

will be a serious error in diagnosis and lead to incorrect treatment. The diagnosis<br />

of heart failure can be made from other features of the history and clues in the<br />

physical exam.<br />

Entrapment by prior expectation<br />

Jumping to conclusions about the diagnosis <strong>based</strong> upon a first impression of the<br />

chief complaint can lead to the wrong diagnosis due to lack of consideration of<br />

other diagnoses. This, along with incorrect coding schemes, is called premature<br />

closure of the differential diagnosis, and discussed in Chapter 20. If a physician<br />

examines a patient who presents with a sore throat, fever, aches, nasal congestion,<br />

and cough and thinks it is a cold, he or she may miss hearing wheezes in<br />

the lungs by only doing a cursory examination of the chest. This occurs because<br />

the physician didn’t expect the wheezes to be present in a cold, but in fact, the<br />

patient may have acute bronchitis which will present with wheezing. In any case,<br />

the symptoms can be easily and effectively treated, but the therapy will be ineffective<br />

if the diagnosis is incorrect.<br />

Bias<br />

Everyone brings an internal set of biases with them, which are <strong>based</strong> upon<br />

upbringing, schooling, training, and experiences. These biases can easily lead to<br />

erroneous diagnoses. If a physician assumes, without further investigation, that<br />

a disabled man with alcohol on his breath is simply a drunk who needs a place<br />

to stay, a significant head injury could easily be missed. Denying pain medication<br />

to someone who may appear to be a drug abuser can result in unnecessary<br />

suffering for the patient, incorrect diagnosis, and incorrect therapy.<br />

Biologic variations in the senses<br />

Hearing, sight, smell, and touch will vary between examiners and will change<br />

with age of the examiner. As one’s hearing decreases, it becomes harder to hear<br />

subtle sounds like heart murmurs or gallop sounds.<br />

Not asking<br />

If you don’t ask, you won’t find out! Many clinicians don’t ask newly diagnosed<br />

cancer patients about the presence of depression, although at least one-third<br />

of cancer patients are depressed and treating the depression may make it easier<br />

to treat the cancer. Treatment for depression will make the patient feel more


236 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

in control, thus less likely to look for other methods of therapy such as alternative<br />

or complementary medicine to the exclusion of proven chemotherapy.<br />

Other typical examples involve asking difficult questions. Many physicians don’t<br />

ask about sexual history, alcohol use, or domestic violence because they may<br />

be afraid of opening Pandora’s box. On the other hand, most patients are reluctant<br />

to give important information spontaneously about these issues, and need<br />

to be asked in a non-threatening way. When asked in an honest and respectful<br />

manner, almost all patients are pleased that these difficult questions are being<br />

asked and will give accurate and detailed information. This is part of the art of<br />

medicine.<br />

Simple ignorance<br />

Physicians have to know what they are doing in order to be able to do it well.<br />

Poor history and physical examination skills will lead to incorrect diagnoses. For<br />

example, if a physician doesn’t know the significance of the straight leg raise test<br />

in the back examination, he or she won’t do it or will do it incorrectly. This can<br />

lead to a missed diagnosis of a herniated lumbar disc and continued pain for the<br />

patient.<br />

Level of risk<br />

Physicians must be aware of their own level of risk taking. This will directly affect<br />

the amount of risk projected onto the patient. If the physician doesn’t personally<br />

like taking risks, then he or she may try to minimize risk for the patient. On the<br />

other hand, if the physician doesn’t mind taking risks, he or she may not try to<br />

minimize risk for the patient. Physicians can be classified by their risk-taking<br />

behavior into risk minimizers or test minimizers. Risk-taking physicians are less<br />

likely to admit patients with chest pain to the hospital than physicians who are<br />

risk averse or risk minimizers.<br />

Risk minimizers tend to order more tests than test minimizers. They may order<br />

more tests than would be necessary in order to reduce the risk of missing the<br />

diagnosis. They are more likely to order tests or recommend treatments even<br />

when the risk of missing a diagnosis or the potential benefit from the therapy<br />

is small. Test minimizers may order fewer tests than might be necessary and<br />

thereby increase the risk of missing a diagnosis in the patient. They are less likely<br />

to recommend certain tests or treatments, thinking that their patient would not<br />

want to take the risk associated with the test or therapy, but will be willing to<br />

take the risk associated with an error of omission in the process of diagnosis<br />

or treatment. The test minimizer projects that the patient is willing to take the<br />

risk of missing an unlikely diagnosis and would not want any additional tests<br />

performed.


Sources of error in the clinical encounter 237<br />

To minimize the bias associated with risk-taking behavior, physicians must ask<br />

themselves what they would do if this patient were their loved one. Then, the<br />

physician should do that for his or her patient. Additionally, use the communications<br />

techniques discussed in Chapter 18 to maximize understanding, informed<br />

consent, and shared decision making with the patient. Scrupulous honesty and<br />

open communications with the patient are a must here.<br />

Know when you are having a bad day<br />

Everyone has off days. If things aren’t working right because of personal issues,<br />

such as a fight with your spouse, kids, or partners, problems paying your bills,<br />

or other issues, don’t take it out on patients. Physicians must learn to overcome<br />

their own feelings and not let them get in the way of good and empathic communications<br />

with patients. If this is not possible, it is better to reschedule for a<br />

different day.<br />

The examinee<br />

Biologic variation in the system being examined<br />

The main source of random error in medicine is biologic variation. People are<br />

complex biological organisms and all physiological responses vary from person<br />

to person, or from time to time in the same person. For example, some<br />

patients with chronic bronchitis will have audible wheezes and rhonchi while<br />

others won’t have wheezes and will only have a cough on forced expiration. Some<br />

people with heart attacks have typical crushing substantial chest pain while others<br />

have a fainting spell, weakness, or shortness of breath as their only symptom.<br />

Understanding this will lead to better appreciation of subtle variations in the history<br />

and physical examination.<br />

Effects of illness and medication<br />

Ignoring the effect of medication or illness on the physiologic response of the<br />

patient may result in an inaccurate examination. For instance, patients who take<br />

beta-blocker drugs for hypertension will have a slowing of the pulse, so they may<br />

not have the expected physical exam findings like tachycardia even if they are in<br />

a condition such as shock.<br />

Memory and rumination<br />

Patients may remember their medical history differently at different times,<br />

which results in a form of recall bias. This explains the commonly observed


238 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

phenomenon that the attending seems to obtain the most accurate history. The<br />

intern or medical student will usually obtain the first history from a patient.<br />

When the attending gets the history later, the patient will have had time to reconsider<br />

their answers to the questions and may give a different and more accurate<br />

history. They may have recalled things they did not remember or thought were<br />

not important during the first questioning. A way to reduce this is by summarizing<br />

the history obtained several times during the initial encounter.<br />

Filling in<br />

Sometimes patients will invent parts of the history because they cannot recall<br />

what actually happened. This commonly occurs with dementia patients and<br />

alcoholics during withdrawal. In most of these cases, orientation to time and<br />

place is also lost. In some instances, otherwise oriented patients will be unable<br />

to recall an event because they were briefly impaired and actually don’t know<br />

what happened. This is common in the elderly who fall as a result of a syncopal<br />

episode. These patients may fill in a plausible explanation for their fall such<br />

as “I must have tripped.” In a case like this, try to get an explicit description of<br />

the entire event step by step before simply attributing their fall to tripping over<br />

something.<br />

Toss-ups<br />

Some questions can be answered correctly in many different ways, and because<br />

of this, the way a question is worded may result in the patient giving apparently<br />

contradictory answers. Descriptors of pain and discomfort are notoriously vague<br />

in their presentation and will change from telling to telling by the patient. Asking<br />

“do you have pain” could be answered no by the patient who describes their<br />

pain as pressure and doesn’t equate that with pain. The examiner will not find<br />

out that this person has chest pain without asking more specific questions using<br />

other common descriptors of chest pain such as aching, burning, pressure, or<br />

discomfort.<br />

Patient ignorance<br />

The patient may not be able to give accurate and correct answers due to lack of<br />

understanding of the examiner’s questions. The average patient understands at<br />

the level of a tenth-grade student, meaning that half of patients are below the<br />

tenth-grade level. They may not understand the meaning of a word as simple as<br />

congestion, and answer no, when they have a cough and stuffed nose. To avoid<br />

this error, avoid using complex medical or non-medical terminology.


Sources of error in the clinical encounter 239<br />

Patient speaks different language<br />

Situations in which the patient and physician cannot understand each other<br />

often lead to misinterpretation of communication. Federal law requires US hospitals<br />

to have translators available for any patient who cannot speak or understand<br />

spoken English including deaf persons. In situations where a translator<br />

is not immediately available, a translation service sponsored by AT&T is available<br />

by phone. This is especially important because patients who do not speak<br />

English are more likely to be admitted to the hospital from the Emergency<br />

Department, and to have additional and often unnecessary diagnostic testing<br />

performed.<br />

Patient embarrassment<br />

Patients will not usually volunteer sensitive information although they may be<br />

very anxious to discuss these same topics when asked directly. This includes<br />

questions about sexual problems, domestic violence, and alcohol or drug abuse.<br />

For example, even though teenagers are engaged in sexual activity, they may not<br />

know how to ask about protection from pregnancy or sexually transmitted diseases.<br />

It is better to assume that most patients will not feel comfortable asking<br />

questions about these awkward subjects, thus the physician should ask about<br />

these issues directly in an empathetic and non-judgmental manner.<br />

Denial<br />

Some patients will minimize certain complaints because they are afraid of finding<br />

out they have a bad disease. They may say that their pain is really not so bad<br />

and that the tests or treatments the physician is proposing are not necessary.<br />

The physician’s job is to determine the patient’s fear, educate the patient about<br />

the nature of the illness, and help him or her make an informed decision.<br />

Patient assessment of risk and level of risk taking<br />

Some patients will reject the physician’s interpretation of the nature of their complaint<br />

because of their own risk-taking behavior. They may be more willing or less<br />

willing to take a risk than the physician thinks is reasonable. The physician must<br />

follow the precept of patient autonomy here. The physician’s job is to educate the<br />

patient about the nature of their illness and the level of risk they are assuming by<br />

their behavior, and then help them make an informed decision. In the end, if<br />

the patient decides to refuse the physician’s suggestions for evaluation and treatment<br />

after being fully informed of the risks and benefits, they have the capacity<br />

to refuse care and should be treated with therapies that they will accept.


240 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

Lying<br />

Finally, there are occasions when a patient will simply lie to the physician. Questions<br />

about alcohol or drug abuse, child abuse, and sexual activity are common<br />

areas where this occurs. The physician may detect inconsistencies in the history<br />

or pick up secondary clues that give an idea that this may be happening. The<br />

best way to handle this situation is to get corroborating evidence from the family,<br />

current and previous physicians, and medical records. Sometimes, the physician<br />

must simply believe them and treat them anyway.<br />

The environment<br />

Disruptive environments for the examination<br />

Excess noise or interruptions, including background noise or children in the<br />

examination room, make it hard to be accurate in examination. This may be<br />

unavoidable in some circumstances like in the Emergency Department with its<br />

chaotic environment and constant noise from disruptive patients. If it is impossible<br />

to remove the noise, make sure it is compensated for in some other way. It<br />

may take longer to gather information in these circumstances, but the physician<br />

will be rewarded with increased accuracy.<br />

Disruptive interactions between the examiner and the examined<br />

Patients who are uncooperative, delirious, agitated, or in severe pain, as well as<br />

crying children are in this category. In this circumstance, the physician must simply<br />

try his or her best to do a competent examination over the interruptions.<br />

Providing pain relief for patients with severe pain early in the encounter will usually<br />

help to obtain a better history and more accurate examination. Occasionally<br />

in the Emergency Department, patients have to be sedated in order to examine<br />

them properly.<br />

Reluctant co-workers<br />

As a physician, nurses, residents, and other physicians may disagree with your<br />

evaluation. If you believe that your evaluation is correct and evidence-<strong>based</strong>,<br />

their opinions should not stand in the way. For instance, if a patient comes to the<br />

Emergency Department with the worst headache of their life, the correct medical<br />

action is to rule out a subarachnoid hemorrhage. This is done with a CT scan<br />

and, if that is negative, a spinal tap. The fact that this occurs at two o’clock in the<br />

morning should not make a difference in the decision to order the CT scan. This


Sources of error in the clinical encounter 241<br />

is true even if the radiologist asks to wait until morning to do the procedure or<br />

if the nurses say that the spinal tap is unnecessary since it takes more nursing<br />

time. The physician must know when to stand his or her ground and stick up for<br />

the patient.<br />

Incomplete function or use of diagnostic tools<br />

Diagnostic instruments and tools should be functioning properly and the examiner<br />

should be an expert in their use. One should know how the stethoscope,<br />

blood pressure cuff, ophthalmoscope, otoscope, reflex hammer, and tuning fork<br />

are correctly used and check on them before use. Practice using these tools<br />

before seeing patients. This would also apply to more technological tools such as<br />

x-rays and other imaging devices, electrocardiograms, transcutaneous oxymetry<br />

measuring devices, just to name a few of the common diagnostic tools in common<br />

usage.<br />

Strategies for preventing or minimizing error in the<br />

clinical examination<br />

The following suggestions will help to avoid making errors in the clinical examination.<br />

The examination is a tool for making an accurate final diagnosis. In order<br />

to serve this purpose, the examination must be done in a meticulous and systematic<br />

way.<br />

(1) Match the diagnostic environment to the diagnostic task. It is necessary to<br />

make sure the environment is user friendly to the physician and the patient.<br />

Wherever possible, get rid of noisy distractions.<br />

(2) Repeat key elements of the examination. Physicians should review and<br />

summarize the history with patients to make sure the data are correct. Make<br />

sure the physical examination findings are accurate by repeating them and<br />

observing how they change with time and treatment.<br />

(3) Corroborateimportantelementsofthepatienthistorywithdocumentsand<br />

witnesses. Physicians need to ensure that all the information is gathered<br />

personally, without relying on secondhand information. If the patient does<br />

not speak English or is deaf, get a translator. Overall, physicians should not<br />

make clinical decisions <strong>based</strong> on an incomplete history due to the inability<br />

to accurately understand the patient, or <strong>based</strong> on secondhand history that<br />

is not corroborated.<br />

(4) Confirm key clinical findings with appropriate tests. The physician should<br />

determine which tests are most useful in order to refine the diagnosis.


242 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

Table 21.1. Problem-oriented medical record: the SOAP format<br />

S<br />

O<br />

A<br />

P<br />

Subjective information gathered directly from the patient – the history.<br />

Objective information gathered during the patient examination and from<br />

diagnostic tests.<br />

Assessment of the patient’s problem including a differential diagnosis and the<br />

likelihood of each disease on the list, as well as other psycho-social problems that<br />

may affect the diagnostic process or therapeutic relationship. This is where<br />

inference should be noted. Make a determination of the nature of the patient’s<br />

problem and the interpretation of that problem, the diagnosis. Initially this will be<br />

a provisional diagnosis, differential diagnosis, or just a summary statement of the<br />

problem.<br />

Plan of treatment or further diagnostic testing.<br />

This aspect of medical decision making is the basis of the next several<br />

chapters.<br />

(5) Ask blinded colleagues to examine the patient. Physicians should corroborate<br />

findings to make sure that they are accurate. This will occur more often<br />

during medical school and residency training, and may be difficult to do<br />

in private practice. However, even experienced physicians will occasionally<br />

ask colleagues to check part of their clinical examination when things don’t<br />

quite add up. Obtaining reasonable and timely consultation with a specialist<br />

is another way of double checking examination findings.<br />

(6) Report evidence as well as inference, making a clear distinction between<br />

the two. Initially, the physician should record the facts only. When this is<br />

done, it is then appropriate to clearly note clinical interpretations in the<br />

record by using the problem-oriented medical record and the SOAP format<br />

(Table 21.1).<br />

(7) Use appropriate technical tools. Physicians need to make sure that physical<br />

examination tools are working properly and that they know how to use them<br />

well.<br />

(8) Blind the assessment of raw diagnostic test data. The physician should look<br />

at the results of diagnostic tests objectively, applying the principles of medical<br />

decision making contained in the next several chapters. The physician<br />

should not be overly optimistic or pessimistic about the value of a single lab<br />

test, and should apply rigorous methods of decision making in determining<br />

the meaning of the test results.<br />

(9) Apply social sciences, as well as biologic sciences of medicine. The physician<br />

should remember that the patient is functioning within a social context.<br />

Emotional, cultural, and spiritual components of health are important


Sources of error in the clinical encounter 243<br />

in getting an accurate picture of the patient. These can easily affect the interpretation<br />

of the information gathered.<br />

(10) Write legibly. Physicians must realize that others will read their notes and<br />

prescriptions. If the handwriting is not legible, mistakes will occur. If this is<br />

a serious problem, individual physicians could consider dictating charts or<br />

using a computer for medical charting.


22<br />

The use of diagnostic tests<br />

Science is always simple and always profound. It is only the half-truths that are dangerous.<br />

George Bernard Shaw (1856–1950): The Doctor’s Dilemma, 1911<br />

Learning objectives<br />

In this chapter you will learn:<br />

the uses and abuses of diagnostic tests<br />

the hierarchical format to determine the usefulness of a diagnostic test<br />

The Institute of <strong>Medicine</strong> has determined that error in medicine is due to<br />

overuse, underuse, and misuse of medical resources – resources such as diagnostic<br />

tests. In order to understand the best way to use diagnostic tests, it is helpful<br />

to have a hierarchical format within which to view them.<br />

The use of medical tests in making a diagnosis<br />

Before deciding on ordering a diagnostic test, physicians should have a good reason<br />

for doing the test. There are four general reasons for ordering a diagnostic<br />

test.<br />

(1) To establish a diagnosis in a patient with signs and symptoms. Examples of<br />

this are a throat culture in a patient with a sore throat to look for hemolytic<br />

group A streptococcus bacteria or a mammogram in a woman with a palpable<br />

breast mass to look for a cancer.<br />

(2) To screen for disease among asymptomatic patients. Examples of this are the<br />

phenylketonuria test in a healthy newborn to detect a rare genetic disorder,<br />

a mammogram in a woman without signs or symptoms of a breast mass, or<br />

the prostate specific antigen test in a healthy asymptomatic man to look for<br />

prostate cancer. Screening tests will not directly benefit the majority of people<br />

who get them, since they don’t have the disease, but the result can be<br />

244


The use of diagnostic tests 245<br />

reassuring if it is negative. In general there are five criteria that must be met<br />

for a successful screening test – burden of suffering, early detectability, test<br />

validity, acceptability, and improved outcome – and unless all these are met,<br />

the test should not be recommended. We will discuss these in Chapter 28.<br />

(3) To provide prognostic information on patients with established disease.<br />

Examples of this are a CD-4 count or viral load in a patient with HIV infection<br />

to look for susceptibility to opportunistic infection, or a CA-27.29 or 15.3<br />

in a woman with breast cancer to look for disease recurrence.<br />

(4) To monitor ongoing therapy, maximize effectiveness, and minimize side<br />

effects. One example of this is monitoring the prothrombin time in patients<br />

on warfarin therapy. This checks the patient’s level of anticoagulation and<br />

prevents levels from being either too low, thus leading to new clotting, or too<br />

high, and leading to excess bleeding. Another example is therapeutic gentamycin<br />

level in patients on this antibiotic to reduce the likelihood of toxic<br />

levels causing renal failure.<br />

Important features to determine the usefulness of a diagnostic test<br />

There are several ways of looking at the usefulness of diagnostic tests. This hierarchical<br />

evaluation uses six possible endpoints to determine a test’s utility. The<br />

more criteria in the schema that are fulfilled, the more potentially useful the test<br />

will be. On the contrary, tests that fulfill fewer criteria have more limited usefulness.<br />

These criteria are <strong>based</strong> on an article by Pearl. 1<br />

(1) Technical aspects. What are the technical performance characteristics of the<br />

test? How easy and cheap is it to perform and how reliable are the results?<br />

(a) Reliable and precise – results should be reproducible, giving the same<br />

result when the test is repeated on the same individual under the same<br />

conditions. This is usually a function of the instrumentation or operator<br />

reliability of the test. While precision used to be assumed to be present<br />

for all diagnostic tests, many studies have demonstrated that with most<br />

non-automated tests, there is some degree of subjectivity in test interpretation.<br />

This has been seen in x-ray tests such as CT scan, mammography,<br />

and angiography. It is also present in tests commonly considered to<br />

be the “gold standard” such as the interpretation of tissue samples from<br />

autopsies, biopsies, or surgery.<br />

(b) Accurate – the test should produce the correct result or the actual value<br />

of the variable it is seeking to measure all of the time. The determination<br />

of accuracy depends upon the ability of the instrument’s result to<br />

be the same as the result determined using a standardized specimen and<br />

1 W. S. Pearl. A hierarchical outcomes approach to test assessment. Ann. Emerg. Med. 1999; 33: 77–84.


246 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

an instrument that has been specially calibrated to always measure the<br />

same result.<br />

(c) Operator dependence – test results may depend on the skill of the person<br />

performing the test. A person with more experience, better training,<br />

or more talent will get more precise and accurate results on many<br />

tests.<br />

(d) Feasibility and acceptability – how easy is it to do the test? Is there a<br />

large and expensive machine that must be bought? Is the test invasive<br />

or uncomfortable to perform? For example, many patients cannot tolerate<br />

being in an MRI machine because they have claustrophobia. For this<br />

subset of patients, an MRI would be an unacceptable test. If a test is very<br />

expensive and not covered by health insurance, the patient may not be<br />

able to pay for it, making it a useless test for them.<br />

(e) Interference and cross-reactivity – are there any substances such as bodily<br />

components, medications, or foods that will interfere with the results?<br />

These substances may create false positive test results. The substances<br />

may also prevent the test from picking up true positives and thereby<br />

make them false negatives. An example of this if a person eats poppyseed<br />

bagels, they will give a false positive urine test for opiates.<br />

(f) Inter-observer and intra-observer reliability – previously discussed in<br />

the section on the kappa statistic (Chapter 7), this concept is related to<br />

operator dependence.<br />

(2) Diagnostic accuracy. How well does the test help in making the diagnosis<br />

of the disease? This includes the concepts of validity, likelihood ratios, sensitivity,<br />

specificity, predictive values, and area under the ROC curve. These<br />

concepts will be discussed in the next several chapters.<br />

(a) Validity – the test should discriminate between individuals with and<br />

without the disorder in question. How does the test result compare to<br />

that obtained using the gold standard? Criterion-<strong>based</strong>validity describes<br />

how well the measurement agrees with other approaches for measuring<br />

the same characteristic, and is a very important measurement in studies<br />

of diagnostic tests.<br />

(b) The gold standard – this is also known as the reference standard. The<br />

result of a gold-standard test defines the presence or absence of the disease<br />

(i.e., all patients with the disease have a positive test and all patients<br />

without the disease have a negative test). All other diagnostic tests must<br />

be compared to a gold standard for the disease. There are very few true<br />

gold standards in medicine and some are better or scientifically more<br />

pure than others. Some typical gold standards are:<br />

(i) Surgical or pathological specimens. These are traditionally considered<br />

to be the ultimate gold standard, but their interpretations can<br />

vary with different pathologists.


(ii) Blood culture for bacteremia. Theoretically, all bacteria that are<br />

present in the blood should grow on a suitable culture medium.<br />

Sometimes, for technical reasons, the culture does not grow bacteria<br />

even though they were present in the blood. This can occur because<br />

the technician doesn’t plate the culture properly, it is stored at an<br />

incorrect temperature, or there just happened to be no bacteria in<br />

the particular 10-cc vial of blood that was sampled.<br />

(iii) Jones criteria for rheumatic fever. This is a set of fairly objective criteria<br />

for making a diagnosis of rheumatic fever. Factors that could<br />

decrease the accuracy of these criteria are that a component of the<br />

criteria, such as temperature, may be measured incorrectly in some<br />

patients, or another criterion like arthritis may be interpreted incorrectly<br />

by the observer.<br />

(iv) DSM IV criteria for major depression. These criteria are objective,<br />

yet depend on the clinician’s interpretation of the patient’s description<br />

of their symptoms.<br />

(v) X-rays. As mentioned previously, x-rays are open to variation in the<br />

reading, even by experienced radiologists.<br />

(vi) Long-term follow-up. The ultimate fall-back or de-facto gold standard.<br />

If we are ultimately interested in finding out how well a test<br />

works to separate the diseased patients from the healthy patients,<br />

we can follow everyone who received the test for a specified period<br />

of time and see which outcomes they all have. This technique works<br />

as long as the time period is long enough to see all the possible disease<br />

outcomes, yet short enough to study realistically.<br />

(3) Diagnostic thinking. Does the result of the test cause a change in diagnosis<br />

after testing is complete? This includes concepts of incremental gain and<br />

confidence in the diagnosis. If we are almost certain that a patient has a disease<br />

<strong>based</strong> upon one test result or the history and physical exam, we don’t<br />

need a second test to confirm that result. Diagnostic thinking only considers<br />

how the test performs in making the diagnosis in a given clinical setting, and<br />

is therefore closely related to diagnostic accuracy. The setting within which<br />

this thinking operates is dependent on the prevalence of the disease in the<br />

patient population being tested.<br />

(4) Therapeutic effectiveness. Is there a change in management as a result of<br />

the outcome of the test? Also, is the test cost-effective in the management of<br />

the particular disease? For example, the venogram is the gold-standard test<br />

in the diagnosis of deep venous thrombosis. It is an expensive and invasive<br />

test that can cause some side effects, although these side effects are rarely<br />

lethal. Is this test worth it if an ultrasound is almost as accurate? Part of the<br />

art of medicine is determining which patients with one negative ultrasound<br />

can safely wait for a confirmatory ultrasound 3 days later, and which patients<br />

The use of diagnostic tests 247


248 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

need to have an immediate venogram or initiation of anticoagulant medication<br />

therapy.<br />

(5) Patient outcomes. Does the result of the test mean that the patient will feel<br />

or be better? This considers biophysiological parameters, symptom severity,<br />

functional outcome, patient utility, expected values, morbidity avoided, mortality<br />

change, and cost-effectiveness of outcomes. We will discuss some of<br />

these issues in the chapter on decision trees and patient values (Chapter 31).<br />

(6) Societal outcomes. Is the test effective for the society as a whole? Even a<br />

cheap test, if done excessively, may result in prohibitive costs to society. Outcomes<br />

include the additional cost of evaluation or treatment of patients with<br />

false positive test results and the psychosocial cost of these results on the<br />

patient and community. Other outcomes are the risk of missing the correct<br />

diagnosis in patients who are falsely negative and may suffer negative outcomes<br />

as a result of the diagnosis being missed. Again, physicians may need<br />

to also consider a cost analysis for evaluating the test. Interestingly, the perspective<br />

of the analysis can be the patient, the payor, or society as a whole.<br />

Overall, patient or societal outcomes ultimately determine the usefulness of<br />

a test as a screening tool.


23<br />

Utility and characteristics of diagnostic tests:<br />

likelihood ratios, sensitivity, and specificity<br />

It seems to me that science has a much greater likelihood of being true in the main than<br />

any philosophy hitherto advanced.<br />

Bertrand Russell (1872–1970): The Philosophy of Logical Atomism, 1924<br />

Learning objectives<br />

In this chapter you will learn:<br />

the characteristics and definitions of normal and abnormal diagnostic test<br />

results<br />

how to define, calculate, and interpret likelihood ratios<br />

the process by which diagnostic decisions are modified in medicine and<br />

the use of likelihood ratios to choose the most appropriate test for a given<br />

purpose<br />

how to define, calculate, and use sensitivity and specificity<br />

how sensitivity and specificity relate to positive and negative likelihood<br />

ratios<br />

the process by which sensitivity and specificity can be used to make diagnostic<br />

decisions in medicine and how to choose the most appropriate test<br />

for a given purpose<br />

In this chapter, we will be talking about the utility of a diagnostic test. This is a<br />

mathematical expression of the ability of a test to find persons with disease or<br />

exclude persons without disease. In general, a test’s utility will depend on two<br />

factors. These are the likelihood ratios and the prevalence of disease in the target<br />

population. Additional test characteristics that will be introduced are the sensitivity<br />

and specificity. These factors will tell the user how useful the test will be in<br />

the clinical setting. Using a test without knowing these characteristics will result<br />

in problems that include missing correct diagnoses, over-ordering tests, increasing<br />

health-care costs, reducing trust in physicians, and increasing discomfort<br />

249


250 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

and side effects for the patient. Once one understands these properties of diagnostic<br />

tests, one will be able to determine when to best order them.<br />

Why order a diagnostic test?<br />

The indications for ordering a diagnostic test can be distilled into two simple<br />

rules. They are:<br />

(1) When the characteristics of that test give it validity in the clinical setting. Will<br />

a positive or negative test be a true positive or a true negative result? Will<br />

that result help in correctly identifying a diseased patient from one without<br />

disease?<br />

(2) When the test result will change the probability of the disease leading to a<br />

change of clinical strategy. What will a positive or negative test result tell me<br />

about this patient that I don’t already know and that I need to know? Will the<br />

test results change my treatment plan for this patient?<br />

If the test that is being considered does not fall into one of these categories, it<br />

should not be done!<br />

What do diagnostic tests do?<br />

Diagnostic tests are a way of obtaining information that provides a basis for revising<br />

disease probabilities. When a patient presents with a clinical problem, one<br />

first creates a differential diagnosis. One attempts to reduce the number of diseases<br />

on this list by ordering diagnostic tests. Ideally, each test will either rule in<br />

or rule out one or more of the diseases on the differential diagnosis list. Diseases<br />

which are common, have serious sequelae such as death or disability, or can be<br />

easily treated are usually the ones which must initially be ruled in or out.<br />

We rule in disease when a positive test for that disease increases the probability<br />

of disease, making its presence so likely that we would treat the patient for that<br />

disease. This should also make all the other diseases on the differential diagnosis<br />

list so unlikely that we would no longer consider them as possible explanations<br />

for the patient’s complaints. We rule out disease when a negative test for that disease<br />

reduces the probability of that disease, making it so unlikely that we would<br />

no longer look for evidence that our patient had that disease.<br />

After setting up a list of possible diseases, we can assign a pretest probability<br />

to each disease on the differential. This is the estimated likelihood of disease<br />

in the particular patient before any testing is done. As we discussed earlier, it is<br />

<strong>based</strong> on the history and physical examination as well as on the prevalence of<br />

the disease in the population. It is also called the prior or a-priori probability of<br />

disease in that patient.


Utility and characteristics of diagnostic tests 251<br />

Post-test ∝ Pretest ×<br />

probability<br />

probability<br />

Test factor<br />

Fig. 23.1 Bayes’ theorem.<br />

What we know<br />

after doing<br />

the test<br />

=<br />

What we knew<br />

before doing<br />

the test<br />

×<br />

How much the test results<br />

change the likelihood of<br />

what we knew before<br />

After doing a diagnostic test, we are able to calculate the post-test probability<br />

of disease. This is the estimated likelihood of the disease in a patient after testing<br />

is done. This is also called the posterior or a-posteriori probability of disease.<br />

We can do this when the test result is either positive or negative. A positive test<br />

tends to rule in the disease while a negative test tends to rule out the disease. We<br />

normally think of a test as being something done by a lab or radiologist. However,<br />

the test can be an item of history, part of the physical examination, a laboratory<br />

test, a diagnostic x-ray, or any other diagnostic maneuver. Common examples of<br />

this are pulmonary function testing, psychological testing, EEG, or EKG.<br />

Mathematically, the pretest probability of the disease is modified by the application<br />

of a diagnostic test to yield a post-test probability of the disease. This revision<br />

of the pretest disease probabilities is done using a number called the likelihood<br />

ratio (LR). Likelihood ratios are stable characteristics of a diagnostic test<br />

and give the strength of that test. The likelihood ratio can be used to revise disease<br />

probabilities using a form of Bayes’ theorem (Fig. 23.1). We will return to<br />

Bayes’ theorem in Chapter 24. Before fully looking at likelihood ratios, it is useful<br />

to look at the definitions of normality in diagnostic tests.<br />

Types of test results<br />

Dichotomous test results can have only two possible values. Typical results are<br />

yes or no, positive or negative, alive or dead, better or not. A common dichotomous<br />

result is x-ray results which are read as either normal or abnormal and<br />

showing a particular abnormality. There is also the middle ground, or gray zone,<br />

in these tests as sometimes they will be unreadable because of poor technical<br />

quality. In addition, there are many subtle gradations that can appear on an x-ray<br />

and lead to various readings, but they may not pertain to the disease for which<br />

the patient is being evaluated.<br />

Continuous test results can have more than two possible values. The serum<br />

sodium level or the level of other blood components is an example of a continuous<br />

test. A patient can have any of a theoretically infinite number of values<br />

for the test result. In real life, serum sodium can take any value from about 100


252 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

Fig. 23.2 Gaussian results of a<br />

diagnostic test.<br />

Abnormal<br />

Abnormal<br />

Normal<br />

± 2 SD<br />

to 170, although at the extremes the person is near death. In practice, we often<br />

take continuous tests and select a set of values for the variable that will be considered<br />

normal (135–145 mEq/dL for serum sodium) thereby turning this continuous<br />

test into a dichotomous test, which is reported as normal or abnormal.<br />

Values of the serum sodium below 135 mEq/dL, called hyponatremia, or above<br />

145 mEq/dL, called hypernatremia, are both abnormal. Clearly, the farther from<br />

the normal range, the more serious the problem.<br />

Definitions of a normal test result<br />

There are many mathematical ways to describe the results of a diagnostic test<br />

as normal or abnormal. In the method of percentiles, cutoffs are chosen at preset<br />

percentiles of the diagnostic test results. These preset percentiles are chosen<br />

as the upper and lower limits of normal. All values above the upper limit or below<br />

the lower limit of the normal percentiles are abnormal. This method assumes<br />

that all diseases have the same prevalence. A special case of this method is the<br />

Gaussian method. In this method, normal is 95%, which is plus or minus two<br />

standard deviations (± 2 SD) of the values observed of all tests done (Fig. 23.2).<br />

Results are only specific to the population being studied and cannot be generalized<br />

to other populations.<br />

In reality, there are two normal distributions of test results (Fig. 23.3). One is for<br />

patients who are afflicted with the disease and the other is for those free of disease.<br />

There is usually an overlap of the distributions of test values for the sick and<br />

not-sick populations. The goal of the diagnostic test is to differentiate between<br />

the two groups. Some disease-free patients will have abnormal test results while<br />

some diseased patients will have normal results, thus setting any single value of<br />

the test as the cutoff between normal and abnormal will usually misclassify some<br />

patients. The ideal test, the gold standard, will have none of this overlap between<br />

diseased and non-diseased populations and will therefore be able to differentiate<br />

between them perfectly at all times.


Utility and characteristics of diagnostic tests 253<br />

TN<br />

FN<br />

FP<br />

Fig. 23.3 The “real-life” results<br />

of a diagnostic test.<br />

Healthy<br />

TP<br />

Diseased<br />

Negative<br />

test<br />

Test cutoff<br />

Positive<br />

test<br />

For almost all tests that are not a gold standard, there are four possible outcomes.<br />

True positives (TP) are those patients with disease who have a positive<br />

or abnormal test result. True negatives (TN) are those without the disease who<br />

have a negative or normal test result. False negatives (FN) are those with disease<br />

who have a negative or normal test result. False positives (FP) are those without<br />

disease who have a positive or abnormal test result. We can see this graphically<br />

in Fig. 23.3.<br />

Ideally, when a research study of a diagnostic test is done, patients with and<br />

without the disease are all given both the diagnostic test and the gold-standard<br />

test. The results will show that some patients with a positive gold-standard test,<br />

and who have the disease, will have a positive diagnostic test and some will have<br />

a negative diagnostic test. The ones with a positive test are the true positives and<br />

those with a negative test are false negatives. A similar situation exists among<br />

patients who have a negative gold-standard test and therefore, are all actually<br />

disease-free. Some of them will have a negative diagnostic test result and are<br />

called true negatives and some will have a positive test result and are called false<br />

positives.<br />

Strength of a diagnostic test<br />

The results of a clinical study of a diagnostic test can determine the strength of<br />

the test. The ideal diagnostic test, the gold standard, will always discriminate diseased<br />

from non-diseased individuals in a population. This is another way of saying<br />

that the test is 100% accurate. The diagnostic test we are comparing to the<br />

gold standard is a test that is easier, cheaper, or safer than the gold standard, and<br />

we want to know its accuracy. That tells us how often it is correct, yielding either<br />

a true positive or true negative result and how often it is incorrect yielding either<br />

a false positive or false negative result.<br />

From the results of this type of study, we can create a 2 × 2 table that divides<br />

a real or hypothetical population into four groups depending on their disease


254 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

Fig. 23.4 Results of a study of a<br />

diagnostic test.<br />

T+<br />

D+ D−<br />

TP<br />

FP<br />

D+ Disease present<br />

D− Disease absent<br />

T+ Test positive<br />

T− Test negative<br />

T−<br />

FN<br />

TN<br />

TP = True positive<br />

FP = False positive<br />

FN = False negative<br />

TN = True negative<br />

Fig. 23.5 Positive likelihood<br />

ratio (LR+) calculations.<br />

D+ D−<br />

L{T+ if D+} = TP/(TP + FN)<br />

This is also called SENSITIVITY or T+ TP FP<br />

True Positive Rate (TPR).<br />

L{T+ if D−} = FP/(FP + TN) T− FN TN<br />

This is called the False Positive Rate (FPR).<br />

D+ D−<br />

LR+ = L{T+ if D+}/L{T+ if D−}<br />

(Likelihood ratio of a positive test)<br />

LR+ = TPR/FPR = Sensitivity/FPR<br />

L{T+ if D+}<br />

L{T+ / if D−}<br />

status (D+ or D–) and test results (T+ or T–). Patients are either diseased (D+)<br />

or free of disease (D–) as determined by the gold standard test. The diagnostic<br />

test is applied to the sample, and patients have either a positive (T+) or negative<br />

(T–) diagnostic test. We can then create a 2 × 2 table to evaluate the mathematical<br />

characteristics of this diagnostic test. This 2 × 2 table (Fig. 23.4) is the conceptual<br />

basis for almost all calculations made in the next several chapters.<br />

We can calculate the likelihood or probability of finding a positive test result if<br />

a person does or does not have the disease. Similarly, we can calculate the likelihood<br />

of finding a negative test result if a person does or does not have the disease.<br />

Comparing these likelihoods can give a ratio that shows the strength of the test.<br />

Likelihoods are calculated for each of the four possible outcomes. They can be<br />

compared in two ratios and are analogous to the relative risk in studies of risk or<br />

harm. These are called the positive and negative likelihood ratios. In studies of<br />

diagnostic tests, we are looking at the probability that a person with the disease<br />

will have a positive test. Compare that to the probability that a person without<br />

the disease has a positive test and the likelihood ratio of a positive test can be<br />

calculated (LR+ in Fig. 23.5).<br />

The LR+ tells us by how much a positive test increases the likelihood of disease<br />

in a person being tested. We start with the likelihood of disease, do the test, and<br />

as a result of a positive test that likelihood increases. The LR+ tells us how much


Utility and characteristics of diagnostic tests 255<br />

D+ D−<br />

L{T− if D+} = FN/(TP + FN)<br />

This is called the False Negative Rate (FNR). T+ TP FP<br />

L{T− if D−} = TN/(FP + TN)<br />

This is also called SPECIFICITY or the T− FN TN<br />

True Negative Rate (TNR)<br />

D+ D−<br />

LR− = L{T− if D+}/L{T− if D−}<br />

(Likelihood ratio of a negative test)<br />

LR− = FNR/TNR = FNR/Specificity L{T− if D+}<br />

L{T− if D−}<br />

Fig. 23.6 Negative likelihood<br />

ratio (LR–) calculations.<br />

of an increase in this likelihood we can expect. We can do the same thing for a<br />

negative test. In this case, we are looking at the likelihoods of having a negative<br />

test in people with and without the disease. The LR– or likelihood ratio of a negative<br />

test tells us by how much a negative test decreases the likelihood of disease<br />

in persons who are having the test done. Figure 23.6 describes these calculations.<br />

Likelihood ratios are called stable characteristics of a test. This means that they<br />

do not change with the prevalence of the disease. Their values are determined by<br />

clinical studies against a gold standard, therefore, published reports of likelihood<br />

ratios are only as good as the gold standard against which they are <strong>based</strong> and the<br />

quality of the study that determined their value.<br />

The likelihood ratios are the strength of the diagnostic test. The larger the<br />

value of LR+, the more a positive test will increase the probability of disease in<br />

a patient to whom the test is given and who then has a positive result. In general,<br />

one would like the likelihood ratio of a positive test to be very high, ideally<br />

greater than 10, to maximally increase the probability of disease after doing the<br />

test and getting a positive result. Similarly, one would want the likelihood ratio<br />

of a negative test to be very low, ideally less than 0.1 to maximally decrease the<br />

probability of disease after doing the test and getting a negative result. A qualitative<br />

list of LRs has been devised to show the strength of a test <strong>based</strong> upon LR<br />

values. These are listed in Table 23.1.<br />

Table 23.1. Strength of test by likelihood ratio<br />

Qualitative strength LR+ LR−<br />

Excellent 10 0.1<br />

Very good 5 0.2<br />

Fair 2 0.5<br />

Useless 1 1


256 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

The likelihood that a patient with the disease has a positive test is also known<br />

as the sensitivity or the true positive rate (TPR). This tells the reader how sensitive<br />

the test is for finding those persons with disease when only looking at those with<br />

disease. It displays how often the result is a true positive compared to a false<br />

negative as it is the fraction of people with the disease who test positive. It is<br />

important to note that sensitivity can only be calculated from among people who<br />

have the disease. Probabilistically, it is expressed as P[T+|D+], the probability of<br />

a positive test if the person has disease.<br />

If the result of a very sensitive test is negative, it tells us that the patient doesn’t<br />

have the disease and the test is Negative in Health (NIH). This is because in a<br />

very sensitive test, there are very few false negatives, therefore virtually all negative<br />

tests must occur in non-diseased people. In addition, if the clinician has<br />

properly reduced the number of diagnostic possibilities, it would be even more<br />

unlikely that the patient has the disease in question. As a general rule, when<br />

two or more tests are available, the most sensitive one should be done to minimize<br />

the number of false negatives. This is especially true for serious diseases<br />

that are easily treated. An example of a very sensitive test is the thyroid simulating<br />

hormone (TSH) test for hypothyroidism. A normal TSH makes it extremely<br />

unlikely that the patient has hypothyroidism, thus with a normal TSH, hypothyroidism<br />

is ruled out. A sensitive test rules out disease – and the mnemonic is<br />

SnOut (Sensitive = ruled Out).<br />

Similarly, the likelihood that a patient without disease has a positive test is also<br />

known as the false positive rate (FPR). It is equal to one minus the specificity. It<br />

tells us how often the result is a false positive compared to a true negative. FPR =<br />

FP/(FP + TN). This is the proportion of non-diseased people with a positive<br />

test.<br />

The likelihood that a patient without the disease has a negative test is also<br />

known as the specificity or the true negative rate (TNR). It tells the reader how<br />

specific the test is for finding those persons without disease, when only looking<br />

at those without disease. It demonstrates how often the result is a true negative<br />

compared to a false positive, as it is the fraction of people without the disease<br />

who test negative. It is important to realize that specificity can only be calculated<br />

from among people who do not have the disease. Probabilistically, it is expressed<br />

as P[T− |D−], the probability of a negative test if the person does not have<br />

disease.<br />

If the result of a very specific test is positive, it tells us that the patient has the<br />

disease and the test is Positive in Disease (PID). This is because there are very<br />

few false positives, therefore any positive tests must occur in diseased people. If<br />

the clinician has properly reduced the number of diagnostic possibilities, then<br />

it would be even more likely that the patient does have the disease in question.<br />

When two or more tests are available, the most specific should be done to minimize<br />

the number of false positives. This is especially true for diseases that are


Utility and characteristics of diagnostic tests 257<br />

Table 23.2. Mnemonics for sensitivity and<br />

specificity<br />

(1) SeNsitive tests are Negative in health (NIH)<br />

SPecific tests are Positive in disease (PID)<br />

(2) Sensitivity: SnOut – sensitive tests rule out disease<br />

Specificity: Spln – specific tests rule in disease<br />

(3) SeNsitivity includes False Negatives<br />

SPecificity includes False Positives<br />

not easily treated or for which the treatment is potentially dangerous. An example<br />

of a very specific test is the ultrasound for deep venous thrombosis of the leg.<br />

If the ultrasound is positive, it is extremely likely that there is a clot in the vein.<br />

Thus, a deep vein thrombosis is ruled in. A specific test rules in disease – and the<br />

mnemonic is SpIn (Specificity = ruled In).<br />

Similarly, the likelihood that a patient with disease has a negative test is also<br />

known as the false negative rate (FNR). It is equal to one minus the sensitivity. It<br />

tells the reader how often the result is a false negative compared to a true positive.<br />

FNR = FN/(FN + TP). This is the proportion of diseased people with a negative<br />

test.<br />

Using sensitivity and specificity<br />

The sensitivity and specificity are the mathematical components of the likelihood<br />

ratios. They are the characteristics that are most often measured and<br />

reported in studies of diagnostic tests in the medical literature. Three mnemonics<br />

can help to remember the difference between sensitivity and specificity.<br />

These are listed in Table 23.2. Like likelihood ratios, true positive rate, false positive<br />

rate, true negative rate, and false negative rate are also intrinsic characteristics<br />

of a diagnostic test. From the study results, we can use our 2 × 2 table (Fig.<br />

23.7) that divided a real or hypothetical population into four groups depending<br />

on their disease status (D+ or D–) and test results (T+ or T–) as a starting point<br />

to evaluate these characteristics of the diagnostic test.<br />

We have previously noted the mathematical relationship between sensitivity<br />

and specificity and the likelihood ratios. Likelihood ratios can be calculated from<br />

sensitivity and specificity. The formulas are as follows:<br />

LR+ =sensitivity/(1 − specificity)<br />

LR− =(1 − sensitivity)/specificity


258 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

Fig. 23.7 Sensitivity and<br />

specificity calculations.<br />

D+ D−<br />

T+<br />

TP<br />

FP<br />

Sensitivity = TPR =<br />

TP<br />

TP + FN<br />

T−<br />

FN<br />

TN<br />

Specificity = TNR =<br />

TN<br />

TN + FP<br />

D+<br />

Sensitivity<br />

D−<br />

Specificity<br />

Fig. 23.8 The effect of changing<br />

the cutoff point for a diagnostic<br />

test.<br />

Lower cutoff, more<br />

sensitive but less specific<br />

Healthy<br />

FN<br />

FP<br />

Raise cutoff, more<br />

specific but less<br />

sensitive<br />

Diseased<br />

TN<br />

TP<br />

Negative<br />

test<br />

Test cutoff<br />

Positive<br />

test<br />

There is also a dynamic relationship between sensitivity and specificity. As the<br />

sensitivity of a test increases, the cutoff point moves to the left in Fig. 23.8. The<br />

number of true positives increases compared to the number of false negatives. At<br />

the same time, the number of false positives will increase compared to the number<br />

of true negatives. This will result in a decrease in the specificity. Notice what<br />

happens to the sensitivity and specificity in Fig. 23.8 when the test cutoff moves<br />

to the right. Now the sensitivity decreases as the specificity increases. We will see<br />

this dynamic relationship better when we discuss receiver operating characteristic<br />

curves in Chapter 25.<br />

Sample problem<br />

Diarrhea in children is usually caused by viral infection. However in some cases,<br />

bacterial infection causes the diarrhea and these cases should be treated with<br />

antibiotics. A study was done in which 156 young children with diarrhea had<br />

stool samples taken. All of them were tested for the presence of white blood<br />

cells in the stool, and a positive test was defined as one in which there were


Utility and characteristics of diagnostic tests 259<br />

D+ D− Totals<br />

T+ 23 (TP) 16 (FP) 39<br />

T− 4 (FN) 113 (TN) 117<br />

Totals 27 129 156 (N)<br />

>5 white blood cells per high power field. All the children had a stool culture<br />

done, which was the gold standard. There were 27 children who had positive cultures<br />

and 23 of these had smears that were positive for fecal white blood cells.<br />

Of the 129 who had a negative stool culture, 16 had smears that were positive<br />

for fecal white blood cells. What are the likelihood ratios of the stool leukocyte<br />

test?<br />

First make your 2 × 2 table (Fig. 23.9). From this you can tell that the prevalence<br />

is 27/156 = 0.17.<br />

Fig. 23.9 A 2 × 2 table using<br />

data from the study of the use of<br />

fecal leukocytes in the diagnosis<br />

of bacterial diarrhea in children.<br />

The prevalence of disease is<br />

27/156 = 0.17. From: T. G.<br />

DeWitt, K. F. Humphrey & P.<br />

McCarthy. Clinical predictors of<br />

acute bacterial diarrhea in young<br />

children. Pediatrics 1985; 76:<br />

551–556.<br />

L{T+|D+} = sensitivity or TPR = TP/(TP + FN) = 23/(23 + 4) = 0.85<br />

L{T+|D−} = 1 − specificity = FPR = FP/(TN + FP) = 16/(113 + 16) = 0.12<br />

From these we can calculate the likelihood ratio of a positive test:<br />

LR+ =L{T+|D+}/L{T +|D−} = 0.85/0.12 = 7.08<br />

Doing the same for a negative test leads to the following results:<br />

L{T−|D+} = 1 − sensitivity = FNR = FN/(TP + FN) = 4/(23 + 4) = 0.15<br />

L{T−|D−} = specificity or TNR = TN/(TN + FP) = 113/(113 + 16) = 0.88<br />

LR− =L{T−|D+}/L{T−|D−} = 0.15/0.88 = 0.17<br />

These likelihood ratios are pretty good and this is a fairly good test since the<br />

LR+ =7.08 and the LR− =0.17 are very close to a strong test (LR+ > 10 and<br />

LR− < 0.1). This is a test that will increase the likelihood of disease by a lot if<br />

the test is positive and decrease the likelihood of disease by a lot if the test is negative.<br />

We will talk about applying these numbers in a real clinical situation in a<br />

later chapter.<br />

It is always necessary to be aware of biases in a study, and this example is no<br />

different. The following are the potential biases in this study. It was done on 156<br />

children who presented to an emergency department with severe diarrhea and<br />

were entered into the study. This meant that someone, either the resident or<br />

attending physician on duty at the time, thought that the child had infectious


260 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

or bacterial diarrhea. Therefore, they were already screened before any testing<br />

was done on them and the study is subject to filter or selection bias. This simply<br />

means that the population in the study may not be representative of the population<br />

of all children with diarrhea like the ones being seen in a pediatric or<br />

family-practice office. The next chapter will deal with this problem and how to<br />

generalize the results of this study to real patients.


24<br />

Bayes’ theorem, predictive values, post-test<br />

probabilities, and interval likelihood ratios<br />

As far as the laws of mathematics refer to reality, they are not certain; and as far as they are<br />

certain, they do not refer to reality.<br />

Albert Einstein (1879–1955)<br />

Learning objectives<br />

In this chapter you will learn:<br />

how to define predictive values of positive and negative test results and how<br />

they differ from sensitivity and specificity<br />

the difference between odds and probability and how to use each correctly<br />

Bayes’ theorem and the use of likelihood ratios to modify the probability of<br />

a disease<br />

how to define, calculate, and use interval likelihood ratios for a diagnostic<br />

test<br />

how to calculate and use positive and negative predictive values<br />

how to use predictive values to choose the appropriate test for a given diagnostic<br />

dilemma<br />

how to apply basic test characteristics to solve a clinical diagnostic problem<br />

the use of interval likelihood ratios in clinical decision making<br />

In this chapter, we will be talking about the application of likelihood ratios, sensitivity,<br />

and specificity to a patient.<br />

Introduction<br />

Likelihood ratios, sensitivity, and specificity of a test are derived from studies of<br />

patients with and without disease. They are stable and essential characteristics<br />

of the test that give us the probabilities of a positive or negative test if the patient<br />

261


262 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

does or does not have disease. This is not the information a clinician needs to<br />

know in order to apply the test to a single patient.<br />

What the clinician needs to know is: if a patient has a positive test, what is the<br />

likelihood that patient has the disease? The clinician is interested in how the test<br />

result relates to the patient. For a given patient, how will the probability of disease<br />

change given a positive or negative test result? Applying likelihood ratios or<br />

sensitivity and specificity to a selected pretest probability of disease will give the<br />

post-test probability to answer this question. There are two methods for doing<br />

this calculation. The first uses Bayes’ theorem, while the second calculates the<br />

predictive values of a positive and negative test directly from sensitivity, specificity,<br />

and prevalence using the 2 × 2 table.<br />

Predictive values<br />

The positive predictive value (PPV) is the proportion of patients with the disease<br />

among all those who have a positive test. If the test comes back positive, it<br />

shows the probability that this patient really has the disease. Probabilistically, it<br />

is expressed as P[D+|T +], the probability of disease if a positive test occurs. It is<br />

also called the post-test or posterior probability of a positive test. A related concept<br />

is the false alarm rate (FAR), which is equal to 1 – PPV. That is the proportion<br />

of people with a positive test who do not have disease and will then be falsely<br />

alarmed by a positive test result.<br />

The negative predictive value (NPV) is the proportion of patients without the<br />

disease among all those who have a negative test. If the test comes back negative,<br />

it shows the probability that this patient really does not have the disease. Probabilistically,<br />

it is expressed as P[D– | T –], the probability of not having disease if<br />

a negative test occurs. It is also called the post-test or posterior probability of a<br />

negative test. A related concept is the false reassurance rate (FRR), which is equal<br />

to 1 – NPV. That is the proportion of people with a negative test who have disease<br />

and will be falsely reassured by a negative test result.<br />

Bayes’ theorem<br />

Thomas Bayes was an English clergyman with broad talents. His famous theorem<br />

was presented posthumously in 1763. In eighteenth-century English, it said:<br />

“The probability of an event is the ratio between the value at which an expectation<br />

depending on the happening of the event ought to be computed and the<br />

value of the thing expected upon its happening.” Now, that’s not so easy to understand<br />

is it? In simple language, the theorem was an updated way to predict the<br />

odds of an event happening when confronted with new information. In statistics,<br />

this new information is that gained in the research process. In making diagnoses


Bayes’ theorem and predictive values 263<br />

in clinical medicine, this new information is the likelihood ratio. Bayes’ theorem<br />

is a way of using likelihood ratios (LRs) to revise disease probabilities.<br />

Bayes’ theorem was put into mathematical form by Laplace, the discoverer of<br />

his famous law. Its use in statistics was supplanted at the start of the twentieth<br />

century by Sir Ronald Fisher’s ideas of statistical significance, the use of P < 0.05<br />

for statistical significance. It was kept in the dark until revived in the 1980s. We<br />

won’t get into the actual formula in its usual and original form here because it<br />

only involves another very long and useless formula. A derivation and the full<br />

mathematical formula for Bayes’ theorem are given in Appendix 5, if interested.<br />

In it’s simplest and most useful form, it states:<br />

Post-test odds = pretest odds × LR<br />

Odds and probabilities<br />

In order to use Bayes’ theorem and likelihood ratios, one must first convert<br />

the probability of disease to the odds of disease. Odds describe the chance<br />

that something will happen against the chance it will not happen. Probability<br />

describes the chance that something will happen against the chance that it will<br />

or will not happen. The odds of an outcome are the number of people affected<br />

divided by the number of people not affected. In contrast, the probability of an<br />

outcome is the number of people affected divided by the number of people at<br />

risk or those affected plus those not affected. Probability is what we are estimating<br />

when we select a pretest probability of disease for our patient. We next have<br />

to convert this to odds.<br />

Let’s use a simple example to show the relationship between odds and probability.<br />

If we have 5 white blocks and 5 black blocks in a jar, we can calculate the<br />

probability or odds of picking a black block at random and of course, without<br />

looking. The odds of the outcome of interest, picking a black block, are 5/5 =<br />

1. There are equal odds of picking a white and black block. For every one black<br />

block that is picked, on average, one white block will be picked. The probability<br />

of the outcome of interest or picking a black block is 5/10 = 0.5. Half of all the<br />

picks will be a black block. Figure 24.1 shows this relationship.<br />

In our society, odds are usually associated with gambling. In horse racing or<br />

other games of chance, the odds are usually given backward by convention. For<br />

example, the odds against Dr. Disaster winning the fifth race at Saratoga are 7 : 1.<br />

This means that this horse is likely to lose 7 times for every eight races he enters.<br />

In usual medical terminology, these numbers are reversed. We put the outcome<br />

we want on top and the one we don’t want on the bottom. Therefore, the odds of<br />

himwinningwouldbe1:7,or1/7 or 0.14. He will win one time in eight.<br />

The probability of Dr. Disaster winning is different. Here we answer the question<br />

of how many times will he have to race in order to win once? He will have<br />

to race eight times in order to have one win. The probability of him winning any


264 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

Black and white blocks in a jar Odds Probability<br />

9/1 = 9 9/10 = 0.9<br />

3/1 = 3 3/4 = 0.75<br />

2/2 = 1 2/4 = 0.5<br />

1/3 = 0.33 1/4 = 0.25<br />

1/9 = 0.11 1/10 = 0.1<br />

Fig. 24.1 Relationship between odds and probability. As the odds and probabilities get<br />

smaller, they also approximate each other. As they get larger, they become more and more<br />

different.<br />

Fig. 24.2 Converting odds to<br />

probability (and back).<br />

To convert odds to probability:<br />

Probability = Odds/(1 + Odds)<br />

To convert probability to odds:<br />

Odds = Probability/(1 − Probability)<br />

one race is 1 in 8 or 1/8 or 0.125. Since the odds and probabilities are small numbers,<br />

they are very similar. If he were a better horse and the odds of him winning<br />

were 1 : 1, or one win for every loss, the odds could be expressed as 1/1 or 1.0.<br />

Here the probability would be that he would win one race in every two he starts.<br />

The probability of winning is 1/2 or 0.5.<br />

The language for odds and probabilities differs. Odds are expressed as one<br />

number to another: for example, odds of 1 : 2 are expressed as “one to two” and<br />

equal the fraction 0.5. This is the same as saying the odds are 0.5 to 1. Probabilityisexpressedasafraction.Thesame1:2oddswouldbeexpressedas“onein<br />

three” = 0.33. These two expressions and numbers are the same way of saying<br />

that for every three attempts, there will be one successful outcome.<br />

There are mathematical formulas for converting odds to probability and vice<br />

versa. They are listed in Fig. 24.2.<br />

Using likelihood ratios to revise pretest probabilities of disease<br />

Likelihood ratios (LRs) can be used to revise disease pretest probabilities when<br />

test results are dichotomous, using Bayes’ theorem. This says post-test odds of


Bayes’ theorem and predictive values 265<br />

Pretest<br />

probability<br />

Convert to pretest odds<br />

Fig. 24.3 Flowchart for Bayes’<br />

theorem.<br />

Multiply by LR<br />

If+; PPV<br />

if−; FRR<br />

Convert<br />

to post-test<br />

probability<br />

Post-test odds<br />

disease equal pretest odds of disease times the likelihood ratio. We get the pretest<br />

probability of disease from our differential diagnosis list and our estimate of<br />

the possibility of disease in our patient. The pretest probability is converted to<br />

pretest odds and multiplied by the likelihood ratio. This results in the post-test<br />

odds, which are converted back to a probability, the post-test probability.<br />

The end result of using Bayes’ theorem when a positive test occurs is the posttest<br />

probability of disease. This is also called the positive predictive value (PPV).<br />

For a negative test, Bayes’ theorem calculates the probability that the person still<br />

has disease even if a negative test occurs. This is called the false reassurance rate<br />

(FRR). From this, one can calculate the negative predictive value (NPV), which<br />

is the probability that a person with a negative test does not have the disease.<br />

Mathematically it is 1 minus the FRR. The process is represented graphically in<br />

Fig. 24.3.<br />

We will demonstrate this with an example. A study was done to evaluate the<br />

use of the urine dipstick in testing for urinary tract infections (UTI) in children<br />

seen in a pediatric emergency department. 1 A positive leukocyte esterase and<br />

nitrite test on a urine dipstick was defined as being diagnostic of a UTI. In this<br />

case, a urine culture was done on all the children and therefore was the gold<br />

standard. A positive test on both indicators, the leukocyte esterase and nitrite,<br />

had a positive likelihood ratio (LR+) of 20 but a negative likelihood ratio (LR–) of<br />

0.61. In the study population, the probability of a urinary tract infection in the<br />

children being evaluated in that setting was 0.09 (9%).<br />

Suppose you are in a practice and estimate that a particular child whom you<br />

are seeing for fever has a pretest probability of 10% of having a UTI. This is<br />

equivalent to a low pretest probability of disease. If you want to find out what<br />

1 From K. N. Shaw, D. Hexter, K. L. McGowan & J. S. Schwartz. Clinical evaluation of a rapid screening<br />

test for urinary tract infections in children. J. Pediatr. 1991; 118: 733–736.


266 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

the post-test probabilities of a urinary tract infection are after using the dipstick<br />

test, use Bayes’ theorem and do the following steps:<br />

(1) Convert probability to odds. Pretest probability = 0.1, therefore, Pretest<br />

odds = 0.1/(1 – 0.1) = 0.11. (Remember, for low values, the same number<br />

could be used and get results that are close enough.)<br />

(2) Apply Bayes’ theorem. Multiply pretest odds by the likelihood ratio for a positive<br />

test (LR+). In this case, LR+ =20, a very high LR+, so the test is very<br />

powerful if positive. Post-test odds = pretest odds × LR+=0.11 × 20 = 2.2.<br />

(3) Convert odds back to probability. Post-test probability = odds/(odds + 1) =<br />

2.2/3.2 = 0.69. (Here we have to do the formal calculation back to probability<br />

to get a reasonable result.)<br />

(4) Interpret the result. Post-test probability or positive predictive value of disease<br />

is 69%. In other words, a positive urine dipstick has increased the probability<br />

of a urinary tract infection from 0.1 to 0.69. This is a big jump! Most<br />

tests have much less ability to jump the patient’s pretest probability.<br />

Using the same example for a negative test:<br />

(1) Pretest probability and odds of disease are unchanged. Pretest odds = 0.11.<br />

(2) LR– = 0.61, and post-test odds = 0.11 × 0.61 = 0.067.<br />

(3) Post-test probability = 0.067/1.067 = 0.063.<br />

In other words, a negative urine dipstick has reduced the probability of urinary<br />

tract infection from 0.1 to 0.06. This is the false reassurance rate (FRR), and<br />

tells us how many children we will falsely tell not to worry, in this case 6 out of<br />

100. We can also calculate the negative predictive value, which is 1 – FRR, or<br />

1 – 0.06. The NPV is, therefore, 0.94, or 94% of children with a negative test are<br />

free of disease. Of course, it is important to recognize that the pretest probability<br />

of not having a urinary tract infection before doing any test was estimated<br />

at 90%.<br />

When we get a negative test result, we have to make a clinical decision. Should<br />

we do the urine culture or gold standard test for all children who have a negative<br />

dipstick test in order to pick up the 6% who actually have an infection? Or,<br />

should we just reassure them and repeat the test if the symptoms persist? This<br />

conundrum must be accurately communicated to the patient, and in this case<br />

the parents, and plans made for all contingencies. Choosing to do the urine culture<br />

on all children with a negative test will result in a huge number of unnecessary<br />

cultures. They are expensive and will result in a large expenditure of effort<br />

and money for the health-care system. Whether or not to do the urine culture<br />

depends on the consequences of not diagnosing an infection at the time the child<br />

presents with their initial symptoms. In the office, it is not known if these undetected<br />

children progress to kidney damage. The available evidence suggests that<br />

there is no significant delayed damage, that the majority of these infections will<br />

spontaneously clear or the child will show up with persistent symptoms and be<br />

treated at a later time.


Bayes’ theorem and predictive values 267<br />

.1<br />

.2<br />

.5<br />

99<br />

95<br />

Fig. 24.4 Nomogram for Bayes’<br />

theorem. From T. J. Fagan.<br />

[letter.] N. Engl. J. Med. 1975;<br />

293: 257. Used with permission.<br />

%<br />

1 1000<br />

500<br />

2 200<br />

100<br />

5<br />

50<br />

20<br />

10 10<br />

5<br />

20<br />

1<br />

30<br />

40<br />

50<br />

60<br />

70<br />

80<br />

90<br />

95<br />

.5<br />

.2<br />

.1<br />

.05<br />

.02<br />

.01<br />

.005<br />

.002<br />

.0001<br />

90<br />

80<br />

70<br />

60<br />

50<br />

40<br />

30<br />

20<br />

10<br />

5<br />

2<br />

1<br />

.5<br />

%<br />

99 .1<br />

Likelihood<br />

ratio<br />

Pretest<br />

probability<br />

.2<br />

Posttest<br />

probability<br />

The nomogram<br />

A nomogram to calculate post-test probability using likelihood ratios was developed<br />

in 1975 by Fagan (Fig. 24.4). Begin by marking the LR and pretest probability<br />

on the nomogram. Connect these two points, and continue the line until the<br />

post-test probability is reached. This obviates the need to calculate pretest odds<br />

and post-test probability. For our example of a child with signs and symptoms<br />

of a urinary tract infection, the plot of the post-test probability for this clinical<br />

situation is shown in Fig. 24.5.<br />

Calculating post-test probabilities using sensitivity<br />

and specificity directly<br />

The other way of calculating post-test probabilities uses sensitivity and specificity<br />

directly to calculate the predictive values. Not only are positive and negative<br />

predictive values of the test related to the sensitivity and specificity, but they<br />

are also dependent on the prevalence of disease. The prevalence of disease is the


268 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

.1<br />

99<br />

.2<br />

Pretest probability is 10%. Using the<br />

post-test probability nomogram:<br />

1. Find 10% on the pretest<br />

probability scale<br />

2. Find the LR value of 20<br />

3. Connect these points and<br />

continue that line until it<br />

intersects the post-test<br />

probability line<br />

4. Read the post-test probability<br />

(69%) off that line<br />

.5<br />

1<br />

2<br />

5<br />

10<br />

20<br />

95<br />

1000<br />

500<br />

200<br />

100<br />

50<br />

20<br />

10<br />

5<br />

2<br />

1<br />

% %<br />

30<br />

.5 20<br />

40<br />

.2<br />

50<br />

.1 10<br />

60<br />

.05<br />

5<br />

70<br />

.02<br />

.01<br />

80<br />

.005 2<br />

.002<br />

90<br />

.001 1<br />

95<br />

90<br />

80<br />

70<br />

60<br />

50<br />

40<br />

30<br />

.5<br />

The line shown here is for<br />

the positive dipstick in a<br />

child with a possible<br />

urinary tract infection.<br />

.2<br />

99<br />

Pretest<br />

probability<br />

Likelihood<br />

ratio<br />

.1<br />

Posttest<br />

probability<br />

Fig. 24.5 Using the Bayes’<br />

theorem nomogram in the<br />

example of UTI in children.<br />

pretest probability of disease that has been assigned to the patient or the prevalence<br />

of disease in the population of interest. The history and physical exam give<br />

an estimate of the pretest probability. Simply knowing the sensitivity and specificity<br />

of a test without knowing the prevalence of the disease in the population<br />

from which the patient is drawn will not help to differentiate between disease<br />

and non-disease in your patient. Go back to Table 20.5 in Chapter 20 and look at<br />

the table of pretest probabilities again. This ought to help it make more sense.<br />

Clinicians can use pretest probability for disease and non-disease respectively<br />

along with the test sensitivity and specificity to calculate the post-test probability<br />

that the patient has the disease (post-test probability = predictive value). This is<br />

shown graphically in Fig. 24.6.<br />

Calculating predictive values step by step<br />

(1) Pick a likely pretest probability (P) of disease using the rules we discussed in<br />

Chapter 20. Moderate errors in the selection of this number will not significantly<br />

affect the results or alter the interpretation of the result.<br />

(2) Set up a cohort of 1000 (N) patients or use a similarly convenient number to<br />

make the math as easy as possible and divide them into diseased (D+ =P<br />

× N) and non-diseased (D– = (1 − P) × N) groups <strong>based</strong> on the estimated<br />

pretest probability or prevalence (P). Use the 2 × 2 table.


Bayes’ theorem and predictive values 269<br />

T+<br />

TP<br />

FP<br />

÷<br />

T+ PPV<br />

=<br />

PPV =<br />

NPV =<br />

TP<br />

TP + FP<br />

TN<br />

TN + FN<br />

T−<br />

FN<br />

TN<br />

÷<br />

T−<br />

=<br />

NPV<br />

FAR =<br />

FP<br />

FP + TP<br />

D+ D−<br />

N<br />

FRR =<br />

FN<br />

FN + TN<br />

PPV = positive predictive value<br />

NPV = negative predictive value<br />

1 − PPV = FAR = false alarm rate<br />

1 − NPV = FRR = false reassurance rate<br />

(3) Multiply the D+ and D– by the sensitivity and specificity respectively to get<br />

the contents of the boxes TP and TN:<br />

Fig. 24.6 Predictive values<br />

calculations.<br />

Sensitivity × D+ =TP<br />

Specificity × D− =TN<br />

(4) Fill in the remaining boxes, FN and FP. FN = (D+) − TP and FP = (D−) −<br />

TN.<br />

(5) Calculate predictive values using the formulas<br />

PPV = TP/(TP + FP)<br />

NPV = TN/(TN + FN).<br />

Let’s go back to the 156 young children with diarrhea whom we met at the end<br />

of Chapter 23. 2 Recall that we calculated the sensitivity and specificity of the stool<br />

sample test for fecal white blood cells with > 5 cells/high power field defining a<br />

positive test and got 85% and 88%, respectively. We have already decided that<br />

this study population does not represent all children with diarrhea who present<br />

to a general pediatrician’s office. In this setting, the pediatrician estimates the<br />

prevalence of bacterial diarrhea is closer to 0.02 than 0.17 as it was in the study:<br />

27/156. How does the lower prevalence change the predictive values of the test?<br />

What is the likelihood of disease in a child with a positive or negative test?<br />

(1) First, use 1000 patients (N) to set up the 2 × 2 table using the new estimated<br />

clinical prevalence of bacterial diarrhea of 0.02 or 20 out of 1000 (Fig. 24.7).<br />

2 T. G. Dewitt, K. F. Humphrey & P. McCarthy. Clinical predictors of acute bacterial diarrhea in young<br />

children. Pediatrics 1985; 76: 551–556.


270 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

Fig. 24.7 Set up the 2 × 2table<br />

using a population of 1000<br />

patients (N) and an estimated<br />

clinical prevalence (P) of<br />

bacterial diarrhea of 0.02.<br />

T+<br />

T−<br />

D+<br />

TP<br />

FN<br />

D−<br />

FP<br />

TN<br />

TP + FP<br />

FN + TN<br />

20 980 1000 (N)<br />

Fig. 24.8 Multiply the number<br />

with disease (D+ =P × N) by<br />

the sensitivity and the number<br />

without disease (D– = (1 – P) ×<br />

N) by the specificity to get the<br />

values of TP and TN.<br />

T+ 20 × 0.85<br />

= 17<br />

D+ D−<br />

FP<br />

T− FN 980 × 0.88<br />

= 862<br />

20 980 1000 (N)<br />

Fig. 24.9 Subtract TP from D+<br />

and TN from D– to get the values<br />

of FP and FN, and add the lines<br />

across.<br />

D+ D−<br />

T+ 17 118 135<br />

T− 3 862 865<br />

20 (P) 980 (N − P) 1000 (N)<br />

(2) Next, multiply the number with disease by the sensitivity and without disease<br />

by the specificity to get the values of TP and TN. Round off decimals<br />

(Fig. 24.8).<br />

(3) Fill in the FP and FN boxes and add the lines across (Fig. 24.9).<br />

(4) Calculate PPV, NPV, FAR, and FRR:<br />

PPV = TP/T+ =17/135 = 0.13<br />

NPV = TN/T− =862/865 = 0.996<br />

FAR = FP/T+ =1 − PPV = 0.87<br />

FRR = FN/T− =1 − NPV = 0.004<br />

(5) Interpret the results and decide how to use them.<br />

Compared to the original population with a prevalence of 17.3%, we can see<br />

that the PPV drops significantly when the prevalence decreases. This is a general<br />

rule of the relationship between PPV and prevalence.


Bayes’ theorem and predictive values 271<br />

PPV of 13% means that most positives are not true positives but, in fact, they<br />

are children who do not have bacterial diarrhea. For every seven children treated<br />

with antibiotics thinking they had bacterial diarrhea, only one really needed it.<br />

The others got no benefit from any kind of antibacterial treatment. Clinicians<br />

have to decide whether it is better to treat six children without bacterial diarrhea<br />

in order to treat the one with the disorder, to treat no one with antibiotics,<br />

or to order another test to further eliminate the false positives. The upside to<br />

antibiotics is that bacterial diarrhea will get better faster with antibiotics. The<br />

downsides of antibiotic use include rare side effects such as allergic reactions and<br />

problems that are removed from the individual like increased bacterial resistance<br />

with high rates of antibiotic usage in the population. So, if a clinician decides this<br />

is not a serious problem and treatment is a reasonable trade-off then he or she<br />

will use antibiotics. If, on the other hand, a clinician decides that antibiotic resistance<br />

is a real and significant problem, and treatment will not change the course<br />

of the illness in a dramatic manner and not significantly alleviate much suffering,<br />

then he or she would choose not to treat. In that case, the clinician would<br />

decide to not do the fecal white blood cell test since even with a positive result,<br />

the patient would not be treated with antibiotics.<br />

An NPV of 99.6% means that if the test is negative, only 4 in 1000 children with<br />

true bacterially caused infectious diarrhea will be missed, so the physician can<br />

safely avoid treating the patient with antibiotics. This is especially true since the<br />

result of non-treatment is simply prolonging the diarrhea by a day. The physician’s<br />

treatment would be different if the results of non-treatment were serious,<br />

resulting in prolonged disease with significant morbidity or mortality. In that<br />

case, even 4 out of 1000 could be too many to miss, and the physician should<br />

do the gold standard test on all the children.<br />

Predictive values are the numbers that clinicians need in order to determine<br />

the likelihood of disease in a patient with a positive or negative test result and<br />

a given pretest probability. These numbers will modify the differential diagnosis<br />

and change the pretest probabilities assigned to the patient.<br />

Finally, we can do the same problem with likelihood ratios. The calculations<br />

are as follows:<br />

LR+ =sensitivity/(1 − specificity) = 0.85/0.12 = 7.08<br />

LR – = (1 − sensitivity)/specificity = 0.15/0.88 = 0.17<br />

Using a pretest probability of 2%, the probability and odds are the same:<br />

0.02. Applying Bayes’ theorem, post-test odds = LR+×0.02 = 7.08 × 0.02 =<br />

0.14, and post-test probability = 0.124. Compare this to the PPV of 0.13.<br />

Similarly for a negative test: post-test odds = LR−×0.02 = 0.17 × 0.02 =<br />

0.0034. Compare this to the FRR of 0.004.


272 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

In summary, we now have two ways of calculating the post-test probability of<br />

disease given the operating characteristics of the tests. One is to use Bayes’ theorem<br />

and likelihood ratios to modify pretest odds and calculate post-test odds.<br />

The other way is to use prevalence, sensitivity, and specificity in a 2 × 2tableto<br />

calculate predictive values.<br />

Finally, we must think about accuracy. This term has been used more in the<br />

past to designate the strength of a diagnostic test. In this instance, it is the true<br />

positives and true negatives divided by the total population to whom the test<br />

was applied. However, this can be a grossly misleading number. If there are many<br />

people without the disease compared to with disease, a very specific test with few<br />

false positives will be accurate even with poor sensitivity. Thus, this says nothing<br />

about the sensitivity and should not be used as the measure of a test’s performance.<br />

The same holds true for a population with very high prevalence of disease<br />

and high sensitivity.<br />

Fig. 24.10 Interval likelihood<br />

ratio (iLR).<br />

iLR =<br />

(patients with disease and with test result in interval)<br />

(total patients with disease)<br />

(patients without disease and with test result in interval)<br />

(total patients without disease)<br />

= % patients with disease AND results in interval<br />

% patients without disease AND results in interval<br />

Interval likelihood ratios (iLR)<br />

Likelihood ratios allow us to calculate post-test probabilities when continuous<br />

rather than just dichotomous test results are used. Single cutoff points of tests<br />

with continuous variable results set potential “traps” for the unwary clinician.<br />

Often in studies where the outcome variable of interest is a continuous variable,<br />

a single dichotomous cutoff point is selected as the best single-point cutoff<br />

between normal and abnormal patients. Valuable data are disregarded if the<br />

results of such a test are considered only “positive” or “negative.” We can obviate<br />

this problem using interval likelihood ratios.<br />

The “interval” LR (iLR) is the probability of a test result in the interval under<br />

consideration among diseased subjects, divided by the probability of a test result<br />

within the same interval among non-diseased subjects. Simply put, the interval<br />

likelihood ratio is the percentage of patients with disease who have test results<br />

in the interval divided by the percentage of patients without disease with test<br />

results in the interval (Fig. 24.10). If the iLR associated with an interval is less


Bayes’ theorem and predictive values 273<br />

Table 24.1. Distribution of white blood cell count in patients with and without<br />

appendicitis<br />

With appendicitis Without appendicitis iLR+<br />

WBC/μL (% of 59) (% of 145) (95% CI)<br />

4000–7000 1 (2%) 30 (21%) 0.1 (0–0.39)<br />

7000–9000 9 (15%) 42 (29%) 0.52 (0–1.57)<br />

9000–11000 4 (7%) 35 (24%) 0.29 (0–0.62)<br />

11000–13000 22 (37%) 19 (13%) 2.8 (1.2–4.4)<br />

13000–15000 6 (10%) 9 (6%) 1.7 (0–3.6)<br />

15000–17000 8 (14%) 7 (5%) 2.8 (0–6.0)<br />

17000–19000 4 (7%) 3 (2%) 3.5 (0–10)<br />

19000–22000 5 (8%) 0 (0%) Infinite (NA)<br />

Total 59 (100%) 145 (100%)<br />

Example: For WBC from 4000 to 7000, iLR = (1/59)/(30/145) = 2%/21% = 0.1. From<br />

S. Dueholm, P. Bagi & M. Bud. Laboratory aid in the diagnosis of acute appendicitis.<br />

A blinded prospective trial concerning diagnostic value of leukocyte count, neutrophil<br />

differential count, and C-reactive protein. Dis. Colon Rectum 1989; 32: 855–859.<br />

than 1 the probability of disease decreases and if greater than 1 the probability<br />

of disease increases.<br />

When data are gathered for results of a continuous variable, predetermined<br />

cutoff points should be set. Then the number of people with and without disease<br />

in each interval can be determined. Many authorities believe that these results<br />

are more accurate and represent the true state of things better than a single cutoff<br />

point. The following illustration with the white cell count in appendicitis will<br />

illustrate this issue.<br />

A 16-year-old girl comes to the emergency department complaining of rightlower-quadrant<br />

abdominal pain for 14 hours and a decreased appetite. Her<br />

physical examination reveals right-lower-quadrant tenderness and spasm and<br />

the clinician thinks that she might have appendicitis. A white blood count (WBC)<br />

is obtained and the result is a level of 10 200 cells/μL. The “normal” range is<br />

4 500–11 000 cells/μL. Although this test result is “normal,” it is just below the<br />

cutoff for an elevated WBC count. You know that a mildly elevated WBC count<br />

has a different implication than a highly elevated WBC count of 17 000 cells/μL.<br />

Interval likelihood ratios can help attack this question quantitatively.<br />

Table 24.1 represents the distribution of WBC count results among 59 patients<br />

with confirmed appendicitis and 145 patients without appendicitis. For each<br />

interval, the probabilities for results within the interval were used to calculate<br />

an iLR.


274 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

Note that in this study the interval likelihood ratio is lower for the third interval<br />

(9k–11k) than for the second interval (7k–9k), and similarly for the intervals<br />

11k–13k and 13k–15k. The 95% CIs overlap in each case and include the point<br />

estimate of the other group’s iLR. Therefore the iLR differences found for these<br />

intervals are not statistically different. This is the result of the small sample size<br />

in this study, and probably represents a Type II error. This value of LR+ would<br />

more likely be in line and show a positive dose–response relationship if there<br />

were more patients. But the inconsistency of these results points up the need for<br />

more research to be done in this area.<br />

Ideally, 95% CI should always be given for each LR. This allows the reader to<br />

determine the statistical significance of the results. In initial studies, researchers<br />

often “data dredge” by using several different cutoff points to see which gives<br />

the best LR or iLR and which are statistically significant. These results must be<br />

verified in a second study on a different population called a validation study.<br />

Given this girl’s symptoms and physical findings, we estimate that her pretest<br />

probability of appendicitis before obtaining results of WBC count is about 0.50.<br />

This says that we’re not sure and it is a toss-up. What is the probability of appendicitis<br />

if our patient had a WBC count of 10 200? We will demonstrate how to<br />

determine this using Bayes’ theorem.<br />

Start with the pretest probability of 50% and calculate the odds. These are<br />

0.5/(1 – 0.5). Pretest odds (appendicitis) = 0.5/0.5 = 1 and iLR = 0.29. Therefore<br />

post-test odds (appendicitis) = 1 × 0.29 = 0.29, and post-test probability (appendicitis)<br />

= 0.29/1.29 = 0.22. This is less than before, but not low enough to rule out<br />

the diagnosis. We must therefore decide either to do another test or to observe<br />

the patient.<br />

What happens if her white cell count is 7 500 (iLR = 0.52)? The pretest odds<br />

are unchanged and the post-test odds (appendicitis) = 1 × 0.52 = 0.52. Post-test<br />

probability (appendicitis) = 0.52/1.52 = 0.33, leading to the same problem as<br />

with a white-cell count of 10 200.<br />

What if her white-cell count is 17 500 (iLR = 3.5)? Again, the pretest odds are<br />

unchanged and the post-test odds (appendicitis) = 1 × 3.5 = 3.5. Post-test probability<br />

(appendicitis) = 3.5/4.5 = 0.78. This is much higher, but far from good<br />

enough to immediately treat her for the suspected disease. In this case, treatment<br />

requires an operation on the appendix. This is major surgery and although<br />

pretty safe in this day and age, it is still more risky than not operating if the patient<br />

does not have appendicitis. Most surgeons want the probability of appendicitis<br />

to be over 85% before they will operate on the patient. This is called the treatment<br />

threshold.<br />

Therefore, even with the white cell count this high, we have not crossed the<br />

treatment threshold of 85%. This value was adopted <strong>based</strong> upon previous studies<br />

and prevailing surgical practice when it was considered important to have<br />

a negative operative rate of 15% in order to prevent missing appendicitis and


Bayes’ theorem and predictive values 275<br />

D+ D−<br />

WBC > 9K T+ 49 73<br />

WBC < 9K T− 10 72<br />

Totals 59 145<br />

Fig. 24.11 The 2 × 2 table for<br />

the use of a white blood cell<br />

count of greater than 9000 as a<br />

cutoff for diagnosing<br />

appendicitis. Data from S.<br />

Dueholm, P. Bagi & M. Bud. Dis.<br />

Colon Rectum 1989; 32:<br />

855–859.<br />

LR+ = (36/90)/(18/910) = 20<br />

LR− = (54/90)/(892/910) = 0.61<br />

D+ D−<br />

T+ 36 18 54<br />

T− 54 892 946<br />

90 910 1000<br />

Fig. 24.12 The 2 × 2tableto<br />

calculate the post-test<br />

probability of a urinary tract<br />

infection using the dipstick<br />

results on urine testing for UTI.<br />

Data from K. N. Shaw, D. Hexter,<br />

K. L. McGowan & J. S. Schwartz.<br />

J. Pediatr. 1991; 118: 733–736.<br />

risking rupture of the appendix. Therefore, if the probability of appendicitis is<br />

greater than 0.85, the patient should be operated upon.<br />

Let’s see what will happen if we lump the test results together and consider<br />

a white blood cell count of 9 000 as the upper limit of normal. Now use likelihood<br />

ratios to calculate predictive values and apply them to a population with a<br />

prevalence of 50%. For the original study patients, LR+ =1.66 and LR− =0.34<br />

(Fig. 24.11). For the patient in our example, post-test odds = 1 × 1.66 = 1.66<br />

and the post-test probability = 1.66/2.66 = 0.62. This is slightly different from<br />

the results using the interval likelihood ratio, but is still below the treatment<br />

threshold.<br />

For the study on the use of urine-dipstick testing for UTI which we discussed<br />

earlier in this chapter, the 2 × 2 table is shown in Fig. 24.12. In the original study,<br />

the prevalence was 0.09. Using the 2 × 2 table allows you to visualize the number<br />

of patients in each cell, and gives an idea of the usefulness of the test.<br />

The probability of disease if a positive test occurs is 36/54 = 0.67, and the probability<br />

of disease if the test is negative is 54/946 = 0.057. These are very similar<br />

to the values calculated using the LRs. Remember, for our population we used a<br />

prevalence of 10% (not 9%).


25<br />

Comparing tests and using ROC curves<br />

His work’s a man’s, of course, from sun to sun, But he works when he works as hard as I<br />

do – Though there’s small profit in comparisons. (Women and men will make them all the<br />

same.)<br />

Robert Frost (1874–1963): A Servant to Servants<br />

Learning objectives<br />

In this chapter you will learn:<br />

the dynamic relationship between sensitivity and specificity<br />

how to construct and interpret an ROC curve for a diagnostic test<br />

Analysis of diagnostic test performance using ROC curves<br />

ROC is an acronym for Receiver Operating Characteristics. It is a concept that<br />

originated during the early days of World War II when radar was a newly developed<br />

technology. The radar operators had to learn to distinguish true signals,<br />

approaching enemy planes, from noise, usually flocks of birds like geese or<br />

clouds. The ROC curve let them decide which signals were most likely to be<br />

which. In medicine, an ROC curve tells you which test has the best ability to differentiate<br />

healthy people from ill ones.<br />

The ROC curve plots sensitivity against specificity. The convention has been to<br />

plot the sensitivity, the true positive rate against 1 – specificity, the false positive<br />

rate. This ratio looks like the likelihood ratio, doesn’t it? The ROC curve for a particular<br />

diagnostic test tells which cutoff point maximizes sensitivity, specificity,<br />

and both. ROC curves for two tests can also tell you which test is best.<br />

By convention, when drawing ROC curves the x-axis is the false positive rate,<br />

1 – specificity, going from 0 to 1 or 0% to 100%, and the y- axis is the sensitivity<br />

or true positive rate, also going from 0 to 1 or 0% to 100%. The best cutoff point<br />

for making a diagnosis using a particular test would be the point closest to the<br />

(0,1) point, the point at which there is perfect sensitivity and specificity. It is by<br />

276


Comparing tests and using ROC curves 277<br />

Table 25.1. Sensitivity and specificity for each cutoff point of WBC count in<br />

appendicitis<br />

Sensitivity Specificity 1 – specificity<br />

WBC/μL (95% CI) (95% CI) (95% CI)<br />

>4000 100 (95–100) 0 (0–3) 100 (97–100)<br />

>7000 98 (91–100) 21 (15–29) 79 (71–85)<br />

>9000 83 (71–92) 50 (42–59) 50 (41–58)<br />

>11000 76 (63–86) 74 (62–84) 26 (16–38)<br />

>13000 39 (27–53) 87 (73–98) 13 (2–27)<br />

>15000 29 (18–47) 93 (78–100) 7 (0–22)<br />

>17000 15 (7–27) 98 (80–100) 2 (0–20)<br />

>19000 6 (3–19) 100 (85–100) 0 (0–15)<br />

Source: From S. Dueholm, P. Bagi & M. Bud. Dis. Colon Rectum 1989; 32: 855–859.<br />

definition, the gold standard. This point has 0% false positive rate and 100% true<br />

positive rate, sensitivity.<br />

Look at the data from the study about the usefulness of the white-bloodcell<br />

count in the diagnosis of appendicitis in the example of the girl with<br />

right-lower-quadrant pain (Table 25.1) and draw the ROC curve for the results<br />

(Fig. 25.1). The sensitivity and specificity was calculated for each cutoff point as<br />

a different dichotomous value. This has now created a curve of the sensitivity and<br />

specificity for different cutoff points of the white blood cell count in diagnosing<br />

appendicitis.<br />

Comparing diagnostic tests<br />

ROC curves can help determine which of two tests is better for a given purpose.<br />

First, examine the ROC curves for the two tests. Is one clearly better by virtue of<br />

being closer to the upper left corner than the other? For the hypothetical tests A<br />

and B depicted in Fig. 25.2(a) it is clear that test A outperforms test B over the<br />

entire range of lab values. This means that for any given cutoff point, the sensitivity<br />

and specificity of test A will always be better than for the corresponding<br />

point of test B.<br />

Tests can also be compared even if their ROC curves overlap. This is illustrated<br />

in Fig. 25.2(b), where the curves for tests C and D overlap. One option is to chose<br />

a single cutoff value for the point closest to the (0,1) point on the graph, which<br />

will always be the best single cutoff point for making the diagnosis.<br />

Another approach uses the concept of the area under the curve (AUC). A test<br />

whose ROC curve is the diagonal from the upper right (point 1,1) to the lower left


278 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

1.0<br />

0.9<br />

0.8<br />

0.7<br />

True positive rate (sensitivity)<br />

0.6<br />

0.5<br />

0.4<br />

0.3<br />

0.2<br />

0.1<br />

0.0<br />

0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0<br />

Fig. 25.1 ROC curve for white<br />

blood cell count in appendicitis,<br />

<strong>based</strong> on data in Table 25.1.<br />

False positive rate (1 − specificity)<br />

(a)<br />

(b)<br />

A<br />

B<br />

C<br />

Sensitivity<br />

Sensitivity<br />

D<br />

1 − specificity 1 − specificity<br />

Fig. 25.2 ROC curves of four<br />

hypotheticaltestsA,B,C,andD.


Comparing tests and using ROC curves 279<br />

(point 0,0) is a worthless test. At any given point, it’s sensitivity and false positive<br />

rate are equal, making diagnosis using this test a coin toss for all cutoff points.<br />

Think of the Likelihood Ratio as being one for any point on that curve. The AUC<br />

for this curve is 0.5 and is the same a flipping a coin. Similarly the gold standard<br />

test will be perfect and have an AUC of one (1.0). Ideally, look for an AUC that is<br />

as close to one as possible.<br />

ROC curves that are close to the imaginary diagonal line are poor tests. For<br />

these tests, we can say that the AUC is only slightly greater than 0.5. Obviously,<br />

ROC curves that are under this line are such poor tests that they are worse than<br />

flipping a coin. We can use the AUC to statistically compare the area under two<br />

ROC curves.<br />

The AUC has an understandable meaning. It answers the “two alternativeforced<br />

choice (2AFC) problem.” This means that “given a normal patient chosen<br />

at random from the universe of normal patients, and an abnormal patient, again<br />

chosen at random, from the universe of abnormal patients, the AUC describes<br />

the probability that one can identify the abnormal patient using this test<br />

alone.” 1<br />

There are several ways to measure the AUC for an ROC curve. The simplest<br />

is to count the blocks and calculate the percentage under the curve, the medical<br />

student level. A slightly more complex method is to calculate the trapezoidal<br />

area under the curve by approximating each segment as a regular geometric figure,<br />

the high-school-geometry level. The most complex way is to use the technique<br />

known as the “smoothed area using maximum likelihood estimation techniques,”<br />

which can be done using a computer. There are programs written to<br />

calculate these areas under the curve.<br />

A study looked at the usefulness of the CAGE questionnaire as a screening<br />

diagnostic tool for identifying alcoholism among adult patients in the outpatient<br />

medical practice of a university teaching hospital. In this population, the sensitivity<br />

of an affirmative answer to one or more of the CAGE questions (Table 25.2)<br />

was about 0.9 and the specificity was about 0.8. Although one could consider the<br />

CAGE “positive” if a patient has one or more answers in the affirmative, in reality<br />

the CAGE is more “positive” given more affirmative answers on the four component<br />

questions. In this test, each answer is given one point to make a total score<br />

from zero to four.<br />

Have you ever felt you should Cut down on your drinking?<br />

Have people Annoyed you by criticizing your drinking?<br />

Have you ever felt bad or Guilty about your drinking?<br />

Have you ever had a drink first thing in the morning to steady your nerves or<br />

get rid of a hangover (Eye-opener)?<br />

1 Michigan State University, Department of Internal <strong>Medicine</strong>. Power Reading: Critical Appraisal of the<br />

Medical Literature. Lansing, MI: Michigan State University, 1995.


280 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

Table 25.2. Results of CAGE questions using different cutoffs<br />

Numbers of<br />

questions answered Sensitivity 1 – specificity<br />

affirmatively Alcoholic Non-alcoholic (TPR) (FPR)<br />

>3 56/294 56/527 0.19 0.00<br />

>2 130/294 516/327 0.44 0.02<br />

>1 216/294 482/527 0.73 0.09<br />

>0 261/294 428/527 0.89 0.19<br />

Source: Data from D. G. Buchsbaum, R. G. Buchanan, R. M. Centor, S. H. Schnoll &<br />

M. J. Lawton. Screening for alcohol abuse using CAGE scores and likelihood ratios.<br />

Ann. Intern. Med. 1991; 115: 774–777.<br />

1<br />

>0<br />

0.8<br />

>1<br />

True Positive<br />

Rate (TPR =<br />

Sensitivity)<br />

0.6<br />

0.4<br />

>2<br />

0.2<br />

>3<br />

False Positive Rate (FPR = 1 − specificity)<br />

0<br />

0.2<br />

0.4 0.6 0.8 1<br />

Fig. 25.3 ROC curve of CAGE<br />

question data from Table 25.2.<br />

One has the choice of considering the CAGE questionnaire “positive” if the<br />

patient answers all four, three or more, two or more, or one or more of the component<br />

questions in the affirmative. Moving from a more stringent to a less stringent<br />

cutoff tends to sacrifice specificity (1 – FPR) for sensitivity (TPR).<br />

By convention the ROC curves start at the FPR = 0 and TPR = 0 point. The<br />

CAGE here is always considered negative regardless of patients’ answers. There<br />

are no false positives, but no alcoholics are detected. The ROC curves end at the


Comparing tests and using ROC curves 281<br />

FPR = 1.0 and TPR = 1.0 point. The CAGE is always considered positive regardless<br />

of patients’ answers. The test has perfect sensitivity but all non-alcoholics<br />

are falsely identified as positives.<br />

When the ROC is plotted, the area under this curve is 0.89 units with a standard<br />

error of 0.13 units so we’d expect a randomly selected alcoholic patient from the<br />

sample population to have a higher CAGE score than a randomly selected nonalcoholic<br />

patient about 89% of the time. Computers can be used to compare ROC<br />

curves by calculating the AUCs and determining the statistical variance of the<br />

result. Another study of the CAGE questionnaire was done by Mayfield 2 on psychiatric<br />

inpatients whereas Buchsbaum’s study (Table 25.2 and Fig. 25.3) used<br />

general-medicine outpatients. The Mayfield study had an AUC of 0.9 with a standard<br />

error of 0.17. Using a statistical test, these two study results are not statistically<br />

different, validating the result.<br />

2 D. Mayfield, G. McLeod & P. Hall. The CAGE questionnaire: validation of a new alcoholism screening<br />

instrument. Am.J.Psychiatry1974; 131: 1121–1123.


26<br />

Incremental gain and the threshold approach to<br />

diagnostic testing<br />

Science is the great antidote to the poison of enthusiasm and superstition.<br />

Adam Smith (1723–1790): The Wealth of Nations, 1776<br />

Learning objectives<br />

In this chapter you will learn:<br />

how to calculate and interpret the incremental diagnostic gain for a given<br />

clinical test result<br />

the concept of threshold values for testing and treating<br />

the use of multiple tests and the effect of independent and dependent tests<br />

on predictive values<br />

how predictive values help make diagnostic decisions in medicine and how<br />

to use predictive values to choose the appropriate test for a given purpose<br />

how to apply basic test characteristics to solve a clinical diagnostic problem<br />

Revising probabilities with sensitivity and specificity<br />

Remember the child from Chapter 20 with the sore throat? Let’s revisit our differential<br />

diagnosis list (Table 26.1). Since strep and viruses are the only strong<br />

contenders on this list, it would be hoped that a negative strep test would mean<br />

that the likelihood of viruses as the cause of the sore throat is high enough to<br />

defer antibiotic treatment for this child. One would only need to rule out strep to<br />

do this. Therefore, it would make sense to do a rapid strep test. It comes up positive.<br />

Looking up the sensitivity and specificity of this test shows that they are 0.9<br />

and 0.9, respectively. Now the pretest probability is 0.5 (50%). There are two ways<br />

to solve this problem, either using likelihood ratios or sensitivity and specificity<br />

to get the predictive values.<br />

282


Incremental gain and the threshold approach to diagnostic testing 283<br />

Table 26.1. Pretest probability: sore throat<br />

Streptococcal infection 50%<br />

Viruses 75%<br />

Mononucleosis 5%<br />

Epiglottitis


284 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

Fig. 26.3 Results of calculating<br />

the values of the 2 × 2 table for<br />

a population of 1000 patients<br />

(N) and a clinical prevalence of<br />

strep throat infection that is low<br />

or 0.1 (100 out of 1000).<br />

Calculations for PPV and NPV are<br />

shown.<br />

D+ D−<br />

T+<br />

T−<br />

90<br />

10<br />

90<br />

810<br />

180<br />

820<br />

100 900 1000<br />

PPV = 90/180 = 0.5<br />

NPV = 810/820 = 0.987<br />

also 0.9. Therefore, with a positive test result, it is reasonable to accept this diagnosis<br />

and realize that one might have over- or unnecessarily treated one out of<br />

every 10 children who were treated with antibiotics and who would actually not<br />

have strep throat. But the cost of that is low enough that it is reasonable not to<br />

worry. This is also <strong>based</strong> on the risks of antibiotic treatment causing rare allergy<br />

to antibiotics and occasional gastrointestinal discomfort and diarrhea. This balances<br />

against the benefit of treatment, a 1-day shorter course of symptoms and<br />

some decrease in the very rare sequellae of strep infection, tonsillar abscess, and<br />

acute rheumatic fever.<br />

Similarly, if the test had come up negative, the likelihood of strep is extremely<br />

low and one could accept that there might be 10% or one out of every 10 children<br />

who would be falsely reassured when they could be treated with antibiotics<br />

for this type of sore throat. However, looking at the risks of not treating<br />

the patient, one realizes that in this case they are also small. Rheumatic fever,<br />

once a common complication of strep throat, is now extremely rare, with much<br />

less than 1% of strep infections leading to this and the rate is even lower in most<br />

populations.<br />

Bacterial resistance from overuse of antibiotics is the only other problem left<br />

and for now it is reasonable to decide that this will not deter writing a prescription<br />

for antibiotics. That decision on when to treat in order to decrease overuse<br />

of antibiotics would be deferred to a high-level government policy panel we vow<br />

to try to use antibiotics only when reasonably indicated for a positive strep test<br />

and not for things like a common cold. This simple decision-making process will<br />

do until there is a blue-ribbon panel that will look at all the evidence and make a<br />

clinical guideline, algorithm, or practice guideline on when to treat and when to<br />

test for strep throat.<br />

If the pretest probability of strep <strong>based</strong> upon signs and symptoms was much<br />

lower (say 10%), this equation will change (Fig. 26.3). Use the likelihood ratios to<br />

get the same results by starting with the pretest probability of disease, which is<br />

now 10%. The pretest odds are 0.11 and applying Bayes’ theorem for a positive<br />

test with LR+, results in post-test odds (= 9 × 0.11) of 0.99. This makes the posttest<br />

probability (0.99/1.99) = 0.497. This is the positive predictive value, which is


Incremental gain and the threshold approach to diagnostic testing 285<br />

pretty close to the 0.5 that was obtained using the 2 × 2 table. Similarly, for a negative<br />

test, using the LR–, the post-test odds (= 0.11 × 0.11) are 0.0121. Therefore,<br />

the post-test probability if the test is negative, which is equivalent to the false<br />

reassurance rate or FRR is 0.0121 and the negative predictive value (1 – FRR) is<br />

0.988.<br />

The PPV for a positive test is now 50%. With the patient as a partner in shared<br />

decision making, it is now reasonable to decide that since 1 day less of symptoms<br />

is the major benefit of antibiotics, it is not worth the excess antibiotic use to treat<br />

one without strep throat for every one with strep throat, and it is reasonable to<br />

withhold treatment. In the case of a pretest probability of 10%, it is then reasonable<br />

to decide not to do the test in the first place. If practicing in a community<br />

with a high incidence of acute rheumatic fever after strep throat infections, it<br />

may still be reasonable to test since that could make it worthwhile to treat all the<br />

positives to prevent this more serious sequella even though one would overtreat<br />

half of the children. Over-treating one child for every one correctly treated is a<br />

small price to pay for the prevention of a disease as serious as acute rheumatic<br />

fever, which will leave its victims with permanent heart deformities.<br />

Incremental gain<br />

Incremental gain is the expected increase in diagnostic certainty after the application<br />

of a diagnostic test. It is the change in the pretest estimate of a given<br />

diagnosis. Mathematically it is PPV – P or positive predictive value minus pretest<br />

probability. For a negative test, the incremental gain would be NPV – (1 – P). For<br />

incremental gain of a negative test, begin with the prevalence of no disease (1– P)<br />

and go up to the NPV. The difference simply tells how much the test will increase<br />

the probability of disease or how much “bang for your buck” occurs when using<br />

a particular diagnostic test. This is one measure of the usefulness of a diagnostic<br />

test. By convention use absolute values so that all the incremental gains are<br />

positive numbers. They are all improvements on the previous level of probability.<br />

For a given range of pretest probability, what is the diagnostic gain from doing<br />

the test? Using the example of strep throat in a child and beginning with a pretest<br />

probability of 50%, after doing the test the new probability of disease was 90%.<br />

This represents an incremental gain of 40% (90 – 50). For a negative test the incremental<br />

gain would also be 40% since the initial probability of no disease was<br />

50% and the post-test probability of no disease was 90% (50 – 90). Doing the<br />

same calculations for a patient with a higher pretest probability of disease, but<br />

in whom there is still some uncertainty of strep on clinical grounds, say that the<br />

pretest probability was estimated to be between a coin toss (50%) and certainty<br />

(100%) so put it at about 75%. How would that change the incremental gain? Figure<br />

26.4 shows the 2 × 2 table and the calculations <strong>based</strong> on predictive values.


286 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

Fig. 26.4 Results of calculating<br />

the values of the 2 × 2 table for<br />

a population of 1000 patients<br />

(N) and a clinical prevalence of<br />

strep throat infection that is<br />

moderately high or 0.75 (750<br />

out of 1000). Calculations for<br />

PPV, NPV, FAR, and FRR are<br />

shown.<br />

D+ D−<br />

T+ 675 25 700<br />

T− 75 225 300<br />

750 250 1000<br />

PPV = 675/700 = 0.964<br />

NPV = 225/300 = 0.750<br />

FAR = 1 − PPV = 0.036<br />

FRR = 1 − NPV = 0.250<br />

Fig. 26.5 Results of calculating<br />

the values of the 2 × 2 table for<br />

a population of 1000 patients<br />

(N) and a clinical prevalence of<br />

strep throat infection that is very<br />

high or 0.9 (900 out of 1000).<br />

Calculations for PPV, NPV, FAR,<br />

and FRR are shown.<br />

D+ D−<br />

T+<br />

810 10 820<br />

T− 90 90 180<br />

900 100 1000<br />

PPV = 810/820 = 0.988<br />

NPV = 90/180 = 0.50<br />

FAR = 1 − PPV = 0.012<br />

FRR = 1 − NPV = 0.50<br />

Using likelihood ratios we start with the pretest probability of disease, which is<br />

now 75% making the pretest odds equal to 3. Now the post-test odds for a positive<br />

test using LR+ are 9 × 3, which is 27 making the post-test probability (27/28) =<br />

0.964. Similarly, for a negative test, use the LR– to calculate the post-test odds of<br />

0.11 × 3, which is 0.33. Therefore, the post-test probability if the test is negative<br />

is the FRR, which is 0.25 and the negative predictive value 1 – FRR, which is 0.75.<br />

Now the post-test probability of disease is more certain (96.4%), but if the test<br />

is negative it will be wrong more often (25%). The incremental gain is now only<br />

21.4% for a positive test (96.4 – 75) and up to 50% for a negative test (75 – 25).<br />

Now do the same for a pretest probability of 90%. This represents almost<br />

certainty <strong>based</strong> on signs and symptoms (Fig. 26.5). Using likelihood ratios, the<br />

pretest probability of disease is now 90%, so the pretest odds are 9 and multiply<br />

that by the likelihood ration LR+ to get the post-test odds for a positive test.<br />

Thisis9× 9 = 81. The post-test probability is therefore 0.987 (81/82). For a negative<br />

test, the post-test odds are calculated using LR–, which is 0.11 × 9 = 1, so<br />

the post-test probability, the FRR, is 0.5 and the negative predictive value is 1 –<br />

FRR = 0.5.<br />

The incremental gains are now:<br />

Positive test: 98.8 − 90 = 8.8<br />

Negative test: 50 − 10 = 40<br />

So little (8.8%) is gained if the test is positive and a lot (40%) is gained if<br />

the test is negative. In order to avoid the false negatives it would probably be<br />

best to choose not to do the test if one was this certain and gave a high pretest


Incremental gain and the threshold approach to diagnostic testing 287<br />

Table 26.2. Incremental gains for rapid strep throat tests<br />

Pretest probability Incremental gain T+ FN Incremental gain T– FP<br />

10% 40 (10 to 50) 10/1000 8.8 (90 to 98.8) 90/1000<br />

50% 40 (50 to 90) 50/1000 40 (50 to 90) 50/1000<br />

75% 21.4 (75 to 96.4) 75/1000 50 (25 to 75) 25/1000<br />

90% 8.8 (90 to 98.8) 90/1000 40 (10 to 50) 10/1000<br />

probability that the child had strep throat. Putting all of these results in a table<br />

(Table 26.2) makes it easy to compare the results.<br />

In general, the greatest incremental gain occurs when the pretest probability is<br />

in an intermediate range, usually between 20% and 70%. Notice also that as the<br />

pretest probability increased the number of false negatives also increased and<br />

the number of false positives decreased. The opposite happens when the pretest<br />

probability is very low and there will be an increased number of false positives<br />

and lower number of false negatives. This last situation occurs when working<br />

with a screening test.<br />

The question that must then be asked is at what level of clinical certainty or<br />

pretest probability should a given test be done? This depends on the situation<br />

and the test. The use of threshold values can assist the clinician in making this<br />

judgment.<br />

Threshold values<br />

Incremental gain tells how much a diagnostic test increases the value of the<br />

pretest probability assigned <strong>based</strong> upon the history and physical and modified<br />

by the characteristics of the test and the prevalence of disease in the population<br />

from which the patient is drawn. This simply tells the amount of certainty<br />

gained by doing the test. One can decide not to do the test if the incremental gain<br />

is very small since very little is gained clinically. The midrange of pretest probability<br />

yields the highest incremental gain, which is lost at the extremes of pretest<br />

probability range.<br />

Another way to look at the process of deciding whether to do a test is using<br />

the method of threshold values. In this process find the probability of disease<br />

above which one should treat no matter what, and conversely the level below<br />

which one would never treat, and therefore shouldn’t even do the test. These are<br />

determined using the test characteristics and incremental gain to decide if it will<br />

be worthwhile to do a particular diagnostic test.


288 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

Determine threshold values by calculating PPV and NPV for many different<br />

levels of pretest probability. At each step ask if one still wanted to treat <strong>based</strong><br />

upon a positive result or would be willing to rule out <strong>based</strong> on a negative test<br />

result. Decision trees can also be used to determine the threshold values and<br />

these will be covered in Chapter 30. An alternative method uses a simple balance<br />

sheet to approximate the threshold values. An explanation for this can be found<br />

in Appendix 6.<br />

In practice, clinicians use their clinical judgment to determine the threshold<br />

values for each clinical situation. This is part of the “art of medicine” or that part<br />

of EBM <strong>based</strong> upon clinical experience. Clinicians ask themselves “will I gain<br />

any additional useful clinical information by doing this test?” If the answer to<br />

this question is no, they shouldn’t do the test. They already know enough about<br />

the patient and should either treat or not treat regardless of the test result, since<br />

no useful additional information is gained by performing the test.<br />

The treatment threshold is the value at which the clinician asks “do I know<br />

enough about the patient to begin treatment and would treat regardless of the<br />

results of the test?” If the answer to this question is yes, the test shouldn’t be<br />

done. This occurs at high values of pretest probability. If a test is done, it ought to<br />

be one with high specificity, which can be used to rule in disease. But if a negative<br />

test result is obtained a confirmatory test or the gold-standard test must be done<br />

to avoid missing a person with a false negative test. If a test with high specificity<br />

only is chosen, a positive test will rule in disease, but there are too many false<br />

negatives, which must be confirmed with a second or gold standard test.<br />

The testing threshold is the value at which the clinician asks “is the likelihood<br />

of disease so low that even if I got a positive test I would still not treat the patient?”<br />

If the answer to this question is yes, the test shouldn’t be done. This occurs at low<br />

values of pretest probability. If a test is done it ought to be one with high sensitivity,<br />

which can be used to rule out disease. But, if a positive test result is obtained<br />

a confirmatory test or the gold-standard test must be done to avoid over-treating<br />

a person with a false positive test. If a test with high sensitivity only is chosen, a<br />

negative test will rule out disease, but there are too many false positives, which<br />

must be confirmed with a second or gold standard test.<br />

Both of these threshold levels depend not only on the test characteristic, the<br />

sensitivity and specificity, and prevalence of disease, but also on the risks and<br />

benefits associated with treatment or non-treatment. The values of probability<br />

of disease for the treatment and testing thresholds should be established before<br />

doing the test. The clinician selects a pretest probability of disease, and determines<br />

whether performing the test will result in placing the patient above the<br />

treatment threshold or below the testing threshold. If it won’t, the test would not<br />

be worth doing.<br />

At pretest probabilities above the treatment threshold, testing may produce an<br />

unacceptable number of false negatives in spite of a high PPV. Some patients


Incremental gain and the threshold approach to diagnostic testing 289<br />

testing<br />

threshold<br />

treatment<br />

threshold<br />

Fig. 26.6 Thresholds for strep<br />

throat example.<br />

no test<br />

no treat<br />

test and treat<br />

on the basis of<br />

the test result<br />

do not test<br />

get on with treatment<br />

0 .1 .2 .3 .4 .5 .6 .7 .8 .9 1<br />

Prior probability of disease<br />

would be denied the benefits of treatment, perhaps more than would benefit<br />

from discovery of the disease and subsequent treatment. The pretest probability<br />

of disease is so great that treatment should proceed regardless of the results<br />

of the test. This is because if the test results are negative they are more likely<br />

to be a false negative and could miss someone with the disease. In that setting<br />

one must be ready to do a confirmatory test, possibly the gold standard<br />

test. In other words, one should be more willing to treat someone who does not<br />

have the disease and has a false positive test result, than to miss treating someone<br />

who is a false negative. This may not be true if treatment involves a lot of<br />

risk and suffering such as needing a major operation or taking potentially toxic<br />

medication.<br />

At pretest probabilities below the testing threshold, testing would lead to an<br />

unacceptable number of false positives or a high FAR. Patients would be unnecessarily<br />

exposed to the side effects of further testing or treatment with very little<br />

benefit. The likelihood of disease in someone with a positive test is so small that<br />

treatment should not be done even if the test is positive since it is too likely that a<br />

positive test will be a false positive. Again one must be ready to do a confirmatory<br />

test. This approach is summarized in Fig. 26.6.<br />

For the child in our example with a sore throat, this testing threshold is a<br />

pretest probability of strep throat below 10%. Below this level, applying the rapid<br />

strep antigen test and getting a positive result would still not increase the probability<br />

of disease enough to treat the patient and one can be certain enough that<br />

disease is not present that the benefit of treating is extremely small. Similarly, the<br />

treatment threshold is a pretest probability of strep throat above 50%. Above this<br />

level, applying the rapid strep antigen test and getting a negative result would<br />

still not decrease the probability of disease enough to refrain from treating the<br />

patient and one can be certain enough that disease is present so that the benefit<br />

of treatment is reasonably great. Between these values of pretest probability<br />

(from 10–50%) do the test first and treat only if the test is positive, since the<br />

post-test probability then increases above the treatment threshold. If the test is<br />

negative, the post-test probability is now below the testing threshold.<br />

In this example of the child with a sore throat, almost all clinicians agree that<br />

if the pretest probability is 90% as would be present in a child with a severe sore


290 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

throat, large lymph nodes, pus on the tonsils, bright red tonsils, fever, and no<br />

signs of a cold, the child ought to be treated without doing a test. There would<br />

still be a likelihood of incorrectly diagnosing about 10% of viral sore throats as<br />

strep throats with this estimate of disease. In general, as the probability of disease<br />

increases, the absolute number of missed strep throats will increase. In fact,<br />

most clinicians agree that if the post-test probability is greater than 50%, the<br />

child ought to be treated. This is the treatment threshold.<br />

Similarly, if the probability of strep throat was 10% or less in a child with mild<br />

sore throat, slight redness, minimal enlargement of the tonsils, no pus, minimally<br />

swollen and non-tender lymph nodes, no fever, and signs of a cold, half of all positives<br />

will be false positives and too many children would be overtreated. There<br />

won’t be much gain from a negative test, since almost all children are negative<br />

before we do the test. For a pretest probability of 10%, the PPV (as calculated<br />

before) is 50%, which is not above the treatment threshold value of 50%. The<br />

addition of the test is not going to help in differentiating the diagnosis of strep<br />

throat from that of viral pharyngitis. Therefore one should not do the test if this<br />

is the pretest probability of disease. This is the testing threshold.<br />

If the pretest probability is between 10% and 50%, choose to do a test, probably<br />

the rapid strep antigen test that can be done quickly in the office and will give an<br />

immediate result. Choose to treat all children with a positive test result. Then<br />

decide what to do with a negative test. The options here are not to treat or to do<br />

the gold-standard test on all those children with a negative rapid strep test and<br />

with a moderately high pretest probability of about 50%. In this case one should<br />

do the throat-culture test. It is about five times more expensive and takes 2 days<br />

as opposed to 10 minutes for the rapid strep antigen test. However, there will<br />

still be a savings by having to do the gold-standard test on less than half of the<br />

patients, including all those with low pretest probability and negative tests and<br />

those with high pretest probability who have been treated without any testing.<br />

In the example of strep throat, the “costs” of doing the relatively inexpensive<br />

test, of missing a case of uncommon complications and of treatment reactions<br />

such as allergies and side effects are all relatively low. Therefore the threshold for<br />

treatment would be pretty low, as will the threshold for testing.<br />

This method is more important and becomes more complex in more serious<br />

clinical situations. Consider a patient complaining of shortness of breath. If one<br />

suspects a pulmonary embolism or a blood clot in the lungs, should an expensive<br />

and potentially dangerous test in which dye is injected into the pulmonary<br />

arteries, called a pulmonary angiogram and the gold standard for this disease,<br />

be done in order to be certain of the diagnosis? The test itself is very uncomfortable,<br />

has some serious complications of about 10% major bleeding at the site of<br />

injection and can cause death in less than 1% of patients.<br />

Should one begin treatment <strong>based</strong> upon history, physical examination, and an<br />

“imperfect” diagnostic test such as a chest CT or ventilation–perfusion lung scan


Incremental gain and the threshold approach to diagnostic testing 291<br />

that came up positive? There are problems with treatment. Treating with anticoagulants<br />

or “blood thinners” can cause excess bleeding in an increasing number<br />

of patients as time on the drug increases and the patient will be falsely labeled<br />

as having a serious disease, which could affect their future employability and<br />

insurability. These are difficult decisions and must be made considering all the<br />

options and the patient’s values. They are the ultimate combination of medical<br />

science and the physician’s art.<br />

Finally, 95% confidence intervals should be calculated on all values of likelihood<br />

ratios, sensitivity, specificity, and predictive values. The formulas for<br />

these are very complex. The best online calculator to do this can be found at<br />

the School of Public Health of the University of British Columbia website at<br />

http://spph.ubc.ca/sites/healthcare/files/calc/bayes.html. For very high or low<br />

values of sensitivity and specificity (FN or FP less than 5) use the rules for zero<br />

numerator to estimate the 95% CI. These are summarized in Chapter 13.<br />

Multiple tests<br />

The ideal test is capable of separating all normal people from people who have<br />

disease and defines the “gold standard.” This test would be 100% sensitive and<br />

100% specific and therefore, would have no false positive or false negative results.<br />

Few tests are both this highly sensitive and specific, so it is common practice to<br />

use multiple tests in the diagnosis of disease. Using multiple tests to rule in or<br />

rule out disease changes the pretest probability for each new test when used<br />

in combination. This is because each test performed should raise or lower the<br />

pretest probability for the next test in the sequence. It is not possible to predict<br />

a priori what happens to the probability of disease when multiple tests are used<br />

in combination and whether there are any changes in their operating characteristics<br />

when used sequentially.<br />

This occurs because the tests may be dependent upon each other and measure<br />

the same or similar aspects of the disease process. One example is using two different<br />

enzyme markers to measure heart-muscle cell damage in a heart attack.<br />

Tests are independent of each other if they measure completely different things.<br />

An example of this would be cardiac muscle enzymes and radionuclide scan of<br />

the heart muscle. An overview of the effects of using multiple tests is seen in<br />

Fig. 26.7.<br />

In many diagnostic situations, multiple tests must be used to determine the<br />

final diagnosis. This is required when application of an initial test does not<br />

raise the probability of disease above the treatment threshold. If a positive<br />

result on the initial test does not increase the post-test probability of disease<br />

above the treatment threshold, a second, “confirmatory” test must be done.<br />

The expectation in this case is that a positive result on the second test will


292 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

(a)<br />

Independent tests<br />

(b)<br />

Dependent tests<br />

Pretest probability<br />

Pretest odds<br />

Pretest probability<br />

Pretest odds<br />

× LR 1 +<br />

(first test)<br />

× LR 2 +<br />

(second test)<br />

× LR 2 +<br />

(second test)<br />

× LR 1 +<br />

(first test)<br />

Post-test<br />

odds<br />

Post-test<br />

probability<br />

Post-test<br />

odds<br />

Post-test<br />

probability<br />

Post-test<br />

odds<br />

Fig. 26.7 Using multiple tests.<br />

“clinch” the diagnosis by putting the post-test probability above the treatment<br />

threshold.<br />

If the second test is negative this leads to more problems. This negative result<br />

must be considered in the calculations of post-test probability. If the post-test<br />

probability after the negative second test is below the testing threshold the diagnosis<br />

is ruled out. Similarly, if the second test is positive and the post-test probability<br />

after the second test is above the treatment threshold, the diagnosis is<br />

confirmed. If the second test is negative and the resulting post-test probability<br />

is not below the testing threshold, a third test must be done. If that is positive,<br />

more testing may still need to be done to resolve the discordant results on the<br />

three tests.<br />

A complication in this process of calculation of post-test probability is that<br />

the two tests may not be independent of each other. If the tests are independent,<br />

they measure different things that are related to the same pathophysiological<br />

process. They both measure the same process but by different mechanisms.<br />

Another example of independent tests is in the diagnosis of blood clots in<br />

the legs, deep vein thrombosis or DVT. Ultrasound testing takes a picture of the<br />

veins and blood flow through the veins using sound waves and a transducer. The<br />

serum level of d-dimer measures the presence of a byproduct of the clotting process.<br />

The two tests are complementary and independent. A positive d-dimer test<br />

is very non-specific, and a positive test does not confirm the diagnosis of DVT. A<br />

subsequent positive ultrasound virtually confirms the diagnosis. The ultrasound<br />

is not as sensitive, but is very specific and a positive test rules in the disease.


Incremental gain and the threshold approach to diagnostic testing 293<br />

Two tests are dependent if they both measure the same pathophysiological<br />

process in more or less the same way. An example would be the release<br />

of enzymes from damaged heart-muscle cells in an acute myocardial infarction,<br />

AMI. The release of creatine kinase (CK) and troponin I (TropI) both occur<br />

through related pathological mechanisms as infarcted myocardial muscle cells<br />

break down. Therefore they ought to have about the same characteristics of sensitivity<br />

and specificity. The two tests should give the same or similar results when<br />

they are consecutively done on the same patient. There is a difference in the time<br />

course of release of each enzyme. Both are released early but, TropI persists for<br />

a longer time than CK. This makes the two of them useful tests when monitored<br />

over time. If a patient has an increased serum level of CK, the diagnosis of AMI<br />

is confirmed. A negative TropI may cast doubt upon the diagnosis and a positive<br />

TropI will confirm the diagnosis.<br />

The use of multiple tests is a more challenging clinical problem than the use<br />

of a single test alone. In general, a result that confirms the previous test result is<br />

considered confirmatory. A result that does not confirm the previous test result<br />

will most often not change the diagnosis immediately, and should only lead<br />

to questioning the veracity of the diagnosis. It then must be followed up with<br />

another test. If the pretest probability is high and the initial test is negative, the<br />

risk of a false negative is usually too great and a confirmatory test must be done.<br />

If the pretest probability is low and the initial test is positive, the risk of a false<br />

positive is usually too great and a confirmatory test must be done.<br />

If the pretest probability is high, a positive test is confirmatory unless the<br />

specificity of that test is very low. If the pretest probability is low, a negative test<br />

excludes disease unless the sensitivity of that test is very low. Obviously if the<br />

pretest probabilities are either very high or very low, the clinician ought to consider<br />

not doing the test at all. In the case of very high pretest probability immediate<br />

initiation of treatment without doing the test should be considered as the<br />

pretest probability is probably above the treatment threshold. Similarly, in the<br />

case of very low pretest probability, the test ought not to be done in the first place<br />

since the pretest probability is probably below the testing threshold.<br />

Real-life application of these principles<br />

What happens in real life? Can these concepts be used clinically? It is relatively<br />

easy to learn to do the calculations necessary to determine post-test probability.<br />

However, in the clinical situation, “in the trenches,” it is often not very helpful.<br />

Almost all clinicians will most often do what they always do and have been<br />

taught to do in a particular clinical situation when it is similar to other clinical<br />

encounters they have had in the past. Those actions should be <strong>based</strong> on these<br />

same principles of rational decision making, but are learned through training


294 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

and continuing education. However, in difficult cases, one will sometimes need<br />

to think about these concepts and go through the process of application of diagnostic<br />

test characteristics and the use of Bayes’ theorem to one’s patient. There<br />

are some general rules that ought to be followed when using diagnostic tests.<br />

If the pretest probability of a diagnosis is high and the test result is positive<br />

there should be no question but to treat the patient. Similarly, if the pretest probability<br />

is low and the test result is negative, there should be no question but not to<br />

treat the patient. However, if the suspected disease has a high pretest probability<br />

and the test is negative, additional tests must be used to confirm that the patient<br />

does not have the disease. If the second test is positive, that should lead to further<br />

investigation with additional tests, probably the gold standard to “break the<br />

tie.” Similarly, if the disease has a low pretest probability and the test is positive,<br />

additional tests must be done to confirm that the patient actually has the disease.<br />

If the second test is negative, that should lead to further investigation with<br />

additional tests, probably the gold-standard test to “break the tie.”<br />

In patients with a medium pretest probability, it may not be possible for a single<br />

test to determine the need to treat, unless that test has a very high positive<br />

or very low negative likelihood ratio. In general, go with the results of the test if<br />

that result puts the post-test probability over the treatment threshold or under<br />

the testing threshold. The higher the LR+ of a positive test, preferably over 10 is<br />

best, the more likely it is to put the probability over the treatment threshold. The<br />

lower the LR– of a negative test, preferably under 0.1 is best, the more likely it is<br />

to put the probability under the testing threshold.


27<br />

Sources of bias and critical appraisal of studies of<br />

diagnostic tests<br />

It is a vice to trust all, and equally a vice to trust none.<br />

Seneca (c.3 BC – AD 65): Letters to Lucilius<br />

Learning objectives<br />

In this chapter you will learn:<br />

the potential biases in studies of diagnostic tests<br />

the elements of critical appraisal of studies of diagnostic tests<br />

Studies of diagnostic tests are unique in their design. Ideally they compare the<br />

tests in a sample of patients who have a diagnosis that we are certain is correct.<br />

The reader must be aware of potential sources of bias in evaluating these<br />

studies.<br />

Overview of studies of diagnostic tests<br />

In order to find bias in studies of diagnostic tests, it is necessary to know what<br />

these studies are intended to do. When evaluating studies of a diagnostic test,<br />

it is useful to use a structured approach. The first step is to formulate the fourpart<br />

clinical question in the PICO format. In these cases, the question relates the<br />

diagnostic test, the intervention, to the gold standard, or the comparison. The<br />

patient population is those patients in whom the test would normally be done<br />

in a clinical setting and the target disorder is the disease that is attempting to be<br />

diagnosed.<br />

A typical PICO question might be framed as follows. A blood clot in the lungs<br />

or a pulmonary embolism (PE) can be diagnosed with the new-generation x-raycomputed<br />

tomography scanners (CT) of the chest. Is this diagnostic tool as accurate<br />

as the gold-standard pulmonary angiogram obtained by squirting dye into<br />

the pulmonary artery and taking an x-ray and is it better than the old standard<br />

295


296 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

test, the ventilation–perfusion (V/Q) scan of the lungs? The clinical question<br />

asks: in patients suspected of having a PE (population), does the chest CT (intervention)<br />

diagnose PE (outcome as determined by angiogram) better than the<br />

V/Q scan (comparison)? This question asks what the sensitivity and specificity<br />

of the CT and V/Q scans are relative to the gold standard test, the angiogram,<br />

which is assumed to have perfect sensitivity and specificity.<br />

Studies of diagnostic tests should begin with a representative sample of<br />

patients in whom the reasonable and average practitioner would be looking for<br />

the target disorder. This may not always be possible since studies done with different<br />

populations may result in different results of test characteristics, a result<br />

which cannot be predicted. Patient selection can easily limit the external validity<br />

of the test. In the ideal situation, the patients enrolled in the study are then all<br />

given the diagnostic test and the gold-standard tests without the researchers or<br />

the patient knowing the results of either test. The number of correct and incorrect<br />

diagnoses can then be computed.<br />

As with any clinical study, there will be sources of bias in studies of diagnostic<br />

tests. Some of these are similar to biases that were presented in Chapter 8<br />

on sources of bias in research, but others are unique to studies of diagnostic<br />

tests. You ought to look for three broad categories of bias when evaluating studies<br />

of diagnostic tests. These are selection bias, observer bias, and miscellaneous<br />

biases.<br />

Selection bias<br />

Filter bias<br />

If the patients studied for a particular diagnostic test are selected because they<br />

possess a particular characteristic, the resulting operating characteristics found<br />

by this study can be skewed. The process of patient selection should be explicit<br />

in the study methods but it is often omitted. Part of the actual clinical diagnostic<br />

process is the clinician selecting or filtering out those patients who should get<br />

a particular diagnostic test done and those who don’t need it. A clinician who<br />

believes that a particular patient does not have the target disorder would not<br />

order the test for that disease.<br />

Suspect this form of bias when only a portion of eligible patients are given the<br />

test or entered into the study. The process by which patients are screened for<br />

having the testing should be explicitly stated in any study of a diagnostic test<br />

allowing the reader to determine the external validity of the study. Decide for<br />

yourself if a particular patient in actuality is similar enough to the patients in the<br />

study to have the test ordered and to expect results to be similar to those found<br />

in the study.


Sources of bias and critical appraisal of studies of diagnostic tests 297<br />

Using the example of a study of patients with suspected PE, what if only those<br />

patients who were strongly suspected of having a PE are enrolled in the study. If<br />

there is no clear-cut and reproducible way to deterimine how they were selected<br />

it would be difficult, if not impossible, to determine how to select patients to<br />

have the test done on them. It is possible that an unknown filter was applied to<br />

the process of patient selection for the study. Although this filter could be applied<br />

in an equitable and non-differential manner, it can still cause bias since its effect<br />

may be different in those patients with and without the target disease. This selection<br />

process usually makes the test work better than it would in the community<br />

situation. The community doctor, not knowing what that filter was, would not<br />

know which patients to select for the suggested test and would tend to be less<br />

selective of those patients to whom the test would be applied.<br />

Spectrum and subgroup bias (case-mix bias)<br />

A test may be more accurate when given to patients with classical forms of a<br />

disease. The test may be more likely to identify patients with the disease that is<br />

more severe or “well-developed” and less likely to accurately identify the disease<br />

in those patients who present earlier in the course of the disease or in whom the<br />

disease is occult or not obvious. This can be a reflection of real-life test performance.<br />

Most diagnostic tests have very little utility in the general and asymptomatic<br />

population, while being very useful in specific clinical situations. Most<br />

of that problem is due to a large percentage of false positives when the very low<br />

prevalence population is tested.<br />

There are also cases for which the test characteristics, sensitivity and specificity,<br />

also increase as the severity of disease increases. Some patients with leaking<br />

cerebral aneurysms present with severe headaches. If only a small leak is<br />

present, the patient is more likely to present with a severe headache and no neurological<br />

deficits. In this case, the CT scan will miss the bleed almost 50% of the<br />

time. If there is a massive bleed and the patient is unconscious or has severe neurologic<br />

deficit, the CT is positive in almost 100% of cases. These are the sensitivity<br />

of the test in these two situations.<br />

In the 1950s and 1960s, the yearly “executive physical examination,” which<br />

included many laboratory, x-ray, and other tests was very popular, especially<br />

among corporate executives. The yield of these examinations was very low. In<br />

fact the results were most often normal and, when abnormal, were usually falsely<br />

positive. There is a similar phenomenon today with a proliferation of private CT<br />

scanners that are advertised as generalized screening tests for anyone who can<br />

pay for them. They are touted as being able to spot asymptomatic disease in early<br />

and curable stages with testimonials given on their usefulness. The correct use of<br />

screening tests will be discussed in Chapter 28.


298 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

Verification bias<br />

Patients can be selected to receive the gold-standard test <strong>based</strong> upon the results<br />

of the diagnostic test being evaluated. But, sometimes those who have negative<br />

tests won’t all have the gold-standard test done and have some other method for<br />

evaluating the presence or absence of disease in them. This will usually make the<br />

test perform better than it would if the gold standard were done on all patients<br />

who would be considered for the test in a real clinical situation. Frequently,<br />

patients with negative tests are followed clinically for a certain period of time<br />

instead of having the gold-standard test performed on them. This may be appropriate<br />

if no patients are lost to follow-up and if the presence of disease results in<br />

some measurable change in the patient over the time of follow-up. You cannot<br />

do this with silent diseases that become apparent only many years later unless<br />

you follow all of the patients in the study for many years.<br />

Incorporation bias<br />

This occurs if the diagnostic test being studied is used as or is part of the gold<br />

standard. One common way that this happens is that a diagnostic sign of interest<br />

becomes a reason that patients are enrolled into the study. This means that<br />

the final diagnosis of the disease is dependent on the presence of a positive diagnostic<br />

test. Ideally the diagnostic test and the gold standard should be independent<br />

of each other meaning that there is no mechanistic relationship between<br />

the diagnostic test and the gold standard.<br />

A classic example of this type of bias occurs in studies of acute myocardial<br />

infarction (AMI). One criterion for diagnosis of AMI is the elevation of the creatine<br />

kinase enzyme (CK) in the blood of patients with AMI as a result of muscle<br />

damage from the infarction. Another criterion is characteristic changes on the<br />

electrocardiogram. Studies of the usefulness of CK as a serum marker for making<br />

the diagnosis of AMI will be flawed if it is used as part of the definition of AMI.<br />

It will be increased in all AMI patients since it is both the diagnostic test being<br />

investigated and the reference or gold-standard test. This will make the diagnostic<br />

test look better or more accurate in the diagnosis of AMI resulting in higher<br />

sensitivity and specificity than it probably has in real-life diagnosis.<br />

In another example, patients with suspected carpal tunnel syndrome have certain<br />

common clinical signs of carpal tunnel syndrome such as tenderness over<br />

the carpal tunnel. The presence of this sign gets them into a study looking at<br />

the validity and usefulness of common signs of carpal tunnel syndrome, which<br />

are important diagnostic criteria in patients referred for specialty care. This bias<br />

makes that sign look better than it actually is in making a positive diagnosis since<br />

patients who might not have this sign, and who likely have milder disease, were<br />

never referred to the specialist and were therefore excluded from the study.


Sources of bias and critical appraisal of studies of diagnostic tests 299<br />

Observer bias<br />

Absence of a definitive test or the tarnished gold standard<br />

This is probably the most common problem with studies of diagnostic tests. The<br />

gold standard must be reasonably defined. In most cases, no true gold standard<br />

exists, and a research study must make do with the best that is available. The<br />

authors ought to discuss the problem of lack of a gold standard as part of their<br />

results.<br />

For example, patients with abdominal trauma may undergo a CT scan of the<br />

abdomen to look for internal organ damage. If the scan is positive, they are<br />

admitted to the hospital and may be operated upon. If it is negative, they are<br />

discharged and followed for a period of time to make sure a significant injury<br />

was not missed. However, if the follow-up time is too short or incomplete, there<br />

may be some patients with significant missed injuries who are not discovered<br />

and some may be lost to follow-up. The real gold standard, operating on everyone<br />

with abdominal trauma, would be ethically unacceptable.<br />

Review or interpretation bias<br />

Interpretation of a test can be affected by the knowledge of the results of other<br />

tests or clinical information. This can be prevented if the persons interpreting the<br />

test results are blinded to the nature of the patient’s other test results or clinical<br />

presentation. If this bias is present, the test will appear to work better than it<br />

otherwise would in an uncontrolled clinical situation. There are two forms of<br />

review bias.<br />

In test review bias, the person interpreting the tests has prior knowledge<br />

of the patient’s outcome or their result on the gold-standard test. Therefore,<br />

they may be more likely to interpret the test so that it confirms the already<br />

known diagnosis. For example, a radiologist reading the myocardial perfusion<br />

scan mapping blood flow through the heart of a patient whom they know to<br />

have an AMI is more likely to read an equivocal area of the scan as showing<br />

no flow and therefore consistent with an MI. This is because he or she knows<br />

that there is a heart attack in that area that should show up with an area of<br />

diminished blood flow to some of the heart muscle. As a result the radiologist<br />

interprets the equivocal sign as definitely showing no flow, or a positive test for<br />

AMI.<br />

In diagnostic review bias, the person interpreting the gold-standard test<br />

knows the result of the diagnostic test. This may change the interpretation of<br />

the gold standard, and make the diagnostic test look better since the reviewer<br />

will make it concur with the gold standard more often. This will not occur if<br />

the gold-standard test is completely objective by being totally automated with


300 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

a dichotomous result or if the interpreter of the gold standard is blinded to the<br />

results of the diagnostic test. For example, a patient with a positive ultrasound<br />

of the leg veins is diagnosed with deep venous thrombosis or a blood clot in the<br />

veins. A radiologist reading the venogram, dye assisted x-ray of the veins, which<br />

is the gold standard in this case, is more likely to read an equivocal area as one<br />

showing blockage since he or she knows that the diagnostic test showed an area<br />

consistent with a clot.<br />

Context bias<br />

This is a common heuristic, or thought pattern. The person interpreting the test<br />

will base their reading of the test upon known clinical information. This can be<br />

a bias when determining raw test data or in a real-life situation. Radiologists are<br />

more likely to read pneumonia on a chest x-ray if they are told that the patient<br />

has classical findings of pneumonia such as cough, fever, and localized rales over<br />

one part of the lungs on examination. In daily clinical situations, this will make<br />

the correlation between clinical data and test results seem better than they may<br />

be in a situation in which the radiologist is given no clinical information, but<br />

asked only to interpret the x-ray findings.<br />

Miscellaneous sources of bias<br />

Indeterminate and uninterpretable results<br />

Some tests have results that are not always clearly positive or negative, but may<br />

be unclear, indeterminate, or uninterpretable. If these are classified as positive or<br />

negative, the characteristics of the test will be changed. This makes calculation<br />

and manipulation of likelihood ratios or sensitivity and specificity much more<br />

complicated since categories are no longer dichotomous, but have other possible<br />

outcomes.<br />

For example, some patients with pulmonary emboli have an indeterminate<br />

perfusion–ventilation lung scan showing the distribution of radioactive material<br />

in the lung. This means that the results are neither positive nor negative and<br />

the clinician is unsure about how to proceed. Similarly, the CT scan for appendicitis<br />

in some patients with the condition may not show the entire appendix.<br />

This is more likely to occur if the appendix lies in an unusual location such as<br />

in the pelvis or retrocecal area. In cases of patients who actually have the disease,<br />

if the result is classified as positive, the patient will be correctly classified.<br />

If however, the result is classified as negative, the patient will be incorrectly<br />

classified. Again the need for blinded reading and careful a-priori definitions<br />

of a positive and negative test can prevent the errors that go with this type of<br />

problem.


Sources of bias and critical appraisal of studies of diagnostic tests 301<br />

Reproducibility<br />

The performance of a diagnostic test depends on the performance of the technician<br />

and the equipment used in performance of the test. Tests that are operatordependent<br />

are most prone to error because of lack of reproducibility. They may<br />

perform very well when carried out in a research setting, but when extrapolated<br />

to the community setting, the persons performing them may never rise to the<br />

level of expertise required, either because they don’t do enough of the tests to<br />

become really proficient or because they lack the enthusiasm or interest. For<br />

example, CT scans for appendicitis are harder to read than those taken for other<br />

GI problems. When tested in a center that was doing research on this use, they<br />

performed very well. When extrapolated to the community hospital setting, they<br />

did less well. Tests initially studied in one center should be studied in a wide<br />

variety of other settings before the results of their operating characteristics are<br />

accepted.<br />

Post-hoc selection of test positivity criteria<br />

This situation is often seen when a continuous variable is converted to a dichotomous<br />

one for purposes of defining the cutoff between normal and abnormal. In<br />

studying the test, it is discovered that most patients with the disease being sought<br />

have a test value above a certain threshold and most without the disease have a<br />

test value below that threshold. There is statistical significance for the difference<br />

in disease occurrence in these two groups (P < 0.05). That threshold is therefore<br />

selected as the cutoff point.<br />

In some cases, the researchers looked at several cutoff points before deciding<br />

on a final one. Some of them produced differences that were not statistically significant.<br />

This is a form of data dredging and could be classified as a Type I error. A<br />

validation study should be done to verify this result and the results given as likelihood<br />

ratios rather than simple differences and P values. This problem can be<br />

evaluated by using likelihood ratios and sensitivity and specificity and plotting<br />

them on the Receiver Operating Characteristics curve for the data rather than<br />

using only statistical significance as the defining variables in test performance.<br />

Temporal changes<br />

Test characteristics measured at one point in time may change as the test is technically<br />

improved. The measures calculated from the studies of the newer technology<br />

will not apply to the older technology. This is especially true in radiology,<br />

where new generations of MRI machines, CT scanners, and other imaging<br />

modalities are regularly introduced. The results of a study done with the latest<br />

generation of CT scanners may not be seen if your hospital is still using the


302 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

older scanners. Look for this problem in the use of newer biochemical or pathological<br />

tests, as well as in questionnaire tests if the questionnaire is constantly<br />

being improved. There may also be problems associated with the technological<br />

improvement in tests. Newer generations of CT scanners are more likely to<br />

deliver higher doses of radiation to the body.<br />

Publication bias<br />

Studies that are positive, that find a statistically significant difference between<br />

groups, are more likely to be published than those that find no difference. Consider<br />

the possibility that there may be several unpublished negative studies “out<br />

there” when deciding to accept the results of studies of a new test. Ideally, diagnostic<br />

tests should be studied in a variety of clinical settings and with different<br />

mixes of patients.<br />

Words of caution: the manufacturers of a new test want as many physicians<br />

to use the test as often as possible and may sponsor studies that have various of<br />

the biases noted above. There is a lot of money to be made in the introduction of<br />

a new test, especially if it involves an expensive new technology. For example, a<br />

magnetic resonance imaging, MRI, machine costs several million dollars, which<br />

must be justified by the performance of lots of scans. These may not be justified<br />

<strong>based</strong> on good objective evidence obtained through well-conducted studies<br />

of the technology. As a conscientious physician, you must decide when these<br />

expensive technologies are truly useful to your patient. Working with well-done<br />

published guidelines and knowing the details of the studies of these new modalities<br />

can help to put their use into perspective.<br />

Studies sponsored by the manufacturer of the test being studied are always<br />

open to extra scrutiny. Although this does not automatically make it a bad study,<br />

if the authors have a financial stake in the results of the study they often “spin”<br />

the results in the most favorable manner. Conversely, a company producing a<br />

diagnostic test will resist publication of a negative study, and this may lead to<br />

suppression of important medical information.<br />

The ideal study of diagnostic tests<br />

The following is a hypothetical example of an ideal research study of a diagnostic<br />

test. The study looked at the use of head CT in predicting the complications of<br />

stroke therapy with blood clot dissolving medication. Patients who are having a<br />

stroke get an immediate head CT. The scan is initially read by a community radiologist<br />

who is part of the treating physician group and not by a neuro-radiologist<br />

who specializes in reading head CTs. If in that radiologist’s opinion the scan<br />

shows any sign of potential bleeding into the brain, that patient is excluded from<br />

the study.


Sources of bias and critical appraisal of studies of diagnostic tests 303<br />

This scan is then taken to two neuro-radiologists who are experts in reading<br />

head CTs. They read the scan without knowing the nature of the patient problem<br />

or each other’s reading of the scan. If they disagree with each other’s reading, a<br />

third radiologist is called in as a tiebreaker. All patients who are felt to be clinically<br />

eligible for the drug are randomized to be given either the drug or placebo.<br />

The rate of resolution of symptoms and the percentage of patients who make full<br />

recovery, do worse, and die are measured for each group.<br />

The reference standard is the reading of the two blinded neuro-radiologists,<br />

or a majority of two in the case of disagreement. This is not perfect, but mirrors<br />

the best that could be done in any radiology setting. The outcome should then<br />

be judged by a clinician who would probably be a neurologist in this case and<br />

who is also blinded to the results of the CT and the group to which the patient<br />

was randomized. Although not perfect, and no study is, there are adequate safeguards<br />

to ensure the validity of the results. The inclusion criteria are specified<br />

and the filter for which patients are chosen is explicit. The biggest problem with<br />

this study is that patients who are excluded by the initial reading of the CT may in<br />

fact have been eligible for the treatment if a bleed was mistakenly read. However,<br />

in a real-life situation, this is what would occur, so the results are generalizable<br />

to the setting of a community hospital. This group with positive CT scans can be<br />

studied separately if all of their CT scans are taken and read by the same panel of<br />

neuro-radiologists who then record their final readings, the gold standard. This<br />

will tell us how accurate the reading of a bleed was on the CT scans by the community<br />

radiologists.<br />

The gold standard is clearly defined and about as good as it gets. The test, CT<br />

read by community radiologists, and gold standard, CT read by neuro-radiology<br />

specialists, are independent of each other and read in a blinded manner since<br />

the two groups of radiologists are not communicating with each other. A more<br />

perfect gold standard could be another test such as magnetic resonance imaging,<br />

MRI, of the brain. All patients would need to have both the diagnostic test and the<br />

gold-standard test. The follow-up period must be made sufficiently long so that<br />

all possible outcome events are captured. That is not a significant problem here<br />

as all patients can be observed immediately for the outcome. The outcome is<br />

being measured by a clinician who is blinded to the results of the gold-standard<br />

test and the treatment given to the patient. The only potential problem is that<br />

the time factor to get the patient into a CT scan and then an MRI might make the<br />

time to getting the medication too long and lead to worse results for the patient.<br />

How to evaluate research studies of diagnostic tests:<br />

putting it all together<br />

All practicing physicians will be faced with the ability to order an ever-increasing<br />

number of diagnostic tests. Many of these will have only theoretical promise and


304 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

may not have been tested very thoroughly in clinical practice. One must be able<br />

to critically evaluate the studies of diagnostic tests and determine for oneself<br />

whether the test is appropriate to use in your particular clinical setting. The criteria<br />

discussed in this chapter are taken with permission from the series called<br />

Users’ Guides to the Medical Literature, published in JAMA (see Bibliography).<br />

Are the results valid?<br />

(1) Was there an independent, blind comparison with a reference (gold)<br />

standard of diagnosis?<br />

Diagnostic test studies measure the degree of association between the predictor<br />

variable or test result and the outcome or disease. The presence or absence of the<br />

outcome or disease is determined by the result of a reference or gold-standard<br />

test. The diagnostic test under study cannot be used to determine the presence<br />

or absence of the disease. That would be an example of incorporation bias.<br />

The term “normal” must be sensibly defined. How this term is arrived at must<br />

be specified. This could be done using a Gaussian distribution, percentile rank,<br />

risk factor presence or absence, culturally desirable outcome, diagnostic outcome,<br />

or therapeutic outcome and should be specified. If prolonged follow-up<br />

of apparently well patients is used to define the absence of disease, the period<br />

of follow-up must be reasonable so that almost all latent cases of the disease in<br />

question will develop to a stage where the disease can be readily identified.<br />

Both the diagnostic test being studied and the gold standard must be applied<br />

to the study and control subjects in a standardized and blinded fashion. This<br />

should be done following a standardized protocol and using trained observers to<br />

improve reliability. Comparing the new test to the gold standard assesses accuracy<br />

and validity. Blinding reduces measurement bias. Ideally, the test should<br />

be automated and not operator-dependent, multiple measurements should be<br />

made, and at least two investigators involved. One will apply or interpret the new<br />

diagnostic test on the subjects while the second will apply or interpret the gold<br />

standard on the subjects.<br />

(2) Was the study test described adequately?<br />

The test results should be easily reproducible or reliable and easy to interpret<br />

with low inter-observer variation. Enough information should be present<br />

in the Methods section to perform the diagnostic test, including any special<br />

requirements, dosages, precautions, and timing sequences. An estimated cost<br />

of performing the test should be given, including reagents, physician or technician<br />

time, specialty care, and turn-around time. Long- and short-term side<br />

effects and complications associated with the test should be discussed. The<br />

test parameters may be very variable in different settings because test reliability<br />

varies. For “operator-dependent tests” the level of skill of the person performing<br />

the test should be noted and some discussion of how they are trained


Sources of bias and critical appraisal of studies of diagnostic tests 305<br />

included in the description of the study so that this training program can be<br />

duplicated.<br />

(3) Was the diagnostic test evaluated in an appropriate spectrum of patients?<br />

In order to reduce sampling bias, the study patients should be adequately<br />

described and representative of the population likely to receive the test. The distribution<br />

of age, sex, and spectrum of other medical disorders unrelated to the<br />

outcome of interest should be representative of the population in whom the test<br />

will ultimately be used. The spectrum of disease should be wide enough to represent<br />

all the levels of patients for whom the test may be used and should include<br />

early disease, late disease, classical cases, and difficult-to-diagnose cases, those<br />

commonly confused with other disorders. If only very classical cases are studied,<br />

the diagnostic test may perform better than it would for less characteristic cases,<br />

an example of spectrum bias.<br />

Frequently, research studies of diagnostic tests are done at referral centers that<br />

see many cases of severe, classical, or unmistakable disease. This may not correlate<br />

with the distribution of levels of disease seen in physicians’ offices or community<br />

hospitals leading to referral or sampling bias. Investigators testing a new<br />

test will often choose a sample of subjects that have a higher-than-average prevalence<br />

of disease. This may not represent the prevalence of disease in the general<br />

population. If the study is a case–control study or retrospective study, typically<br />

50% of the subjects will have disease and 50% will be normal, a ratio that is very<br />

unlikely to actually exist in the general population. Physicians tend to order testing<br />

in subjects who are less likely to have the disease than those usually studied<br />

when the test is developed.<br />

There should be clear description of the way that people were selected for the<br />

test. This means that the reader should be able to clearly understand the selection<br />

filter that was used to preselect those people who are eligible for the test.<br />

They should be able to determine which patients are in the group most likely to<br />

have the disease as opposed to other patients who have a lower prevalence of the<br />

disease and yet might also be eligible for the test. In a case–control study, the control<br />

patients should be similar in every way to the diseased subjects except for the<br />

presence of disease. This cannot be done using only young healthy volunteers as<br />

the study subjects! The cases with the disease should be as much like the controls<br />

without the disease in every other way possible. The similarity of study and control<br />

subjects increases the possibility that the test is measuring differences due<br />

to disease and not age, sex, general health, or other factors or disease conditions.<br />

(4) Was the reference standard applied regardless of the diagnostic test result?<br />

The choice of a reference gold or diagnostic standard may be very difficult. The<br />

diagnostic standard test may be invasive, painful, costly, and possibly even dangerous<br />

to the patient, resulting in morbidity and even mortality. Obviously taking<br />

a surgical biopsy is a very good reference standard, but it may involve major


306 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

surgery for the patient. For that reason, many diseases will require prolonged<br />

follow-up of patients suspected as being free of the disease as an acceptable reference<br />

standard. How and for how long this follow-up is done will often determine<br />

the internal validity of the study. The study should be free of verification<br />

and other forms of review bias such as test review and context bias, which can<br />

occur during the process of observing patients who are suspected of having or<br />

not having the disease. Adequate blinding of observers is the best way to avoid<br />

these biases.<br />

(5) Has the utility of the test been determined?<br />

If the test is to be used or the investigators desire that it be used as part of a<br />

battery or sequence of tests, the contribution of this test to the overall validity of<br />

the battery or sequence must be determined. Is the patient better off for having<br />

the test done alone or as part of the battery of tests? Is the diagnosis made earlier,<br />

the treatment made more effective, the diagnosis made more cheaply, or more<br />

safely? These questions should all be answered especially before we use a new<br />

and very expensive or dangerous test. Some of these questions are answered by<br />

the magnitude of the results. But, there are always logistical questions that must<br />

be answered to determine the usefulness of a test in varied clinical situations.<br />

What is the impact of the results?<br />

The study results must be important. This means that the study must determine<br />

the likelihood ratios of the test. In most studies this will be done by calculation<br />

of the sensitivity and specificity. If these are reasonably good, the next step is<br />

deciding to which patients the results can be applied. Confidence intervals for<br />

the likelihood ratios should be given as part of the results. Where multiple test<br />

cutoff points are possible, an ROC curve should be provided and the best cutoff<br />

point determined. All of these points have associated confidence intervals. In<br />

any study of a diagnostic test, the initial study should be considered a derivation<br />

study and followed by one or more large validation studies. These will determine<br />

if the initial good results were actually true or if they were just that good by<br />

chance alone.<br />

Can the results be applied to my patients?<br />

Consider the population tested and the patient who is being evaluated. The<br />

answer to the question of generalizability or particularizability depends on how<br />

similar each individual patient is to the study population. You have to ask<br />

whether he or she would have been included in the sample being studied. Ideally<br />

the answer to that question ought always to be yes. But sometimes there are reasons<br />

for using a particular population. For example, studies done in the Veterans


Sources of bias and critical appraisal of studies of diagnostic tests 307<br />

Affairs Hospital System will be mostly of men. This does not automatically disqualify<br />

a female patient from having the test done for the target disorder. There<br />

ought to be a good physiological reason to exclude her from having the tests<br />

<strong>based</strong> on the results from a study of men. Perhaps there is a hormonal effect that<br />

will alter the results of the test. However, each physician must use their best clinical<br />

judgment to be able to determine whether the results of the study can be used<br />

in a given individual patient. Other factors which might affect the characteristics<br />

of the test in a single patient, include age and ethnic group.<br />

(1) Is the diagnostic test available, affordable, accurate, and precise<br />

in my setting?<br />

How do the capabilities of the lab or diagnostic center that one is working in<br />

compare with the one described in the study? This is a function of the type of<br />

equipment used and the operator-dependency of the test. Some very sophisticated<br />

and complex tests may only be available at referral or research centers and<br />

not readily available in the average community hospital setting. The estimated<br />

costs of false positive and false negative test results should be addressed, including<br />

the cost of repeat testing or further diagnostic procedures for false positive<br />

results and of a missed diagnosis due to false negative results. The cost of the<br />

test should be given, as well as the cost of following up on false positive tests and<br />

missing some patients with false negative tests. This could include the cost of<br />

malpractice insurance and payment of awards in cases of missed disease. This<br />

is very complex since the notion of negligence in missing a diagnosis depends<br />

more on one’s pretest probability of disease and how one handles the occurrence<br />

of a false negative test.<br />

(2) Can I come up with a reasonable pretest probability of disease<br />

for my patient?<br />

This was addressed earlier, and although small deviations from the true pretest<br />

probability are not important, large variations are. One does not want to be very<br />

far off in estimating the prior probability. If the physician estimates that the<br />

patient has a 10% probability of disease and the true probability of disease is<br />

90%, this will seriously and adversely decrease the ability to diagnose the problem.<br />

Data on pretest probability come from several sources including published<br />

studies of symptoms, one’s personal experience, the study itself, if the sample is<br />

reasonably representative of the population of patients from which one’s patient<br />

comes, and clinical judgment <strong>based</strong> on the information that is gathered in the<br />

history and physical exam process. If none of these gives a reasonable pretest<br />

probability, consider getting some help from an expert consultant. A colleague<br />

or consultant will probably be able to help here. Most reasonable and prudent<br />

physicians will agree on a ballpark figure, high, medium, or low, for the pretest<br />

probability in most patient presentations of illness.


308 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

Indication creep is a phenomenon that occurs when a diagnostic test is used<br />

in more and more patients who are less and less likely to have the disease being<br />

sought. This will happen after a test is studied in one group of patients, usually<br />

those with more severe or classical disease and then extended to patients<br />

with lower pretest probability of disease. As the test gets marketed and put into<br />

widespread clinical use, the type of patient who gets the test tends to be one<br />

with a lower and lower pretest probability of disease and eventually, the test is<br />

frequently done in patients who have almost zero pretest probability of disease.<br />

However, physicians are especially cautious to avoid missing anyone with a disease<br />

in the fear of being sued for malpractice. However, they must be equally<br />

cautious about over-testing those patients with such low probability of disease<br />

in whom almost all positive tests will be false positives.<br />

(3) Will the post-test probability change my management of this patient?<br />

This is probably the most important question to ask about the usefulness of a<br />

diagnostic test, and will determine whether the test should or should not be<br />

done. The first issue is a mathematical one. Will the resulting post-test probability<br />

move the probability across the testing or treatment threshold? If not, either<br />

do not do the test, or be prepared to do a second or even a third test to confirm<br />

the diagnosis.<br />

Next, is the patient interested in having the test done and are they going to<br />

be “part of the team?” If the patient is not a willing partner in the process, it<br />

is not a good idea to begin doing the test or tests. Give the information to the<br />

patient in a manner they can understand and then ask them if they want to go<br />

through with the testing. They ought to understand the risks of disease, and of<br />

correct and incorrect results of testing, and the ramifications of a positive and<br />

negative test results. Incorporated in this is the question of the ultimate utility of<br />

the test. The prostate specific antigen (PSA) test to screen for prostate cancer is<br />

a good example since a positive test must be followed up with a prostate biopsy,<br />

which is invasive and potentially dangerous. In some men, a positive PSA test<br />

does not mean prostate cancer, but only an enlarged prostate, which could be<br />

diagnosed by other means. The decision making for this problem is very complex<br />

and should be done through careful consideration of all of the options and the<br />

patients’ situation such as age, general health, and the presence of other medical<br />

conditions.<br />

Finally, how will a positive or negative result help the patient reach his or<br />

her goals for treatment? If the patient has “heartburn” and you no longer suspect<br />

a cardiac problem, but suspect gastritis or peptic ulcers, will doing a test<br />

for Helicobacter pylori infection as a cause of ulcers and treatment with specific<br />

anti-microbial drugs if positive, or symptomatic treatment if negative, satisfy the<br />

patient that he or she does not have a gastric carcinoma? If not, then endoscopy,


Sources of bias and critical appraisal of studies of diagnostic tests 309<br />

the gold standard in this case, ought to be considered without stopping for the<br />

intermediate test.<br />

Studies of diagnostic tests should determine the sensitivity and specificity of<br />

the test under varying circumstances. The prevalence of disease in the population<br />

studied may be very different from that in most clinical practices. Therefore,<br />

predictive values reported in the literature should be reserved for validation<br />

studies and studies of the use of the test under well-defined clinical conditions.<br />

Remember that the predictive value of a test is dependent not only on the likelihood<br />

ratios, but also very directly on the pretest probability of disease.<br />

Final thoughts about diagnostic test studies<br />

It is critical to realize that studies of diagnostic tests done in the past were often<br />

done using different methodology than what is now recommended. Many of the<br />

studies done years ago only looked for the correlation between a diagnostic test<br />

and the final diagnosis. For example, a study of pneumonia might look at all<br />

physical examination findings for patients who were subjected to chest x-rays,<br />

and determine which correlated most closely with a positive chest x-ray, the gold<br />

standard.<br />

There are two problems with these types of studies. First, the patients are<br />

selected by inclusion criteria that include getting the test done, here a chest x-ray,<br />

which already narrows down the probability that they have the illness. In other<br />

words, some selection filter was applied to the population. Second, correlation<br />

only tells us that you are more or less likely to find a certain clinical finding with<br />

an illness. It does not tell you what the probability of the illness is after application<br />

of that finding or test. The correlation does not give the same useful information<br />

that you get from likelihood ratios or sensitivity and specificity. Those<br />

will tell the clinician how certain diagnostic findings correlate with the presence<br />

of illness and how to use those clinical findings to determine the presence or<br />

absence of disease.


28<br />

Screening tests<br />

Detection is, or ought to be, an exact science, and should be treated in the same cold and<br />

unemotional manner. You have attempted to tinge it with romanticism, which produces<br />

much the same effect as if you worked a love-story or an elopement into the fifth<br />

proposition of Euclid.<br />

Sir Arthur Conan Doyle (1859–1930): The Sign of Four, 1890<br />

Learning objectives<br />

In this chapter you will learn:<br />

the attributes of a good screening test<br />

the effects of lead-time and length-time biases and how to recognize them<br />

in evaluating a screening test<br />

how to evaluate the usefulness of a screening test<br />

how to evaluate studies of screening tests<br />

Introduction<br />

Screening tests are defined as diagnostic tests that are useful in detecting disease<br />

in asymptomatic or presymptomatic persons. The goal of all screening tests is to<br />

diagnose the disease at a stage when it is more easily curable (Fig. 28.1). This is<br />

usually earlier than the symptomatic stage and is one of the reasons for doing a<br />

diagnostic test to screen for disease.<br />

Screening tests must rise to a higher level of utility since the majority of people<br />

being screened derive no benefit from having the test done. Because the vast<br />

majority of people who are screened do not have the disease, they get minimal<br />

reassurance from a negative test because their pretest probability of disease was<br />

low before the test was even done. However, for many people, the psychological<br />

relief of having a negative test, especially for something they are really scared of,<br />

is a worthwhile positive outcome.<br />

310


Screening tests 311<br />

Onset<br />

of<br />

disease<br />

Asymptomatic<br />

Symptomatic<br />

Reversible Reversible Irreversible<br />

Dead<br />

Diagnosed by screening.<br />

Treatment resulting in<br />

prolonged period of time<br />

until death.<br />

Usual diagnosis in patients<br />

presenting with signs or<br />

symptoms of disease.<br />

There are three rules for diagnostic tests that must be more carefully applied<br />

to screening tests. The first rule is that there is no free lunch. As the sensitivity<br />

of a test increases to detect a greater percentage of diseased persons, specificity<br />

falls and the number of false positives increases. The second rule is that the<br />

prevalence of the disease matters and as the prevalence decreases, the number<br />

of false positives increases and relative number of true positives to false positives<br />

decreases. The final rule is that the burden of proof regarding efficacy depends<br />

upon the clinical context, which can depend on multiple factors. If the intervention<br />

is innocuous and without side effects, screening should be done more often<br />

than if the intervention is dangerous, high-risk, or toxic. Similarly, if the test or<br />

treatment is very expensive, the level of proof of benefit of the screeing test must<br />

be greater.<br />

During the 1950s the executive physical examination was used to screen for<br />

“all” diseases in corporate executives and other, mostly wealthy, people. It was<br />

a comprehensive set of diagnostic tests including multiple x-rays, blood tests,<br />

exercise stress tests, and others, usually administered while the patient spent a<br />

week in the hospital. It was justified by the thought that finding disease early<br />

was good and would lead to improved length and quality of life. The more diseases<br />

looked for, the more likely that disease would be found at an earlier phase<br />

in its course and treatment at this early stage would lead to better health outcomes.<br />

Subsequent analysis of the data from these extensive examination programs<br />

revealed no change in health outcomes as a result of these examinations.<br />

There were more people incorrectly labeled with diseases that they didn’t have<br />

than there were diseases detected early enough to reduce mortality or morbidity.<br />

Ironically, most of the diseases that were identified in these programs could have<br />

been detected simply from a comprehensive history.<br />

This is occurring again with the advent of full body CT scans to screen for hidden<br />

illness, mostly cancer. In this case most of the positive tests are false positives<br />

and the further testing that is required to determine wether the test is a false or<br />

true positive usually requires invasive testing such as operative biopsy. Finally,<br />

Fig. 28.1 Disease timeline and<br />

diagnosisbyscreeningor<br />

diagnostic test. The ideal<br />

screening test.


312 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

Table 28.1. Criteria for a valid screening test<br />

(1) Burden of suffering The disease must be relatively common. The burden of<br />

suffering must be sufficiently great.<br />

(2) Early detectability The disease must be detectable at an early stage,<br />

preferably when totally curable.<br />

(3) Accuracy and validity The test must be accurate and valid: it must reliably pick<br />

up disease (few misses) and not falsely label too many<br />

healthy people.<br />

(4) Acceptability The test must be simple, inexpensive, not noxious, and<br />

easy to administer. It must be acceptable to the patient<br />

and to the health-care system.<br />

(5) Improved outcome There must be treatment available, which if given at the<br />

time that early disease is detected, will result in improved<br />

outcome (lower mortality and morbidity) among those<br />

patients being screened.<br />

it has recently been determined that the radiation exposure from the CT scans<br />

could actually cause more disease, specifically cancers, than would be picked up<br />

with the screening.<br />

Criteria for screening<br />

There are five criteria that must be fulfilled before a test should be used as a<br />

screening test. These are listed in Table 28.1. Following these rules will prevent<br />

the abuses of screening tests that occurred in the 1950s and 1960s and which<br />

continue today.<br />

The disease must impose a significant burden of suffering on the population<br />

to be screened. This means either that the disease is common or that it results in<br />

serious or catastrophic disability. This disability may result in loss of productive<br />

employment, patient discomfort or dissatisfaction, as well as passing the disease<br />

on to others. It also means that it will cost someone a lot of money to care for<br />

persons with the disease. The hope is to reduce this cost both in human suffering<br />

and in dollars by treating at an earlier stage of disease and preventing complications<br />

or early death. This depends on well-designed studies of harm or risk<br />

to tell which diseases are likely to be encountered in a significant portion of the<br />

population in order to decide that screening for them is needed.<br />

For example, it would be unreasonable to screen the population of all 20-yearold<br />

women for breast cancer with yearly mammography. The risk of disease is


so low in this population that even a miniscule risk of increased cancer associated<br />

with the radiation from the examination may cause more cancers than<br />

the test would detect. Similarly, the prevalence of cancer in this population is so<br />

low that the likelihood a positive test would be cancer is very low and there will<br />

be many more false positives than true positives. Similarly screening for HIV in<br />

an extremely low-risk population would lead to incorrectly labeling many more<br />

people as being HIV-positive who were not affected and therefore, false positives.<br />

This could lead to a lot of psychological trauma and require lots of confirmatory<br />

testing in these positives, which would cost a huge amount of money to find one<br />

true case of HIV.<br />

The screening test must be a good one and must accurately detect disease in<br />

the population of people who are in the presymptomatic phase of disease. This<br />

means that it must have high sensitivity. It should also reliably exclude disease in<br />

the population without disease or have high specificity. Of the two, we want the<br />

sensitivity to be perfect or almost perfect so that we can identify all patients with<br />

the disease. We’d like the specificity to be extremely high so that only a few people<br />

without disease are mislabeled leading to a high positive predictive value.<br />

This usually means that a reasonable confirmatory test must be available that<br />

will more accurately discriminate between those people with a positive screening<br />

test who do and don’t have the disease. This confirmatory test ought to be<br />

very specific and acceptable to most people. It should be relatively comfortable,<br />

not very painful, should not cause serious side effects, and also be reasonably<br />

priced.<br />

A screening test may be unacceptable if it produces too many false positives<br />

since those people will be falsely labeled as having the disease, a circumstance<br />

which could lead to psychological trauma, anxiety, insurance or employment<br />

discrimination, or social conflicts. False labeling has a deleterious effect on most<br />

people. Several studies have found significant increases in anxiety that interferes<br />

with life activities in persons who were falsely labeled as having disease on a<br />

screening test. This is an especially serious issue with genetic tests in which a<br />

positive test does not mean the disease will express itself, but only that a person<br />

has the gene for the disease.<br />

There are practical qualities of a good screening test. The cost ought to be<br />

low so it can be economically done on large populations. It should be simple<br />

to perform with good accuracy and reliability. And finally, it must be acceptable<br />

to the patient. For screening tests, most people will tolerate only a low<br />

level of discomfort either from the test procedure itself or from the paperwork<br />

involved in getting the test done. People would much rather have their blood<br />

pressure taken to screen for hypertension than have a colonoscopy to look<br />

for early signs of colon cancer. Finally, people are more willing to have a test<br />

performed to detect disease when they are symptomatic than when they are<br />

well.<br />

Screening tests 313


314 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

The mechanics of a screening program must be well planned if the plan is to<br />

give a huge number of people a diagnostic test. If the test is too complex such<br />

as screening colonoscopy for colon cancer, most people would not be willing<br />

to have it done. A test that is very uncomfortable such as a digital rectal exam<br />

for prostate or rectal cancer, may be refused by a large proportion of patients.<br />

Both examples also require more complex logistics such as individual examining<br />

rooms and sedation for the colonoscopy than a screening test such as blood<br />

pressure measurement. Screening tests must also be well advertised so that people<br />

will know why and how to have the test done.<br />

Pitfalls in the screening process<br />

Simply diagnosing the disease at an earlier stage is not helpful unless the prognosis<br />

is better if treatment is begun at that earlier stage of the illness. The treatment<br />

must be acceptable and more effective before people will be willing to accept<br />

treatment at an asymptomatic stage of illness. Why should someone take a drug<br />

for hypertension if they have no signs or symptoms of the disease when that drug<br />

can cause significant side effects and must be taken for a lifetime?<br />

During the 1960s and 1970s, some lung cancers were detected at an earlier<br />

stage by routine screening chest x-rays. However, immediate treatment of these<br />

cancers did not result in increased survival and caused increased patient suffering<br />

due to serious side effects of the surgery and chemotherapeutic drugs. Therefore,<br />

even though cancers were detected at an earlier stage, mortality was the<br />

same.<br />

The validity of a screening test can be determined from the evidence in the<br />

literature. Screening tests must balance the need to learn something about a<br />

patient, the diagnostic yield, with the ability to actively and effectively intervene<br />

in the disease process at an earlier stage.<br />

There are three significant problems of studies of screening tests. These are<br />

lead-time, length-time, andcompliance biases. Lead-time bias results in overoptimistic<br />

results of the screening test in the clinical study. The patients seem to<br />

live longer but this is only because their disease is detected earlier. In this case,<br />

the total time from onset of illness to death is the same in the group of patients<br />

who were screened and treated early compared with the unscreened group. The<br />

lead time is the time from diagnosis of disease by screening test to the appearance<br />

of symptoms. The time from appearance of symptoms to death is the same<br />

whether the disease was detected by the screening test or not. The total life span<br />

of the screened patient is no different from that of the unscreened patient. The<br />

time between early diagnosis with the screening test and appearance of symptoms,<br />

the lead time, will now be spent undergoing treatment (Fig. 28.2). This


Screening tests 315<br />

No screening<br />

Onset of<br />

disease<br />

Asymptomatic<br />

Onset of<br />

Symptoms<br />

Survival Time<br />

Death<br />

Fig. 28.2 Lead-time bias.<br />

Screened but<br />

early treatment<br />

not effective<br />

(lead-time bias)<br />

Screened and<br />

early treatment is<br />

effective (No leadtime<br />

bias, time<br />

effectiveness)<br />

Apparent Survival Time<br />

Prolonged Survival Time<br />

could be very uncomfortable due to the side effects of treatment or even dangerous<br />

if treatment can result in serious morbidity or death of the patient.<br />

Length-time bias is much more likely to occur in observational studies.<br />

Patients are not randomized and the spectrum of disease may be very different<br />

in the screened group when compared to the unscreened group. A disease that<br />

is indolent and slowly progressive is more likely to be detected than one that is<br />

rapidly progressive and quickly fatal. Patients with aggressive cancers are more<br />

likely to die shortly after their cancer is detected. Those with slow-growing indolent<br />

tumors are more likely to be cured of their disease after screening and will<br />

live a long time until they die of other causes. There are some whose disease is too<br />

early to detect and who will be missed by screening. Without screening, his or her<br />

disease will be detected when it becomes symptomatic, which will be at a later<br />

stage. Length-time bias is illustrated in Fig. 28.3. This problem can be reduced in<br />

large population studies by effective randomization that ensures a similar spectrum<br />

of disease in screened and unscreened patients.<br />

Compliance bias occurs because in general, patients who are compliant with<br />

therapy do better than those who are not regardless of the therapy. Compliant<br />

patients may have other characteristics such as being more health-conscious in<br />

their lifestyle choices, which lead to better outcomes. Studies of screening tests<br />

often compare a group of people who are in a screening program with people<br />

in the population who are not in the screening program. They are usually not<br />

randomized to be in either group. Therefore, the screened group is more likely to<br />

be composed of people who are more compliant or health-conscious, since they<br />

took advantage of the screening test in the first place. This will make it more likely<br />

that the screened group will do better since they may be the healthier patients<br />

in general. This bias can be avoided if patients in these studies are randomized<br />

before being put through the screening test. One way to test for this bias is to


316 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

Fig. 28.3 Length-time bias.<br />

Screening Performed<br />

Rapidly progressive -<br />

detected too late; no<br />

survival benef t<br />

Onset<br />

Detectable Symptomatic<br />

Curable Not curable<br />

Death<br />

Slowly progressive -<br />

Detected in curable<br />

symptomatic phase;<br />

survival benefit<br />

Very slowly growing<br />

tumor (missed) - not<br />

detected; patient<br />

reassured, no actual<br />

survival benefit, but,<br />

survival appears longer<br />

have two groups of patients, one that is randomized to receive the screening test<br />

or not and the other group that has a choice of whether to get screened or not.<br />

This was described in Chapter 15 on the randomized clinical trial (RCT).<br />

Effectiveness of screening<br />

Another problem with screening tests revolves around their overall effectiveness.<br />

For example, consider the use of mammograms for the early detection of breast<br />

cancer in young women. Women aged 50–70 in whom the cancer is detected<br />

at an early stage do appear to have better outcomes. The use of mammography<br />

for screening younger women (age 40–50) is still controversial. In studies<br />

of this group, it made very little difference in ultimate survival if the woman was<br />

screened. Early detection in this population resulted in a large number of false<br />

positive tests requiring biopsy and unnecessary worry for the women affected.<br />

It also resulted in an increased exposure to x-rays among these women and<br />

increased the cost of health care for everyone in the society.<br />

A convenient concept to use in the calculation of benefit is the number needed<br />

to screen to get benefit (NNSB). Like the number needed to treat to get benefit<br />

(NNTB), it is simply 1/ARR, ARR being the absolute risk reduction or the difference<br />

in percentage response between the screened and unscreened groups. The<br />

ideal number to use here is the percentage of women who die from their cancer<br />

in the screened (EER) and unscreened (CER) groups. The NNSB can be used to<br />

balance the positive and negative effects of screening. For example, in the case<br />

of using mammograms to screen for breast cancer in women at age 40, we can<br />

make the spreadsheet as in Table 28.2.


Screening tests 317<br />

Table 28.2. Screening 40- to 50-year-old women for breast cancer<br />

using mammography<br />

Screened<br />

Not screened<br />

Total population 1000 1000<br />

Positive mammogram 300 –<br />

Biopsies (invasive procedures) 150 –<br />

New breast cancers 15 15<br />

Deaths from breast cancer 5–8 7–8<br />

Source: From:D.Eddy.Clinical Decision Making. Sudbury, MA: Jones<br />

& Bartlett, 1996.<br />

On the benefit side, there is the prevention of at most three deaths per 1000<br />

women screened. This leads to a large NNSB = 333. This means that 333 women<br />

must be screened to prevent one death from breast cancer.<br />

CER = 8/1000 EER = 5/1000 ARR = (8/1000 − 5/1000) = 3/1000<br />

NNSB = 1/ARR = 1/(3/1000) = 1/0.003 = 333<br />

If the tests actually result in the same number of deaths from breast cancer,<br />

about 8% in both groups, the NNSB will be infinite and there will be no benefit<br />

of screening.<br />

Typical acceptable NNSB for currently used screening modalities are in the<br />

100–1000 range. If the test is relatively benign or treatment is very easy and the<br />

expected outcome is very good in the screened population a much larger NNSB<br />

is acceptable. More randomized clinical trials of screening tests are needed to<br />

determine acceptable levels of NNSB.<br />

The United States Public Heath Service (USPHS) has published a set of criteria<br />

for an acceptable screening test. The test must be accurate and able to<br />

detect the target condition earlier than without screening and with sufficient<br />

accuracy to avoid producing large numbers of false positive and false negative<br />

results. Screening for and treating persons with early disease must be effective<br />

and should improve the likelihood of favorable health outcomes by reducing<br />

disease-specific mortality or morbidity compared to treating patients when they<br />

present with signs or symptoms of the disease. These criteria come from the<br />

USPHS Guide to Clinical Preventive Services, which also contains a compendium<br />

of recommendations for the use of the most important screening tests. 1 There are<br />

1 US Preventive Services Task Force. Guide to Clinical Preventive Services. 2nd edn. Washington,<br />

DC: USPHS, 1996. Available online through the National Library of <strong>Medicine</strong>’s HSTAT service at<br />

hstat.nlm.nih.gov.


318 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

also very effective evidence-<strong>based</strong> guidelines for screening put out by the Agency<br />

for Healthcare Research and Quality. 2<br />

Critical appraisal of studies of screening tests 3<br />

(1) Are the recommendations valid?<br />

(a) Is there evidence from RCTs that earlier intervention works? Most screening<br />

strategies are <strong>based</strong> upon observational studies. Ideally, the intervention<br />

should be shown to be effective in a well-done RCT. The overall<br />

screening strategy should also be validated by an RCT. Only if the therapeutic<br />

intervention is extremely dramatic, which most aren’t, is there<br />

likely to be no question about its efficacy. A good example of this would<br />

be in screening for hypothyroidism in the newborn. Early detection and<br />

treatment will prevent problems in this rare birth problem. Observational<br />

studies of screening tests are weaker than a well-done RCT. If there is an<br />

RCT of the screening modality, it should be first analyzed using the Users’<br />

guide for studies of therapy, which is also the guide that would be used to<br />

determine efficacy of prevention.<br />

(b) Were the data identified, selected, and combined in an unbiased fashion?<br />

Look for potential confounding factors during the process by which<br />

subjects are recruited or identified for inclusion in a study of screening.<br />

These may easily result in serious bias usually from confounding variables.<br />

Innate differences between the screened and not-screened groups<br />

should be aggressively sought. Frequently these differences are glossed<br />

over as being insignificant and they often are not and can lead to confounding<br />

bias.<br />

(2) What are the recommendations and will they help me in caring for my<br />

patients? What are the benefits?<br />

(a) The benefits should be calculable using the NNSB (number needed<br />

to screen to benefit). The beneficial outcomes that the results refer to<br />

should be important for the patient. The confidence intervals should be<br />

narrow.<br />

(b) The harms or potential harms should be clearly identified. Persons who<br />

are labeled with the disease and who are really disease-free will at least<br />

be inconvenienced and may require additional testing that is not benign.<br />

At least they may have increased anxiety until the final diagnosis is made.<br />

Early treatment may result in such severe side effects that patients may<br />

2 Agency for Healthcare Research and Quality. www.ahrq.gov.<br />

3 Adapted with permission from the Users’ Guides to the Medical Literature, published in JAMA (see<br />

Bibliography).


not want the treatment. You should be able to calculate the NNSH (number<br />

needed to screen to harm) of the intervention <strong>based</strong> upon the study<br />

data. This should be done with 95% confidence intervals to demonstrate<br />

the precision of that result.<br />

(c) These should be compared in different people and with different screening<br />

strategies by looking at all possible screening strategies when evaluating<br />

a screening program. Different strategies may result in different<br />

outcomes either in final results or patient suffering, depending on the<br />

prevalence of disease in the population screened and the screening and<br />

verification strategy employed.<br />

(d) Look for the impact of the screening test on people’s values and preferences.<br />

There ought to be an evaluation of patient values as part of the<br />

study. These can be done using focus groups or qualitative studies of<br />

patient populations. If this is missing, be suspicious about the acceptability<br />

of the screening strategy. The study should be asking patients how<br />

they feel about the screening test itself as well as the possibility of being<br />

falsely labeled.<br />

(e) The study should explore the impact of uncertainty by calculating a sensitivity<br />

analysis as described in Chapter 30. There is uncertainty associated<br />

with any study result and the 95% confidence intervals should be<br />

given.<br />

(f) The cost-effectiveness should be evaluated considering all the possible<br />

costs associated with the screening, including but not limited to, setting<br />

up the program, advertising, following up positives, and excess testing<br />

and treatment of positives. A more complete guide to cost-effectiveness<br />

analysis is found in Chapter 31.<br />

Screening tests 319


29<br />

Practice guidelines and clinical prediction rules<br />

Anyfoolcanmakearule<br />

And every fool will mind it.<br />

Henry David Thoreau (1817–1862): Journal, 1860<br />

Whoever controls guidelines controls medicine<br />

D. Eddy, JAMA, 1990; 263: 877–880<br />

Learning objectives<br />

In this chapter you will learn:<br />

the reasons for and origins of practice guidelines<br />

the problems associated with practice guidelines and the process by which<br />

they are developed<br />

how to evaluate practice guidelines and how they are actually used in<br />

practice<br />

the process of clinical prediction rule development<br />

the significance of different levels of prediction rules<br />

What are practice guidelines?<br />

Practice guidelines have always been a part of medical practice. They are present<br />

in the “diagnosis” and “treatment” sections in medical textbooks. Unfortunately,<br />

published practice guidelines are not always evidence-<strong>based</strong>. As an example, for<br />

the treatment of frostbite on the fingers, a surgical textbook says that operation<br />

should wait until the frostbitten part falls off, yet there are no studies backing<br />

up this claim. Treatment guidelines for glaucoma state that treatment should be<br />

initiated if the intraocular pressure is over 30 mmHg or over a value in the middle<br />

20 mmHg range if the patient has two or more risk factors. It then gives a list of<br />

over 100 risk factors but gives no probability estimates of the increased rate of<br />

glaucoma attributable to any single risk factor. Clearly these are not evidence<strong>based</strong><br />

or particularly helpful to the individual practitioner.<br />

320


Practice guidelines and clinical prediction rules 321<br />

Practice guidelines are simply an explicit set of steps that when followed will<br />

result in the best outcome. In the past, they have been used for good reasons<br />

such as hand washing before vaginal delivery to prevent childbed fever or puerperal<br />

sepsis and for bad ones such as frontal lobotomies to treat schizophrenia.<br />

In some cases they are promulgated as a result of political pressure. One recent<br />

example is breast-cancer screening with mammograms in women between 40<br />

and 50 years old. This has been instituted in spite of lack of good evidence of<br />

improved outcomes. This particular program can cost a billion dollars a year<br />

without saving very many lives and can irrationally shape physician and patient<br />

behavior for years.<br />

A physician in 1916 said “once a Caesarian section, always a Caesarian section,”<br />

meaning that if a woman required a Caesarian section for delivery, all<br />

subsequent deliveries should be by Caesarian section. As a result of this one<br />

statement, the practice became institutionalized. This particular “guideline” was<br />

<strong>based</strong> on a bad outcome in just a few patients. It may have been valuable 85 years<br />

ago, but with modern obstetrical techniques it is less useful now. Many recent<br />

studies have cast doubts on the validity of this guideline, but a new study suggests<br />

that there is a slightly increased risk of uterine rupture and poor outcome<br />

for mother and baby if vaginal delivery is attempted in these women. Clearly the<br />

jury is still out on this one and it is up to the individual patient with her doctor’s<br />

input to make the best decision for her and her baby.<br />

Practice guidelines are used for a variety of purposes. Primarily they ought to<br />

be used as a template for optimal patient care. This should be the best reason for<br />

their implementation and use in clinical practice. When evidence-<strong>based</strong> practice<br />

guidelines are written, reviewed, and <strong>based</strong> upon solid high-quality evidence,<br />

they should be implemented by all physicians. A good example of an evidence<strong>based</strong><br />

clinical guideline in current use is weight-<strong>based</strong> dosing of the anticoagulant<br />

heparin for the treatment of deep venous thrombosis (DVT). When the<br />

guideline is used, there are fewer adverse effects of treatment, treatment failure,<br />

or excess bleeding and better outcomes leading to more rapid resolution of the<br />

DVT.<br />

However, there are “darker” consequences that accompany the use of practice<br />

guidelines. They can be used as means of accreditation or certification. Currently<br />

several specialty boards use chart-review processes as part of their specialty<br />

recertification process. Managed care organizations (MCOs) can develop<br />

accreditation rules that depend on physician adherence to practice guidelines in<br />

the majority of their patients with a given problem. Performance criteria can be<br />

used as incentives in the determination of merit pay or bonuses, a process called<br />

Pay for Performance (P4P).<br />

In the last 30 years there has been an increase in the use of practice guidelines<br />

in determining the proper utilization of hospital beds. Utilization review<br />

has resulted in the reduction of hospital stays, which occurred in most cases


322 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

Table 29.1. Desirable attributes of a clinical guideline<br />

(1) Accurate the methods used must be <strong>based</strong> on good-quality<br />

evidence<br />

(2) Accountable the readers (users) must be able to evaluate the<br />

guideline for themselves<br />

(3) Evaluable the readers must be able to evaluate the health and<br />

fiscal consequences of applying the guideline<br />

(4) Facilitate resolution of<br />

conflict<br />

the sources of disagreement should be able to be<br />

identified, addressed, and corrected<br />

(5) Facilitate application the guidelines must be able to be applied to the<br />

individual patient situation<br />

without any increase in mortality or morbidity. The process of utilization review<br />

is strongly supported by managed care organizations and third-party payors. The<br />

guidelines upon which these rules are <strong>based</strong> ought to be evidence-<strong>based</strong> (Table<br />

29.1).<br />

Development of practice guidelines<br />

How should practice guidelines be developed? The process of guideline development<br />

should be evidence-<strong>based</strong>. Ideally a panel of interested physicians is<br />

assembled and collects the evidence for and against the use of a particular set<br />

of diagnostic or therapeutic maneuvers. Some guidelines are simply consensusor<br />

expert-<strong>based</strong> and the results may not be consistent with the best available<br />

evidence.<br />

When evaluating a guideline it ought to be possible to determine the process<br />

by which the guideline was developed. These are summarized using the AGREE<br />

criteria. 1 This working group of evidence-<strong>based</strong> practitioners have developed six<br />

domains for the evaluation of the quality of the process of making a practice<br />

guideline. These domains are: scope and purpose of the guideline, stakeholder<br />

involvement, rigor of development, clarity and presentation, applicability and<br />

editorial independence. This process only indirectly assesses the quality of the<br />

studies that make up the evidence used to create the guideline.<br />

There are several general issues that should be evaluated when appraising the<br />

validity of a practice guideline. First, the appropriate and important health outcomes<br />

must be specified. They should be those outcomes that will matter to<br />

patients and all relevant outcomes should be included in the guideline. These<br />

include pain, anxiety, death, disfigurement, and disability. They should not be<br />

chemical or surrogate markers of disease. Next, the evidence must be analyzed<br />

1 AGREE criteria: Found at website of the AGREE Collaboration: http://www.agreecollaboration.org


Practice guidelines and clinical prediction rules 323<br />

for validity and the effect of these interventions on the outcomes of interest. This<br />

must include explicit descriptions of the manner in which the evidence was collected,<br />

evaluated, and combined. The quality of the evidence used should be<br />

explicitly given.<br />

The magnitudes of benefits and risks should be estimated and benefits compared<br />

to harms. This must include the interests of all parties involved in providing<br />

care for the patient. These are the patient, health-care providers, third-party<br />

payors, and society at large. The preferences assigned to the outcomes should<br />

reflect those of the people or patients who will receive those outcomes.<br />

The costs both economic and non-economic should be estimated and the net<br />

health benefits compared to the costs of providing that benefit. Alternative procedures<br />

should be compared to the standard therapies in order to determine<br />

the best therapy. Finally, the analysis of the guideline must incorporate reasonable<br />

variations in care provided by reasonable clinicians. A sensitivity analysis<br />

accounting for this reasonable variation must be part of the guideline.<br />

Once a guideline is developed, physicians who will use this guideline in practice<br />

must evaluate its use. If the guideline is not acceptable for the practitioner,<br />

it will not be used. For example, in 1992 a clinical guideline was developed for<br />

the management of children aged 3 to 36 months with fever but no resources<br />

to detect and treat occult bacteremia. This guideline was published simultaneously<br />

in the professional journals Annals of Emergency <strong>Medicine</strong> and Pediatrics.<br />

After a few years, the guideline was only selectively used by pediatricians, but<br />

almost universally used by emergency physicians. Why? The patients seen in<br />

pediatricians’ offices are significantly different than those seen in emergency<br />

departments (ED). Sicker kids are sent to the ED by their pediatricians for further<br />

evaluation. The pediatricians are able to closely follow their febrile kids while<br />

emergency physicians are unable to do this. Therefore, emergency physicians<br />

felt better doing more testing and treating of febrile children in the belief that<br />

they would prevent serious sequelae. Finally, testing was easier to do in an ED<br />

than in a pediatrician’s office. This guideline has been removed since most of the<br />

children in this age group are now immunized against the worst bacteria causing<br />

occult bacteremia, hemophilus and pneumococcus.<br />

Even if a practice guideline is validated and generally accepted by most physicians,<br />

there may still be a delay in the general acceptance of this guideline. This<br />

is mostly because of inertia. Physicians’ behavior has been studied and certain<br />

interventions have been found to change behavior. These include direct<br />

intervention such as reminders on a computer or ordering forms for drugs or<br />

diagnostic tests, follow-up by allied health-care personnel, and education from<br />

opinion leaders in their field. One of the most effective interventions involved<br />

using prompts on a computer when ordering tests or drugs. These resulted in<br />

improved drug-ordering practices and long-term changes in physician behavior.<br />

Less effective were audits of patient care charts and distributed educational<br />

materials. Least effective were formal continuing medical education (CME)


324 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

presentations especially if they were of brief duration (less than 1 day). In some<br />

cases, these very short presentations actually produced negative results leading<br />

to lower use of high quality evidence in physician practices. The construct called<br />

Pathman’s Pipeline demonstrating the barriers to uptake of validated evidence<br />

was discussed in Chapter 17.<br />

Practice guidelines should be developed using a preset process called the<br />

evidence- and outcomes-<strong>based</strong> approach. Separate the main steps of the policymaking<br />

process, the outcome and desirability. First estimate the specific outcomes<br />

and probability of each one of the proposed interventions. Then, make<br />

judgments about the desirability of each of the outcomes. Explicitly estimate the<br />

effect of the intervention on all outcomes that are important to patients. Estimate<br />

how the outcomes will likely vary with different patient characteristics and <strong>based</strong><br />

on estimates of outcomes from the highest-quality experimental evidence available.<br />

Use formal methods such as systematic reviews or formal critical appraisal<br />

of the component studies to analyze the evidence and estimate the outcomes. To<br />

accurately understand patient preferences, use actual assessments of patients’<br />

preferences to determine the desirability of the outcomes.<br />

Critical appraisal of clinical practice guidelines 2<br />

(1) Are the recommendations valid?<br />

(a) Were all important options and outcomes considered? These must be considered<br />

from the perspective of the patient as well as the physician. All reasonable<br />

physician options should be considered including comments on<br />

those options not evidence-<strong>based</strong> but in common practice.<br />

(b) Was a reasonable, explicit, and sensible process used to identify, select,<br />

and combine evidence? This must be reproducible by anyone reading<br />

the paper outlining how the guideline was developed. Explicit rationale<br />

for choice of studies should be done. <strong>Evidence</strong> should be presented and<br />

graded by quality indicators.<br />

(c) Was a reasonable, explicit, and sensible process used to consider the<br />

relative value of different outcomes? The different outcomes should be<br />

described explicitly and the reasons why each outcome was chosen<br />

should be given. Patient values should be used where available.<br />

(d) Is the guideline likely to account for recent developments of importance?<br />

The bibliography should include the most recent evidence regarding the<br />

topic.<br />

(e) Has a peer-review and testing process been applied to the guideline? Ideally,<br />

clinicians who are expert in the area of the guideline should develop<br />

2 Adapted with permission from the Users’ Guides to the Medical Literature, published in JAMA (see<br />

Bibliography).


Practice guidelines and clinical prediction rules 325<br />

and review the guideline. The guideline developers must balance the need<br />

to have experts create a guideline with the potential conflicts of interest<br />

of those experts. It should be tested in various settings to determine if<br />

physicians are willing to use it and to ensure that it accomplishes its stated<br />

goals.<br />

(2) What are the recommendations?<br />

(a) Are practical and clinically important recommendations made? The<br />

guidelines should be simple enough and make enough sense for most<br />

clinicians to use them.<br />

(b) How strong are the recommendations? The evidence for the guideline<br />

should be explicitly listed and graded using a commonly used grading<br />

scheme. Currently the GRADE criteria or the levels of evidence from<br />

the Centre for <strong>Evidence</strong>-Based <strong>Medicine</strong> at Oxford University are probably<br />

the grading schemes most often used. The results of the studies<br />

should be compelling with large effect sizes to back up the use of the<br />

evidence.<br />

(c) How much uncertainty is associated with the evidence and values used<br />

in creating the guideline? It should be clear from the presentation of<br />

the evidence how uncertainty in the evidence has been handled. Some<br />

sort of sensitivity analysis should be included. What happens when basic<br />

assumptions are changed within the limits of the 95% CI of the different<br />

outcomes?<br />

(3) Will the recommendations help me in caring for my patients?<br />

(a) Is the primary objective of the guideline important clinically? The guidelines<br />

ought to meet your needs for improving the care of the patient you<br />

are seeing. They should be consistent with your patient’s health objectives.<br />

(b) How are the recommendations applicable to your patients? The patient<br />

must meet the criteria for inclusion into the guideline. Patient preferences<br />

must be considered after a thorough discussion of all the options.<br />

It must be reasonable for any physician to provide the needed follow-up<br />

and support for patients who require the recommended health care.<br />

Clinical prediction rules<br />

Physicians are constantly looking for sets of rules to assist them in the diagnostic<br />

process. Prediction rules are more specific than clinical guidelines for certain<br />

diagnoses. The definition of clinical prediction rules is that they are a decisionmaking<br />

support tool that can help physicians to make a diagnosis. They are<br />

derived from original research and incorporate three or more variables into the<br />

decision process.


326 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

The Ottawa ankle rules<br />

The Ottawa ankle rules were first developed in the early 1990s and are now in<br />

universal use in most ED and primary-care practices. Their development is an<br />

excellent model for how prediction rules should be created. The main reason<br />

for developing this rule was to attempt to decrease the number of ankle x-rays<br />

ordered for relatively minor trauma. The rule has been successfully applied in<br />

various settings and resulted in decreased use of ankle x-rays. This has become<br />

the prototype for the development of clinical prediction rules.<br />

The first step in the development of these rules was to determine the underlying<br />

processes in making a particular diagnosis and initiating treatment modalities.<br />

In the case of the Ottawa ankle rules, this involved defining the components<br />

of the ankle examination, determining whether physicians could accurately<br />

assess them, and attempting to duplicate the results in a variety of settings. In the<br />

case of the ankle rules, it was found that only a few physical examination findings<br />

could be reliably and reproducibly assessed. Surprisingly, not all physicians reliably<br />

documented findings as apparently obvious as the presence of ecchymosis.<br />

For some of the physical-examination findings the kappa values were less than<br />

0.6. This level was considered to be the minimum acceptable level of agreement.<br />

The next step was to take all these physical-examination variables and apply<br />

them to a group of patients with the complaint of traumatic ankle pain. The<br />

authors determined which of these multiple variables were the most predictive<br />

of an ankle fracture. These variables were then applied to a group of patients and<br />

a statistical model was used to determine the final variables in the rule. When<br />

combined, these gave the rule the best operating characteristics. This means<br />

that when these variables are correctly applied to a patient they have the best<br />

sensitivity and specificity for diagnosing ankle fractures. In this case the rule<br />

creators decided that they wanted 100% sensitivity and were willing to sacrifice<br />

some specificity in the attempt. The process of determining which variables will<br />

be part of the rules is pure and simple data dredging. The results of this study<br />

become the derivation set for the prediction rule. This is defined as a Level-4 prediction<br />

rule. It is developed in a derivation set and ready for testing prospectively<br />

in the medical community as a validation set in different settings. For the Ottawa<br />

ankle rules, the clinical prediction rule was positive and required that an x-ray be<br />

taken if the patient could not walk four steps immediately and in the Emergency<br />

Department and if they had tenderness over the lateral or medial malleoli of the<br />

ankle.<br />

Following this the rules were applied to another group of patients, the validation<br />

set. The same rules were applied to a new population in a prospective<br />

manner. In this case the rule functioned perfectly. This raised the rule to a<br />

Level-2 rule, since it had been validated in a different study population. If the<br />

rule were only valid in a small subpopulation, it would be a Level-3 rule. In this


Practice guidelines and clinical prediction rules 327<br />

Table 29.2. Levels of clinical decision rules<br />

Level 1<br />

Level 2<br />

Level 3<br />

Level 4<br />

Rule that can be used in a wide variety of settings with confidence that it can<br />

change clinician behavior and improve patient outcomes. At least one<br />

prospective validation in a different population and one impact analysis<br />

demonstrating change in clinician behavior with beneficial consequences.<br />

Rule that can be used in various settings with confidence in its accuracy.<br />

Demonstrated accuracy in at least one prospective study including a broad<br />

spectrum of patients and clinicians or validated in several smaller settings<br />

that differ from one another.<br />

Rule that clinicians may consider using with caution and only if patients in<br />

the study are similar to those in the clinician’s clinical setting. Validated in<br />

only one narrow prospective sample.<br />

Rule that is derived but not validated or validated only in split samples, large<br />

retrospective databases, or by statistical techniques.<br />

Source: From T. G. McGinn, G. H. Guyatt, P. C. Wyer, C. D. Naylor, I. G. Stiell & W. S.<br />

Richardson. Users’ guides to the medical literature. XXII. How to use articles about clinical<br />

decision rules. <strong>Evidence</strong>-<strong>based</strong> medicine working group. JAMA 2000; 284: 79–84.<br />

Used with permission.<br />

case, the rule was tried in a cross-section of the population that included men<br />

and women of all ages. There was not a large ethnic mix in the population, but<br />

this is a relatively minor point in this disease since there is no a-priori reason<br />

to think that African-Americans or other non-Caucasian ethnic groups will react<br />

differently in an ankle examination than Caucasians.<br />

Finally, a Level-1 rule is one that is ready for general use and has been shown<br />

to work effectively in many clinical settings. It should also show that the savings<br />

predicted from the initial study were maintained when the rule was applied in<br />

other clinical settings. This is now true of the Ottawa ankle rules.<br />

There are some published standards for clinical prediction rules. Wasson and<br />

others developed these in 1985, and a modified version was published in JAMA<br />

in 2000 (Table 29.2).<br />

Methodological standards for developing clinical decision rules<br />

The clinical problem addressed should be a fairly commonly encountered condition.<br />

It will be very difficult if not impossible to determine the accuracy of<br />

the examination or laboratory tests for uncommon or rare illnesses. The clinical<br />

predicament should have led to variable practices by physicians in order to


328 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

support the need for a clinical prediction rule. This means that physicians act in<br />

very different ways when faced with several patients who have the same set of<br />

symptoms. There should also be general agreement that the current diagnostic<br />

practice is not fully effective, and a desire on the part of many physicians for this<br />

to change.<br />

There must be an explicit definition of findings used to predict the outcome.<br />

Ideally the inter-observer agreement should be able to be determined. Only<br />

those with a high enough inter-observer reliability as demonstrated by a high<br />

kappa value should then be used as part of the final rule. There are several versions<br />

of the kappa test. For most dichotomous data the simple kappa is used.<br />

Other statistical methods are used for more complex data such as the weighted<br />

kappa for ordinal data and intra-class correlation coefficient for continuous<br />

interval data. Once tested, only those signs also called predictor variables with<br />

good agreement across various levels of provider experience should be used in<br />

the final rule.<br />

All the important predictor variables must be included in the derivation process.<br />

These predictors are the components of the history and physical exam that<br />

will be in the rule to be developed. If significant components are left out of the<br />

prediction rule, providers are less likely to use the rule, as it will not have face<br />

validity for them. The predictor variables all must be present in a significant proportion<br />

of the study population or they are not likely to be useful in making the<br />

diagnosis.<br />

Next, there should be an explicit definition of the outcomes. They must be easily<br />

understandable by all providers and be clinically important to the patient.<br />

Finding people with a genetic defect that is not clinically important may be<br />

interesting for physicians and researchers, but may not directly benefit patients.<br />

Therefore, most providers will not be interested in this outcome and will not seek<br />

to accomplish it using that particular guideline.<br />

The outcome event should be assessed in a blinded manner to prevent bias.<br />

The persons observing the outcome should be different from those recording<br />

and assessing the predictor variables. In cases where the person assessing the<br />

predictor variable is also the one determining the outcome, observation bias can<br />

occur. This occurs when the people doing the study are aware of the assessment<br />

and the outcome and may change their definitions of the outcome or the assessment<br />

of the patient. This may occur in subtle ways yet still produce dramatic<br />

alterations in the results.<br />

The subjects should be carefully selected. There should be a range of ages, ethnic<br />

groups, and genders of patients. The selection of a sample should include the<br />

process of selection, inclusion and exclusion criteria, and the clinical and demographic<br />

characteristics of the sample. Patient selection should be free of bias and<br />

there should be a wide spectrum of patient and disease characteristics. The study


Practice guidelines and clinical prediction rules 329<br />

should determine the population of patients to which this rule will be applied.<br />

This gives the clinician the parameters for application of the rule. In the Ottawa<br />

ankle rules, there were no children under age 18 and therefore initially the rule<br />

could not be applied to them. Subsequent studies found that the rule applied<br />

equally well in children as young as 12.<br />

The setting should also be described. Studies that are done only in a specialized<br />

setting will result in referral bias. In these cases, the rules developed may not<br />

apply in settings where physicians are not as academic or where the patient base<br />

has a broader spectrum of the target disorder. A rule that is validated in a specialized<br />

setting must be further validated in more diverse community settings.<br />

The original Ottawa ankle rule was derived and validated in both a universityteaching-hospital<br />

emergency department and a community hospital. The results<br />

were the same in both settings.<br />

The sample size and number of outcome events should be large enough to<br />

prevent a Type II error. If there are too few outcome events, the rule will not be<br />

particularly accurate or precise and have wide confidence intervals for sensitivity<br />

or specificity. As a rule of thumb, there should be at least 10–20 desired outcome<br />

events for each independent variable. For example, if we want to study a prediction<br />

rule for cervical spine fracture in injured patients and have five predictor<br />

variables, we should have at least 50 and preferably 100 significant cervical spine<br />

fractures. A Type I error can also occur if there are too many predictor variables<br />

compared to the number of outcome events. If the rule worked perfectly, it would<br />

have a sensitivity of 100%, the definition of a perfect screening rule. This rule will<br />

rule out disease if it is completely negative. It will not rule in disease if positive.<br />

However since a sample size of 50 patients without cervical spine fractures is<br />

pretty small, the confidence intervals on this would go from 94% to 100%. If the<br />

outcome is not too bad, this is a reasonable rule. However if the outcome were<br />

possible paralysis, missing up to 6% of the patients with a potential for this outcome<br />

would be disastrous. This would prevent that rule from being universally<br />

used.<br />

The mathematical model used to create the rule should be adequately<br />

described. The most common methods are recursive partitioning and classification<br />

and regression trees (CART) analysis. In each of these, the various predictor<br />

variables are modeled to see how well they can predict the ultimate<br />

outcome. In the recursive-partitioning method, the most powerful predictor<br />

variable is tested to see which of the positive patients are identified. Those<br />

patients are then removed from the analysis and the rest are tested with the<br />

next most powerful predictor variable. This is continued until all patients with<br />

the desired outcome are identified. The CART methodology, a form of logistic<br />

regression analysis, is much more complex and beyond the scope of this<br />

text.


330 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

There must be complete follow-up, ideally 100% of all patients enrolled in the<br />

study. If fewer patients are followed to completion of the study, the effect of<br />

patient loss should be assessed. This can be done with a best case/worst case<br />

analysis, which will give a range of values of sensitivity and specificity within<br />

which the rule can be expected to operate.<br />

The rule should be sensible. This means it must be clinically reasonable, easy<br />

to use, and with a clear-cut course of action if the rule is positive or negative.<br />

A nine-point checklist for determining which heart-attack patient should go to<br />

the intensive care unit and which can be admitted to a lower level of care is not<br />

likely to be useful to most clinicians. There are just too many variables for anyone<br />

to remember. One way of making it useful is to incorporate it into the order form<br />

for admitting patients to these units, or creating a clinical pathway with a written<br />

checklist that incorporates the rule and must be used prior to admission to the<br />

cardiac unit.<br />

For most physicians, rules that give probability of the outcome are less useful<br />

than those that tell the physician there are specific things that must be done<br />

when a certain outcome is achieved. However, future physicians, who will be better<br />

versed in the techniques of Bayesian medical decision making, will have an<br />

easier time using rules that give probability of disease rather than specific outcome<br />

actions. They will also be better able to explain the rationale for a particular<br />

decision to their patients. The Wells criteria for risk-stratifying patients<br />

in whom you suspect deep vein thrombosis (DVT) are an example of probabilities<br />

as the outcome of the rule. 3 The final outcome classifies patients into high,<br />

moderate, and low levels of risk for having a DVT. Each of these has a probability<br />

that is pretty well defined through the use of experimental studies of diagnostic<br />

tests.<br />

The rule should be tested in a prospective manner. Ideally this should be done<br />

with a population and setting different than that used in the derivation set. This<br />

is a test for misclassification when the rule is put into effect prospectively. If the<br />

rule still functions in the same manner that it did in the derivation set, it has<br />

passed the test of applicability. This is where provider training in the use of the<br />

rule can be studied. How long does it take to learn to use the rule? If it takes<br />

too long, most providers in community settings will be reluctant to take the time<br />

to learn it. They will feel that the rule is something that will be only marginally<br />

useful in a few instances. Providers who have a stake in development of the rule<br />

are more likely to use it better and more effectively than those who are grudgingly<br />

goaded into using it by an outside agency.<br />

3 P.S.Wells,D.R.Anderson,J.Bormanis,F.Guy,M.Mitchell,L.Gray,C.Clement,K.S.Robinson&B.<br />

Lewandowski. Value of assessment of pretest probability of deep-vein thrombosis in clinical management.<br />

Lancet 1997; 350: 1795–1798.


Practice guidelines and clinical prediction rules 331<br />

It must still be tested in other sites and with other practitioners in order to<br />

determine the effect of clinical use in other sites. This testing should be done<br />

in a prospective manner. As part of this testing, the use of the rule should be<br />

able to reduce unnecessary medical care. This should result in automatic costeffectiveness<br />

of the rule. A rule designed to reduce the number of x-rays taken of<br />

the neck, if correctly applied, will result in less x-rays ordered. There is no question<br />

that there will be an overall cost saving. Of course, if there is a complex and<br />

lengthy training process involved some of the cost savings will be transferred to<br />

the training program, making the rule less effective. Of course, if the rule doesn’t<br />

work well, it may lead to malpractice suits because of errors in patient care making<br />

it even more expensive.<br />

Critical appraisal of prediction rules<br />

(1) Is the study valid?<br />

(a) Were all important predictors included in the derivation process? The<br />

model should include all those factors that physicians might take into<br />

account when making the diagnosis.<br />

(b) Were all important predictors present in a significant proportion of the<br />

study population? The predictor variables should be those that are common.<br />

No specific percentage is required, but clinical judgment should<br />

decide this.<br />

(c) Were all the outcome events and predictors clearly defined? The description<br />

of the outcomes and predictors should be easily reproducible by anyone<br />

in clinical practice.<br />

(d) Were those assessing the outcome event blinded to the presence of the<br />

predictors and those assessing the presence of predictors blinded to the<br />

outcome event?<br />

(e) Was the sample size adequate and did it include adequate outcome<br />

events? There should be at least 10–20 cases of the desired outcome,<br />

patients with a positive diagnosis, for each of the predictor variables<br />

being tested.<br />

(f) Does the rule make clinical sense? The rule should not fly in the face of<br />

current clinical practice otherwise it will not be used.<br />

(2) What are the results?<br />

(a) How well do clinicians agree on the presence or absence of the findings<br />

incorporated into the rule? Inter- and intra-rater agreement and kappa<br />

values with confidence intervals should be given.<br />

(b) What is the sensitivity and specificity of the prediction rule? The rule<br />

should lead to a high LR+, ideally > 10, and a low LR–, ideally < 0.1.


332 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

(c) How well does the rule predict the outcome? Depending on the severity<br />

of the outcome, the rule should find patients with the desired outcome<br />

almost all of the time. Is the post-test probability for the rule high in all<br />

clinical scenarios?<br />

(3) How can I apply the results to my patients?<br />

(a) Are the patients in the study similar enough to my patient?<br />

(b) Can I efficiently and effectively use the rule in my patients?


30<br />

Decision analysis and quantifying patient values<br />

Chance favors only the prepared mind.<br />

Louis Pasteur (1822–1895)<br />

Learning objectives<br />

In this chapter you will learn:<br />

the function of each part of a decision tree<br />

how to use a decision tree in conjunction with the uncertainties of a<br />

diagnostic test to assist in decision making for patients<br />

different ways of quantifying patient values using linear rating scales, time<br />

trade-off, and standard gamble<br />

how to define and use QALYs<br />

Introduction<br />

How do physicians choose between various treatment options? For the individual<br />

physician treating a single patient, it is a matter of obtaining the relevant clinical<br />

information to make a diagnosis. This is followed by treatment as set down<br />

in some sort of clinical practice guideline or from the results of a well-done RCT.<br />

However, these results may have a high degree of uncertainty with large 95% CI<br />

and may not consider the patient’s preferences or values. To help deal with these<br />

issues there are some statistical techniques that we can apply to quantify the<br />

process.<br />

To put the concept of risk into perspective, we must briefly go back a few hundred<br />

years. Girolamo Cardano (1545) and Blaise Pascal (1660) noted that in making<br />

a decision that involved any risk there were two elements that were completely<br />

unique and yet both were required to make the decision. These were the<br />

objective facts about the likelihood of the risk and the subjective views on the<br />

part of the risk taker about the utility of the outcomes involved in the risk. This<br />

333


334 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

second factor leads to the usefulness or expected value of the outcomes expected<br />

from the risk. This involved weighing the gains and losses involved in taking each<br />

of the potential risks and attaching a value to each outcome. Pascal created the<br />

first recorded decision tree when deciding whether or not to believe in God.<br />

The Port Royal text on logic (1662) noted that people who are “pathologically<br />

risk-averse” make all their choices <strong>based</strong> only upon the consequences and will<br />

refuse to make a choice if there is even the remotest possibility of an adverse<br />

consequence. They do not consider the statistical likelihood of that particular<br />

consequence in making a decision. Later, in the early eighteenth century,<br />

<strong>Dan</strong>iel Bernoulli noted that those who make choices <strong>based</strong> only upon the probability<br />

of an outcome without any regard for the quality of the risk involved<br />

with that particular outcome would be considered foolhardy. Most of us are<br />

somewhere in between, which takes us to the modern era in medical decision<br />

making.<br />

There is a systematic way in which the components of decision making can be<br />

incorporated to make a clinical decision and determine the best course of therapy.<br />

This statistical method for determining the best path to diagnosis and treatment<br />

is called expected-values decision making. Given the probability of each<br />

of the risks and benefits of treatment, which strategy will produce the greatest<br />

overall benefit for the patient? The theory of expected-values decision making<br />

is <strong>based</strong> on the assumption that there is a risk associated with every treatment<br />

option and uncertainty associated with each risk.<br />

By using the technique known as instrumental rationality the clinician can calculate<br />

the treatment strategy which will produce the most benefit for the average<br />

or typical patient. The clinician quantifies each treatment strategy by assigning<br />

a numerical value to each outcome called the utility and multiplying that value<br />

by the probability of the occurrence of that outcome. The utilities and probabilities<br />

can be varied to account for variation in patient values and likelihood of<br />

outcomes.<br />

The vocabulary of expected-values decision making: expected<br />

value = utility × probability<br />

The probability is a number from 0 to 1 that represents the likelihood of a particular<br />

outcome of interest. You must know as much about each outcome of the various<br />

treatment options as possible. The probability of each outcome (P) comes<br />

from clinical research studies of patient populations. Ideally, they will have the<br />

same or similar characteristics as the patient or population that is being treated.<br />

These can also come from systematic reviews of many clinical studies or metaanalyses.<br />

They are usually not exact, but are only a best approximation, and<br />

ought to come with 95% confidence intervals attached.


Decision analysis and quantifying patient values 335<br />

There must then be an assignment of a value or utility (U) to each outcome<br />

that quantifies the desirability or undesirability of that outcome. A utility of 1<br />

is assigned to a perfect outcome, usually meaning a complete cure or perfect<br />

health. A utility of 0 is usually thought of as a totally unacceptable outcome,<br />

usually reserved for death. Intermediate utility values are assigned to other outcomes.<br />

The quality of life resulting from each intermediate outcome will be less<br />

than expected with a total cure but more than death. This outcome state may be<br />

wholly or partially unbearable due to treatment side effects or adverse effects<br />

of the illness. A numerical value for utility between 1 and 0 is then assigned<br />

to this outcome. Recent studies of patient values for outcomes of cardiopulmonary<br />

resuscitation (CPR) revealed that some patients will give negative scores<br />

to outcomes such as surviving in a persistent vegetative state and being maintained<br />

on a ventilator. This means that they consider these outcomes to be worse<br />

than death. As research into the development of patient values has continued,<br />

it is clear that there are many outcomes that are valued as less than zero. A<br />

recent example was a study that requested patients to determine their values<br />

in stroke care. Being alive but with a severe disability was rated as less than<br />

zero.<br />

A decision tree illustrating treatment options can then be constructed, as<br />

seen from the following clinical example. Thrombolytic therapy, the use of clotdissolving<br />

medication called t-PA, can be used to treat acute embolic or thrombotic<br />

cerebrovascular accident, a CVA, or stroke due to a blood clot in the brain.<br />

Consider a patient who is a 60-year-old man with sudden onset of weakness of<br />

the right arm and leg associated with inability to speak. A stroke is suspected and<br />

the physician wants to try this new form of treatment to dissolve the suspected<br />

clot in the artery supplying the left parietal area of the brain. A CT scan shows no<br />

apparent bleeding in the brain. There are two options for the patient at this point.<br />

Thrombolytic therapy (t-PA) can be given to dissolve the clot or the patient can<br />

be treated using traditional methods of anticoagulation and intensive physical<br />

rehabilitation therapy.<br />

The first step is to list the possible outcomes for each therapy. For purposes<br />

of the exercise we will greatly simplify this process and assume that there are<br />

only three possible outcomes. Thrombolytic therapy can result in one of two outcomes,<br />

either a cure with complete resolution of the symptoms or death from<br />

intracranial hemorrhage, bleeding into the substance of the brain. Traditional<br />

medical therapy will result in some improvement in the clinical symptoms in all<br />

patients but leave all of them with some residual deficit.<br />

Next, find the probabilities of each of the outcomes. Outcome probabilities are<br />

obtained from studies of populations of patients with similarities for both the<br />

stroke and risk factors for bleeding. The probability of death from thrombolyic<br />

therapy is P d , for complete cure it is P c , which is equal to 1 – P d , and for partial<br />

improvement with medical therapy in this example only, the probability is 1.


336 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

The next step is to assign a utility to each of the outcomes. The utility of complete<br />

cure is 1, death is 0, and the unknown residual chronic disability is U x .<br />

These values are obtained from studies of patient attitudes toward each of the<br />

outcomes in question and will be discussed in more detail shortly.<br />

Mechanics of constructing a decision tree<br />

There are three components to any decision tree. Nodes are junctures where<br />

something happens. There are three types of nodes: decision, probability or<br />

chance, and stationary. A decision node is the point where the clinician or patient<br />

must choose between two or more possible options. A probability node is the<br />

point where one of two or more possible outcomes can occur by chance. A<br />

stationary node is the point where the patient starts, their initial presentation,<br />

or finishes, their ultimate outcome. The symbols for the nodes are shown in<br />

Fig. 30.1.<br />

Fig. 30.1 Symbols used in a<br />

decision tree.<br />

Node<br />

Decision node<br />

Symbol<br />

Probability node<br />

Stationary node<br />

Arms connect the nodes. Each arm represents one treatment or management<br />

strategy. Figure 30.2 shows a simple decision tree for our problem. In this simplified<br />

decision tree for stroke, one arm represents thrombolytic therapy and the<br />

other represents standard medical therapy. The thrombolytic therapy arm has<br />

a probability node and then two other arms come from that. These are cure or<br />

death.<br />

In the simplified stroke-therapy example calculate the expected values in each<br />

arm of the tree by multiplying the utility and probability and summing their values<br />

around each node. Therefore, for thrombolytic therapy the expected value<br />

E will equal 1(1 – P d ) + 0(P d ). For standard medical therapy, since the utility<br />

of chronic residual disability is U x and since all patients have this intermediate<br />

outcome, the expected value E is U x . The patient should always prefer the strategy<br />

that leads to the highest expected value. In this example, the patient would<br />

always choose standard medical treatment for stroke if the expected value for<br />

this arm is 100%, which will occur if U x = 1 and if there is a measurable death rate<br />

for treating with thrombolytic therapy, making the expected value of the thrombolytic<br />

arm 100% – P d .


Decision analysis and quantifying patient values 337<br />

arm<br />

Thrombolytic therapy<br />

Cure<br />

1−P d<br />

Cured: U = 1<br />

Probability Node<br />

Treatment options<br />

Die<br />

P d<br />

Died: U = 0<br />

Stationary node<br />

Starting decision<br />

node: patient<br />

presents with acute<br />

stroke.<br />

Standard medical therapy, P = 1<br />

E = Expected value for each arm of the tree<br />

E (thrombolytics) = (1 − P d ) × 1+ (P d × 0)<br />

E (medicine) = 1 × U x<br />

Probabilities<br />

Chronic disability: U = U x<br />

Final<br />

Outcome<br />

Utility<br />

However, the value of a lifetime of chronic neurological disability is not 100%,<br />

and lets assume for this example that it is 0.9. This means that living with chronic<br />

neurological disability is somehow equated with living 90% of a normal life.<br />

Recalculating the expected value of each arm will determine what probability<br />

of death from thrombolytics would result in wanting to choose thrombolytics<br />

over medical therapy. We must solve the equation 1 – P d = 0.9. Since the value<br />

of E for the medicine arm is now 0.9, thrombolytic therapy should be the chosen<br />

modality as long as P d < 0.10.<br />

Disagreeable events such as side effects may reduce the value of a given arm.<br />

For example, if the experience of getting thrombolytics were unpleasant, that<br />

may lead to a utility reduction of 0.01, changing the expected value of that arm<br />

to 1 – 0.01 – P d . In the example, and if U x were still 0.9, thrombolytics would be<br />

favored as long as P d < 0.09.<br />

In reality, there are more outcomes than shown in this example. For the<br />

thrombolytic-therapy arm, the clot can be dissolved successfully, there can be<br />

residual deficit, or the patient may have an intracranial bleed resulting in death,<br />

or have partial improvement but be left with a residual deficit. The degree of<br />

deficit can also be divided into different categories, for example using the Modified<br />

Rankin Scale to create six criteria for outcomes. The thrombolytic arm of the<br />

decision tree would then look as shown in Fig. 30.3, where P c is the probability of<br />

Fig. 30.2 Decision tree for<br />

thrombolytic therapy.


338 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

Cure<br />

U = 1<br />

Thrombolytic therapy<br />

P c<br />

Hemorrhage<br />

Death<br />

P dt<br />

U = 0<br />

P h<br />

Residual damage<br />

Residual<br />

damage<br />

U = U x<br />

1 − P c − P h<br />

U = U x<br />

1 − P dt<br />

E = P c (1) + (1 − P c − P h) U x + P h (P dt × 0 + (1 − P dt )U x )<br />

Fig. 30.3 Expanded decision tree<br />

for thrombolytic therapy.<br />

Resolution<br />

(cure) P c<br />

U = 1<br />

Standard medical therapy<br />

Death<br />

P dm<br />

U = 0<br />

Residual damage<br />

1 − P c − P dm<br />

U = U x<br />

E = P c (1) + (1 − P c − P dm ) U x + P dm × 0<br />

Fig. 30.4 Expanded decision tree<br />

for standard medical therapy.<br />

cure and P h the probability of hemorrhage. The probability of death due to hemorrhage<br />

is P dt and for residual damage due to hemorrhage is 1 – P dt .Forresidual<br />

damage we will use the same utility, U x = 0.9 as in the previous example for the<br />

standard-therapy arm.<br />

Similarly, standard medical treatment can result in spontaneous cure or death.<br />

This will result in that side of the decision tree looking like Fig. 30.4. Here P c is the<br />

probability of complete resolution and P dm the probability of death.<br />

The reason that a decision tree is needed at all is because while there is an<br />

increase in complete cures with thrombolytic therapy there is also an increase in<br />

intracranial hemorrhage leading to residual damage or death. Simply balancing<br />

the two, using NNTB for cure and NNTH for death due to hemorrhage, ignores<br />

the patient’s values for each of these outcomes. This is especially true when one<br />

or both of the alternative outcomes can lead to a lifetime of disability.


Decision analysis and quantifying patient values 339<br />

1<br />

Standard medical therapy<br />

Thrombolytic<br />

therapy<br />

E<br />

Plausible range of P d<br />

from literature<br />

0<br />

0<br />

Probability of death from thrombolytics<br />

Sensitivity analysis<br />

Fig. 30.5 One-way sensitivity<br />

analysis of a simplified<br />

hypothetical stroke therapy<br />

model.<br />

Sensitivity analysis is a way to deal with imprecision in the data used to create the<br />

decision tree. We have discussed that this is true of almost all data obtained from<br />

the medical literature and insist that the results of any kind of study have appropriate<br />

confidence intervals to give the uncertainty of the result. A sensitivity analysis<br />

tests the “robustness” of the conclusions over a range of different values of<br />

probabilities for each branch of the decision tree. Sensitivity analysis asks what<br />

would happen to the expected value of thrombolytics against standard medical<br />

management if we varied the probability or utility of any of the outcomes. One<br />

simple way of doing this it to take the 95% confidence intervals of the probabilities<br />

and use them as the extreme used in the sensitivity analysis. In other words,<br />

recalculate the expected values of each arm of the tree using first the upper and<br />

then the lower 95% CI value as the new probability for one arm.<br />

If there is very little difference between the expected values of the two treatments<br />

being compared, then a slight change in the probabilities assigned to each<br />

arm could easily alter the direction of the decision. In that case, if the values of<br />

the probabilities are off by just a little bit, the entire result will change and the<br />

patient and physician will have little useful information regarding the relative<br />

merits of the two treatments, or which one is superior.


340 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

1<br />

Thrombolytic therapy<br />

E<br />

Thrombolytic<br />

therapy better<br />

Thrombolytic<br />

therapy worse<br />

Standard medical therapy<br />

Range of plausible values<br />

0<br />

0<br />

0.5<br />

0.25<br />

Probability of death P d or P dm<br />

Fig. 30.6 One-way sensitivity<br />

analysis of a more complex<br />

hypothetical model for stroke<br />

therapy.<br />

Sensitivity analysis determines how much variation in the final outcome will<br />

result from plausible variations in each of the input variables. One-way sensitivity<br />

analysis changes only one parameter at a time (Figs. 30.5, 30.6). Multi-way<br />

sensitivity analysis looks for the variable that causes the biggest change in the<br />

value of the overall model. Then the analysis changes all those assumptions that<br />

are “very sensitive” to see what happens to the model. Finally, a curve is drawn to<br />

show what happens to the expected values when the two most “sensitive” variables<br />

are changed (Fig. 30.7).<br />

The results of a sensitivity analysis can be graphed, showing the effect on the<br />

final outcomes with a change in each of these values. Expected values are usually<br />

calculated for each branch of the decision tree as quality-adjusted life years<br />

(QALYs). A QALY equals E × life expectancy, where E is the expected value calculated<br />

from the decision tree.<br />

In the decision tree on thrombolytic therapy and stroke, adding the uncertainty<br />

associated with the results of a CT scan which checks for early signs of<br />

intracranial bleeding as the cause of the stroke, complicates the previous example<br />

of thrombolytic therapy in stroke. This is because the presence of a small<br />

amount of bleeding is difficult to diagnose on the CT scan, and if thrombolytic<br />

therapy is given in the presence of even a very small bleed the likelihood of a<br />

serious and possibly fatal intracranial hemorrhagic stroke increases. Since the<br />

presence of a bleed is not always detected, the CT is not always a valid test and<br />

the construction of the decision tree must incorporate the possibilities of incorrect<br />

interpretations of the CT. The sensitivity and specificity of the CT in stroke


Decision analysis and quantifying patient values 341<br />

Sensitivity<br />

1<br />

Treat<br />

Don't treat<br />

Fig. 30.7 Two-way sensitivity<br />

analysis of a complex model of<br />

treatment for stroke <strong>based</strong> on<br />

the results of the CT scan. (Yes,<br />

the graph of sensitivity vs. 1 –<br />

specificity is the ROC curve.)<br />

0<br />

0<br />

1 − Specificity<br />

1<br />

patients would help to calcluate the probabilities associated with these additions<br />

to the tree.<br />

It is now possible to determine the probability of giving thrombolytic therapy<br />

when there actually is a bleed and the CT scan is read incorrectly causing a false<br />

negative CT, and of not giving the therapy when there is truly no bleed and yet<br />

one is read on the CT scan, a false positive CT. The ultimate decision should still<br />

be <strong>based</strong> on whichever strategy gives the highest final expected utility. Figure 30.8<br />

shows this more complex but also more realistic decision tree of thrombolytic<br />

therapy in stroke.<br />

Reality check! (disclaimer)<br />

This is not a model of what doctors actually do now at the bedside but a mathematical<br />

modeling technique that can help doctors and patients find the best<br />

possible way of making complex medical decisions. It can be used to create<br />

health policy or to determine the best strategy for a practice guideline. In actuality,<br />

physicians have trouble applying decision analysis to individual patients<br />

even when there is a clearly superior treatment. Also, the model requires that<br />

the outcomes be put into a few discrete categories when in fact there are many<br />

outcomes that are not as clear-cut as in the model.<br />

In this example, thrombolytic therapy complications can vary from serious to<br />

mild in severity. Chronic disability can also vary from a mild to a constant disabling<br />

deficit, which can be very severe and last for only a brief period of time<br />

and then spontaneously resolve. Standard medical treatment may actually result<br />

in more patients having only a small amount of residual deficit. On the other


342 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

Thrombolytics (CT-)<br />

no bleed (TN)<br />

P = NPV<br />

cure U = 1<br />

chronic residual disability U = U x<br />

death U = 0<br />

Stroke<br />

Test<br />

Standard therapy<br />

(CT+)<br />

Don t test (same options as<br />

bleed (FN)<br />

P = 1 − NPV<br />

bleed (TP)<br />

P = PPV<br />

no bleed (FP)<br />

P = 1 − PPV<br />

cure U = 1<br />

bleed<br />

chronic residual<br />

disability U = U x<br />

death<br />

cure<br />

chronic<br />

residual<br />

disability<br />

U = 0<br />

U = 1<br />

U = U x<br />

cure U = 1<br />

chronic residual disability U = U x<br />

death U = 0<br />

death<br />

U = 0<br />

chronic<br />

residual disability<br />

U = U x<br />

above, probabilities will be different).<br />

Fig. 30.8 Complex decision tree<br />

incorporating the use of CT scans<br />

in decision making for stroke.<br />

The probabilities have been<br />

omitted for clarity.<br />

hand, thrombolytic treatment may result in more cases with increased residual<br />

deficit or death, both unsatisfactory outcomes. This can occur even if a “cure” is<br />

obtained in a few more patients in the thrombolytic group. You must include all<br />

of these outcomes to make this a more realistic model of the situation. Finally,<br />

any decision analysis must include a reasonable “time horizon” over which the<br />

outcomes should be assessed.<br />

Computers can be used to show patients how their personal values for each<br />

outcome will change the expected value of each treatment. There are computer<br />

programs that have been developed to assist patients in making difficult decisions<br />

about whether or not to have prostate cancer screening and what options<br />

to take if the screening test is positive, but they are not yet commercially available<br />

and are currently used only in research programs. This is clearly a direction<br />

for future research in decision-making theory. The development of user-friendly<br />

computerized interfaces will help improve the quality of patient decisions. This<br />

will never make the doctor obsolete. The doctor must continue to be able to educate<br />

his or her patient about the consequences of each action and describe the<br />

objective reality of each disease state and treatment options for them so that the


Decision analysis and quantifying patient values 343<br />

patient can make appropriate decisions on the utility they want to assign to each<br />

outcome. In short, the role of the health-care provider is to give their patients the<br />

facts and probability of the outcomes and help the patient decide on their utility<br />

for each outcome.<br />

Threshold approach to decision making<br />

Earlier, in Chapter 26, we talked about the treatment and testing thresholds. The<br />

threshold approach to testing and treatment can use decision trees to determine<br />

when diagnostic testing should be done. Consider the situation of a patient complaining<br />

of shortness of breath in whom you suspect a pulmonary embolism or<br />

blood clot in the lungs. Should you order a pulmonary angiogram test in which<br />

dye is injected into the pulmonary arteries? The test itself is very uncomfortable,<br />

causes some complications, and can rarely cause death. There are basically three<br />

options:<br />

(1) Treat <strong>based</strong> on clinical examination and give the patient an anticoagulant<br />

without doing the test. Do this if the probability of disease is above the Treatment<br />

Threshold.<br />

(2) Test first and treat only if the test is positive.<br />

(3) Neither test nor treat if one is very certain that the disease is not present. Do<br />

this if the probability of disease is below the Testing Threshold.<br />

The treatment threshold is the probability of disease above which a physician<br />

should initiate treatment for the disease without first doing the test for the disease.<br />

This is the level above which, testing will produce an unacceptable number<br />

of false negatives and the patient would then be denied the benefits of treatment.<br />

“The pretest probability of disease is so great that I will treat regardless of the<br />

results of the test.”<br />

The testing threshold is the probability of a disease above which a physician<br />

should test before initiating treatment for that disease. This is the probability<br />

below which, there are an unacceptable number of false positives and<br />

patients would then be unnecessarily exposed to the side effects of treatment.<br />

“The pretest probability of disease is so small that I will not treat even if the test<br />

is positive.”<br />

If the post-test probability of disease after a positive test, the positive predictive<br />

value, is still below the treatment threshold, don’t start treatment. It may<br />

take another test to decide if the patient has the disease or not. If the post-test<br />

probability after a negative test, the false reassurance rate, falls below the testing<br />

threshold, it was a worthwhile test and the patient does not need treatment.<br />

It took the probability of disease from a value of probability at which testing<br />

should precede treating, to one at which neither treatment nor further testing is<br />

beneficial. In essence this means that disease has been ruled out. Decision trees<br />

are another way to determine the cutoffs for testing and treating.


344 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

In order to complete the decision tree for our example of thrombolytic therapy<br />

and stroke, the posterior probability that an intracranial bleed has occurred<br />

when the CT scan has been read as negative must be known. This requires knowing<br />

the sensitivity and specificity of the CT scan and the prevalence of intracranial<br />

bleeding. If the post-test probability of a bleed is low, thrombolytic treatment<br />

will be better and a worsening bleed is very unlikely with t-PA, making thrombolytic<br />

therapy more beneficial and conversely standard medical therapy less<br />

beneficial. If the post-test probability of a bleed is high, standard treatment is<br />

likely to be better, since thrombolytic therapy is more likely to lead to increased<br />

bleeding in the brain.<br />

Both of the thresholds are dependent on prevalence or pretest probability! At<br />

a low pretest probability, even a positive CT ought not make a difference since<br />

there would be many false positives and you shouldn’t do the test at all since you<br />

are more likely to have a false positive and unnecessarily give thrombolytic therapy<br />

to someone who won’t benefit. At a high pretest probability, even a negative<br />

CT ought not make a difference since there would be many false negatives and<br />

you shouldn’t do the test at all since you are more likely to have a false negative<br />

and withhold thrombolytic therapy from someone who would benefit. An example<br />

would be a person with known atrial fibrillation, not on anticoagulants, who<br />

had a sudden onset of severe left hemiparesis without a headache. Changing one<br />

fact of this pattern would change the probability of a bleed and the final decision.<br />

The consequence of giving thrombolytic therapy to someone with a bleed makes<br />

the CT worthwhile, since treating anyone with a positive scan will result in a real<br />

tragedy.<br />

At a high pretest probability the clinical picture is so strong that the test<br />

shouldn’t be done at all since a false negative is much more likely than a true<br />

negative leading to treatment of someone with a potential bleed. An example<br />

would be someone with a sudden onset of the worst headache of their life with<br />

their only deficit being slight weakness of their non-dominant hand. Here the<br />

potential of giving thrombolytic therapy to someone with a bleed is too high and<br />

the projected benefit not great enough.<br />

Mathematical expression of threshold approach to testing<br />

There are formulas for calculating these thresholds, but please don’t memorize<br />

them.<br />

Test threshold =<br />

(FP rate) (risk of inappropriate Rx) + (risk of test)<br />

(FP rate)(risk of inappropriate Rx) + (TP rate)(benefit of appropriate Rx)<br />

Treatment threshold =<br />

(TN rate) (risk of inappropriate Rx) − (risk of test)<br />

(TN rate)(risk of inappropriate Rx) + (FN rate)(benefit of appropriate Rx)


Decision analysis and quantifying patient values 345<br />

Mild<br />

symptoms<br />

Major<br />

symptoms<br />

Disability<br />

Fig. 30.9 Markov model<br />

schematic. From F. A.<br />

Sonnenberg & J. R. Beck. Markov<br />

models in medical decision<br />

making: a practical guide. Med.<br />

Decis. Making 1993; 13:<br />

322–338.<br />

Death<br />

Determining the risks and benefits of incorrect diagnosis will set these thresholds.<br />

A false positive test resulting in unnecessary use of risky tests or treatments<br />

such as cardiac catheterization or cardiac drugs or a false negative test resulting<br />

in unnecessarily withholding beneficial tests or treatments are both adverse outcomes<br />

of testing. You can substitute different values of test characteristics, different<br />

positive and negative predictive values, and different values of the benefit<br />

and risk of treatment in a sensitivity analysis of the decision tree and determine<br />

what the effect of these changes will be on the utility of each treatment arm.<br />

Markov models<br />

Another method of making a decision analysis is through the use of Markov models.<br />

These consider the simultaneous interaction of all possible health states. A<br />

patient can be in only one health state at a time. The difficulty with these is that<br />

there must be some data on the average time a given individual patient spends<br />

in each health state. This is then weighted by considering the quality of life for<br />

each state.<br />

Ovals are states of health associated with quality measures such as death<br />

(U = 0), complete health or cure (U = 1), and other outcomes (U varies from 0<br />

to 1). Arrows are transitions between states or within a state and are attached to<br />

probabilities or the likelihood of changing states or remaining in the same state.<br />

This type of model is ideal for putting into a computer to get the final expected<br />

values. A Markov model of health decision making is diagrammed in Fig. 30.9.<br />

Ethical issues<br />

Finally, there are significant ethical issues raised by the use of decision trees and<br />

expected-values decision making. After performing a decision tree, one must<br />

place ethical values on the decisions. Issues of morality and fairness must be considered.<br />

When there are limited resources, is it more just to spend a large amount


346 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

of resources for a small gain? Is a small gain defined as one affecting only a few<br />

people or one having only a small health benefit? Some of these questions can<br />

be answered using cost-effectiveness analyses and will be covered in the next<br />

chapter.<br />

The use of a decision tree in making medical decisions can help the patient,<br />

provider, and society decide which treatment modality will be most just. Look<br />

for treatments that benefit the most people or have the largest overall improvement<br />

in health outcome. Ethical problems arise when a choice has to be made<br />

on whether to consider the best outcome from the perspective of a large population<br />

or the individual patient. If we take the perspective of the individual patient,<br />

how are we to know that the treatment will benefit that particular patient, the<br />

next patient, or the next 20 patients? Should we use the perspective of statistical<br />

significance (P < 0.05) or is it fairer to use NNTB? Is the decision up to each<br />

individual or should the decision be legislated by society?<br />

Decision trees allow the provider, society, and the patient to decide which therapy<br />

is going to be the most beneficial for the most people. Whether decision trees<br />

are a mathematical expression of utilitarianism is a hotly debated issue among<br />

bioethicists.<br />

Siegler’s schema (Table 30.1) is useful for using these models in medical and<br />

ethical decision making. The basic perspectives of medical care within the traditional<br />

patient–physician relationship include medical indications, which are<br />

physician-directed, and patient preferences, which are patient-driven. Both of<br />

these are input variables in the decision tree. Current or added perspectives<br />

modify the decision and include quality of life, which considers the impact on<br />

the individual of high-technology interventions and contextual features, which<br />

are cultural, societal, family, religious or spiritual, community, and economic factors.<br />

These are all part of the discussion between the provider and the patient and<br />

form the basis of the provider–patient relationship.<br />

Assessing patient values<br />

Patient values must be incorporated into medical decision making and healthcare<br />

policies by providers, government, managed care organizations, and other<br />

decision makers. The output of decision trees is variable and ultimately is <strong>based</strong><br />

on the patient preferences. We can measure and quantify patient values and use<br />

them in decision trees to help patients make difficult decisions.<br />

Using unadjusted life expectancy or life years cannot compare various states<br />

of health in cases with the same number of years of life because they do not<br />

quantify the quality of those years. Quality-of-life scales or measures of status<br />

rated by others or by the patient themself include health status, functional status,<br />

well-being, or patient satisfaction. Common scales are the Activities of Daily<br />

Living or ADL and the Arthritis Activity Scale used in rheumatoid arthritis. These


Decision analysis and quantifying patient values 347<br />

Table 30.1. Seigler’s schema for ethical decision making in medicine<br />

Ethical concern<br />

MEDICAL INDICATION<br />

What is the best treatment?<br />

What are the alternatives?<br />

PATIENT PREFERENCES<br />

What does the patient want?<br />

What outcome does the patient prefer?<br />

QUALITY OF LIFE<br />

What impact will the proposed treatment<br />

or lack of it have on the patient’s life?<br />

SOCIECONOMIC ISSUES<br />

(CONTEXTUAL FEATURES)<br />

What does the patient want within their<br />

own socioeconomic milieu?<br />

What are the needs of the patient’s<br />

society?<br />

Ethical principle<br />

BENEFICENCE<br />

The duty to promote the good of the<br />

patient<br />

AUTONOMY<br />

Respect for the patient’s right to<br />

self-determination<br />

NON-MALEFICENCE<br />

The duty not to inflict harm or injury<br />

JUSTICE<br />

The patient is given what is their “due”<br />

Source: From A. R. Jonsen, M. Siegler & W. J. Winslade. Clinical Ethics. 3rd edn. New<br />

York: McGraw-Hill, 1992. pp. 1–10.<br />

0 1<br />

Death<br />

Normal<br />

Fig. 30.10 Linear rating scale.<br />

Simply measure the patient’s<br />

mark on the scale as a<br />

percentage of the entire length<br />

of the scale.<br />

are difficult to use in a quantitative manner. This discussion will present several<br />

standardized quantitative measures of patient preference that can be used<br />

to measure the relative preference that a patient has for one or another outcome.<br />

The linear-rating-scale method utilizes a 10-cm visual analog scale (VAS) with<br />

one end being zero or death and the other end one or a completely healthy life<br />

(Fig. 30.10.) The patient is asked “where on this scale would you rate your life if<br />

you had to live with chronic disease?” In the t-PA in stroke example that would be<br />

the residual neurological deficit from the stroke syndrome. The resultant value of<br />

U is the percentage of the total length of the line.<br />

The time trade-off method for this example asks “suppose you have 10 years<br />

left to live with chronic residual neurological disability from the stroke. If you<br />

could trade those 10 years for x years without any residual neurological deficit,<br />

what is the smallest number of years you would trade to be deficit-free?” Since<br />

it is a direct question, there is a lot of variability attached to the answer between<br />

patients.


348 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

Fig. 30.11 Standard gamble.<br />

Sure thing<br />

Gamble<br />

P di<br />

Death = 0<br />

U CND<br />

(Utility)<br />

1 − P di<br />

No stroke = 1 (cure)<br />

The standard-gamble or utility method attempts to find out how much risk<br />

the patient is willing to take. The patient is told to consider an imaginary situation<br />

in which you will give them a pill that will instantly cure their stroke. However,<br />

there is a risk in that it occasionally causes instant but painless death. If<br />

there were 100% cure and 0% death, every patient would always take the pill. On<br />

the other hand, if there were 0% cure and 100% death no one would ever take<br />

the pill unless the patient is extremely depressed and considers their life totally<br />

worthless. Continue to change the cure-to-death ratio until the person cannot<br />

decide which course of action to take. This is the point of indifference. “At what<br />

level of risk would you be indifferent to the outcome?” In our stroke example, the<br />

sure thing is chronic residual neurological deficit and the gamble is no deficit or<br />

death. Set up a “mini decision tree” and solve for the utility of living with chronic<br />

neurological deficit. This is diagrammed in Fig. 30.11, where:<br />

P di is the probability of death at the point of indifference, the information<br />

learned when using the standard gamble.<br />

U CND = (0 × P di ) + (1 − P di )1or<br />

U CND = 1 − P di . This is the value of living with a chronic stroke syndrome that<br />

the patient assigns as an outcome through a standard gamble.<br />

QALYs are the units of the standardized measure that combines the quality of<br />

life and life expectancy. It is the output measure that is commonly used in decision<br />

analyses. It combines total life expectancy with a quantitative measure of<br />

patient value. A decision analysis can determine how many QALYs result from<br />

each strategy. The QALY is determined by taking the normal life expectancy and<br />

multiplying it by the patient value or utility of 1 year of life.<br />

Different values will be obtained from each method used to measure patient<br />

values. The linear rating scale measures the quality of functionality of life, the<br />

time trade-off introduces a choice between two certainties, and the standard<br />

gamble introduces probability and willingness to take risks into the equation.<br />

Attitudes toward risk and framing effects<br />

Attitudes toward risk vary with individuals and at different periods of time during<br />

their lives. Patient values can be related to special events such as the birth of a


Decision analysis and quantifying patient values 349<br />

child or marriage, habits such as smoking or drinking, or age. The length of time<br />

involved in the trade-off will be different if asked of a younger or older person<br />

since a younger person may be less likely to be willing to trade off years. Also<br />

personal preferences related to the amount of risk a person is generally willing<br />

to take in other activities, such as sky-diving, play a role in determining patient<br />

values. Since values tend to be very personal, providers should not be the ones to<br />

assign these values. Values <strong>based</strong> on the provider’s own risk-taking behavior will<br />

not accurately measure the values of their patient.<br />

How the questions are worded or framed will influence the answer to the question.<br />

Asking what probability of death a patient is willing to accept will likely<br />

give a lower number than asking what probability of survival they are willing to<br />

accept. The framing of the questions may reflect the risk-taking attitude of the<br />

provider. A patient is more likely to prefer a treatment if told that 90% of those<br />

treated are alive 5 years later than if told that 10% are dead after the same time<br />

period, even though the outcome is exactly the same. The feelings aroused by the<br />

idea of death are more likely to lead to the rejection of an option framed from the<br />

perspective of death when this same option would be endorsed in the opposite<br />

framing of the choice, the perspective of survival. Although apparently inconsistent<br />

and irrational, this effect is a recurring phenomenon. This irrationality is<br />

not due to lack of knowledge since physicians respond no differently than nonphysician<br />

patients.<br />

Probability means different things to different people. This is related to how<br />

individuals relate to numbers and how well people understand probabilities. In<br />

general, people (including physicians and other health-care providers) do not<br />

understand probabilities very well. Physicians tend to give qualitative rather<br />

than quantitative expressions of risk in many different and ambiguous ways. For<br />

example, what does a “rare risk” of death mean? Does it mean 1% of the time or<br />

one in a million? From the patient perspective, a rare event happens 100% of the<br />

time if it happens to them.<br />

Finally, patient values change when they have the disease in question as<br />

opposed to when they do not. Patients who are having a stroke are much more<br />

willing to accept moderate disability than well persons who are asked about the<br />

abstract notion of disability if they were to get a stroke. This means that stroke<br />

patients assign a higher value to the utility (U) of residual deficit than well people<br />

asked in the abstract. Most clinical studies of these issues that are now being<br />

done have quality-of-life and patient-preference measures attached to possible<br />

outcomes. They should help clarify the effects of variations in patient values on<br />

the outcomes of decision trees.


31<br />

Cost-effectiveness analysis<br />

When gold argues the cause, eloquence is important.<br />

Publilius Syrus (first century BC): Moral Sayings<br />

Learning objectives<br />

In this chapter you will learn:<br />

the process of evaluating an article on cost-effectiveness<br />

the concepts of marginal cost and marginal benefit<br />

how to use these tools to help make medical decisions for a population<br />

how to calculate a simple cost-effectiveness problem and evaluate the costeffectiveness<br />

of a specific therapy<br />

The cost of medical care is constantly rising. The health-care provider of the<br />

future will seek to use the most cost-efficient methods to care for her or his<br />

patients. Cost-effectiveness analysis can be used to help choose between treatment<br />

options for an individual patient or for large populations. Governments<br />

and managed care organizations use cost-effectiveness techniques to justify<br />

their coverage for various health-care “products.” Drug companies often produce<br />

cost-effectiveness studies to show that their more expensive drugs are<br />

actually cheaper in the long run by being cheaper to administer or by saving<br />

future health-care costs. Health-care providers, policy makers, and insuranceplan<br />

administrators must be able to evaluate the validity of these claims through<br />

the critical analysis of cost-effectiveness studies.<br />

How do we decide if a test or treatment is worth it?<br />

If one treatment costs less and is clearly more effective than the alternative<br />

option, there is no question about which treatment to use. Similarly, if the<br />

350


treatment costs more and is clearly less effective, there will also be no question<br />

about which to use. Treatment with the most effective treatment modality would<br />

proceed for the patient and that would also save money in the process. More<br />

often than not, however, the situation arises for which one therapy costs much<br />

more and is marginally more effective than a much less expensive therapy or the<br />

converse, where one therapy is clearly less effective but is also less expensive.<br />

Cost-effectiveness analysis gives us the data to answer the question “how much<br />

more will this extra effectiveness cost or how much more will use of the less effective<br />

therapy ultimately cost?”<br />

This is a serious ethical issue for society and relates to a concept called opportunity<br />

costs. If one very expensive treatment is beneficial for a few people and<br />

we decide to pay for that treatment, we may be unable to afford other equally<br />

or more effective treatments that may help many more people. There is only<br />

so much money to go around and you can’t spend the same dollar twice! If we<br />

fund bone marrow transplants for questionably beneficial indications, we may<br />

not be able to pay for hypertension screening leading to treatment that could<br />

prevent the need for certain other high cost therapies like kidney or heart organ<br />

transplants in the future. A bone marrow transplant may prolong one life by 6<br />

years, yet result in loss of funds for hypertension screening and treatment programs<br />

which could prevent six new deaths from uncontrolled hypertension in<br />

that same period. Cost-effectiveness analysis should be able to tell if the cost of a<br />

new therapy is “worth it” or if we should be paying for some other, cheaper, and<br />

possibly more effective therapy.<br />

Cost rationing has always been a contentious issue in medicine. The wealthy<br />

can get any medical procedure done regardless of efficacy or cost while the poor<br />

must wait for available services. This is known as de-facto rationing and is manifested<br />

by long waiting times in a municipal hospital emergency department or<br />

for an appointment to be examined by a specialist or get diagnostic studies done.<br />

In the United States, there may be reduced availability of certain drugs to patients<br />

in some managed care organizations, on Medicaid and certainly to uninsured<br />

working people who cannot afford to pay for that drug out of their own pockets.<br />

The State of Oregon used a type of cost-effectiveness analysis to decide what services<br />

the State Medicaid program should cover. We are constantly making value<br />

judgments over how we as a society will spend our money. The ethical issues will<br />

be left to the politicians and ethicists to discuss. This chapter will present the<br />

tools needed to evaluate studies of cost-effectiveness.<br />

Cost-effectiveness studies can be very complex to evaluate. On the most basic<br />

level, they simply add up all the costs of a particular procedure, subtract from<br />

them the cost of the comparison procedure that is in current use, and divide<br />

by the benefit, usually the number of additional QALYs obtained by using the<br />

new procedure. These are the same QALYs that were calculated in the previous<br />

Cost-effectiveness analysis 351


352 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

chapter on Expected Values Decision Making. However, the manner in which the<br />

analysis is set up will have an enormous impact on what kind of result will be<br />

obtained. It is difficult to do a good and fair cost analysis and relatively simple<br />

to do a bad and often biased one. Therefore it is up to the reader to apply a few<br />

simple rules when reading a cost analysis. If these rules are followed, you can be<br />

fairly sure the analysis is relatively fair and usually valid.<br />

Guidelines for assessing an economic analysis of clinical care<br />

Was a broad enough viewpoint adopted?<br />

Is there a specified point of view, either a hospital, health insurance entity, ministry<br />

of health, or preferably society as a whole, from which the costs and effects<br />

are being viewed? 1 The viewpoint should be given from the perspective of who<br />

is paying for the treatment and who is affected by the decision outcome of what<br />

to treat and not treat. Often these studies compare usual fee for service or thirdparty<br />

insurance against managed-care costs. However, the comparison may simply<br />

be for the costs of the treatments only without a specific viewpoint on who is<br />

paying for them or how much is being reimbursed.<br />

There is a disconnect between costs and charges in health-care finances<br />

because of the large amount of uncompensated and negotiated care that is delivered.<br />

This must be considered in any economic analysis. Costs are the amount of<br />

money that is required to initiate and run a particular intervention. Charges are<br />

the amount of money that is going to be requested from the payors. It is disingenuous<br />

to use charges since they always overestimate the costs. However, when<br />

using simple costs only, the cost of treating non-insured patients must be factored<br />

into the accounting.<br />

The different programs being compared must be adequately described. It<br />

should be possible from reading the article’s methods to set up the same program<br />

in any comparable setting. This requires a full description of the process of<br />

setting up the program, the costs and effects of the program, and how these were<br />

measured.<br />

Were all the relevant clinical strategies compared?<br />

Does the analysis compare well-defined alternative courses of action? The comparison<br />

between treatment options must be specified. Typically two treatment<br />

options or treatment as opposed to non-treatment are considered in a costeffectiveness<br />

analysis. The treatment options ought to be those that are in<br />

1 Adapted with permission from the User’s Guide to the Medical Literature, published by JAMA (see<br />

Bibliography).


Cost-effectiveness analysis 353<br />

common use by the bulk of physicians in a particular field and not just fringe<br />

practitioners. Using treatments that are no longer in common use will give a<br />

biased result to the analysis.<br />

Was clinical effectiveness established?<br />

The program’s effectiveness should have been validated. There should be hard<br />

evidence from well-done randomized clinical trials to show that the intervention<br />

is effective, and this should be explicitly stated. Where not previously done,<br />

a systematic review or meta-analysis should be performed as part of the analysis.<br />

A cost-effectiveness analysis should not be done <strong>based</strong> on the assumption<br />

that because we can do something it is good. If no RCT is available that<br />

looks at the relevant clinical question, observational studies can be used, but<br />

with the caveat that they are more prone to bias especially from confounding<br />

variables.<br />

Were the costs measured accurately?<br />

Does the analysis identify all the important and relevant costs and effects that<br />

could be important? Were credible measures selected for the costs and effects<br />

that were incorporated into the analysis? On the cost side this includes the actual<br />

costs of organization and setting up a program and continuing operations, additional<br />

costs to patient and family, costs outside the health-care system like time<br />

lost from work and decreased productivity, and intangible costs such as loss of<br />

pleasure or loss of companionship. These costs must be compared for both doing<br />

the intervention program and not doing the program but doing the alternatives.<br />

On the effect side, the analysis should include “hard” clinical outcomes: mortality,<br />

morbidity, residual functional ability, quality of life and utility of life, and<br />

the effect on future resources. These include the availability of services and<br />

future costs of health care and other services incurred by extending life. For<br />

example, it may be fiscally better to allow people to continue to smoke since this<br />

will reduce their life span and save money on end-of-life care for those people<br />

who die prematurely. This doesn’t mean we should encourage smoking.<br />

The error made most often in performing cost-effectiveness analyses is the<br />

omission of consideration of opportunity costs that were referred to at the start<br />

of this chapter. If you pay for one therapeutic intervention you may not be able<br />

to pay for some other one. Cost-effectiveness analyses must include an analysis<br />

of these opportunity costs so that the reader can see what equivalent types of<br />

programs might need to be cut from the health-care budget in order to finance<br />

the new and presumably better intervention. Analyses that do not consider this<br />

issue are giving a biased view of the usefulness of the new program and keeping<br />

it out of the context of the most good for the greater society.


354 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

Table 31.1. Comparing inpatient vein stripping (IP Stripping) to outpatient<br />

injection (OP Injections) of varicose veins<br />

Treatment<br />

Outcomes<br />

Cost to hospital No further Support Further<br />

per patient treatement stockings treatment<br />

(indexed) needed needed needed<br />

OP injections 9.77 78% 9% 13%<br />

IP stripping 44.22 86% 11% 3%<br />

What is the resulting cost or cost per unit health gained and is this<br />

gain impressive?<br />

The marginal or incremental gain for both the costs and effects should be calculated.<br />

First, the degree of risk reduction is determined. On a superficial basis,<br />

a very simple way to do a quick cost-effectiveness analysis is with the number<br />

needed to treat to benefit (NNTB). This is the number of patients you must treat<br />

in order to achieve the desired effect in one additional patient. It is the inverse<br />

of the attributable risk reduction (ARR) between the two therapies. This is compared<br />

to the marginal cost of the better treatment to get a cost-effectiveness<br />

estimate.<br />

For example, in the GUSTO trial of thrombolytic therapy for myocardial infarction,<br />

a difference in outcomes was found when t-PA was used instead of streptokinase:<br />

t-PA at $2000/dose resulted in 6.5% mortality while streptokinase at<br />

$200/dose resulted in 7.5% mortality. The ARR is the difference between the two,<br />

or 1%. The NNTB for t-PA is 100 (1/ARR) which is how many patients must be<br />

treated with t-PA instead of streptokinase to prevent one additional death. The<br />

marginal or incremental cost per life saved is then $180 000 [($2000 − $200) ×<br />

100 lives].<br />

The prices used to calculate costs should be appropriate to the time and place.<br />

The use of US dollars in studies on Canadian health-care resources will not translate<br />

into a credible cost analysis. Also, the effects measured should include lives<br />

or years of life saved, improvement in level of function, or utility of the outcome<br />

for the patient.<br />

There are several different ways to analyze costs and effects. In a costminimization<br />

analysis only costs are compared. This works if the effects of the<br />

two interventions are equal or minimally different. For example, when comparing<br />

inpatient vein stripping to outpatient injection of varicose veins, the results<br />

shown in Table 31.1 were obtained. Here the cost is so different that even if 13%<br />

of outpatients require additional hospitalization (and therefore we must pay for


Cost-effectiveness analysis 355<br />

Table 31.2. Comparing doxycycline to azithromycin for Chlamydia infections<br />

Treatment<br />

Outcomes<br />

Cost to hospital No further Adverse Compliance<br />

per patient treatement needed effects rate<br />

Doxycycline 3 77% 29% 70%<br />

Azithromycin 30 81% 23% 100%<br />

Source: Data extracted from A. C. Haddix, S. D. Hillis & W. J. Kassler. The cost effectiveness<br />

of azithromycin for Chlamydia trachomatis infections in women. Sex. Transm. Dis.<br />

1995; 22: 274–280.<br />

both procedures) you will still save money by performing outpatient injections.<br />

We are assuming that the end results are similar in both groups.<br />

Another analysis compared doxycycline 100 mg twice a day for 7 days to<br />

azithromycin 1 g given as a one-time dose for the treatment of Chlamydia infections<br />

in women. It found that some patients do not complete the full 7-day<br />

course for doxycycline and then need to be retreated, and can infect other people<br />

during that period of time (Table 31.2). The cost of azithromycin that would make<br />

the use of this drug cost-effective for all patients can then be calculated. In this<br />

case, the drug company making azithromycin actually lowered their price for the<br />

drug by over 50% <strong>based</strong> on that analysis, to a level that would make azithromycin<br />

more cost-effective.<br />

In a cost-effectiveness analysis the researcher seeks to determine how much<br />

more has to be paid in order to achieve a benefit of preventing death or disability<br />

time. Here, the effects are unequal and all outcomes must be compared.<br />

These include costs, well years, total years, and utility or benefits. The outcome<br />

is expressed as incremental or marginal cost over benefit. Commonly used units<br />

are additional dollars per QALY or life saved.<br />

The first step in a cost-effectiveness analysis is to determine the difference in<br />

the benefits or effects of the two treatment strategies or policies being compared.<br />

This gives the incremental or marginal gain expressed in QALYs or other units of<br />

utility. This is done using an Expected Values Decision Analysis as described in<br />

Chapter 30. It is possible that one of the tested strategies may have a relatively<br />

small benefit and yet be overall more cost-effective than others therapies, which<br />

although only slightly less effective are very much more expensive.<br />

Next the difference in cost of the two treatment strategies or policies must be<br />

determined, to get the incremental or marginal cost. The cost-effectiveness is<br />

the ratio of the incremental cost to the incremental gain. Consider the example<br />

of two strategies, A and B. In the first (A), the quality-adjusted life expectancy is<br />

15 QALYs and the cost per case is $10 000. In the second (B), the life expectancy


356 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

is 20 QALYs, a definite improvement, but at a cost of $110 000 per case. The costeffectiveness<br />

of B as compared to A is the difference in cost divided by the difference<br />

in effects. This is (110 000 − 10 000)/(20 − 15) = $20 000/QALY gained.<br />

Note that if the more effective treatment had also cost less, you should obviously<br />

use the more effective one unless it has other serious drawbacks such as serious<br />

known side effects. Calculate this only when the more effective treatment strategy<br />

or policy is also more costly.<br />

Are the conclusions unlikely to change with sensible changes<br />

in costs and outcomes?<br />

Since most research on a given therapy is done at different times, changes over<br />

time must be accounted for. This process is called discounting and considers<br />

inflation and depreciation. It takes into account that inflation occurs and that,<br />

instead of paying for a program now, those costs can be invested now and other<br />

funds used to pay for solving the problem later. For example, you can pay $200<br />

a year for 10 years or $2000 in 10 years. The future costs are usually expressed in<br />

current dollars since $200 in the future is equivalent to less than $200 today. Actuarial<br />

and accounting methods used should be specified in the methods section<br />

of the analysis.<br />

Setting up a program is usually a greater cost than running it and initial costs<br />

are usually amortized over several decades. Discounting the value side of the<br />

equation considers that the value of a year of life saved now may be greater than<br />

a year saved later. Adding a year of life to someone at age 40 may mean more to<br />

them than adding a year of life to a 40-year-old but only after they reach the age<br />

of 60. This was considered in the discussion on patient preferences and values in<br />

Chapter 30.<br />

As with any other clinical research study, the numbers used to perform the<br />

analysis are only approximations and have 95% confidence levels attached.<br />

Therefore, a sensitivity analysis should always be done to check on the assumptions<br />

made in the analysis. This is a process by which the results of the analysis<br />

are changed <strong>based</strong> on reasonable changes in costs or effects that are statistically<br />

expected <strong>based</strong> upon the 95% CI values. Suitable graphs can demonstrate the<br />

change in the overall cost-effectiveness <strong>based</strong> on changes in one or more parameters.<br />

If the cost curve is relatively flat, a large change in a baseline characteristic<br />

does not result in much change in the cost-effectiveness of the intervention.<br />

Are the estimates of the costs and outcomes appropriately related to the<br />

baseline risk in the population?<br />

There may be various levels of risk within the population. What is cost-effective<br />

for one subgroup may not be cost-effective for another. The study should


Cost-effectiveness analysis 357<br />

attempt to identify these subgroups and assign individual cost-effectiveness<br />

analyses to each of them. For example, if looking at the cost-effectiveness of<br />

positive inotropic agents in the treatment of heart failure, it may be that for<br />

severe heart failure their use is cost-effective, while for less severe cases it is not.<br />

The use of beta-blocker drugs in heart failure has been studied, and the costeffectiveness<br />

is much greater when the drug is used in high-risk patients than in<br />

low-risk patients. However, it is above the usual definition of the threshold for<br />

saving a life in both circumstances.<br />

Final comments: ethical issues<br />

How much are we willing to spend to save a life? What is an acceptable cost per<br />

QALY gained? A commonly accepted figure in the decision-analysis literature is<br />

$40 000 to $60 000 per QALY, approximately the cost to maintain a person on<br />

dialysis for 1 year. This number has increased only slightly over the past 40 years<br />

since renal dialysis is more common although more expensive. In the United<br />

Kingdom, the National Institute for Health and Clinical Excellence (NICE) considers<br />

a threshold of cost-effectiveness to be between £20 000 and £30 000 per<br />

QALY.<br />

There are multiple ethical issues involved in the use of cost-effectiveness analyses.<br />

The provider is being asked to take sides with the option that will cost the<br />

least, or at least be the most cost-effective. This may not be the best option for<br />

each patient. Cost-effectiveness analyses are really more useful as political tools<br />

for making decisions on coverage by insurance schemes rather than for daily use<br />

in bedside clinical decision making.<br />

There are some cases when cost-effectiveness is the best thing to do for the<br />

individual patient. Universally these situations occur when the best practice is<br />

the cheapest. One example is the use of antibiotics for treating urethral Chlamydia<br />

infections that was mentioned earlier. More importantly, since most physicians<br />

cannot understand the issues involved in cost-effectiveness analyses when<br />

these come up in health policy areas, they should turn to agencies that are doing<br />

these on a regular basis. These are the AHRQ in the United States and NICE in the<br />

United Kingdom. Pharmaceutical and medical instrument and device manufacturers<br />

and some specialty physicians are often trying to assert that their service,<br />

product, or procedure is the best and most cost-effective because, although more<br />

expensive now, it will lead to savings later. This can occur because of the “spin”<br />

that is put on their cost-effectiveness analysis. To be able to pick up the inconsistencies<br />

and omissions from a cost-effectiveness analysis is very difficult. However,<br />

most physicians ought to be able at least to understand the analysis and<br />

subsequent comments made by people who are more highly trained in evaluating<br />

this type of study. Recognizing the presence or absence of conflict of interest<br />

in these commentaries is of utmost importance.


358 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

One current debate is over the use of chest pain evaluation units (CPEU) in<br />

Emergency Departments (ED) of acute care hospitals. These are for patients who<br />

are at low risk of having a myocardial infarction and for whom a stay of 48 hours<br />

in an intensive care unit is very expensive and probably unnecessary. In this discussion,<br />

it is assumed that discharge home from the ED is not safe as up to 4% of<br />

acute MIs are missed by emergency physicians. Proponents of these CPEUs point<br />

out that a lot of money will be saved if these low-risk patients are put into the<br />

CPEU rather than the acute-care hospital bed. They have done cost-effectiveness<br />

analyses that show only a slight overall increase in costs under the assumptions<br />

of the current admission rate of these patients to the hospital. However, if now<br />

all the extremely low-risk patients, including those who have virtually no risk, are<br />

admitted to the CPEU, the overall admission rate may actually increase, resulting<br />

in markedly increased costs. Clearly there must be a search for some other<br />

method of dealing with these patients, which will be cost-effective and result in<br />

decreased hospital-bed utilization. The methods of cost-effectiveness analysis<br />

must look at all eventualities.


32<br />

Survival analysis and studies of prognosis<br />

He ended; and thus Adam last replied:<br />

How soon hath thy prediction, seer blest,<br />

Measured this transient world, the race of time,<br />

Till time stand fixed! Beyond is all abyss,<br />

Eternity, whose end no eye can reach.<br />

John Milton (1608–1674): Paradise Lost<br />

Learning objectives<br />

In this chapter you will learn:<br />

how to describe various outcome measures such as survival and prognosis<br />

of illness<br />

the ways outcomes may be compared<br />

the steps in reviewing an article which measures survival or prognosis<br />

One of the most important pieces of information that patients want is to know<br />

what is going to happen to them during their illness. The clinician must be able<br />

to provide information about prognosis to the patient in all medical encounters.<br />

Patients want to know the details of the outcomes they can expect from their disease<br />

and treatment. Evaluation of the clinical research literature on prognosis is<br />

a required skill for the health-care provider of the future. Outcome analysis looks<br />

at the interplay of three factors: the patient, the intervention, and the outcome.<br />

We want to know how long a patient with the given illness will survive if given<br />

one of two possible treatments. These treatments can be two active therapies<br />

or therapy and placebo. Studies of outcomes or prognosis should clearly define<br />

these three elements.<br />

The patient: the inception cohort<br />

To start an outcome study, an appropriate inception cohort must be assembled.<br />

This means a group of patients for whom the disease is identified at a uniform<br />

359


360 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

point in the course of the disease, called the inception. This can occur at the<br />

appearance of the first unambiguous sign or symptom of a disease or at the<br />

first application of testing or therapy. Ideally this should be as early in the disease<br />

as possible. However, it should be at a stage where most reasonably prudent<br />

providers can make the diagnosis and not sooner as most providers won’t be able<br />

to make the diagnosis and initiate therapy at that earlier stage of disease. Collection<br />

of the cohort after the occurrence of the outcome event and looking backward<br />

will distort the results either in a positive or negative way if some patients<br />

with the disease die before diagnosis or commonly have spontaneous remissions<br />

soon after diagnosis. A study of survival of patients with acute myocardial<br />

infarction who are studied from the time they arrive in the coronary care unit<br />

will miss those who die suddenly either before seeking care or in the emergency<br />

department.<br />

Incidence/prevalence bias can be a fatal flaw in the study if the inception<br />

cohort is assembled at different stages of illness. This confuses new from ongoing<br />

cases of the illness. There may be very different prognoses for patients at<br />

these various stages of the illness. Lead-time and length-time bias occurring as<br />

the result of screening programs should be avoided by proper randomization.<br />

These were discussed in detail in Chapter 28 on screening tests.<br />

Diagnostic criteria, disease severity, referral pattern, comorbidity, and demographic<br />

details for inclusion of patients into the study must be specified. Patients<br />

referred from a primary-care center may be different than those referred from a<br />

specialty or tertiary-care center. Termed referral filter bias, this is due to an overrepresentation<br />

of patients with later stages of disease or more complex illness<br />

who are more likely to have poor results. Centripetal bias is another name for<br />

cases referred to tertiary-care centers because of the need for special expertise.<br />

Popularity bias occurs when the more challenging and interesting cases only are<br />

referred to the experts in the tertiary care center. The results of these biases limit<br />

external validity in other settings where most patients will present with earlier or<br />

milder disease.<br />

All members of the inception cohort should be accounted for at the end of the<br />

study and their outcomes known. This is much more important in these types<br />

of studies as we really want to know all of the possible outcomes of the illness.<br />

There are non-trivial reasons why patients drop out of a study. These include<br />

recovery, death, refusal of therapy due to the disease, side effects of therapy, loss<br />

of interest, or moving away. One study showed that patients in a study who were<br />

harder to track and more likely to drop out had a higher mortality rate.<br />

There are several rules of thumb to use in determining the effect of incomplete<br />

follow-up. First, identify the outcome of most interest to you and determine the<br />

fraction of patients who had this outcome. Then add the patients “lost to followup”<br />

to both the numerator and the denominator, which gives the result if all<br />

patients lost had the outcome of interest. Now add the patients lost to follow-up


Survival analysis and studies of prognosis 361<br />

Table 32.1. A study of 71 patients 6 of whom were lost to follow-up<br />

Original study “Highest” case “Lowest” case<br />

Relapse rate 39/65 = 60% 45/71 = 63% 39/71 = 55%<br />

Mortality rate 1/65 = 1.5% 7/71 = 10% 1/71 = 1.4%<br />

to only the denominator, giving the lowest result if no patient lost had the outcome<br />

of interest. Compare these two results. If they are very close to each other,<br />

the result is still useful. If not the result of the study may be useless. In the example<br />

in Table 32.1, the difference in relapse rates is minor while the difference in<br />

mortality is quite large. As a general rule, the lower the rate of an outcome, the<br />

more likely it is to be affected by patients lost to follow-up.<br />

The intervention<br />

There should be a clear and easily reproducible description of the intervention<br />

being tested. All details of a therapeutic program should be described in the<br />

study. The reader should be able to duplicate the process of the study at another<br />

institution. All the interventions tested or compared should be those that make a<br />

difference. It is of paramount importance that the intervention proposed in the<br />

study be one that can be performed in settings other than at the most advanced<br />

tertiary care setting only. Similarly, testing a drug against placebo may not be<br />

as important or useful as testing it against the drug that is currently the most<br />

favorite for that indication. Most of these issues have been discussed in the chapter<br />

on randomized clinical trials in Chapter 15.<br />

The outcome<br />

The outcome criteria should be objective, reproducible, and accurate. The outcome<br />

assessment should also be done in a blinded manner to avoid diagnostic<br />

suspicion and expectation bias in the assessment of patient outcomes. There can<br />

be significant bias introduced into the study if the outcomes are not measured<br />

in a consistent manner. Ideally, the outcome measures should be unmistakably<br />

objective. Death or life are clear and easily measured outcome variables although<br />

the cause of death as measured on a death certificate is not always a reliable,<br />

clear, or objective outcome measure of the actual cause of death. Admission to<br />

the hospital appears to be clear and objective, but the reasons or threshold for<br />

admission to the hospital may be very subjective and subject to significant interrater<br />

variability. Outcomes such as “full recovery at home” or “feeling better”<br />

have a higher degree of subjectivity associated with them.


362 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

There should be adjustment for extraneous prognostic factors. The researcher<br />

should determine whether the prognostic factor is merely a marker or actually<br />

a factor that is responsible for the causation. This determines whether or not<br />

there are alternative explanations for the outcomes due to some confounding<br />

variable. Count on the article being reviewed by a statistician who can determine<br />

if the authors used the correct statistical analysis, but be aware that the correct<br />

adjustment for extraneous factors may not have been done correctly if at all. If<br />

the authors suggest that a group of signs, symptoms, or diagnostic tests accurately<br />

predict an outcome, look for a validation sample in a second study which<br />

attempts to verify that indeed these results occurred because of a causal relationship<br />

and not just by chance. Look for at least 10 and preferably 20 patients<br />

who actually had the outcome of interest for each prognostic factor that is evaluated<br />

to give clinically and statistically significant results. Chapter 14 has a more<br />

detailed discussion of multivariate analysis.<br />

Most often outcomes are expressed as a dichotomous nominal variable (e.g.,<br />

dead or alive, disease or no disease, a patent or occluded bypass, improved or<br />

worse, it works or it doesn’t, etc.). One is interested in the association of an independent<br />

variable such as drug use, therapy, risk factor, diagnostic test result,<br />

tumor stage, age of patient, or blood pressure with the dependent or outcome<br />

variable.<br />

Diagnostic-suspicion bias occurs when the physician caring for the patient<br />

knows the nature and purpose of the outcomes being measured and as a result,<br />

changes the interpretation of a diagnostic test, the actual care or observation of<br />

the patient. Expectation bias occurs when the person measuring the outcome<br />

knows the clinical features of the case or the results of a diagnostic test and alters<br />

their interpretation of the outcome event. This is less likely when the intervention<br />

and outcome measures are clearly objective. Ideally blind diagnosis, treatment,<br />

and assessment of all the patients going through the study will prevent<br />

these biases.<br />

Another problem in the outcomes selected occurs when multiple outcomes<br />

are lumped together. Many more studies of therapy are comparing two groups<br />

for several outcomes at once and these so-called composite outcomes have<br />

been discussed in Chapter 11 in greater detail. Commonly used measures of<br />

heart therapies might include death, an important outcome, non-fatal myocardial<br />

infarction, important but less than death and need for revascularization procedure<br />

much less important than death. The use of these measures can lead to<br />

over-optimistic conclusions regarding the therapy being tested. If each outcome<br />

were measured alone, none would have statistical significance, which could be<br />

due to a possible Type II error. When combined, multiple or composite outcomes<br />

may then show statistical significance.<br />

One example is the recent CAPRIE trial comparing clopidogrel, an antiplatelet<br />

agent, against aspirin. The primary outcome measures were overall number of


Survival analysis and studies of prognosis 363<br />

deaths, and of deaths due to stroke, myocardial infarction, or vascular causes.<br />

The definition of vascular causes was not made clear. The end result was that<br />

there were no decreases in death from stroke or myocardial infarction, but a 20%<br />

reduction in deaths in the patients with peripheral arterial disease. The absolute<br />

reduction was 1.09% (from 4.80% to 3.71%, giving an NNTB of 91). If these<br />

patient outcomes were considered as separate groups, the differences would not<br />

have been statistically significant. Another danger is that some patients may be<br />

counted several times because they have several of the outcomes. Finally, the<br />

clinical significance of the combined outcome is unknown.<br />

There are basically three types of data that are used to indicate risk of an outcome.<br />

Interval data such as blood pressure is usually considered to be normally<br />

distributed and measured on a continuous scale. Nominal data like tumor type<br />

or treatment options is categorical and often dichotomous like alive and dead or<br />

positive and negative test results. Ordinal data such as tumor stage is also categorical<br />

but with some relation between the categories. There are three types of<br />

analyses applied to this type of problem: frequency tables, logistic analysis, and<br />

survival analysis. Decision theory uses probability distributions to estimate the<br />

probability of an outcome. A loss function measures the relative benefit or utility<br />

of that outcome.<br />

Frequency tables<br />

Frequency tables use a chi-square analysis to compare the association of the outcome<br />

with risk factors that are nominal or ordinal. For the chi-square analysis,<br />

data are usually presented in a table where columns are outcomes, rows are risk<br />

factors, and the frequencies appear as table entries. The observed data are compared<br />

with the data that would be expected if there were no association. The<br />

analysis results in a P value which indicates the probability that the observed<br />

outcome could have been obtained by chance when it was really no different<br />

from the expected value. Fisher’s exact test is used when the observed value of<br />

any cell is less than 5.<br />

Logistic analysis<br />

This is a more general approach to measuring outcomes than using frequency<br />

tables. Logistic regression estimates the probability of an outcome <strong>based</strong> on one<br />

or more risk factors. The risk factors may be interval, ordinal, or nominal variables.<br />

Results of logistic regression analysis are often reported as the odds ratio,<br />

relative risk, or hazard ratio. For one independent variable of interval-type data<br />

and relative risk, this method calculates how much of an increase in the risk of<br />

the outcome occurs for each incremental increase in the exposure to the risk factor.<br />

An example of this would answer the question “how much additional risk of


364 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

stroke will occur for each increase of 10 mm Hg in systolic blood pressure?” For<br />

ordinal data the analysis calculates the probability of an outcome <strong>based</strong> on the<br />

stage of disease for example, the recurrence of a stage 4 compared to a stage 2<br />

tumor.<br />

For multiple variables, is there some combination of risk factors that will better<br />

predict an outcome than one risk factor alone? Which of these risk factors<br />

will be the best predictor of that outcome? The identification of significant risk<br />

factors can be done using multiple regressions or stepwise regression analyses as<br />

we discussed in Chapter 29 on clinical prediction rules.<br />

Survival analysis<br />

In the real world the ultimate outcome is often not known and could be dead<br />

as opposed to “so far, so good” or not dead yet. It would be difficult to justify<br />

waiting until all patients in a study die so that survival in two treatment or risk<br />

groups can be compared. Besides, another common problem with comparing<br />

survival between groups occurs in trying to determine what to do with patients<br />

who are doing fine but die of an incident unrelated to their medical problem such<br />

as death in a motor-vehicle accident of a patent who had a bypass graft 15 years<br />

earlier. This will alter the information used in the analysis of time to occlusion<br />

with two different types of bypasses. Finally, how should the study handle the<br />

patient who simply moves away and is lost to follow-up?<br />

The situations described above are examples of censored data. The data consist<br />

of a time interval and a dichotomous variable indicating status, either failure<br />

(dead, graft occluded, etc.) or censored (i.e., not dead yet, success so far, etc.). In<br />

the latter case, the patient may still be alive, have died but not from the disease<br />

of interest, or been alive when last seen but could not be located again.<br />

A potential problem in these analyses is the definition of the start time. Early<br />

diagnosis may automatically confer longer survival if the time of diagnosis is the<br />

start time. This is also called lead-time bias, as discussed in Chapter 28, and is a<br />

common problem with screening tests. Censoring bias occurs when one of the<br />

treatment groups is more likely to be censored than the other. If certain patients<br />

are lost as a result of treatment (e.g., harmful side effects) their chances of being<br />

censored are not independent of their survival times. A survival analysis initially<br />

assumes that any patient censoring is independent of the outcome. Figure 32.1<br />

shows an example of the effects of censoring on a hypothetical study.<br />

Survival curves<br />

The distribution of survival times is most often displayed as a survivor function,<br />

also called a survival curve. This is a plot of the proportion of subjects surviving<br />

versus time. It is important to note that “surviving” may indicate things other


Survival analysis and studies of prognosis 365<br />

9<br />

x<br />

9<br />

x<br />

8<br />

O<br />

8<br />

O<br />

7<br />

x<br />

7<br />

x<br />

6<br />

6<br />

5<br />

x<br />

5<br />

x<br />

4<br />

4<br />

3<br />

O<br />

3<br />

O<br />

2<br />

x<br />

2<br />

x<br />

1<br />

x<br />

1<br />

x<br />

1970 1975 1977 1980<br />

t=0 t = 5 years<br />

Fig. 32.1 Censoring. Patients are enrolled in a study over a 2-year period (1975–1977). All are<br />

followed until 1980 and patients who die are marked with an x. Some patients (2 and 5) are<br />

enrolled at a late stage of their disease. Their inclusion will bias the cohort toward poorer<br />

survival. Two patients (4 and 6) are still alive at the end of the observation period. Patient 1<br />

lived longer than everyone except patient 4, although it appears that patient 1 didn’t live so<br />

long, since their previous survival (pre-1975) does not count in the analysis. We don’t know<br />

how long patient 4 will live since he or she is still alive at the end of the observation period<br />

and their data are censored at t = 5 years. Two other patients (3 and 8) are lost to follow-up,<br />

and their data are censored early (o).<br />

100<br />

Fig. 32.2 Kaplan–Meier survival<br />

curve.<br />

Treatment<br />

% survival<br />

Control P < 0.05<br />

for difference<br />

Time<br />

Treatment 300 200 140 80<br />

Control 300 150 100 20


366 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

than actual survival (i.e., life vs. death), such as success of therapy (i.e., patent<br />

vs. non-patent coronary bypass grafts). These curves can be deceptive since the<br />

number of individuals represented by the curve decreases as time increases. It<br />

is key that a statistical analysis is applied at several times to the results of the<br />

curves. The number of patients at each stage of the curve should also be given.<br />

The Kaplan-Meier curve is the one most commonly used.<br />

There is one primary method for plotting and analyzing survival curves. The<br />

actuarial-life-table method measures the length of time from the moment the<br />

patient is entered into the study until failure occurs. The product-limit method<br />

is a graphic representation of the actuarial-life-table method and is also known<br />

as the Kaplan–Meier method. It is the plot of survival that is most commonly used<br />

in medicine. The analysis looks at the period of time, the month or year since the<br />

subject entered the study, in which the outcome of interest occurred. A typical<br />

Kaplan–Meier curve is shown in Fig. 32.2.<br />

There are several tests of equality of these survivor functions or curves that are<br />

commonly performed. One of the most popular is the Mantel–Cox also known as<br />

log-rank test. The Cox proportional-hazard model uses interval data as the independent<br />

variable determining how much the odds of survival are altered by each<br />

unit of change in the independent variable. This answers the question of how<br />

much the risk of stroke is increased with each increase of 10 mm Hg in mean<br />

arterial blood pressure. Further discussion of survival curves and outcome analysis<br />

is beyond the scope of this book. Two of the Users’ Guides to the Medical<br />

Literature articles provide more detail. 1,2<br />

1 A. Laupacis, G. Wells, W. S. Richardson & P. Tugwell. Users’ guides to the medical literature. V. How to<br />

use an article about prognosis. <strong>Evidence</strong>-Based <strong>Medicine</strong> Working Group. JAMA 1994; 272: 234–237.<br />

2 C. D. Naylor & G. H. Guyatt. Users’ guides to the medical literature. X. How to use an article reporting<br />

variations in the outcomes of health services. <strong>Evidence</strong>-Based <strong>Medicine</strong> Working Group. JAMA 1996;<br />

275: 554–558.


33<br />

Meta-analysis and systematic reviews<br />

Common sense is the collection of prejudices acquired by age eighteen.<br />

Albert Einstein (1879–1955)<br />

Learning objectives<br />

In this chapter you will learn:<br />

the principles of evaluating meta-analyses and systematic reviews<br />

the concepts of heterogeneity and homogeneity<br />

the use of L’Abbé, forest, and funnel plots<br />

measures commonly used in systematic reviews: odds ratios and effect size<br />

how to review a published meta-analysis and use the results to solve a clinical<br />

problem<br />

Background and rationale for performing meta-analysis<br />

Over the past 50 years there has been an explosion of research in the medical<br />

literature. In the worldwide English-language medical literature alone, there<br />

were 1,300 biomedical journals in 1940, while in 2000 there were over 14,000. It<br />

has become almost impossible for the individual practitioner to keep up with<br />

the literature. This is more frustrating when contradictory studies are published<br />

about a given topic. Meta-analyses and systematic reviews are relatively new<br />

techniques used to synthesize and summarize the results of multiple research<br />

studies on the same topic.<br />

A primary analysis refers to the original analysis of research data as presented<br />

in an observational study or randomized clinical trial (RCT). Secondary analysis<br />

is a re-analysis of the original data either using another statistical technique or<br />

answering new questions with previously obtained data.<br />

The traditional review article is a qualitative review. It is a summary of all primary<br />

research on a given topic and it may provide good background information<br />

367


368 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

that is more up to date than a textbook. But review articles have the disadvantage<br />

of being somewhat subjective and reflecting the biases of the author, who may<br />

be very selective of the articles chosen for review. One must be knowledgeable of<br />

the literature being reviewed in order to evaluate this type of article critically.<br />

Meta-analysis is more comprehensive or “transcends” traditional analysis of<br />

data. Typically, a meta-analysis looks at data from multiple studies of the same<br />

clinical question and uses a variety of statistical techniques to integrate their<br />

findings. It may be called a quantitative systematic review and represents the<br />

rigorous application of research techniques and statistical analysis to present an<br />

overview of a given topic.<br />

A meta-analysis is usually done to reconcile studies with different results. It can<br />

look at multiple negative studies to uncover Type II errors or at clinical problems<br />

where there are some negative and some positive studies to uncover Type I or<br />

Type II errors. It can help uncover a single study which has totally different results<br />

because of systematic error or bias in the research process. Large confidence<br />

intervals in some studies may be narrowed by combining them. For example,<br />

multiple small trials done before 1971 showed both positive and negative effects<br />

of light or phototherapy on hyperbilirubinemia in newborns. A meta-analysis in<br />

1985 showed an overall positive effect.<br />

Occasionally a large trial shows an opposite effect from that found in multiple<br />

small trials. This is often due to procedural or methodologic study design differences<br />

in the trials. However, as a general rule, correctly done large cooperative<br />

trials are more reliable than meta-analysis of many smaller trials. For example a<br />

meta-analysis of multiple small trials of magnesium in acute myocardial infarction<br />

(AMI) showed a positive effect on decreasing mortality. The ISIS-4 trial, a<br />

large multicenter RCT where magnesium was given in one arm of the study,<br />

showed no benefit, although it was given later in the course of the AMI than it<br />

had been in the smaller studies. The disparity of study methodologies in this case<br />

required that the researchers set up a new multicenter study of the use of magnesium<br />

in AMI. Called MAGIC, it is now in progress. The use of meta-analysis does<br />

not reduce the need for large well-done studies of primary clinical modalities.<br />

Guidelines for evaluation of systematic reviews<br />

Were the question and methods clearly stated and were comprehensive<br />

search methods used to locate relevant studies?<br />

In meta-analysis, the process of article selection and analysis should proceed<br />

by a preset protocol. By not changing the process in mid-analysis the author’s<br />

bias and retrospective bias are minimized. This means that the definitions of<br />

outcome and predictor or therapy variables of the analysis are not changed in


Meta-analysis and systematic reviews 369<br />

mid-stream. The research question must be clearly defined, including a defined<br />

patient population and clear and consistent definitions of the disease, interventions,<br />

and outcomes.<br />

A carefully defined search strategy must be used to detect and prevent publication<br />

bias. This bias occurs because trials with positive results and those with<br />

large sample sizes are more likely to be published. Sources should include conference<br />

proceedings, dissertation abstracts, and other databases, as well as the<br />

usual search of MEDLINE. A manual search of relevant journals may uncover<br />

some additional studies. The bibliographies of all relevant articles found should<br />

be hand searched to find any misclassified articles that were missed in the original<br />

search.<br />

The authors must cite where they looked and should be exhaustive in looking<br />

for unpublished studies. Not using foreign studies may introduce bias since<br />

some foreign studies are published in English-language journals while others<br />

may be missed. The authors should also contact the authors of all the studies<br />

found and ask them about other researchers working in the area who may have<br />

unpublished studies available. The Cochrane Collaboration maintains a register<br />

of controlled trials called CENTRAL, which attempts to document all current trials<br />

regardless of result. Also, the National Library of <strong>Medicine</strong> and the National<br />

Institutes of Health in the United States have an online repository of clinical trials<br />

called www.clinicaltrials.gov, which can be accessed to determine if a clinical<br />

trial is ongoing and proceeding according to its original plan.<br />

Were explicit methods used to determine which articles to include in the<br />

review and were the selection and assessment of the methodologic quality<br />

of the primary studies reproducible and free from bias?<br />

Objective selection of articles for the meta-analysis should be clearly laid out and<br />

include inclusion and exclusion criteria. The objectives and procedures must be<br />

defined ahead of time. This includes a clearly defined research and abstraction<br />

method and a scoring system for assessing the quality of the included studies. For<br />

each study several factors ought to be assessed. The publication status may suggest<br />

stronger studies in that those that were never published or only published<br />

in abstract form may be significantly deficient in methodological areas.<br />

The strength of the study design will determine the ability to prove causation.<br />

Randomized clinical trials are the strongest study design. A well-designed observational<br />

study with appropriate safeguards to prevent or minimize bias and confounding,<br />

will also give very strong results. The methods of meta-analysis include<br />

ranking or grading the quality of the evidence. The Cochrane Collaboration is<br />

using the new GRADE recommendations to rank the quality of studies in their<br />

systematic reviews. Appendix 1 gives two commonly used criteria for grading various<br />

levels of evidence, the one used by the Centre for <strong>Evidence</strong>-Based <strong>Medicine</strong>


370 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

at Oxford and the GRADE criteria. The full set of GRADE criteria can be downloaded<br />

from the Cochrane Collaboration’s website, www.cochrane.org.<br />

The study sites and patient populations of the individual studies may limit<br />

generalizability of the meta-analysis. The interventions or exposures should be<br />

similar between studies. Finally, the studies should be measuring the same or<br />

very similar outcomes. We will discuss issues of how to judge homogeneity and<br />

combine heterogeneous studies.<br />

Independent review of the methods section looks at inclusion and exclusion<br />

criteria, coding, and replication issues. There must be accurate and objective<br />

abstraction of the data, ideally done by blinded abstracters. Two abstracters<br />

should gather the data independently and the author should check for interrater<br />

agreement. The methods and results sections should be disguised to prevent<br />

reviewers from discovering the source of the research. Inter-rater reliability<br />

of coders should be maximized with a minimal level of 0.9 on the kappa statistic.<br />

Once this has been established, a single coder can code all the remaining study<br />

results.<br />

Were the differences in individual study results adequately explained and<br />

were the results of the primary studies combined appropriately?<br />

Studies may be homogeneous or heterogeneous. There are both qualitative and<br />

quantitative measures of heterogeneity. Testing for heterogeneity of the studies<br />

is done to determine if the studies are qualitatively similar enough to combine.<br />

The tests for heterogeneity include the Mantel–Haentszel chi-squared test,<br />

the Breslow–Day test, and the Q statistic by the DerSimonian and Laird method.<br />

They all suffer from low power so are likely to have a Type II error. If the test statistic<br />

is statistically significant (P < 0.05), the studies are likely to be heterogeneous.<br />

However, the absence of statistical significance does not mean homogeneity and<br />

may only be present due to low power of the statistical test for heterogeneity.<br />

The presence of heterogeneity among the studies analyzed will result in erroneous<br />

interpretation of the statistical results. If the studies are very heterogeneous,<br />

one strategy for analyzing them is to remove the study with most extreme<br />

or outlier results and recalculate the statistic. If the statistic is no longer statistically<br />

significant, it can be assumed that the outlier study was responsible for all or<br />

most of the heterogeneity. That study should then be examined more closely to<br />

determine what about the study design might have caused the observed extreme<br />

result. This could be due to differences in the population studied or systematic<br />

bias in the conduct of the study.<br />

Analysis and aggregation of the data can be done in several ways, but should<br />

consider sample sizes and magnitude of effects. A simple vote count in which the<br />

number of studies with positive results is directly compared with the number of<br />

studies with negative results is not an acceptable method since neither effect


Meta-analysis and systematic reviews 371<br />

size nor sample size are considered. Pooled analysis or lumped data add numerators<br />

and denominators of each study together to produce a new result. This is<br />

better than a vote count, but still not acceptable since that process ignores the<br />

confidence intervals for each study and allows errors to multiply in the process<br />

of adding the results. Simple combination of P values is not acceptable because<br />

this does not consider the direction of the effect or magnitude of the effect size.<br />

Weighted outcomes compare small and large studies, analyze them as equals,<br />

and then weight the results by the sample size. This involves adjusting each outcome<br />

by a value that accounts for the sample size and degree of variation. Confidence<br />

intervals should be applied to the mean results of each study evaluated.<br />

Aggregate study and control-group means and confidence intervals can then be<br />

calculated. Subgroups should be analyzed where appropriate, recognizing the<br />

potential for making a Type I error. There are two standard measures for evaluating<br />

the results of a meta-analysis: the odds ratio and the effect size.<br />

The odds ratio (OR) is the most common way of combining results in metaanalysis.<br />

The odds ratio can be calculated for each study showing whether the<br />

intervention increases or decreases the odds of a favorable outcome. These can<br />

then be combined statistically and the 95% confidence intervals calculated for<br />

all the odds ratios. If we are looking at a positive outcome such as % still alive,<br />

an OR > 1 favors the experimental treatment. If looking at a negative outcome<br />

such as mortality rates, an OR < 1 favors the experimental treatment. The OR is<br />

used rather than the relative risk (RR) even though the studies are usually RCTs.<br />

This is done because of the mathematical problems when using the RR. Newer<br />

calculation techniques are making it possible to calculate an aggregate RR, and<br />

this is becoming more common in meta-analyses.<br />

The effect size (d or δ) is a standard metric compared across studies. It is a relative<br />

and not an absolute value. The equation for effect size is d = (m 1 – m 2 )/SD,<br />

where m 1 and m 2 are the means of the two groups being studied and SD is the<br />

standard deviation of either sample population. A difference (δ) in SD units of<br />

0.2 SD is a small effect, 0.5 SD a moderate effect, and >0.8 SD, a large effect. If the<br />

data are skewed, it is better to use median rather than mean of the data to calculate<br />

the effect size, but this requires the use of other, more complex statistical<br />

methods to accomplish the analysis.<br />

The statistical analytic procedures usually employed in systematic reviews<br />

are far too complex to discuss here. However, there are important distinctions<br />

between the methods used in the presence and in the absence of heterogeneity<br />

of the results of the studies, which the reader should be aware of. If the data<br />

are relatively homogeneous, a statistical process called the fixed-effects model<br />

can be used. This assumes that all the studies can be statistically analyzed as<br />

equals. However, if the data are very heterogeneous, a statistical process called<br />

the random-effects model should be used. This is more complex and takes into<br />

account that the various studies are part of a population of studies of the events.


372 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

Fig. 33.1 Hypothetical<br />

meta-analysis. Initial studies<br />

(except one) lacked power to<br />

find a difference. A difference<br />

was found when all studies<br />

were combined.<br />

Study 1, 1976<br />

Study 2, 1984<br />

B<br />

A<br />

Study 3, 1992<br />

Study 4, 1996<br />

Study 5, 1997<br />

Study 6, 1998<br />

Overall<br />

0.01 0.1 1 10 100<br />

Favors treatment Favors placebo<br />

Odds Ratio<br />

The result is the presence of wider confidence intervals. Unfortunately, the methods<br />

used for the random-effects model give more weight to the smaller studies,<br />

which is a potential source of bias if there are more small studies with positive<br />

results, a result of publication bias. Frequently, a single meta-analysis will use<br />

both methods to determine statistical significance. If the two methods give the<br />

same result, the statistical significance is more “powerful” than if one method<br />

finds statistical significance and the other does not.<br />

There are three graphic techniques that can be used to look at the overall data.<br />

These all demonstrate the effect of the problem of publication bias but in different<br />

ways. Large studies or those showing positive effects are more likely to be<br />

published. It is very likely that if one small study showed a positive effect it would<br />

be published. Conversely if a small study showed a negative effect or no difference<br />

between the groups, it is less likely to be published. It is important to be able<br />

to estimate the effect of this phenomenon on the results of the meta analysis.<br />

Graphic displays are a powerful tool to show the difference in study results.<br />

The most common way of graphing meta-analysis results is called the Forest<br />

Plot. This shows the results of each study as a point estimate for the rate, risk<br />

difference, or ratio (odds ratio, relative risk, or effect size) and a line for the 95%<br />

confidence intervals on this point estimate. A log scale is commonly used so that<br />

the reciprocal values are an equal distance from 1 (Fig. 33.1). Always be careful


Meta-analysis and systematic reviews 373<br />

%<br />

responding<br />

with<br />

treatment<br />

100<br />

80<br />

60<br />

Favors treatment<br />

Key:<br />

= 100 < n < 500<br />

= 50 < n < 100<br />

Fig. 33.2 L’Abbé plotofa<br />

hypothetical meta-analysis. The<br />

largest studies showed the most<br />

effect of the treatment,<br />

suggesting that the smaller<br />

studies lacked power.<br />

40<br />

20<br />

Combined result<br />

Favors placebo<br />

= 10 < n < 50<br />

0<br />

0 20 40 60 80 100<br />

% responding with placebo<br />

to check the scales. It is easy to see if the confidence interval crosses the point of<br />

no significance, 0 for differences or 1 for ratios.<br />

There is a visual guide that can suggest heterogeneity in this type of a plot. 1<br />

Simply draw a perpendicular from the higher end of the 95% CI for the study<br />

with the lowest point value. In Fig. 33.1 this is line A drawn through the higher<br />

end of the 95% CI of study 4. Draw a similar line through the lower end of the<br />

95% CI of the study with the highest point value. Here it is line B, through the<br />

lower point of study 5. If the confidence intervals of all of the studies appear in<br />

the space between these two lines, the studies are probably not heterogeneous.<br />

Any study outside this area may be the cause of significant heterogeneity in the<br />

aggregate analysis of the study results.<br />

The L’Abbé plot in Fig. 33.2, is used to show how much each individual study<br />

contributes to the outcome. The two possible outcome rates, for the control and<br />

intervention groups are plotted on the x- and y-axis, respectively. A circle, the<br />

diameter of which is proportional to the sample size, represents each study. A<br />

key to the sample size is given with the plot. The L’Abbe plot is a better visual aid<br />

to see the differences between study results and how much those depend on the<br />

sample size.<br />

Finally, the funnel plot shown in Fig. 33.3, is another way to show the effect<br />

of sample size on the effect size. This is a plot of the effect size (δ) onthex-axis<br />

and sample size on the y-axis. If there are many positive small studies with large<br />

effect sizes, the resulting plot will look like an asymmetric triangle or half of an<br />

upside-down funnel. This suggests that the overall result of the meta-analysis is<br />

being unduly influenced by these many, very positive, small studies, which could<br />

1 Shown to me by Rose Hatala, M.D., from the Department of <strong>Medicine</strong> of the University of British<br />

Columbia, Vancouver, BC, Canada.


374 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

Fig. 33.3 Funnel plot of six<br />

studies. Notice that the largest<br />

effect sizes were found in the<br />

smallest studies. A plot with this<br />

configuration suggests<br />

publication bias.<br />

Sample size<br />

100<br />

80<br />

60<br />

40<br />

20<br />

0 0 0.2 0.4 0.6 0.8 1.0 1.2<br />

Effect size (SDU)<br />

be due to publication bias or that all of these studies may have similar systematic<br />

biases and perhaps fatal flaws in their execution.<br />

Were the reviewers’ conclusions supported by the data cited?<br />

A sensitivity analysis should be done to address the possibility of publication<br />

bias also called the “file-drawer effect.” Negative and unpublished studies are<br />

frequently small and usually won’t be able to drastically change the results of the<br />

meta-analysis. Using the funnel or the L’Abbé plots and other methods will help<br />

alert the reader to the potential presence of publication bias.<br />

There is a way of calculating the potential effect of publication bias. Fail-safe<br />

N is an estimate of the number of negative studies you would need in order to<br />

eliminate the difference between treatment and outcome or cause and effect that<br />

was found. This can mean to reverse the δ value or increase the overall probability<br />

of finding a difference when one doesn’t exist to a value higher than the δ level<br />

(i.e., P > 0.05). If a large part of the positive effect found is due to a few small<br />

and very positive studies, it is possible that there are also a few small and clearly<br />

negative studies that because of publication bias have never been published. If<br />

the fail-safe N is small it means that only a few small negative studies would be<br />

needed to reverse the finding. This is a plausible occurrence. But if fail-safe N is<br />

very large, it is unlikely that there are that many negative studies “out there” that<br />

have never been published and you would accept the results as being positive.<br />

Some common problems with meta-analyses are that they may be comparing<br />

diverse studies with different designs or over different time periods. There may<br />

be excessive inter-observer variability in deciding on which trials to evaluate, and<br />

how much weight to give to each trial. These issues ought to be addressed by the<br />

authors and difference in the results explained. In many cases, the methodologies<br />

will contribute biases that can be uncovered in the meta-analysis process.


Meta-analysis and systematic reviews 375<br />

Individual Analysis and Conventional<br />

Meta-Analysis (odds ratio)<br />

Cumulative Mantel−Haenszel<br />

Method (odds ratio)<br />

Study<br />

Fletcher<br />

Dewar<br />

European 1<br />

European 2<br />

Heikinheimo<br />

Italian<br />

Australian 2<br />

Frankfurt 2<br />

NHLBI SMIT<br />

Frank<br />

Valere<br />

Klein<br />

UK Collab<br />

Austrian<br />

Australian 2<br />

Lasierra<br />

N Ger Collab<br />

Witchitz<br />

European 3<br />

ISAM<br />

GISSI-1<br />

Olson<br />

Baroffio<br />

Schreiber<br />

Cribier<br />

Sainsous<br />

Durand<br />

White<br />

Bassand<br />

Vlay<br />

Kennedy<br />

ISIS-2<br />

Wisenberg<br />

Year<br />

1959<br />

1963<br />

1969<br />

1971<br />

1971<br />

1971<br />

1973<br />

1973<br />

1974<br />

1975<br />

1975<br />

1976<br />

1976<br />

1977<br />

1977<br />

1977<br />

1977<br />

1977<br />

1979<br />

1986<br />

1986<br />

1986<br />

1986<br />

1986<br />

1986<br />

1986<br />

1987<br />

1987<br />

1987<br />

1988<br />

1988<br />

1988<br />

1988<br />

No. of<br />

0.1<br />

Patients<br />

0.2 0.5 1 2 5 10<br />

23<br />

42<br />

167<br />

730<br />

426<br />

321<br />

517<br />

206<br />

107<br />

108<br />

91<br />

23<br />

595<br />

728<br />

230<br />

24<br />

483<br />

58<br />

315<br />

1,741<br />

11,712<br />

52<br />

59<br />

38<br />

44<br />

98<br />

64<br />

219<br />

107<br />

25<br />

368<br />

17,187<br />

66<br />

36,974<br />

z = −8.16, P


376 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

A recent addition to the quantitative systematic review literature comes from<br />

the Cochrane Collaboration, which was described in Chapter 5. The Cochrane<br />

Collaboration is now a worldwide network of interested clinicians, epidemiologists,<br />

and scientists who perform systematic reviews and meta-analyses of clinical<br />

questions. Their reviews are standardized, of the highest quality, and updated<br />

regularly as more information becomes available. They are available online in the<br />

Cochrane Library.<br />

Additional guidelines for meta-analysis<br />

There are some additional guidelines for creating and reviewing meta-analyses<br />

that were published in 1985 by Green and Hall and are still very useful to follow. 2<br />

The inclusion and exclusion criteria for the relevant studies should be defined<br />

and reported. This may lead to substantive and conceptual issues such as how to<br />

handle a study with missing or incomplete data. The coding categories should be<br />

developed in a manner that will accommodate the largest proportion of the identified<br />

literature. Over-coding of characteristics of studies is better than undercoding.<br />

The following characteristics should be coded: type and length of the<br />

intervention, sample characteristics, research design characteristics and quality,<br />

source of the study (e.g., published, dissertation, internal report, and the like),<br />

date of study, and so on. The reliability of the coders should be checked with the<br />

kappa statistic.<br />

Multiple independent and dependent variables should be separately evaluated<br />

using a sensitivity analysis. Interactions between variables outside the principal<br />

relationship being reviewed should be looked for. The distribution of results<br />

should be examined and graphed. Look at outliers more closely. Perform statistical<br />

tests for the heterogeneity of results. If the studies are found to be heterogeneous,<br />

a sensitivity analysis should be performed to identify the outlier<br />

study. The effect size should be specified and level of significance or confidence<br />

intervals given. Effect sizes should be recalculated to give both unadjusted and<br />

adjusted results. Where necessary, nonparametric and parametric effect size estimates<br />

should be calculated.<br />

In the conclusions, the authors should examine other approaches to the same<br />

problem. Quantitative evaluation of all studies should be combined with qualitative<br />

reviews of the topic. This should look at the comparability of treatment and<br />

control groups from study to study. They should also look at other potentially<br />

interesting and worthwhile studies that are not part of the quantitative review.<br />

Finally, the limitations of the review and ideas for future research should be<br />

2 B. F. Green & J. A. Hall. Quantitative methods for literature review. Annu. Rev. Psychol. 1984; 35: 37–54.


Meta-analysis and systematic reviews 377<br />

discussed. For the reader, it is well to remember that “data analysis is an aid to<br />

thought, not a substitute.” 3<br />

The same is true of evidence-<strong>based</strong> medicine in general. It should be an aid<br />

to thought, and an encouragement to integrate the science of medical research<br />

into clinical practice. But, it is not a substitute for critical thinking and the art<br />

of medicine. There is a great tendency to accept meta-analyses as the ultimate<br />

word in evidence. The results of such an analysis are only as good as the evidence<br />

upon which it is <strong>based</strong>. Then again, this statement can apply to all evidence in<br />

medicine. We will always be faced with making difficult decisions in the face of<br />

uncertainty. In that setting, it takes our clinical experience, intuition, common<br />

sense, and good communications with our patients to decide upon the best way<br />

to use the best evidence.<br />

3 B.F.Green&J.H.Hall.Ibid.


Appendix 1 Levels of evidence and grades of<br />

recommendations<br />

Adapted and used with permission from the Oxford Centre for <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

Levels of <strong>Evidence</strong> (May 2001), available at www.cebm.net/levels of evidence.asp.<br />

Adapted and used with permission from the GRADE working group of the Cochrane Collaboration.<br />

378


Levels of evidence<br />

Therapy/Prevention,<br />

Level Etiology/Harm Prognosis Diagnosis<br />

Differential<br />

diagnosis/Symptom<br />

prevalence study<br />

Economic and decision<br />

analyses<br />

1a SR (with homogeneity) of<br />

RCTs a SR (with homogeneity) of<br />

1b Individual RCT (with<br />

narrow confidence<br />

interval)<br />

SR (with homogeneity) of<br />

inception cohort studies; Level 1 diagnostic<br />

CDR validated in<br />

studies; CDR with 1b<br />

different populations d studies from different<br />

clinical centers<br />

Individual inception cohort<br />

study with ≥80%<br />

follow-up; CDR validated<br />

in a single population<br />

SR (with homogeneity) of<br />

prospective cohort<br />

studies<br />

SR (with homogeneity) of<br />

Level 1 economic studies<br />

Validating cohort study with good reference standards; or CDR tested<br />

within one clinical<br />

center gh Prospective cohort study with good follow-up j Analysis <strong>based</strong> on clinically<br />

sensible costs or<br />

alternatives; systematic<br />

review(s) of the evidence;<br />

and including multi-way<br />

sensitivity analyses<br />

1c All or none b All-or-none case series Absolute SpIns and SnOuts i All-or-none case series Absolute better-value or<br />

worse-value analyses k<br />

2a SR (with homogeneity) of<br />

cohort studies<br />

2b Individual cohort study<br />

(including low-quality<br />

RCT; e.g., 2 diagnostic<br />

studies<br />

untreated control groups<br />

in RCTs<br />

Retrospective cohort study Exploratory cohort study<br />

split-sample only e databases<br />

or follow-up of untreated with good reference<br />

control patients in an standards; CDR after<br />

RCT; Derivation of CDR derivation, or validated<br />

or validated on<br />

only on split-sample or<br />

SR (with homogeneity) of<br />

2b and better studies<br />

Retrospective cohort study,<br />

or poor follow-up<br />

SR (with homogeneity) of<br />

Level >2 economic studies<br />

Analysis <strong>based</strong> on clinically<br />

sensible costs or<br />

alternatives; limited<br />

review(s) of the evidence,<br />

or single studies; and<br />

including multi-way<br />

sensitivity analyses<br />

“Outcomes” research Ecological studies Audit or outcomes research<br />

(cont.)


Levels of evidence (cont.)<br />

Therapy/Prevention,<br />

Level Etiology/Harm Prognosis Diagnosis<br />

Differential<br />

diagnosis/Symptom<br />

prevalence study<br />

Economic and decision<br />

analyses<br />

3a SR (with homogeneity) of<br />

case–control studies<br />

3b Individual case–control<br />

study<br />

SR (with homogeneity) of<br />

3b and better studies<br />

Non-consecutive study; or<br />

without consistently<br />

applied reference<br />

standards<br />

SR (with homogeneity) of<br />

3b and better studies<br />

Non-consecutive cohort<br />

study, or very limited<br />

population<br />

SR (with homogeneity) of 3b<br />

and better studies<br />

Analysis <strong>based</strong> on limited<br />

alternatives or costs, poor<br />

quality estimates of data,<br />

but including sensitivity<br />

analyses incorporating<br />

clinically sensible<br />

variations.<br />

4 Case series (and<br />

Case series (and<br />

Case–control study, poor or Case series or superseded Analysis with no sensitivity<br />

poor-quality cohort and poor-quality prognostic non-independent<br />

reference standards analysis<br />

case–control studies) c cohort studies) f reference standard<br />

5 Expert opinion without explicit critical appraisal or <strong>based</strong> on physiology, bench research, economic theory or “first principles”.<br />

SR = Systematic review<br />

CDR = Clinical Decision Rule<br />

Users can add a minus sign to denote the level of evidence that fails to provide a conclusive answer because of:<br />

EITHER a single result with a wide confidence interval (such that, for example, an ARR in an RCT is not statistically significant but whose confidence intervals<br />

fail to exclude clinically important benefit or harm)<br />

OR a systematic review with troublesome (and statistically significant) heterogeneity.<br />

Such evidence is inconclusive, and therefore can only generate Grade D recommendations.<br />

a By homogeneity we mean a systematic review that is free of worrisome variations (heterogeneity) in the directions and degrees of results between individual<br />

studies. Not all systematic reviews with statistically significant heterogeneity need be worrisome, and not all worrisome heterogeneity need be statistically<br />

significant. As noted above, studies displaying worrisome heterogeneity should be tagged with a “–” (minus sign) at the end of their designated level.<br />

b All or none: met when all patients died before the therapy became available, but some now survive on it; or when some patients died before the therapy became<br />

available, but none now die on it.


c By poor-quality cohort study we mean one that failed to clearly define comparison groups and/or failed to measure exposures and outcomes in the same (preferably<br />

blinded) objective way in both exposed and non-exposed individuals and/or failed to identify or appropriately control known confounders and/or failed to carry out a<br />

sufficiently long and complete follow-up of patients. By poor-quality case–control study we mean one that failed to clearly define comparison groups and/or failed to<br />

measure exposures and outcomes in the same (preferably blinded) objective way in both cases and controls and/or failed to identify or appropriately control known<br />

confounders.<br />

d Clinical Decision Rule are algorithms or scoring systems which lead to a prognostic estimation or a diagnostic category.<br />

e Split-sample validation is achieved by collecting all the information in a single group, then artificially dividing this into “derivation” and “validation” samples.<br />

f By poor-quality prognostic cohort study we mean one in which sampling was biased in favour of patients who already had the target outcome, or the measurement of<br />

outcomes was accomplished in 80%, with adequate time for alternative diagnoses to emerge (e.g., 1–6 months acute, 1–5 years chronic).<br />

k Better-value treatments are clearly as good but cheaper, or better at the same or reduced cost. Worse-value treatments are as good and more expensive, or worse<br />

and equally or more expensive.<br />

Grades of recommendation<br />

A consistent level 1 studies<br />

B consistent level 2 or 3 studies or extrapolations from level 1 studies<br />

C level 4 studies or extrapolations from level 2 or 3 studies<br />

D level 5 evidence or troublingly inconsistent or inconclusive studies of any level<br />

“Extrapolations” are where data are used in a situation which has potentially<br />

clinically important differences from the original study situation.


382 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

GRADE quality assessment criteria<br />

Quality of<br />

evidence Study design Lower if ∗ Higher if ∗<br />

High<br />

Randomized<br />

Study quality:<br />

Strong association:<br />

trial<br />

−1 Serious limitations<br />

+ 1 Strong, no plausible<br />

−2 Very serious<br />

limitations<br />

confounders, consistent<br />

and direct evidence ∗∗<br />

Moderate<br />

Low<br />

Observational<br />

−1 Important<br />

+ 2 Very strong, no major<br />

study<br />

inconsistency<br />

threats to validity and direct<br />

evidence ∗∗∗<br />

Very low<br />

Directness:<br />

−1 Some uncertainty<br />

−2 Major uncertainty<br />

−1 Sparse data<br />

+ 1 <strong>Evidence</strong> of a Dose<br />

response gradient<br />

−1 High probability of<br />

+ 1Allplausible confounders<br />

Reporting bias<br />

would have reduced the<br />

effect<br />

∗ 1 = move up or down one grade (for example from high to moderate); 2 = move up<br />

or down two grades (for example from high to low).<br />

∗∗ A statistically significant relative risk of >2 (5(


Appendix 1: Levels of evidence and grades of recommendations 383<br />

Moving from strong to weak recommendations<br />

Factors that can weaken the strength of a recommendation<br />

Decision explanation<br />

Absence of high quality evidence<br />

• Yes<br />

• No<br />

Imprecise estimates<br />

• Yes<br />

• No<br />

Uncertainty or variation in how different individuals value the • Yes<br />

outcomes<br />

• No<br />

Small net benefits<br />

• Yes<br />

• No<br />

Uncertainty about whether the net benefits are worth the • Yes<br />

costs (including the costs of implementing the<br />

• No<br />

recommendation)<br />

Frequent “yes” answers will increase the likelihood of a weak recommendation<br />

• Strong recommendation: the panel is confident that the desirable effects of<br />

adherence to a recommendation outweigh the undesirable effects.<br />

• Weak recommendation: the panel concludes that the desirable effects of adherence<br />

to a recommendation probably outweigh the undesirable effects, but is not<br />

confident.


Appendix 2<br />

Overview of critical appraisal<br />

Adapted from G. Guyatt & D. Rennie (eds.) Users’ Guides to the Medical Literature: a Manual<br />

for <strong>Evidence</strong>-Based Clinical Practice. Chicago: AMA, 2002. Used with permission.<br />

(1) Randomized clinical trials (commonly studies of therapy or prevention)<br />

(a) Are the results valid?<br />

(i) Were the patients randomly assigned to treatment and was allocation effectively<br />

concealed?<br />

(ii) Were the baseline characteristics of all groups similar at the start of the study?<br />

(iii) Were the patients who entered the study fully accounted for at its conclusion?<br />

(iv) Were participating patients, family members, treating clinicians, and other<br />

people (observers or managers) involved in the study “blind” to the treatment<br />

received?<br />

(v) Were all measurements made in an objective and reproducible manner?<br />

(vi) With the exception of the experimental intervention, were all patients treated<br />

equally?<br />

(vii) Were the patients analyzed in the groups to which they were randomized?<br />

(viii) Was follow-up complete?<br />

(b) What are the results?<br />

(i) What is the treatment effect? (Absolute Rate Reduction, Relative Rate Reduction,<br />

Number Needed to Treat)<br />

(ii) What is the variability of this effect? (Confidence Intervals)<br />

(c) Will the results help me in my patient care?<br />

(i) Were all clinically important outcomes considered in the study?<br />

(ii) Will the benefits of the experimental treatment counterbalance any harms<br />

and additional costs?<br />

(iii) Can the results of this study be applied to most of my patients with this or<br />

similar problems?<br />

(2) Cohort studies (commonly studies of risk or harm or etiology)<br />

(a) Are the results valid?<br />

(i) With the exception of the risk factor under study, were all groups similar to each<br />

other at the start of the study?<br />

(ii) Were all measurements (outcome and exposure) made in an objective and<br />

reproducible manner and carried out in the same ways in all groups?<br />

384


Appendix 2: Overview of critical appraisal 385<br />

(iii) Were all patients that were entered into the study accounted for at the end of<br />

the study and was the follow-up for a sufficiently long time?<br />

(b) What are the results?<br />

(i) Is the temporal relationship between the cause and effect correct?<br />

(ii) Is there a dose–response gradient between the cause and effect?<br />

(iii) How strong is the association between cause and effect? (Relative Risk Reduction,<br />

Relative Risk, Absolute Risk Reduction, Number Needed to Harm)<br />

(iv) What is the variability of this effect? (Confidence Intervals)<br />

(c) Will the results help me in my patient care?<br />

(i) What is the relative magnitude of the risk in my patient population?<br />

(ii) Can the results of this study be applied to most of my patients with this or similar<br />

problems?<br />

(iii) Should I encourage the patient to stop the exposure? If yes, how soon?<br />

(3) Case–control studies (commonly studies of etiology or risk or harm)<br />

(a) Are the results valid?<br />

(i) With the exception of the presence of the disease under study, were all groups<br />

similar to each other at the start of the study?<br />

(ii) Were all measurements (disease and exposure) made in an objective and<br />

reproducible manner and carried out in the same ways in all groups? Was an<br />

explicit chart review method used for all patients?<br />

(iii) Was the risk factor information obtained for all patients who were entered into<br />

the study?<br />

(b) What are the results?<br />

(i) Is there a dose–response gradient between the cause and effect?<br />

(ii) How strong is the association between cause and effect? (Odds Ratio)<br />

(iii) What is the variability of this effect? (Confidence Intervals)<br />

(c) Will the results help me in my patient care?<br />

(i) What is the relative magnitude of the risk in my patient population?<br />

(ii) Can the results of this study be applied to most of my patients with this or<br />

similar problems?<br />

(iii) Should I encourage the patient to stop the exposure? If yes, how soon?<br />

Hierarchy of relative study strength<br />

RCT > Cohort > Case–control > Case series<br />

(4) Studies of diagnosis (commonly cohort or case–control studies)<br />

(a) Are the results valid?<br />

(i) Were all the patients in the study similar to those patients for whom the test<br />

would be used in general medical practice?<br />

(ii) Was there a reasonable spectrum of disease in the patients in the study?<br />

(iii) Were the details of the diagnostic test described adequately?<br />

(iv) Were all diagnostic and outcome measurements made in an objective and<br />

reproducible manner and carried out in the same ways in all patients?<br />

(v) Was both the test under study and a reasonable reference standard used to<br />

test all patients?


386 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

(vi) Was the comparison of the test under study to the reference standard done in<br />

a blinded manner?<br />

(vii) Did the results of the test being studied influence the decision to perform the<br />

reference standard test?<br />

(b) What are the results?<br />

(i) How strong is the diagnostic test? (Likelihood Ratios, Sensitivity and Specificity)<br />

(ii) What is the variability of this result? (Confidence Intervals)<br />

(c) Will the results help me in my patient care?<br />

(i) Can the test be used in my patient population when considering factors of<br />

availability, performance, and cost?<br />

(ii) Can I determine a reasonable pretest probability of disease in my patients?<br />

(iii) Will the performance of the test result in significant change in management<br />

for my patients?<br />

(iv) Will my patient be better off as a result of having obtained the test?


Appendix 3<br />

Commonly used statistical tests<br />

The following is a very simplistic summary of the usual tests used in statistical inference.<br />

Descriptive statistics<br />

Type of variable<br />

What is being<br />

described Statistic Graph<br />

Single variable<br />

Ratio or interval<br />

Central<br />

tendency<br />

Mean<br />

Histogram<br />

Stem–leaf plot<br />

Frequency polygon<br />

Box plot<br />

Dispersion<br />

Standard<br />

deviation<br />

Deviation from<br />

normality<br />

Skew or<br />

Kurtosis<br />

Ranks<br />

Central<br />

tendency<br />

Median<br />

Box plot (ordinal)<br />

Bar chart<br />

Dispersion Range Interquartile range<br />

Named<br />

Central<br />

tendency<br />

Mode<br />

Bar chart (nominal)<br />

Dot plot<br />

Dispersion<br />

Number of<br />

categories<br />

Two variables<br />

Ratio or interval Association Pearson’s r Scatter plot<br />

Nominal or<br />

ordinal<br />

Comparison<br />

Kappa, phi, rho<br />

Weighted kappa<br />

Paired bar chart<br />

Scatter plot<br />

387


388 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

Inferential statistics<br />

Type of dependent<br />

variables<br />

Number and type of<br />

independent variables<br />

Test<br />

Ratioorintervaldata<br />

One or two means None t-test or z-test (n > 100)<br />

Continuous<br />

F-test or t-test<br />

Nominal<br />

t-test<br />

(Multiple regression) Multiple continuous F-test<br />

(ANOVA) Multiple nominal F-test or Student<br />

Newman–Keuls test<br />

(ANCOVA)<br />

Multiple continuous and<br />

nominal<br />

Association<br />

Predicting variable values<br />

F-test<br />

Pearson’s r<br />

Regression<br />

Ordinal data None Wilcoxon signed rank test<br />

Ordinal<br />

Spearman’s test<br />

Nominal<br />

Mann–Whitney test<br />

Multiple ordinal<br />

χ 2 -test<br />

Multiple nominal<br />

Kruskal–Wallis test<br />

Association<br />

Spearman’s ρ<br />

Nominal data None (affected by time) Normal approximation to<br />

Poisson<br />

Nominal (paired)<br />

Nominal (unpaired)<br />

McNemar’s test<br />

χ 2 -test, normal approximation,<br />

or Mantel–Haenszel test<br />

Continuous<br />

χ 2 -test for trend<br />

Multiple continuous or χ 2 -test<br />

nominal<br />

Multiple nominal<br />

Mantel–Haenszel test<br />

Multivariate analysis<br />

Multiple linear regression is used when the outcome variable is continuous<br />

Multiple logistic regression is used when the outcome variable is binary event (e.g.,<br />

alive or dead, disease-free or recurrent disease, etc.)<br />

Discriminant function analysis is used when the outcome variable is categorical<br />

(better, worse, or about the same)<br />

Proportional hazards regression (Cox regression) is used when the outcome<br />

variable is the time to the occurrence of a binary event (e.g., time to death or tumor<br />

recurrence)


Appendix 4<br />

Formulas<br />

Descriptive statistics<br />

Mean: μ = (x i )/n<br />

where x i -the numerical value of the i th data point, and n-the total number of data<br />

points.<br />

Variance (s 2 or σ 2 ): s 2 = ((x i − μ) 2 )/(n − 1).<br />

Standard deviation (SD, s,orσ ): s = √ s 2<br />

Confidence intervals using the standard error of the mean<br />

95% CI = μ ± Z 95% (σ/ √ n)<br />

Z 95% = 1.96 (number of standard deviations defining 95% of the data)<br />

SEM = σ/ √ n<br />

95% CI = μ ± 1.96 (SEM)<br />

Basic probability<br />

Probability that event a or event b will occur: P (a or b) = P (a) + P (b)<br />

Probability that event a and event b will occur: P (a and b) = P (a) × P (b)<br />

Probability that at least one of several mutually exclusive events will occur = 1–P (none of<br />

the events will occur)<br />

where P (none of the events will occur) = P (not a) × P (not b) × P (not c) × ...<br />

Event rates (Fig A.4.1)<br />

Control event rate = CER = control patients with outcome of interest/all control<br />

patients = A/CE<br />

Experimental event rate = EER = experimental patients with outcome of interest/all experimental<br />

patients = C/EE<br />

Absolute rate reduction = ARR =|EER − CER|<br />

Relative rate reduction = RRR = (CER − EER)/CER<br />

Number needed to treat to benefit = NNTB = 1/ARR<br />

Relative risk and odds ratio (Fig A.4.2)<br />

Absolute risk of disease in risk group = A/(A + B)<br />

Absolute risk of disease in no risk group = C/(C + D)<br />

389


390 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

Fig. A.4.1 Event-rate<br />

calculations: 2 × 2 table.<br />

Control or<br />

placebo group<br />

Events of interest<br />

A<br />

Other events<br />

B<br />

CE<br />

Control group<br />

events<br />

Experimental<br />

group<br />

C<br />

D<br />

EE<br />

Experimental<br />

group events<br />

Fig. A.4.2 Relative-risk and<br />

odds-ratio calculations: 2 × 2<br />

table.<br />

Direction of sampling (Case−control)<br />

Disease present<br />

Disease absent<br />

Risk present<br />

Direction of<br />

sampling (cohort or RCT)<br />

A<br />

B<br />

A + B<br />

Risk absent<br />

C<br />

D C + D<br />

A + C B + D N<br />

population<br />

Relative risk of disease = RR = [A/(A + B)]/[C/(C + D)]<br />

Absolute attributable risk = AAR = [A/(A + B)] − [C/(C + D)]<br />

Attributable risk percent relative to no-risk group = [A/(A + B) − C/(C + D)]/[C/(C + D)]<br />

Also called relative attributable risk. Dependent on which variable you want to measure<br />

this relative to, it can also be written as;<br />

Attributable risk percent relative to risk group = [A/(A + B) − C/(C + D)]/[A/(A + B)]<br />

Number needed to treat to harm = NNTH = 1/AAR<br />

Odds of risk factor if diseased = A/C<br />

Odds of risk factor if not diseased = B/D<br />

Odds ratio = OR = [A/C]/[B/D] = AD/BC<br />

Confidence intervals (Fig. A.4.3)<br />

For odds ration: Confidence Interval = CI = expln(OR) ±1.96 √ (1/A + 1/B + 1/C + 1/D)<br />

For relative risk: Confidence Interval = CI = expln(RR) ±1.96 √ ([(1 − (A/(A + B)))/A])<br />

+ [(1 − (C/(C + D))/D])<br />

Let the computer do the calculations!


Appendix 4: Formulas 391<br />

95% confidence interval<br />

relative-risk<br />

point estimate<br />

0 1 2 3<br />

risk factor reduces<br />

risk of outcome<br />

1.3 1.8 2.6<br />

risk factor increases risk of outcome<br />

Fig. A.4.3 Confidence interval for relative risk.<br />

Disease<br />

Present<br />

Absent<br />

Test<br />

Positive TP FP T+<br />

Negative<br />

FN TN T−<br />

Diagnostic tests (Fig. A.4.4)<br />

D+ D−<br />

Fig. A.4.4 Diagnostic tests: 2 × 2<br />

table.<br />

True positive rate = TPR = TP/D+=sensitivity<br />

False positive rate = FPR = FP/D−=1 − specificity<br />

False negative rate = FNR = FN/D+=1 − sensitivity<br />

True negative rate = TNR = TN/D−=specificity<br />

Likelihood ratio of a positive test = LR+=sensitivity/(1 − specificity)<br />

Likelihood ratio of a negative test = LR−=(1 − sensitivity)/specificity<br />

Positive predictive value = PPV = TP/T+<br />

Negative predictive value = NPV = TN/T−<br />

False alarm rate = FAR = 1 − PPV<br />

False reassurance rate = FRR = 1 − NPV<br />

Bayes’ theorem<br />

Odds = probability/(1 − probability)<br />

Post-test odds = pretest odds × likelihood ratio (this is PPV if LR+ is used and FRR if LR−<br />

is used)<br />

Probability = odds/(1 + odds)


Appendix 5<br />

Proof of Bayes’ theorem<br />

For a given test with the following parameters:<br />

Sensitivity = NSpecificity= S Pretest probability (prevalence of disease) = P, the 2 × 2table<br />

will be as shown in Fig. A.5.1.<br />

Using the sensitivity and specificity:<br />

PPV = NP<br />

T+ = NP<br />

NP + ((1 − S)(1 − P))<br />

Using Bayes’ theorem:<br />

O(pre) = P/(1 − P) and<br />

O(post) = O(pre) × LR+<br />

LR+ =N/(1 − S)<br />

O(post) = [P/(1 − P)] × [(N)/(1 − S)] = NP/((1 − S)(1 − P))<br />

NP/((1 − S)(1 − P))<br />

P(post) = O/(1 + O) =<br />

1 + (NP/((1 − S)(1 − P)))<br />

Now multiply top and bottom by (1 − S)(1 − P):<br />

NP<br />

=<br />

((1 − S)(1 − P)) + NP or NP<br />

NP + ((1 − S)(1 − P))<br />

The same as the PPV.<br />

Similarly:<br />

FRR = 1 − NPV =<br />

(1 − N)P<br />

T−<br />

=<br />

P(1 − N)<br />

S(1 − P) + P(1 − N)<br />

LR− =(1 − N)/S<br />

O(post) = (P/(1 − P)) × ((1 − N)/S) = P(1 − N)/S(1 − P)<br />

P(1 − N)/S(1 − P)<br />

P(post) =<br />

1 + (P(1 − N)/S(1 − P))<br />

Now multiply top and bottom by S(1 − P):<br />

P(1 − N)<br />

P(post) =<br />

S(1 − P) + P(1 − N)<br />

The same as the FRR (Fig. A.5.1).<br />

392


Appendix 5: Proof of Bayes’ theorem 393<br />

D+ D−<br />

Fig. A.5.1 Bayes’ theorem: 2 × 2<br />

table.<br />

T+ NP (1−S)(1−P) NP + (1−S)(1−P)<br />

T− (1−N)P S(1−P) (1−N)P + S(1−P)<br />

P<br />

1−P


Appendix 6<br />

thresholds<br />

Using balance sheets to calculate<br />

Strep throat<br />

Suppose you are examining a 36-year-old white male with a sore throat and want to know<br />

whether treatment for strep throat is a good idea. Exam is equivocal with large tonsils with<br />

exudate, but no cervical nodes or scarlatiniform rash, and only slight coryza. 1<br />

Disease Strep throat<br />

Prevalence in the literature About 20% for large tonsils with exudate. If no exudate this<br />

drops to about 10%, and if also tender cervical nodes it increases to 40%.<br />

Estimate the treatment threshold.<br />

Potential harm from antibiotic treatment 4–5% of patients will get a rash or diarrhea,<br />

both of which are uncomfortable but not life-threatening. Anaphylaxis (lifethreatening<br />

allergy) is very rare (< 1 : 200 000) and will not be counted in the analysis.<br />

Harm = 0.05.<br />

Impact of this harm Discomfort for about 2–3 days, gets about 0.1 on a 0–1 scale. It<br />

could be greater if the patient modeled swimwear and a rash would put him or her<br />

out of work for those days. Impact = 0.1.<br />

Impact of improvement Since treatment results in relief of symptoms about 1 day<br />

sooner, this should be similar to the harm impact, 0.1 on the 0–1 scale. Impact =<br />

0.1. Improvement = 1 (100% get better by this 1 day).<br />

Action or treatment threshold (Harm × harm impact) / (improvement × improvement<br />

impact) = (0.1 × 0.05) / (0.1 × 1) = 0.05.<br />

This is the threshold for treatment without testing.<br />

Will a test change your mind if the pretest probability is 20%?<br />

The sensitivity and specificity of throat culture is 0.9 and 0.85 respectively. If you apply<br />

these to a pretest probability of 20%, a negative test will result in NPV = 0.03 (3%).<br />

This is below the action (treatment) threshold (5%) and so treatment would not be<br />

initiated if the test were negative. Therefore it pays to do the test.<br />

Tuberculosis<br />

Now let’s consider a different problem in an Asian man with lung lesions, fever,<br />

and cough, and let’s use a slightly different methodology. The differential is between<br />

1 From R. Gross. Making Medical Decisions: an Approach to Clinical Decision Making for Practicing<br />

Physicians. Philadelphia, PA: American College of Physicians, 1999.<br />

394


Appendix 6: Using balance sheets to calculate thresholds 395<br />

tuberculosis (highly contagious and treated with antibiotics) and sarcoidosis (not contagious<br />

and treated with steroids). The initial testing is negative for both. How should the<br />

patient be treated while waiting for the results of the culture for TB (gold standard)? Clinical<br />

probability of TB estimated at 70% before initial testing, 40% after initial testing (normal<br />

angiotensin-converting-enzyme level, negative TB skin test, noncaseating granulomas on<br />

biopsy).<br />

Normal angiotensin-converting-enzyme level: against sarcoidosis but poor sensitivity<br />

Negative TB skin test: against TB, but can be present in overwhelming TB infection<br />

(poor sensitivity)<br />

Noncaseating granulomas on biopsy: against TB and for sarcoidosis<br />

Benefit (B) = untreated TB mortality − treated TB mortality = 50% − 20% = 30%<br />

Risk (R) = death from hepatitis due to treatment = prevalence of hepatitis in Asian men<br />

treated with TB medications (2%) × risk for death from hepatitis (7.6%) = 0.15%<br />

Treatment threshold = 1/(B : R + 1)<br />

B : R = 30 : 0.15 = 200<br />

Treatment threshold = 1/201 = 0.005<br />

Therefore treat with TB medications since the estimated probability of disease in this<br />

patient is 40%, greater than the treatment threshold. If B is very high and R is very low,<br />

you will almost always treat regardless of the test result. If the converse (R high and B<br />

low) you will be much less likely to treat without fairly high degree of evidence of the<br />

target disorder.<br />

Acute myocardial infarction<br />

In this case, you must consider how sure you are of the diagnosis to use the more expensive<br />

thrombolytic therapy (t-PA) rather than the cheaper streptokinase (SK).<br />

B = 0.01 − 1% (difference between the mortality of AMI with t-PA compared to SK)<br />

R = 0.008 − 0.8% (difference between the occurrence of acute cerebral bleed from t-PA<br />

over SK)<br />

Therefore B : R = 1.2<br />

B:R + 1 = 2.2 and T = 1/2.2 = 0.45 and you would not initiate thrombolytic therapy<br />

unless the probability of thrombotic MI was greater than 45%.


Glossary<br />

2AFC (two-alternative-forced-choice) problem The probability that one can identify an<br />

abnormal patient from a normal patient using this test alone.<br />

Absolute risk<br />

The percentage of subjects in a group that experiences a discrete outcome.<br />

Absolute risk (rate) reduction (ARR) The difference in rates of outcomes between the<br />

control group and the experimental or exposed group. An efficacious therapy serves to<br />

reduce that risk. For example, if 15% of the placebo group died and 10% of the treatment<br />

group died, ARR or the absolute reduction in the risk of death is 5%.<br />

Accuracy<br />

Closeness of a given observation to the true value of that state.<br />

Adjustment Changing the probability of disease as a result of performing a diagnostic<br />

maneuver (additional history, physical exam, or diagnostic test of some kind).<br />

Algorithm A preset path which takes the clinician from the patient’s presenting<br />

complaints to a final management decision through a series of predetermined branching<br />

decision points.<br />

All-or-none case series In previous studies all the patients who were not given the<br />

intervention died and now some survive, or many of the patients previously died and now<br />

none die.<br />

Alternative hypothesis There is a difference between groups or an association between<br />

predictor and outcome variables. Example: the patients being treated with a newer<br />

antihypertensive drug will have a lower blood pressure than those treated with the older<br />

drug.<br />

Anchoring The initial assignment of pretest probability of disease <strong>based</strong> upon elements<br />

of the history and physical.<br />

Applicability The degree to which the results of a study are likely to hold true in your<br />

practice setting. Also called external validity, generalizability, particularizability,<br />

relevance.<br />

396


Glossary 397<br />

Arm (of decision tree)<br />

method.<br />

A particular diagnostic modality, risk factor, or treatment<br />

Assessment Clinician’s inferences on the nature of the patient’s problem. Synonymous<br />

with differential diagnosis or hypotheses of cause of the underlying problems.<br />

AUC (area under the ROC curve) Probability that one can identify a diseased patient<br />

from a healthy one using this test alone.<br />

Availability heuristic<br />

studied that fact.<br />

The ability to think of something depends upon how recently you<br />

Bayes’ theorem What we know after doing a test equals what we knew before doing the<br />

test times a modifier (<strong>based</strong> on the test results). Post-test odds = pretest odds × likelihood<br />

ratio.<br />

Bias Any factor other than the experimental therapy that could change the study results<br />

in a non-random way. The direction of bias offset may be unpredictable. The validity of a<br />

study is integrally related to the degree to which the results could have been affected by<br />

biased factors.<br />

Blinding Masking or concealment from study subjects, caregivers, observers, or others<br />

involved in the study of some or all details of the study. Process by which neither the<br />

subject nor the research team members who have contact with the subject know to which<br />

treatment condition the subject is assigned. Single-blind means that one person (patient<br />

or physician) does not know what is going on. Double-blind means that at least two<br />

people (usually patient and treating physician) don’t know what’s going on. Triple-blind<br />

means that patient, treating physician, and person measuring outcome don’t know to<br />

which group patient is assigned. It can also mean that the paper is written before the<br />

results are tabulated. The whole point of blinding is to prevent bias.<br />

Case–control study Subjects are grouped by outcome, cases having the disease or<br />

outcome of interest and controls. The presence of the risk factor of interest is then<br />

compared in the two groups. These studies are usually retrospective.<br />

Case report or case series One or a group of cases of a particular disease or outcome of<br />

interest with no control group.<br />

Clinical guideline<br />

guideline.<br />

An algorithm used in making clinical decisions. Also called a Practice<br />

Clinical significance Results that make enough difference to you and your patient to<br />

justify changing your way of doing things. For example, a drug which is found in a<br />

megatrial of 50 000 adults with acute asthma to increase FEV1 by only 0.5% (P < 0.0001)


398 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

would fail this test of significance. The findings must have practical importance as well as<br />

statistical importance.<br />

Cochrane collaboration An internationally organized effort to catalog and systematically<br />

evaluate all existing clinical studies into systematic reviews easily accessible to practicing<br />

clinicians so as to facilitate the process of using the best clinical evidence in patient care.<br />

Cohort study Subjects are grouped by the risk factor, and those with and without the risk<br />

factor are followed to see who develops the disease and who doesn’t. The occurrence of<br />

the outcome of interest is compared in the two groups. These studies can be prospective<br />

or retrospective (non-concurrent).<br />

Cointervention A treatment that is not under investigation given to a study patient. Can<br />

be a source of bias in the study.<br />

Competing-hypotheses heuristic A way of thinking in which all possible hypotheses are<br />

evaluated for their likelihood and final decision is <strong>based</strong> on the most likely hypothesis<br />

modified by secondary evaluations.<br />

Confidence intervals An interval around an observed parameter guaranteed to include<br />

the true value to some level of confidence (usually 95%). The true value can be expected to<br />

be within that interval with 95% confidence.<br />

Continuous test results<br />

values.<br />

A test resulting in an infinite number of possible outcome<br />

Control group The subjects in an experiment who do not receive the treatment<br />

procedure being studied. They may get nothing, a placebo, or a standard or previously<br />

validated therapy.<br />

Controlled clinical trial Any study that compares two groups for exposure to different<br />

therapies or risk factors. A true experiment in which one group is given the experimental<br />

intervention and the other group is a control group.<br />

Cost-effectiveness Marginal cost divided by marginal benefit. (Cost of treatment A – cost<br />

of treatment B)/(benefit of treatment A – benefit of treatment B).<br />

Cost-effectiveness (or cost–benefit) analysis Research study which determines how<br />

much more has to be paid in order to achieve a given benefit of preventing death,<br />

disability days, or another outcome.<br />

Cost-minimization analysis<br />

Analysis in which only costs are compared.<br />

Criterion-<strong>based</strong> validity How well a measurement agrees with other approaches for<br />

measuring the same characteristic.


Glossary 399<br />

Critical appraisal The process of assessing and interpreting evidence systematically,<br />

considering its validity, results, and relevance.<br />

Critical value Value of a test statistic to which the observed value is compared to<br />

determine statistical significance. The observed test statistic indicates significant<br />

differences or associations exist if its value is greater than the critical value.<br />

Critically appraised topic (CAT) A summary of a search and critical appraisal of the<br />

literature related to a focused clinical question. Catalogue of these kept in an easily<br />

accessible place (e.g., online) can be used to help make real-time clinical decisions.<br />

Decision analysis Systematic way in which the components of decision making can be<br />

incorporated to make the best possible clinical decision using a mathematical model. Also<br />

known as Expected values decision making.<br />

Decision node A point on a branching decision tree at which the clinician must make a<br />

decision to either perform a clinical maneuver (diagnosis or management) or not.<br />

Degrees of freedom (df) A number used to select the appropriate critical value of a<br />

statistic from a table of critical values.<br />

Dependent variable The outcome variable that is influenced by changes in the<br />

independent variable of a study.<br />

Descriptive research Study which summarizes, tabulates, or organizes a set of measures<br />

(i.e., answers the questions who, what, when, where, and how).<br />

Descriptive statistics The branch of statistics that summarizes, tabulates, and organizes<br />

data for the purpose of describing observations or measurements.<br />

Diagnostic test characteristics Those qualities of a diagnostic test that are important to<br />

understand how valuable it would be in a clinical setting. These include sensitivity,<br />

specificity, accuracy, precision, and reliability.<br />

Diagnostic tests Modalities which can be used to increase the accuracy of a clinical<br />

assessment by helping to narrow the list of possible diseases that a patient can have.<br />

Dichotomous outcome Any outcome measure for which there are only two possibilities,<br />

like dead/alive, admitted/discharged, graduated/sent to glue factory. Beware of<br />

potentially fake dichotomous outcome reports such as “improved/not improved”,<br />

particularly when derived from continuous outcome measures. For example, if I define a<br />

10-point or greater increase in a continuous variable as “improved,” I may show what<br />

looks like a tremendous benefit when that result is clinically insignificant. This is lesson 2a<br />

in “How to lie with statistics.”


400 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

Dichotomous test results Only two possible outcome values, yes or no, positive or<br />

negative, alive or dead, etc.<br />

Differential diagnosis A list of possible diseases that your patient can have in<br />

descending order of clinical probability.<br />

Effect size The amount of change measured in a given variable as a result of the<br />

experiment. In meta-analyses when different studies have measured somewhat different<br />

things, a statistically derived generic size of the combined result.<br />

Effectiveness How well the proposed intervention works in a clinical trial to produce a<br />

desired and measurable effect in a well-done clinical trial. These results may not be<br />

duplicated in “real life.”<br />

Efficacy How well the proposed intervention actually works in practice to produce a<br />

desired outcome in other more generalized clinical situations. This is usually the desired<br />

outcome for the patient and society.<br />

Event rate The percentage of events of interest in one or the other of the groups in an<br />

experiment. These rates are compared to calculate number needed to treat. This is also a<br />

term for absolute risk.<br />

Expected values (E) Probability × Utility (P × U). The value of each arm of the decision<br />

tree or the entire decision tree (sum of P × U).<br />

Expected-values decision making<br />

See Decision analysis.<br />

Experimental group(s) The subjects in an experiment who receive the treatment<br />

procedure or manipulation that is being proposed to improve health or treat illness.<br />

Explanatory research – experimental Study in which the independent variable (usually a<br />

treatment) is changed by the researcher who then observes the effect of this change on the<br />

dependent variable (usually an outcome). The key here is the willful manipulation of the<br />

two variables.<br />

Explanatory research – observational Study looking for possible causes of disease<br />

(dependent variable) <strong>based</strong> upon exposure to one or more risk factors (independent<br />

variable) in the population.<br />

Exposure Any type of contact with a substance that causes an outcome. A drug, a<br />

surgical procedure, risk factor, even a diagnostic test can be an exposure. In therapy,<br />

prognosis, or harm studies the “exposure” is the intervention being studied.<br />

External validity<br />

See Applicability.


Glossary 401<br />

False negative (FN)<br />

Patients with disease who have a normal or negative test.<br />

False positive (FP)<br />

Patients without disease who have an abnormal or positive test.<br />

FAR (false alarm rate) Percentage of patients with a positive test who don’t have disease<br />

and will be unnecessarily tested or treated <strong>based</strong> on the incorrect results of a test.<br />

Filter A process by which patients are entered into or excluded from a study. Inclusion<br />

and exclusion criteria when stated explicitly.<br />

FNR (false negative rate) One minus the sensitivity (1 – sens). Percentage of diseased<br />

patients with a negative or normal test.<br />

FPR (false positive rate) One minus the specificity (1 – spec). Percentage of<br />

non-diseased patients with a positive or abnormal test.<br />

Framing effect<br />

question.<br />

How a question is worded (or framed) will influence the answer to the<br />

FRR (False reassurance rate) Percentage of patients with a negative or normal test result<br />

who actually have disease and will lose benefits of treatment for the disease.<br />

Functional status An outcome which describes the ability of a person to interact in<br />

society and carry on with their daily living activities (e.g., Activities of Daily Living (ADL)<br />

or the Arthritis Activity Scale used in Rheumatoid Arthritis).<br />

Gaussian Typical bell-shaped frequency curve in which normal test values are 95%<br />

(± 2SD of all tests done) of all possible values.<br />

Generalizability<br />

See Applicability.<br />

Gold standard The reference standard for evaluation of a measurement or diagnostic<br />

test. The “gold-standard” test is assumed to correctly identify the presence or absence of<br />

disease 100% of the time.<br />

Harm vs. benefit An accounting of the positive and negative aspects of an exposure<br />

(positive or negative) on the outcomes of a study.<br />

Heuristics<br />

Models for the way people think.<br />

Homogeneity Whether the results from a set of independently performed studies on a<br />

particular question are similar enough to make statistical pooling valid.


402 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

Hypothesis An educated guess on the nature of the patient’s illness, usually obtained by<br />

selecting those diseases having the same history or physical examination characteristics<br />

as the patient.<br />

Hypothetico-deductive strategy A diagnosis is made by advancing a hypothesis and<br />

then deducing the correctness or incorrectness of that hypothesis through the use of<br />

statistical methods, specifically the characteristics of diagnostic tests.<br />

Incidence The rate at which an event occurs in a defined population over time. The<br />

number of new cases (or other events of interest) divided by the total population at risk.<br />

Incorporation bias The test being measured is part of the gold standard or inclusion<br />

criteria for entry into a study.<br />

Incremental gain Amount of increase in diagnostic certainty. The change in the pretest<br />

probability of a diagnosis as a result of performing a diagnostic test.<br />

Independent variable(s) The treatment or exposure variable that is presumed to cause<br />

some effect on the outcome or dependent variable.<br />

Inferential statistics<br />

sample.<br />

Drawing conclusions about a population <strong>based</strong> on findings from a<br />

Instrumental rationality Calculation of a treatment strategy which will produce the<br />

greatest benefit for the patient.<br />

Instrumentation<br />

The process of selecting or developing measuring devices.<br />

Instruments (measuring devices) Something that makes a measurement, e.g.,<br />

thermometer, sphygmomanometer (blood pressure cuff and manometer), questionnaire,<br />

etc.<br />

Intention-to-treat Patients assigned to a particular treatment group by the study<br />

protocol are retained in that group for the purpose of analysis of the study results no<br />

matter what happens.<br />

Internal validity<br />

See Validity.<br />

Inter-observer reliability<br />

Consistency between two different observers’ measurements.<br />

Interval likelihood ratios (iLR) Probability of a test result in the interval among diseased<br />

subjects divided by the probability of a test result within the interval among non-diseased<br />

subjects.<br />

Intra-observer reliability<br />

Ability of the same observer to reproduce a measure.


Glossary 403<br />

Intrinsic characteristics of a diagnostic test<br />

See Diagnostic test characteristics.<br />

Justice Equal access to medical care for all patients who require it <strong>based</strong> only upon the<br />

severity of their disease.<br />

Kappa statistic<br />

A measure of inter- or intra-observer reliability.<br />

Level of significance (confidence level) Describes the probability of incorrectly rejecting<br />

the null hypothesis and concluding that there is a difference when in fact none exists (i.e.,<br />

probability of Type I error). Many times this probability is 0.01, 0.05, or 0.10. For medical<br />

studies it is most commonly set at 0.05.<br />

Likelihood ratio of a negative test (LR–) The false negative rate divided by the true<br />

negative rate. The amount by which the pretest probability of disease is reduced in<br />

patients with a negative test.<br />

Likelihood ratio of a positive test (LR+) The true positive rate divided by the false<br />

positive rate. The amount by which the pretest probability is increased in patients with a<br />

positive test.<br />

Likelihood ratio A single number which summarizes test sensitivity and specificity and<br />

modifies the pretest probability of disease to give a post-test probability.<br />

Linear rating scale A scale from zero to one on which patients can place a mark to<br />

determine their value for a particular outcome.<br />

Markov models A method of decision analysis that considers all possible health states<br />

and their interactions at the same time.<br />

Matching An attempt in an experiment to create equivalence between the control and<br />

treatment groups. Control subjects are matched with experimental subjects <strong>based</strong> upon<br />

one or more variables.<br />

Mean<br />

A measure of central tendency; the arithmetic average.<br />

Measurement The application of an instrument or method to collect data<br />

systematically. What the use of the instrument tells us, e.g., temperature, blood pressure,<br />

results of dietary survey, etc.<br />

Meta-analysis A systematic review of a focused clinical question following rigorous<br />

methodological criteria and employing statistical techniques to combine data from<br />

multiple independently performed studies on that question.<br />

Multiple-branching strategy<br />

An algorithmic method used for making diagnoses.


404 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

N or n<br />

Number of subjects in the sample or the number of observations made in a study.<br />

Negative predictive value (NPV)<br />

Probability of no disease after a negative test result.<br />

Nodes Junctures where something happens. The common ones are decision and<br />

probability nodes.<br />

Non-inferiority trial<br />

than the other.<br />

A study that seeks to show that one of two treatments is not worse<br />

Normal (1) A normal distribution or Gaussian distribution of variables, the bell-shaped<br />

curve. (2) A value of a diagnostic test which defines patients who are not diseased.<br />

Null hypothesis The assumption that there is no difference between groups or no<br />

association between predictor and outcome variables.<br />

Number needed to follow (NNF) Number of patients who must be followed before one<br />

additional bad outcome is noted. The lower this number, the worse the risk factor.<br />

Number needed to treat to harm (NNTH) Number of patients who must be treated or<br />

exposed to a risk factor to have one additional bad outcome. The lower this number the<br />

worse the exposure.<br />

Number needed to treat to benefit (NNTB) Number of patients who must be treated to<br />

have one additional successful outcome. The lower that number, the better the therapy.<br />

Objective Information observed by the physician from the patient examination and<br />

diagnostic tests.<br />

Observational study Any study of therapy, prevention, or harm in which the exposure<br />

is not assigned to the individual subject by the investigator(s). A synonym is “nonexperimental”<br />

and examples are case–control and cohort studies.<br />

Odds<br />

The number of times an event occurred divided by the number of times it didn’t.<br />

Odds ratio<br />

group.<br />

The ratio of the odds of an event in one group divided by the odds in another<br />

One-tailed statistical test Used when the alternative hypothesis is directional (i.e.,<br />

specifies a particular direction of the difference between the groups.)<br />

Operator-dependent<br />

performing the test.<br />

The results of a test are dependent on the skill of the person<br />

Outcome<br />

Disease or final state of patient (e.g., alive or dead).


Glossary 405<br />

Outcomes study<br />

a period of time.<br />

The outcome of an intervention, exposure, or diagnosis measured over<br />

P value The probability that the difference(s) observed between two or more groups in a<br />

study occurred by chance if there really was no difference between the groups.<br />

Pathognomonic The presence of signs or symptoms of disease which can lead to only<br />

one diagnosis (i.e. they are only characteristic of that one disease).<br />

Patient satisfaction A rating scale which measures the degree to which patients are<br />

happy with the care they received or feel that the care was appropriate.<br />

Patient values A number, generally from 0 (usually death) to 1 (usually complete<br />

recovery), which denotes the degree to which a patient is desirous of a particular outcome.<br />

Pattern recognition<br />

symptoms.<br />

Recognizing a disease diagnosis <strong>based</strong> on a pattern of signs and<br />

Percentiles Cutoffs between positive and negative test result chosen within preset<br />

percentiles of the patients tested.<br />

Placebo An inert substance given to a study subject who has been assigned to the<br />

control group to make them think they are getting the treatment under study.<br />

Plan<br />

What treatment or further diagnostic testing is required.<br />

Point On a decision tree, the outcome of possible decisions made by the patient and<br />

clinician.<br />

Point estimate The exact result that has been observed in a study. The confidence<br />

interval tells you the range within which the true value of the result is likely to lie with 95%<br />

confidence.<br />

Point of indifference The probability of an outcome of certain death at which a patient<br />

no longer can decide between that outcome and an uncertain outcome of partial<br />

disability.<br />

Population The group of people who meet the criteria for entry into a study (whether<br />

they actually participated in the study or not). The group of people to whom the study<br />

results can be generalized.<br />

Positive predictive value<br />

result.<br />

Probability of disease after the occurrence of a positive test


406 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

Post-test odds The odds of disease after a test has been done. Post-test odds = pretest<br />

odds × likelihood ratio.<br />

Post-test probability The probability of disease after a test has been performed. This is<br />

calculated from post-test odds converted to probability. Also called posterior or<br />

a-posteriori probability.<br />

Power The probability that an experimental study will correctly observe a statistically<br />

significant difference between the study groups when that difference actually exists.<br />

Precision The measurement is nearly the same value each time it is measured. Measure<br />

of random variation or error, or a small standard deviation of the measurement across<br />

multiple measurements.<br />

Predictive values The probability that a patient with a particular outcome on a<br />

diagnostic test (positive or negative) has or does not have the disease.<br />

Predictor variable The variable that is going to predict the presence or absence of<br />

disease, or results of a test.<br />

Pretest odds<br />

The odds of disease before a test is run.<br />

Pretest probability The probability of disease before a test is run. This is converted to<br />

odds for use with Bayes’ theorem. Also called prior or a-priori probability.<br />

Prevalence The proportion of people in a defined group who have a disease, condition,<br />

or injury. The numbers affected by a condition divided by the population at risk. In the<br />

context of diagnosis, this is also called “pretest probability.”<br />

Probability node<br />

chance.<br />

A point in the decision tree at which two or more events occur by<br />

Problem-oriented medical record (POMR) A format of keeping medical records by which<br />

one keeps track of and updates a patient’s problems regularly.<br />

Prognosis<br />

outcomes.<br />

The possible outcomes for a given disease and the length of time to those<br />

Prospective study Any study done forward in time. Important in studies on therapy,<br />

prognosis, or harm, where retrospective studies make hidden biases more likely.<br />

Publication bias The possibility that studies with conflicting results (most often negative<br />

studies) are less likely to be published.


Glossary 407<br />

Quality of life A composite measure of the satisfaction of a patient with their life and<br />

their ability to function appropriately.<br />

Quality-adjusted life years (QALYs) Standardized measure of quality and life expectancy<br />

commonly used in decision analyses. Life expectancy times expected value or utility.<br />

Random selection or assignment Selection process of a sample of the population such<br />

that every subject in the population has an equal chance of being selected for each arm of<br />

the study.<br />

Randomization A technique that gives every patient an equal chance of winding up in<br />

any particular arm of a controlled clinical trial.<br />

Randomized clinical trial or Randomized controlled trial (RCT) An interventional study in<br />

which the patients are randomly selected or assigned either to a group which gets the<br />

intervention or to a control group.<br />

Receiver operating characteristic (ROC) curve A plot of sensitivity versus one minus<br />

specificity (true-positive rate versus false positive rate) can give the quality of a diagnostic<br />

test and determine which is the best cutoff point.<br />

Referral bias Patients entered into a study because they have been referred for a<br />

particular test or to a specialty provider.<br />

Relative risk The probability of outcome in the group with exposure divided by the<br />

probability of outcome in the group without the exposure.<br />

Reliability Loose synonym of precision, or the extent to which repeated measurements<br />

of the same phenomenon are consistent, reproducible, and dependable.<br />

Representativeness heuristic The ease with which a diagnosis is recalled depends on<br />

how closely the patient presentation fits the classical presentation of the disease.<br />

Research question (hypothesis) A question stating a general prediction of results which<br />

the researcher attempts to answer by conducting a study.<br />

Retrospective study Any study in which the outcomes have already occurred before the<br />

study and collection of data has begun.<br />

Risk Probability of an adverse event divided by all of the times one is exposed to that<br />

event.<br />

Risk factor Any aspect of an individual’s life, behavior, or inheritance that could affect<br />

(increase or decrease) the likelihood of an outcome (disease, condition, or injury.)


408 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

Rule in To effectively determine that a particular diagnosis is correct by either excluding<br />

all other diagnoses or making the probability of that diagnosis so high that other<br />

diagnoses are effectively excluded.<br />

Rule out To effectively exclude a diagnosis by making the probability of that disease so<br />

low that it effectively is so unlikely to occur or would be considered non-existent.<br />

Sample That part of the population selected to be studied. The group specifically<br />

included in the actual study.<br />

Sampling bias<br />

outcome.<br />

To select patients for study <strong>based</strong> on some criteria that could relate to the<br />

Screening<br />

Looking for disease among asymptomatic patients.<br />

Sensitivity The ability of a test to identify patients who have disease when it is present.<br />

True-positive rate.<br />

Sensitivity analysis An analytical procedure to determine how the results of a study<br />

would change if the input variables are changed.<br />

Setting<br />

care.<br />

The place in which the testing for a disease occurs, usually referring to level of<br />

SOAP notes Subjective, objective, assessment, and plan. The typical format for<br />

problem-oriented medical record notes.<br />

Specificity The ability of a test to identify patients without the disease when it is<br />

negative. True-negative rate.<br />

Spectrum In a diagnostic study, the range of clinical presentations and relevant disease<br />

advancement exhibited by the subjects included in the study.<br />

Spectrum bias The sensitivity of a test is higher in more severe or “well-developed” cases<br />

of a disease, and lower when patients present earlier in the course of disease, or when the<br />

disease is occult.<br />

Standard gamble A technique to determine patient values by which patients are given a<br />

choice between a known outcome and a hypothetical-probabilistic outcome.<br />

Statistic<br />

A number that describes some characteristic of a set of data.<br />

Statistical power<br />

See Power.


Glossary 409<br />

Statistical significance A measure of how confidently an observed difference between<br />

two or more groups can be attributed to the study interventions rather than chance alone.<br />

Stratified randomization A way of ensuring that the different groups in an experimental<br />

trial are balanced with respect to some important factors that could affect the outcome.<br />

Strategy of exhaustion Listing all possible diseases which a patient could have and<br />

running every diagnostic test available and necessary to exclude all diseases on that list<br />

until only one is left.<br />

Subjective Information from the patient, the history which the patient gives you and<br />

which they are experiencing.<br />

Surrogate marker An outcome variable that is associated with the outcome of interest,<br />

but changes in this marker are not necessarily a direct measure of changes in the clinical<br />

outcome of interest.<br />

Survival analysis A mathematical analysis of outcome after some kind of therapy in<br />

which patients are followed for given a period of time to determine what percentage are<br />

still alive or disease-free after that time.<br />

Systematic review A formal review of a focused clinical question <strong>based</strong> on a<br />

comprehensive search strategy and structured critical appraisal of all relevant studies.<br />

Testing threshold Probability of disease above which we should test before initiating<br />

treatment for that disease, and below which we should neither treat nor test.<br />

Threshold approach to decision making Determining values of pretest probability below<br />

which neither testing nor treatment should be done and above which treatment should be<br />

begun without further testing.<br />

Time trade-off A method of determining patient utility using a simple question of how<br />

much time in perfect health a patient would trade for a given amount of time in imperfect<br />

health.<br />

Treatment threshold Probability of disease above which we should initiate treatment<br />

without first doing the test for the disease.<br />

Triggering A thought process which is initiated by recognition of a set of signs and<br />

symptoms leading the clinician to think of a particular disease.<br />

Two-tailed statistical test Used when alternative hypothesis is non-directional and<br />

there is no specification of the direction of differences between the groups.


410 <strong>Essential</strong> <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

Type I error Error made by rejecting the null hypothesis when it is true and accepting<br />

the alternative hypothesis when it isn’t true.<br />

Type II error Error made by not rejecting the null hypothesis when it is false and the<br />

alternative hypothesis is true.<br />

Unadjusted life expectancy (life years) The number of years a person is expected to live<br />

<strong>based</strong> solely on their age at the time. Adjusting would consider lifestyle factors such as<br />

smoking, risk-taking, cholesterol, weight, etc.<br />

Uncertainty The inability to determine precisely what an outcome would be for a<br />

disease or diagnostic test.<br />

Utility The measure of value of an outcome. Also whether a patient is truly better off as a<br />

result of a diagnostic test.<br />

Validity (1) The degree to which the results of a study are likely to be true, believable and<br />

free of bias. (2) The degree to which a measurement represents the phenomenon of<br />

interest.<br />

Variable Something that can take on different values such as a diagnostic test, risk<br />

factor, treatment, outcome, or characteristic of a group.<br />

Variance<br />

A measure of the spread of values around the mean.<br />

Yule–Simpson paradox A statistical paradox in which one group is superior overall while<br />

the other is superior for all of the subgroups.


Bibliography<br />

Common medical journals<br />

The following are the major peer-reviewed medical journals grouped by specialty. This is<br />

only a partial list. Many other peer-reviewed journals exist in all specialties.<br />

General<br />

New England Journal of <strong>Medicine</strong><br />

JAMA (Journal of the American Medical Association)<br />

BMJ (British Medical Journal)<br />

Lancet<br />

Postgraduate <strong>Medicine</strong><br />

Emergency <strong>Medicine</strong><br />

Annals of Emergency <strong>Medicine</strong><br />

American Journal of Emergency <strong>Medicine</strong><br />

Journal of Emergency <strong>Medicine</strong><br />

Academic Emergency <strong>Medicine</strong><br />

Family Practice<br />

Family Physician<br />

Journal of Family Practice<br />

Journal of the American Board of Family Practice<br />

Archives of Family Practice<br />

Internal <strong>Medicine</strong><br />

Annals of Internal <strong>Medicine</strong><br />

Journal of General Internal <strong>Medicine</strong><br />

Archives of Internal <strong>Medicine</strong><br />

American Journal of <strong>Medicine</strong><br />

Internal <strong>Medicine</strong> Specialties<br />

American Journal of Cardiology<br />

Circulation<br />

Thorax<br />

411


412 Bibliography<br />

Annual Review of Respiratory Diseases<br />

Gut<br />

Gastroenterology<br />

Nephron<br />

Blood<br />

Medical Education<br />

Academic <strong>Medicine</strong><br />

Medical Teacher<br />

Neurology and Neurosurgery<br />

Annals of Neurology<br />

Neurology<br />

Stroke<br />

Journal of Neurosurgery<br />

Neurosurgery<br />

Obstetrics and Gynecology<br />

Obstetrics and Gynecology<br />

American Journal of Obstetrics and Gynecology<br />

Pediatrics<br />

Pediatrics<br />

Journal of Pediatrics<br />

American Journal of Diseases of Children<br />

Psychiatry<br />

American Journal of Psychiatry<br />

Journal of Clinical Psychiatry<br />

Radiology<br />

AJR (American Journal of Roentgenology)<br />

Surgery<br />

Annals of Surgery<br />

American Journal of Surgery<br />

Archives of Surgery<br />

American Surgeon<br />

Journal of the American College of Surgeons<br />

Common non-peer-reviewed journals (also known as “throw-aways”)<br />

Hospital Physician<br />

Resident and Physician


Bibliography 413<br />

There has been a real explosion of books and articles that discuss EBM. This is just a brief<br />

selection.<br />

Books<br />

American National Standards Institute. American National Standard for the Preparation<br />

of Scientific Papers for Written or Oral Presentation. ANSI Z39.16. Washington, DC:<br />

American National Standards Institute, 1972.<br />

Ball, C. M. & Phillips, R. S. <strong>Evidence</strong>-Based on Call: Acute <strong>Medicine</strong>. Edinburgh: Churchill<br />

Livingstone, 2001.<br />

Bernstein, P. L. Against the Gods: the Remarkable Story of Risk. New York, NY: Wiley, 1998.<br />

Bradford Hill, A. A Short Textbook of Medical Statistics. Oxford: Oxford University Press,<br />

1977.<br />

Cochrane, A. L. Effectiveness & Efficiency: Random Reflections on Health Services. London:<br />

Royal Society of <strong>Medicine</strong>, 1971.<br />

Cohen, J. Statistical Power Analysis for the Behavioral Sciences. 2nd edn. Orlando, FL:<br />

Academic Press, 1988.<br />

Daly, J. <strong>Evidence</strong>-<strong>based</strong> medicine and the Search for a Science of Clinical Care. Berkeley, CA:<br />

University of California Press, 2005.<br />

Dawes, M., Davies, P., Gray, A., Mant, J., Seers, K. & Snowball, R. <strong>Evidence</strong>-Based Practice: a<br />

Primer for Health Care Professionals. Edinburgh: Churchill Livingstone, 1999.<br />

Dixon,R.A.,Munro,J.F.&Silcocks,P.B.The <strong>Evidence</strong> Based <strong>Medicine</strong> Workbook: Critical<br />

Appraisal for Clinical Problem Solving. Oxford: Oxford University Press, 1997.<br />

Ebell, M. R. <strong>Evidence</strong>-Based Diagnosis: a Handbook of Clinical Prediction Rules. Berlin:<br />

Springer, 2001.<br />

Eddy, D. Clinical Decision Making. Sudbury, MA: Jones & Bartlett, 1996.<br />

Fletcher, R. H., Fletcher, S. W. & Wagner, E. H. Clinical Epidemiology: the <strong>Essential</strong>s.<br />

Baltimore, MD: Williams & Wilkins, 1995.<br />

Friedland, D. J., Go, A. S., Davoren, J. B., Shlipak, M. G., Bent, S. W., Subak, L. L. & Mendelson,<br />

T. <strong>Evidence</strong>-Based <strong>Medicine</strong>: A Framework for Clinical Practice. Stamford, CT:<br />

Appleton & Lange, 1998.<br />

Gelbach, S. H. Interpreting the Medical Literature. New York, NY: McGraw-Hill, 1993.<br />

Geyman, J. P., Deyo, R. A. & Ramsey, S. D. <strong>Evidence</strong>-Based Clinical Practice: Concepts and<br />

Approaches. Boston, MA: Butterworth Heinemann, 1999.<br />

Glasziou, P., Irwig, L., Bain, C. & Colditz, G. Systematic Reviews in Health Care: a Practical<br />

Guide. Cambridge: Cambridge University Press, 2001.<br />

Glasziou, P., DelMar, C. & Salisbury, J. <strong>Evidence</strong>-Based Practice Workbook. 2nd edn. Blackwell<br />

Publishing 2007.<br />

Greenhalgh, T. & Donald, A. <strong>Evidence</strong>-Based Health Care Workbook. London: BMJ Books,<br />

2000.<br />

Gray,J.A.M.<strong>Evidence</strong>-Based Healthcare: How to Make Health Policy and Management<br />

Decisions. Philadelphia, PA: Saunders, 2001.<br />

Gross, R. Making Medical Decisions: an Approach to Clinical Decision Making for Practicing<br />

Physicians. Philadelphia, PA: American College of Physicians, 1999.<br />

Decisions and <strong>Evidence</strong> in Medical Practice: Applying <strong>Evidence</strong>-Based <strong>Medicine</strong> to Clinical<br />

Decision Making. St Louis, MO: Mosby, 2001.


414 Bibliography<br />

Guyatt, G. & Rennie, D. (eds.). Users’ Guides to the Medical Literature: a Manual for<br />

<strong>Evidence</strong>-Based Clinical Practice. Chicago: AMA, 2002.<br />

Hamer, S. & Collinson, G. Achieving <strong>Evidence</strong>-Based Practice: A Handbook for Practitioners.<br />

Edinburgh: Bailliere Tindall, 1999<br />

Hulley, S. B. & Cummings, S. R. Designing Clinical Research. Baltimore, MD: Williams &<br />

Wilkins, 1988.<br />

Matthews, J. R. Quantification and the Quest for Medical Certainty. Princeton, NJ: Princeton<br />

University Press, 1995.<br />

McDowell, J. E. & Newell, C. Measuring Health: a Guide to Rating Scales and Questionnaires.<br />

New York, NY: Oxford University Press, 1987.<br />

McGee, S. R. <strong>Evidence</strong>-Based Physical Diagnosis. Philadelphia, PA: Saunders, 2001.<br />

McKibbon, A., Eady, A. & Marks, S. PDQ <strong>Evidence</strong>-Based Principles and Practice. Hamilton,<br />

BC: Decker Inc., 1999<br />

Norman, G. & Streiner, D. Biostatistics: the Bare <strong>Essential</strong>s. Hamilton, BC: Decker Inc., 2000.<br />

Riegelman, R. K. & Hirsch, D. S. Studying a Study and Testing a Test. How to Read the Medical<br />

Literature. 4th edn. Boston, MA: Little Brown, 2000.<br />

Sackett, D. L., Haynes, R. B., Guyatt, G. H. & Tugwell, P. Clinical Epidemiology: a Basic Science<br />

for Clinical <strong>Medicine</strong>. 2nd edn. Boston, MA: Little Brown, 1991.<br />

Sackett, D. L., Straus, S. E., Richardson, W. S., Rosenberg, W. & Haynes, R. B. <strong>Evidence</strong> Based<br />

<strong>Medicine</strong>: How to Practice and Teach EBM. 2nd edn. London: Churchill Livingstone,<br />

2000.<br />

Sox,H.C.,Blatt,M.A.,Higgins,M.C.&Marton,K.I.Medical Decision Making. Boston, MA:<br />

Butterworth Heinemann, 1988.<br />

Spencer,J.W.&Jacobs,J.Complementary and Alternative <strong>Medicine</strong>: an <strong>Evidence</strong>-Based<br />

Approach. St Louis, MO: Mosby, 2003.<br />

Straus, S. E., Hsu, S., Ball, C. M. & Phillips, R. S. <strong>Evidence</strong>-Based Acute <strong>Medicine</strong>.Edinburgh:<br />

Churchill Livingstone, 2001.<br />

Straus,S.E.,Richardson,W.S.,Glasziou,P.&Haynes,R.B.<strong>Evidence</strong> Based <strong>Medicine</strong>:<br />

How to Practice and Teach EBM. 3rd edn. Edinburgh: Elsevier Churchill Livingstone,<br />

2005.<br />

Tufte, E. R. The Visual Display of Quantitative Data. Cheshire, CT: Graphics Press, 1983.<br />

Velleman, P. ActivStats. Reading, MA: Addison-Wesley, 1999.<br />

Wulff, H. R. & Gotzsche, P. C. Rational Diagnosis and Treatment: <strong>Evidence</strong>-Based Clinical<br />

Decision-Making. 3rd edn. London: Blackwell, 2000.<br />

Journal articles<br />

General<br />

Ad Hoc Working Group for Critical Appraisal of the Medical Literature. A proposal for more<br />

informative abstracts of clinical articles. Ann. Intern. Med. 1987; 106: 598–604.<br />

Bradford Hill, A. Statistics in the medical curriculum? Br. Med. J. 1947; ii: 366.<br />

Cuddy, P. G., Elenbaas, R. M. & Elenbaas, J. K. Evaluating the medical literature. Part I:<br />

abstract, introduction, methods. Ann. Emerg. Med. 1983; 12: 549–555.<br />

Day, R. A. The origins of the scientific paper: the IMRAD format. AMWA J. 1989; 4: 16–18.


Bibliography 415<br />

Department of Clinical Epidemiology and Biostatistics, McMaster University Health Sciences<br />

Centre. How to read clinical journals. I: why to read them and how to start reading<br />

them critically. Can. Med. Assoc. J. 1981; 124: 555–558.<br />

How to read clinical journals. V: to distinguish useful from useless or even harmful therapy.<br />

Can. Med. Assoc. J. 1981; 124: 1156–1162.<br />

Diamond, G. A. & Forrester, J. S. Clinical trials and statistical verdicts: probable grounds for<br />

appeal. Ann. Intern. Med. 1983; 98: 385–394.<br />

Elenbaas, R. M., Elenbaas, J. K., & Cuddy, P. G. Evaluating the medical literature. Part II:<br />

statistical analysis. Ann. Emerg. Med. 1983; 12: 610–620.<br />

Elenbaas, J. K., Cuddy, P. G. & Elenbaas, R. M. Evaluating the medical literature. Part III:<br />

results and discussion. Ann. Emerg. Med. 1983; 12: 679–686.<br />

Ernst, E. <strong>Evidence</strong> <strong>based</strong> complementary medicine: a contradiction in terms? Ann. Rheum.<br />

Dis. 1999; 58: 69–70.<br />

Greenhalgh, T. How to read a paper. The Medline database. BMJ 1997; 315: 180–183.<br />

Haynes, B., Glasziou, P. & Straus, S. Advances in evidence-<strong>based</strong> information resources for<br />

clinical practice. ACP J. Club 2000; 132: A11–A14.<br />

Haynes, R. B., Mulrow, C. D., Huth, E. J., Altman, D. G. & Gardner, M. J. More informative<br />

abstracts revisited. Ann. Intern. Med. 1990; 113: 69–76.<br />

Haynes, R. B., Wilczynski, N., McKibbon, K. A., Walker, C. J. & Sinclair, J. C. Developing<br />

optimal search strategies for detecting clinically sound studies in MEDLINE. J. Am.<br />

Med. Inform. Assoc. 1994; 1: 447–458.<br />

Isaacs, D. & Fitzgerald, D. Seven alternatives to evidence <strong>based</strong> medicine. BMJ 1999; 319:<br />

1618.<br />

Mulrow, C. D., Thacker, S. B. & Pugh, J. A. A proposal for more informative abstracts of<br />

review articles. Ann. Intern. Med. 1988; 108: 613–615.<br />

Rennie, D. & Glass, R. M. Structuring abstracts to make them more informative. JAMA 1991;<br />

266: 116–117.<br />

Sackett, D. L., Rosenberg, W. M., Gray, J. A., Haynes R. B. & Richardson W. S. <strong>Evidence</strong> <strong>based</strong><br />

medicine: what it is and what it isn’t. BMJ 1996; 312: 71–72.<br />

Sackett, D. L. & Straus, S. E. Finding and applying evidence during clinical rounds:<br />

the “evidence cart”. JAMA 1998; 280: 1336–1338.<br />

Taddio, A., Pain, T., Fassos, F. F., Boon, H., Ilersich, A. L. & Einarson, T. R. Quality of nonstructured<br />

and structured abstracts of original research articles in the British Medical<br />

Journal, the Canadian Medical Association Journal and the Journal of the American<br />

Medical Association. CMAJ 1994; 150: 1611–1615.<br />

Taplin, S., Galvin, M. S., Payne, T., Coole, D. & Wagner, E. Putting population-<strong>based</strong> care<br />

into practice: real option or rhetoric? J. Am. Board Fam. Pract. 1998; 11: 116–126.<br />

Woolf, S. H. The need for perspective in evidence-<strong>based</strong> medicine. JAMA 1999; 282: 2358–<br />

2365.<br />

Cause and effect<br />

Department of Clinical Epidemiology and Biostatistics, McMaster University Health Sciences<br />

Centre. How to read clinical journals. IV: to determine etiology or causation.<br />

Can. Med. Assoc. J. 1981; 124: 985–990.


416 Bibliography<br />

Evans, A. S. Causation and disease: a chronological journey. The Thomas Parran Lecture.<br />

Am. J. Epidemiol. 1978; 108: 249–258.<br />

Weiss, N. S. Inferring causal relationships: elaboration of the criterion of “dose-response.”<br />

Am. J. Epidemiol. 1981; 113: 487–490.<br />

Study design<br />

Bogardus, S. T., Jr., Concato, J. & Feinstein, A. R. Clinical epidemiological quality in molecular<br />

genetic research: the need for methodological standards. JAMA 1999; 281: 1919–<br />

1926.<br />

Burkett, G. L. Classifying basic research designs. Fam. Med. 1990; 22: 143–148.<br />

Gilbert, E. H., Lowenstein, S. R., Koziol-McLain, J., Barta, D. C. & Steiner, J. Chart reviews<br />

in emergency medicine research: where are the methods? Ann. Emerg. Med. 1996; 27:<br />

305–308.<br />

Hayden, G. F., Kramer, M. S. & Horwitz, R. I. The case-control study: a practical review for<br />

the clinician. JAMA 1982; 247: 326–331.<br />

Lavori, P. W., Louis, T. A., Bailar, J. C., III & Polansky, M. Designs for experiments – parallel<br />

comparisons of treatment. N.Engl.J.Med.1983; 309: 1291–1299.<br />

Mantel, N. & Haenszel, W. Statistical aspects of the analysis of data from retrospective studies<br />

of disease. J. Natl. Cancer Inst. 1959; 22: 719–748.<br />

Measurement<br />

Department of Clinical Epidemiology and Biostatistics, McMaster University Health Sciences<br />

Centre. Clinical disagreement. I: how often it occurs and why. Can. Med. Assoc.<br />

J. 1980; 123: 499–504.<br />

Clinical disagreement. II: how to avoid it and how to learn from one’s mistakes. Can. Med.<br />

Assoc. J. 1980; 123: 613–617.<br />

Bias<br />

Croskerry, P. Achieving quality in clinical decision making: cognitive strategies and detection<br />

of bias. Acad. Emerg. Med. 2002; 9: 1184–1204.<br />

Feinstein, A. R., Sosin, D. M. & Wells, C. K. The Will Rogers phenomenon. Stage migration<br />

and new diagnostic techniques as a source of misleading statistics for survival in cancer.<br />

N. Engl. J. Med. 1985; 312: 1604–1608.<br />

Sackett, D. L. Bias in analytic research. J. Chronic Dis. 1979; 32: 51–63.<br />

Sackett, D. L. & Gent, M. Controversy in counting and attributing events in clinical trials.<br />

N.Engl.J.Med.1979; 301: 1410–1412.<br />

Schulz, K. F., Chalmers, I., Hayes, R. J. & Altman, D. G. Empirical evidence of bias. Dimensions<br />

of methodological quality associated with estimates of treatments effects in controlled<br />

trials. JAMA 1995; 273: 408–412.<br />

General biostatistics<br />

Berwick, D. M. Experimental power: the other side of the coin. Pediatrics 1980; 65: 1043–<br />

1045.


Bibliography 417<br />

Moses, L. E. Statistical concepts fundamental to investigations. N.Engl.J.Med.1985; 312:<br />

890–897.<br />

Streiner, D. L. Maintaining standards: differences between the standard deviation and standard<br />

error, and when to use each. Can. J. Psychiatry 1996; 41: 498–502.<br />

Type I and II errors<br />

Cook, R. J. & Sackett, D. L. The number needed to treat: a clinically useful measure of treatment<br />

effect. BMJ 1995; 310: 452–454.<br />

Cordell, W. H. Number needed to treat (NNT). Ann. Emerg. Med. 1999; 33: 433–436.<br />

Freiman, J. A., Chalmers, T. C., Smith, H. Jr. & Kuebler, R. R. The importance of beta, the<br />

type II error and sample size in the design and interpretation of the randomized control<br />

trial. Survey of 71 “negative” trials. N. Engl. J. Med. 1978; 299: 690–694.<br />

Todd, K. H. & Funk, J. P. The minimum clinically important difference in physicianassigned<br />

visual analog pain scores. Acad. Emerg. Med. 1996; 3: 142–146.<br />

Todd, K. H., Funk, K. G., Funk, J. P. & Bonacci, R. Clinical significance of reported changes<br />

in pain severity. Ann. Emerg. Med. 1996; 27: 485–489.<br />

Young, M., Bresnitz, E. A. & Strom, B. L. Sample size nomograms for interpreting negative<br />

clinical studies. Ann. Intern. Med. 1983; 99: 248–251.<br />

Risk<br />

Concato, J., Feinstein, A. R. & Holford, T. R. The risk of determining risk with multivariable<br />

analysis. Ann. Intern. Med. 1993; 118: 201–210.<br />

Hanley, J. A. & Lippman-Hand, A. If nothing goes wrong, is everything all right? Integrating<br />

zero numerators. JAMA 1983; 249: 1743–1745.<br />

Schulman, K. A., Berlin, J. A., Harless, W., Kerner, J. F., Sistrunk, S., Gersh, B. J., Dubé, R.,<br />

Taleghani, C. K., Burke, J. E., Williams, S., Eisenberg, J. M. & Escarce, J. J. The effect of<br />

race and sex on physicians’ recommendations for cardiac catheterization. N. Engl. J.<br />

Med. 1999; 340: 618–626.<br />

Schwartz, L. M., Woloshin, S. & Welch, H. G. Misunderstandings about the effects of race<br />

and sex on physicians’ referrals for cardiac catheterization. (Sounding Board.) N. Engl.<br />

J. Med. 1999; 341: 279–283.<br />

Clinical trials<br />

Bailar, J. C., III, Louis, T. A., Lavori, P. W. & Polansky, M. Studies without internal controls.<br />

N. Engl. J. Med. 1984; 311: 156–162.<br />

Elwood, J. M. Interpreting clinical trial results: seven steps to understanding. Can. Med.<br />

Assoc. J. 1980; 123: 343–345.<br />

Ernst, E. & Resch, K. L. Concept of true and perceived placebo effects. BMJ 1995; 311: 551–<br />

553.<br />

Ernst, E. & White, A. R. Acupuncture for back pain: a meta-analysis of randomized controlled<br />

trials. Arch. Intern. Med. 1998; 158: 2235–2241.<br />

Hróbjartsson, A. & Gotzsche, P. C. Is the placebo powerless? An analysis of clinical trials<br />

comparing placebo with no treatment. N.Engl.J.Med.2001; 344: 1594–1602.


418 Bibliography<br />

Louis, T. A., Lavori, P. W., Bailar, J. C. III & Polansky, M. Crossover and self-controlled designs<br />

in clinical research. N. Engl. J. Med. 1984; 310: 24–31.<br />

Standards of Reporting Trials Group. A proposal for structured reporting of randomized<br />

controlled trials. JAMA 1994; 272: 1926–1931. Correction: JAMA 1995; 273: 776.<br />

Working Group on Recommendations for Reporting Clinical Trials in the Biomedical Literature.<br />

Call for comments on a proposal to improve reporting of clinical trials in the<br />

biomedical literature. Ann. Intern. Med. 1994; 121: 894–895.<br />

Communicating evidence to patients<br />

Epstein, R. M., Alper, B. S., Quill, T. E. Communicating evidence for participatory decision<br />

making. JAMA. 2004; 291: 2359–2366.<br />

Halvorsen, P. A., Selmer, R. & Kristiansen, I. S. Different ways to describe the benefits of<br />

risk-reducing treatments: a randomized trial. Ann. Intern. Med. 2007; 146: 848–856.<br />

Gigerenzer, G. Reckoning with Risk: Learning to Live with Uncertainty. Harmondsworth:<br />

Penguin, 2002.<br />

McNeil, B. J., Pauker, S. G., Sox, H. C. Jr., & Tversky, A. On the elicitation of preferences for<br />

alternative therapies. N.Engl.J.Med.1982; 306: 1259–1262.<br />

Diagnostic tests<br />

Mower, W. R. Evaluating bias and variability in diagnostic test reports. Ann. Emerg. Med.<br />

1999; 33: 85–91.<br />

Patterson, R. E. & Horowitz, S. F. Importance of epidemiology and biostatistics in deciding<br />

clinical strategies for using diagnostic tests: a simplified approach using examples<br />

from coronary artery disease. J. Am. Coll. Cardiol. 1989; 13: 1653–1665.<br />

Miscellaneous<br />

Department of Clinical Epidemiology and Biostatistics, McMaster University Health Sciences<br />

Centre. How to read clinical journals. III: to learn clinical course and prognosis<br />

of disease. Can. Med. Assoc. J. 1981; 124: 869–872.<br />

L’Abbé, K. A., Detsky, A. S. & O’Rourke, K. Meta-analysis in clinical research. Ann. Intern.<br />

Med. 1987; 107: 224–233.<br />

Olson, C. M. Consensus statements: applying structure. JAMA 1995; 273: 72–73.<br />

Sonnenberg, F. A. & Beck, J. R. Markov models in medical decision making: a practical<br />

guide. Med. Decis. Making 1993; 13: 322–338.<br />

Wasson, J. H., Sox, H. C., Neff, R. K. & Goldman, L. Clinical prediction rules: application and<br />

methodological standards. N.Engl.J.Med.1985; 313: 793–799.<br />

Users’ guides to the medical literature<br />

Barratt, A., Irwig, L., Glasziou, P., Cumming, R. G., Raffle, A., Hicks, N., Gray, J. A. & Guyatt,<br />

G. H. Users’ guides to the medical literature: XVII. How to use guidelines and recommendations<br />

about screening. <strong>Evidence</strong>-Based <strong>Medicine</strong> Working Group. JAMA 1999;<br />

281: 2029–2034.


Bibliography 419<br />

Bucher, H. C., Guyatt, G. H., Cook, D. J., Holbrook, A. & McAlister, F. A. Users’ guides to the<br />

medical literature: XIX. Applying clinical trial results. A. How to use an article measuring<br />

the effect of an intervention on surrogate end points. <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

Working Group. JAMA 1999; 282: 771–778.<br />

<strong>Dan</strong>s, A. L., <strong>Dan</strong>s, L. F., Guyatt, G. H. & Richardson, S. Users’ guides to the medical literature:<br />

XIV. How to decide on the applicability of clinical trial results to your patient. <strong>Evidence</strong>-<br />

Based <strong>Medicine</strong> Working Group. JAMA 1998; 279: 545–549.<br />

Drummond, M. F., Richardson, W. S., O’Brien, B. J., Levine, M. & Heyland, D. Users’ guides<br />

to the medical literature: XIII. How to use an article on economic analysis of clinical<br />

practice. A. Are the results of the study valid? <strong>Evidence</strong>-Based <strong>Medicine</strong> Working<br />

Group. JAMA 1997; 277: 1552–1557.<br />

Giacomini, M. K. & Cook, D. J. Users’ guides to the medical literature: XXIII. Qualitative<br />

research in health care. A. Are the results of the study valid? <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

Working Group. JAMA 2000; 284: 357–362.<br />

Users’ guides to the medical literature: XXIII. Qualitative research in health care. B.<br />

What are the results and how do they help me care for my patients? <strong>Evidence</strong>-Based<br />

<strong>Medicine</strong> Working Group. JAMA 2000; 284: 478–482.<br />

Guyatt, G. & Rennie, D. (eds.). Users’ Guides to the Medical Literature: a Manual for<br />

<strong>Evidence</strong>-Based Clinical Practice. Chicago: AMA, 2002.<br />

Guyatt, G. H., Sackett, D. L. & Cook, D. J. Users’ guides to the medical literature: II. How<br />

to use an article about therapy or prevention. A. Are the results of the study valid?<br />

<strong>Evidence</strong>-Based <strong>Medicine</strong> Working Group. JAMA 1993; 270: 2598–2601.<br />

Users’ guides to the medical literature: II. How to use an article about therapy or prevention.<br />

B. What were the results and will they help me in caring for my patients?<br />

<strong>Evidence</strong>-Based <strong>Medicine</strong> Working Group. JAMA 1994; 271: 59–63.<br />

Guyatt, G. H., Sackett, D. L., Sinclair, J. C., Hayward, R., Cook, D. J. & Cook, R. J. Users’<br />

guides to the medical literature: IX. A method for grading health care recommendations.<br />

<strong>Evidence</strong>-Based <strong>Medicine</strong> Working Group. JAMA 1995; 274: 1800–1804.<br />

Guyatt, G. H., Naylor, C. D., Juniper, E., Heyland, D. K., Jaeschke, R. & Cook, D. J. Users’<br />

guides to the medical literature: XII. How to use articles about health-related quality<br />

of life. <strong>Evidence</strong>-Based <strong>Medicine</strong> Working Group. JAMA 1997; 277: 1232–1237.<br />

Guyatt, G. H., Sinclair, J., Cook, D. J. & Glasziou, P. Users’ guides to the medical literature:<br />

XVI. How to use a treatment recommendation. <strong>Evidence</strong>-Based <strong>Medicine</strong> Working<br />

Group and Cochrane Applicability Methods Working Group. JAMA 1999; 281: 1836–<br />

1843.<br />

Guyatt, G. H., Haynes, R. B., Jaeschke, R. Z., Cook, D. J., Green, L., Naylor, C. D., Wilson,<br />

M. C. & Richardson, W. S. Users’ guides to the medical literature: XXV. <strong>Evidence</strong>-<strong>based</strong><br />

medicine: principles for applying the Users’ Guides to patient care. <strong>Evidence</strong>-Based<br />

<strong>Medicine</strong> Working Group. JAMA 2000; 284: 1290–1296.<br />

Hayward, R. S., Wilson, M. C., Tunis, S. R., Bass, E. B. & Guyatt, G. Users’ guides to the medical<br />

literature: VIII. How to use clinical practice guidelines. A. Are the recommendations<br />

valid? <strong>Evidence</strong>-Based <strong>Medicine</strong> Working Group. JAMA 1995; 274: 570–574.<br />

Hunt, D. L., Jaeschke, R. & McKibbon, K. A. Users’ guides to the medical literature: XXI.<br />

Using electronic health information resources in evidence-<strong>based</strong> practice. <strong>Evidence</strong>-<br />

Based <strong>Medicine</strong> Working Group. JAMA 2000; 283: 1875–1879.


420 Bibliography<br />

Jaeschke, R., Guyatt, G. H. & Sackett, D. L. Users’ guides to the medical literature: III. How<br />

to use an article about a diagnostic test. A. Are the results of the study valid? <strong>Evidence</strong>-<br />

Based <strong>Medicine</strong> Working Group. JAMA 1994; 271: 389–391.<br />

Users’ guides to the medical literature: III. How to use an article about a diagnostic test.<br />

B. What are the results and will they help me in caring for my patients? <strong>Evidence</strong>-Based<br />

<strong>Medicine</strong> Working Group. JAMA 1994; 271: 703–707.<br />

Laupacis, A., Wells, G., Richardson, W. S. & Tugwell, P. Users’ guides to the medical literature:<br />

V. How to use an article about prognosis. <strong>Evidence</strong>-Based <strong>Medicine</strong> Working<br />

Group. JAMA 1994; 272: 234–237.<br />

Levine, M., Walter, S., Lee, H., Haines, T., Holbrook, A. & Moyer, V. Users’ guides to the medical<br />

literature: IV. How to use an article about harm. <strong>Evidence</strong>-Based <strong>Medicine</strong> Working<br />

Group. JAMA 1994; 271: 1615–1619.<br />

McAlister, F. A., Laupacis, A., Wells, G. A. & Sackett, D. L. Users’ guides to the medical literature:<br />

XIX. Applying clinical trial results. B. Guidelines for determining whether a drug<br />

is exerting (more than) a class effect. JAMA 1999; 282: 1371–1377.<br />

McAlister, F. A., Straus, S. E., Guyatt, G. H. & Haynes, R. B. Users’ guides to the medical<br />

literature: XX. Integrating research evidence with the care of the individual patient.<br />

<strong>Evidence</strong>-Based <strong>Medicine</strong> Working Group. JAMA 2000; 283: 2829–2836.<br />

McGinn, T. G., Guyatt, G. H., Wyer, P. C., Naylor, C. D., Stiell, I. G. & Richardson, W. S. Users’<br />

guides to the medical literature: XXII. How to use articles about clinical decision rules.<br />

<strong>Evidence</strong>-Based <strong>Medicine</strong> Working Group. JAMA 2000; 284: 79–84.<br />

Naylor, C. D. & Guyatt, G. H. Users’ guides to the medical literature: X. How to use an article<br />

reporting variations in the outcomes of health services. The <strong>Evidence</strong>-Based <strong>Medicine</strong><br />

Working Group. JAMA 1996; 275: 554–558.<br />

Users’ guides to the medical literature: XI. How to use an article about a clinical utilization<br />

review. <strong>Evidence</strong>-Based <strong>Medicine</strong> Working Group. JAMA 1996; 275: 1435–1439.<br />

O’Brien, B. J., Heyland, D., Richardson, W. S., Levine, M. & Drummond, M. F. Users’ guides<br />

to the medical literature: XIII. How to use an article on economic analysis of clinical<br />

practice. B. What are the results and will they help me in caring for my patients?<br />

<strong>Evidence</strong>-Based <strong>Medicine</strong> Working Group. JAMA 1997; 277: 1802–1806.<br />

Oxman, A. D., Sackett, D. L. & Guyatt, G. H. Users’ guides to the medical literature: I. How<br />

to get started. The <strong>Evidence</strong>-Based <strong>Medicine</strong> Working Group. JAMA 1993; 270: 2093–<br />

2095.<br />

Oxman, A. D., Cook, D. J., Guyatt, G. H. Users’ guides to the medical literature: VI. How<br />

to use an overview. <strong>Evidence</strong>-Based <strong>Medicine</strong> Working Group. JAMA 1994; 272: 1367–<br />

1371.<br />

Randolph, A. G., Haynes, R. B., Wyatt, J. C., Cook, D. J. & Guyatt, G. H. Users’ guides to<br />

the medical literature: XVIII. How to use an article evaluating the clinical impact of a<br />

computer-<strong>based</strong> clinical decision support system. JAMA 1999; 282: 67–74.<br />

Richardson, W. S. & Detsky, A. S. Users’ guides to the medical literature: VII. How to<br />

use a clinical decision analysis. A. Are the results of the study valid? <strong>Evidence</strong>-Based<br />

<strong>Medicine</strong> Working Group. JAMA 1995; 273: 1292–1295.<br />

Users’ guides to the medical literature: VII. How to use a clinical decision analysis. B.<br />

What are the results and will they help me in caring for my patients? <strong>Evidence</strong>-Based<br />

<strong>Medicine</strong> Working Group. JAMA 1995; 273: 1610–1613.


Bibliography 421<br />

Richardson, W. S., Wilson, M. C., Guyatt, G. H., Cook, D. J. & Nishikawa, J. Users’ guides<br />

to the medical literature: XV. How to use an article about disease probability for<br />

differential diagnosis. <strong>Evidence</strong>-Based <strong>Medicine</strong> Working Group. JAMA 1999; 281:<br />

1214–1219.<br />

Richardson, W. S., Wilson, M. C., Williams, J.W. Jr, Moyer, V. A. & Naylor, C. D. Users’ guides<br />

to the medical literature: XXIV. How to use an article on the clinical manifestations of<br />

disease. <strong>Evidence</strong>-Based <strong>Medicine</strong> Working Group. JAMA 2000; 284: 869–875.<br />

Wilson, M. C., Hayward, R. S., Tunis, S. R., Bass, E. B. & Guyatt, G. Users’ guides to the<br />

medical literature: VIII. How to use clinical practice guidelines. B. What are the recommendations<br />

and will they help you in caring for your patients? The <strong>Evidence</strong>-Based<br />

<strong>Medicine</strong> Working Group. JAMA 1995; 274: 1630–1632.<br />

Web sites<br />

The classic sites<br />

Centre for <strong>Evidence</strong>-Based <strong>Medicine</strong>, Oxford University This is the one of the oldest<br />

and best EBM sites, with many features including a toolbox, Critically Appraised Topics<br />

(CAT) maker, a glossary, and links to other sites. There is also a CAT-bank of previously prepared<br />

critical analyses. The toolbox has an all-purpose four-fold calculator, which requires<br />

Macromedia Shockwave Player.<br />

www.cebm.net<br />

Bandolier This is an excellent site for getting quick information about a given topic.<br />

They do very brief summary reviews of the current literature. Sponsored by the Centre for<br />

<strong>Evidence</strong>-Based <strong>Medicine</strong>.<br />

www.medicine.ox.ac.uk/bandolier<br />

<strong>Evidence</strong> Based Emergency <strong>Medicine</strong> at the New York Academy of <strong>Medicine</strong> This is<br />

an excellent site with many features including a Journal Club Bank, critical review forms,<br />

glossary, the Users’ Guides to the Medical Literature, and links to other sites.<br />

www.ebem.org/cgi-bin/index.php<br />

University of British Columbia Written by Martin Schechter, this is an excellent site for<br />

online calculations of NNT, likelihood ratios, and confidence intervals. Select links, then<br />

go to Calculators and select either the Bayesian or Clinical Significance Calculators. Must<br />

have data in dichotomous form.<br />

www.spph.ubc.ca<br />

<strong>Evidence</strong> Based <strong>Medicine</strong> Tool Kit, University of Alberta An excellent site to do the Users’<br />

Guides to the Medical Literature. This site has worksheets for all the guides and links to<br />

text versions of the original articles, made available by the Canadian Centres for Health<br />

<strong>Evidence</strong>.<br />

www.ebm.med.ualberta.ca/<br />

Best evidence compilations<br />

<strong>Evidence</strong> Updates from the BMJ Sponsored by the BMJ Group and McMaster University’s<br />

Health Information Research Unit, this site is a great place to look for evaluation of<br />

recent studies and reviews. Very much up to date with a searchable database and email


422 Bibliography<br />

alert service, this is a free service of BMJ. Citations are all pre-rated for quality, clinical relevance<br />

and interest by practicing physicians.<br />

http://bmjupdates.mcmaster.ca/index.asp<br />

<strong>Evidence</strong>-Based On-Call This is a wonderful site for Critically Appraised Topics (CATs).<br />

Tends to favor acute care medicine, but you never know if you’ll find the answer to your<br />

query very quickly. Professional team writes and reviews all CATs. There are 39 topic areas<br />

with a total of hundreds of CATs.<br />

http://www.eboncall.org<br />

Agency for Health Research and Quality This US government agency is responsible for<br />

evaluating the evidence behind new and upcoming technologies and improvements in the<br />

practice of health care in the United States. They have an excellent list of topics with an<br />

evaluation of the strength of the evidence behind them.<br />

http://www.ahrq.gov<br />

BestBETs is a free site that contains CATs, many of which are related to acute-care topics.<br />

There are also unfinished CATS and topics needing CATS, and the site developers hope that<br />

others will input their information into the site.<br />

www.bestbets.org<br />

Ganfyd (Get a note from your doctor) This is a medical wiki that catalogues medical<br />

knowledge and can be edited by any registered medical practitioner and tries to be evidence<br />

<strong>based</strong> with many of the citations graded for quality of the evidence. Some of the<br />

evidence is better than other with no consistency, but that is the fun of wikis.<br />

www.ganfyd.org<br />

Trip Answers This is a spin off from the Trip Data Base search engine. Questions can be<br />

posed to the site and will be answered quickly using the best evidence available.<br />

www.tripanswers.org/<br />

The following websites contain excellent links and other resources for learning and practicing<br />

EBM<br />

<strong>Evidence</strong>-Based Health Informatics Health Information Research Unit, McMaster<br />

University.<br />

http://hiru.mcmaster.ca/hiru<br />

Netting the <strong>Evidence</strong>.<br />

www.shef.ac.uk/scharr/ir/netting<br />

New York Academy of <strong>Medicine</strong> EBM Resource Center.<br />

www.ebmny.org<br />

Mount Sinai School of <strong>Medicine</strong>.<br />

www.mssm.edu/medicine/general-medicine/ebm<br />

Cochrane Collaboration abstracts The abstracts of the Cochrane reviews can all be<br />

accessed here and no subscription is required to view the abstracts. The full Cochrane<br />

Library is free in many countries, but not in the United States. Many libraries have subscriptions.<br />

The abstracts are good if you want only the bottom line, but you won’t get any<br />

of the details and be able to decide for yourself if the review is valid or potentially biased.<br />

www.update-software.com/abstracts/crgindex.htm<br />

Golden Hour is an Israeli site with many features, including links and evidence-<strong>based</strong><br />

medical information.<br />

www.goldenhour.co.il


Bibliography 423<br />

NHS Centre for Reviews and Dissemination attheUniversityofYorkisthesponsoring<br />

site for the Database of Abstracts of Reviews of Effects (DARE)<br />

www.york.ac.uk/inst/crd<br />

Sites requiring subscription<br />

InfoPOEMs Now called <strong>Essential</strong> <strong>Evidence</strong> Plus, this is the website for family-practicerelated<br />

CATs (called POEMs, or Patient-Oriented <strong>Evidence</strong> that Matters). The site has a free<br />

trial period, but requires subscription after that.<br />

www.essentialevidenceplus.com<br />

CochraneCollaboration main site. It contains a collection of the best and most uniformly<br />

performed systematic reviews publications.<br />

www.update-software.com/cochrane<br />

Clinical <strong>Evidence</strong> from the British Medical Journal (BMJ). This is mainly geared to internal<br />

medicine and has an accompanying book and CD-ROM.<br />

www.clinicalevidence.com<br />

Searching sites<br />

TRIP Database contains a free set of critically appraised topics and evidence-<strong>based</strong><br />

references.<br />

www.tripdatabase.com<br />

University of Virginia – Health Sciences Library Excellent access to evidence <strong>based</strong> sites,<br />

whichappearstobefreetothepublic.<br />

http://www.hsl.virginia.edu/internet/library/collections/ebm/index.cfm<br />

Position statements<br />

The AGREE Research Trust This is the home of the AGREE instrument for the evaluation<br />

of Clinical Practice Guidelines. The instrument and training manual are free downloads<br />

from the site.<br />

http://www.agreetrust.org/<br />

The Consort Group The CONSORT Group stands for Consolidated Standards of Reporting<br />

Trials. Their site has the CONSORT statement for reporting RCTs with its associated<br />

check list and flow diagram.<br />

http://www.consort-statement.org/<br />

Learning EBM<br />

JAMAevidence A new feature of the JAMA site will have the User’s Guides to the Medical<br />

Literature and the Rational Clinical Examination series available without cost. Look for it<br />

to open to the general public early in 2009.<br />

http://jamaevidence.com/<br />

<strong>Evidence</strong>-Based Knowledge Portal The Vanderbilt University (Tennessee, United States)<br />

has a series of very nice and simple to use tutorials introducing EBM. There are also some


424 Bibliography<br />

virtual cases that can be used to learn and practice the principles of EBM. Passwords<br />

required but open to the general public.<br />

http://www.mc.vanderbilt.edu/biolib/ebmportal/login.html<br />

Delfini Group This consulting group has put together some nice resources to use in critical<br />

appraisal of the medical literature. There are also some slide shows, which are excellent<br />

for EBM education. They are free.<br />

http://www.delfini.org<br />

Michigan State University. Introduction to EBM Course This is an excellent interactive<br />

introduction to EBM. The seven modules covering Information Mastery, Critical Appraisal<br />

and Knowledge Transfer can be done in a total of about 14–20 hours total.<br />

http://www.poems.msu.edu/InfoMastery/<br />

Anesthetist.com – Interactive Receiver Operating Characteristic Curves This is an excellent<br />

way to learn about the use of diagnostic tests through the use of interactive ROC<br />

curves. Other interesting stuff can be found on the Anesthetist.com website.<br />

http://www.anaesthetist.com/mnm/stats/roc/Findex.htm<br />

Contacts within EBM<br />

CHAIN Contact, Help, Advice, and Information Networks are a free networking tool for<br />

health care workers and others. Specific areas of interest relevant to EBM include knowledge<br />

transfer and life long learning. It is a way of connecting with others in the field and<br />

exchanging ideas. It is free to join. CHAIN Canada is found at<br />

http://www.epoc.uottawa.ca/CHAINCanada.<br />

http://chain.ulcc.ac.uk/chain/index.html<br />

Centre for <strong>Evidence</strong>-Based Child Health This is a main link in child health EBM in<br />

the United Kingdom. The website has an excellent list of links for physicians and nonphysicians<br />

who are interested in child health.<br />

http://www.ich.ucl.ac.uk/ich/academicunits/Centre for evidence <strong>based</strong> child health/<br />

Homepage<br />

Teachers and Developers of EBM International A worldwide group of interested parties<br />

meet every 2 years to discuss the teaching and practice of <strong>Evidence</strong> Based <strong>Medicine</strong>. Their<br />

activities are chronicled on this site.<br />

http://www.ebhc.org/<br />

Create your own EBM sites<br />

EBM Page Generator This shareware created by Yale and Dartmouth Universities (United<br />

States) allows anyone to set up an interactive site to link to any EBM sources. It is easy to<br />

adapt to most interactive educational web platforms.<br />

http://www.ebmpyramid.org/home.php


Index<br />

Locators in italic refer to figures and tables<br />

Locators in bold refer to major entries<br />

Locators for headings with subheadings refer to general aspects of that topic<br />

α error 29<br />

a-posteriori probability see post-test<br />

probability<br />

a-priori probability see pre-test probability<br />

AAR (absolute attributable risk) 147<br />

abdication 165<br />

absolute attributable risk (AAR) 147<br />

absolute rate reduction (ARR) 114,<br />

396<br />

absolute risk 142, 143, 144, 143–4,<br />

396<br />

absolute risk increase (ARI) 147, 151<br />

absolute risk reduction (ARR) 136, 147,<br />

205<br />

abstracts 28, 28, 28<br />

acronyms, mnemonic 221, 221, 222–3, 223,<br />

257, 257; see also mnemonics<br />

accuracy 73, 75–6, 233, 272, 396<br />

accuracy criteria 245–7<br />

ACP (American College of Physicians) 12, 13<br />

ACP Journal Club 13<br />

Activities of Daily Living (ADL) Scale 346<br />

ActivStats 108<br />

actuarial-life tables 366<br />

adjustment 231, 396<br />

adjustment heuristic 230–1<br />

ADL (Activities of Daily Living) Scale 346<br />

Agency for Healthcare Research and Quality<br />

318<br />

AGREE criteria 322<br />

AHRQ, US 357<br />

AIDS 57, 313<br />

Alcohol Use Disorders Identification Test<br />

(AUDIT) 72<br />

algebra 4<br />

algorithms, definition 396<br />

all-or-none case series 58, 190, 396<br />

alternative hypotheses 28, 110, 140, 396<br />

American College of Physicians (ACP)<br />

12, 13<br />

American Society of Internal <strong>Medicine</strong> 13<br />

analogy, reasoning by 196<br />

anchoring 231, 396<br />

anchoring heuristic 230–1<br />

ancient history of medicine 2–3<br />

ANCOVA 387<br />

AND (Boolean operator) 36, 35–7<br />

animal research studies 25–7<br />

Annals of Internal <strong>Medicine</strong> 13<br />

ANOVA (analysis of variance) 387<br />

applicability 187–8, 306–7, 332, 396; see also<br />

strength of evidence/applicability<br />

appropriate tests 241; see also diagnostic tests<br />

Arabic numerals 4<br />

area under the curve (AUC) 277–9, 397<br />

ARI (absolute risk increase) 147, 151<br />

arm, decision tree 397<br />

ARR see absolute rate reduction; absolute risk<br />

reduction; attributable risk reduction<br />

Ars magna (The Great Art) 5<br />

art of medicine 16–18, 187, 225, 288, 291,<br />

377<br />

assessment 397<br />

association (between cause and effect) 59<br />

asterisk truncation function 50<br />

attack rates 107<br />

attributable risk 148, 147–8<br />

attributable risk reduction (ARR) 354<br />

attrition, patient 63, 88–9, 171, 361, 360–1<br />

AUC (area under the curve) 277–9, 397<br />

AUDIT (Alcohol Use Disorders Identification<br />

Test) 72<br />

425


426 Index<br />

author bias 368<br />

availability heuristic 230, 397<br />

β errors 29, 135–6; see also Type II errors<br />

background questions 14, 13–14<br />

Bacon, Francis 109, 110–11<br />

Bacon, Roger 109<br />

balance sheets 394–5<br />

Bandolier 12, 50<br />

bar graphs 97,98<br />

baseline variables 169<br />

Bayes’ theorem 251, 251, 262–3, 264–6, 294,<br />

330<br />

definition 397<br />

formulas 391<br />

proof 392, 393<br />

bell-shaped curve 104, 103–4, 198<br />

beneficence, principle of 185<br />

Berkerson’s bias 85, 169<br />

Bernard, Claude 6<br />

Bernoulli, <strong>Dan</strong>iel 6, 334<br />

Bernoulli, Jacob 5<br />

Best Bets 54<br />

best case/worst case strategies 88, 172–3<br />

bias 15, 27, 31, 90–2, 235; see also diagnostic<br />

tests, critical appraisal; error; precision;<br />

Type I errors<br />

bibliography, medical literature 28, 31,<br />

413–14; see also journals<br />

biological plausibility 195<br />

biological variability<br />

patient 237<br />

physician 235<br />

biomedical research, recent history 7–8<br />

biopsy 305<br />

blinding 29, 65, 170<br />

and bias 85, 86, 361, 362<br />

clinical prediction rules development<br />

328<br />

definition 397<br />

error reduction 242<br />

gold standard comparisons 304<br />

unobtrusive measurements 75<br />

bloodletting 4, 6, 7<br />

BMJ (British Medical Journal) 16, 18, 25, 31,<br />

52, 54<br />

body mass index (BMI) 73, 195<br />

bone marrow transplantation 177<br />

Bonferroni correction 123<br />

Boolean operators 36, 35–7, 39, 50<br />

Boots Pharmaceuticals 91<br />

box-and-whisker plots 98, 100<br />

Bradford Hill, Austin 8, 16, 188<br />

Breslow-Day test 370<br />

British Medical Journal (BMJ) 16, 18, 25, 31,<br />

52, 54<br />

Broad Street Pump study 6, 193<br />

burden of proof 311<br />

CA see critical appraisal<br />

CAGE questionnaire for alcoholism 72, 74,<br />

280, 280, 279–81<br />

CAM (complementary and alternative<br />

medicines) 167<br />

CAPRIE trial 362–3<br />

Cardano, Girolamo 5, 333–4<br />

cardiopulmonary resuscitation (CPR) 335<br />

CART analysis 329<br />

Case Records journal feature 229<br />

case-control studies 22, 23, 61, 60–2<br />

definition 397<br />

measures of risk 142<br />

odds ratios/relative risk 146–7<br />

overview 385<br />

recall bias 83<br />

research design strength 189<br />

case-mix bias 297<br />

case reports 57, 57–8, 189–90, 397<br />

case series 57, 57–8, 189–90, 397<br />

case studies 7, 25<br />

cases 60<br />

CAT (critically appraised topic) 12, 399<br />

causation/cause-and-effect 19–20, 59, 61<br />

bibliography 415–16<br />

clinical question 21–3<br />

cohort studies 62<br />

contributory cause 62<br />

learning objectives 19<br />

multiple 194<br />

proving 189<br />

quotation 19<br />

randomized clinical trials 168<br />

strength of 188, 188<br />

temporal 194<br />

types 20–1<br />

CDSR (Cochrane Database of Systematic<br />

Reviews) 48, 54<br />

censored data 364, 365<br />

censoring bias 364, 365<br />

CENTRAL (Cochrane Central Register of<br />

Controlled Trials) 49, 54, 369<br />

central limit theorem 116<br />

central tendency measures 30, 94, 98–100<br />

Centre for <strong>Evidence</strong>-Based <strong>Medicine</strong> 12, 190,<br />

369, 378–81<br />

centripetal bias 360<br />

CER (control event rate) 114<br />

chakras 2, 2


Index 427<br />

chance nodes 336, 336<br />

children, examining 240<br />

Chinese belief systems 2, 2,2–3<br />

chi-squared analysis 363<br />

chi-squared test 370<br />

CI see confidence intervals<br />

CINAHL 34<br />

circulation of blood 4<br />

classification and regression trees (CART)<br />

analysis 329<br />

clinical consistency 217–20, 233–4<br />

Clinical Decision Making journal feature 229<br />

Clinical <strong>Evidence</strong> database 50, 52, 54<br />

clinical examination 220, 221, 220–2<br />

clinical guidelines see guidelines<br />

clinical prediction rules 325, 327, 327–32; see<br />

also guidelines; Ottawa ankle rules<br />

Clinical Queries search function 38, 38–9, 47,<br />

50<br />

clinical question 15–16, 21–3<br />

clinical research studies 27–8<br />

clinical reviews 27, 27, 30, 188; see also<br />

meta-analyses<br />

clinical significance 124–5, 173, 397–8; see also<br />

significance<br />

clinical trials 46, 65, 64–6, 128–9, 417–18; see<br />

also controlled clinical trials; randomized<br />

clinical trials<br />

clinical uncertainty 112<br />

clinically significant effect size 114<br />

clipboard function 42<br />

CME (continuing medical education) 197, 323<br />

Cochrane, Archie 8<br />

Cochrane Central Register of Controlled Trials<br />

49, 54, 369<br />

Cochrane Collaboration 8, 47, 48, 53, 54, 189<br />

definition 398<br />

GRADE scheme 369<br />

levels of evidence 191<br />

meta-analyses/systematic reviews 375–6<br />

Cochrane databases 50, 55<br />

Cochrane Library 13, 34, 49, 47–50<br />

Cochrane Methodology Register 49<br />

Code of Hammurabi 2<br />

coding 213, 376<br />

coffee 131, 142<br />

cohort studies 22, 23, 62–4<br />

definition 398<br />

measures of risk 142<br />

odds ratios to estimate relative risk 146–7<br />

overview 384–5<br />

research design strength 189<br />

cointervention 88, 398<br />

colorectal screening 35<br />

common sense 196–7, 367, 377<br />

common themes, identifying 212<br />

communication with patients 200, 377<br />

bibliography 205<br />

checking for understanding/agreement<br />

207<br />

converting numbers to words 204, 206<br />

decision making, shared 200<br />

framing bias 205–6<br />

learning objectives 199<br />

natural frequencies 205, 206<br />

patient experience/expectations 200, 202<br />

patient scenario 199–200, 200, 201, 202,<br />

203<br />

presenting information 203–4<br />

presenting recommendations 206–7<br />

providing evidence 203, 204–6<br />

quotation 199<br />

rapport, building 202–3<br />

steps toward 200<br />

too much information 204, 206<br />

comparison groups see controls/control<br />

groups<br />

comparisons, PICO/PICOT model 15–16, 21,<br />

35<br />

competing hypothesis heuristic 230, 398<br />

complementary and alternative medicines<br />

(CAM) 167<br />

Complete Metabolic Profile 106<br />

compliance bias 315–16<br />

compliance rates 173<br />

composite endpoints/outcomes 72, 90, 122–3,<br />

128, 171, 362–3<br />

computer software 212<br />

computerization 342<br />

conclusions 28, 31, 173, 374–6<br />

concomitance 161<br />

conditional probability 106<br />

confabulation 238, 240<br />

confidence formula 132<br />

confidence intervals (CI) 30, 116, 173, 398<br />

calculator 291<br />

formulas 389, 390<br />

hypothesis testing 116<br />

meta-analyses/systematic reviews 371,<br />

376<br />

negative studies, evaluating 136–7<br />

relative risk 391<br />

results strength 192–3<br />

risk assessment 149, 154<br />

rules of thumb 124<br />

Type I errors 123–4<br />

confidence levels see significance<br />

conflicts of interest 177, 183, 182–4


428 Index<br />

confounding bias 87, 318<br />

confounding variables 59, 76, 145, 156, 169,<br />

170; see also multivariate analysis<br />

cohort studies 63<br />

prognosis 362<br />

research design strength 189<br />

specificity 194<br />

consistency of evidence 193<br />

consistency of evidence over time 195–6<br />

Consolidating Standards of Reporting Trials<br />

Group (CONSORT) statement 177, 176–7<br />

construct validity 73<br />

contamination bias 88<br />

contamination of results 76<br />

content analysis 213<br />

context bias 300<br />

continuing medical education (CME) 197,<br />

323<br />

continuous data 69<br />

continuous test results 251–2, 398<br />

continuous variables 139, 138–40<br />

contradictory answers 238<br />

contributory cause 21, 22,62<br />

control event rate (CER) 114<br />

controlled clinical trials 63, 398; see also<br />

randomized clinical trials<br />

controls/control groups 58, 60, 65, 70, 114,<br />

178, 398<br />

cookbook medicine 18<br />

cost-benefit analysis 323, 398; see also<br />

cost-effectiveness<br />

cost-effectiveness 216, 398<br />

accurate cost measurement 353<br />

baseline variables 356–7<br />

clinical effectiveness, establishing 353<br />

clinical prediction rules development<br />

331<br />

comparison of relevant alternatives 352–3<br />

costs per unit of health gained 354–6<br />

deciding if a test/treatment is worth it<br />

350–2<br />

differing perspectives 352<br />

discounting 356<br />

ethics 357–8<br />

guidelines for assessing economic analysis<br />

352–7<br />

learning objectives 350<br />

quotation 350<br />

screening 319<br />

cost-minimization analysis 398<br />

costs, medical tests 227, 228, 246, 248, 307<br />

Cox proportional hazard model 160, 366, 387<br />

Cox regression 158<br />

criterion-<strong>based</strong> validity 73, 246, 398<br />

critical appraisal 10, 12–13, 377, 384–6, 399<br />

critical value 399<br />

critically appraised topic (CAT) 12, 399<br />

cross-reactivity, diagnostic tests 246<br />

cross-sectional studies 22, 23, 57, 59–60, 142<br />

CT scanning 245, 295<br />

intra-observer consistency 234<br />

screening 311–12<br />

technological improvement of tests 301–2<br />

cumulative frequency polygons 98, 99<br />

cumulative meta-analysis 374–6<br />

cutoff points 258<br />

DARE (Database of Abstracts of Reviews of<br />

Effects) 48–9, 54<br />

data acquisition 29<br />

data analysis 212–13, 370–1, 377<br />

data collection, qualitative research 211–12<br />

data display see graphing techniques<br />

data dredging 122–3, 168<br />

Database of Abstracts of Reviews of Effects<br />

(DARE) 48–9, 54<br />

database studies see non-concurrent cohort<br />

studies<br />

databases 34, 51–2<br />

death<br />

certificates 71<br />

guidelines 322<br />

outcome criteria 361<br />

outcome measure 72<br />

probability 349<br />

rates 37<br />

from unrelated causes 364<br />

decimal system 4<br />

decision making 215–16, 219, 333–4<br />

clinical consistency/physician<br />

disagreement 217–20<br />

clinical examination 220, 221, 220–2<br />

decision trees 336, 335–6, 337, 338, 336–8<br />

definition 399<br />

differential diagnosis 216, 224, 225, 225,<br />

223–5, 226, 227<br />

ethics 345–6, 347<br />

exhaustion strategy 229<br />

expected-values 334, 334–6<br />

expert vs. evidence-<strong>based</strong> 11–12<br />

guidelines/automation 218, 219<br />

heuristics 231<br />

hypothesis generation 221, 222–3<br />

hypothetico-deductive strategy 229<br />

learning objectives 215, 333<br />

Markov models 345, 345<br />

multiple branching strategy 229<br />

patient/physician shared 200, 200


Index 429<br />

pattern recognition 228–9, 231<br />

physician 165–6<br />

premature closure 229, 231–2<br />

pre-test probability 225, 224–5, 226<br />

quotation 215, 333<br />

reality checks 341–3<br />

refining the options 226–8<br />

risk, attitudes to 348–9<br />

sensitivity analysis 339, 340, 339–40, 341,<br />

342<br />

threshold approach 343–5<br />

uncertainty/incomplete information 218,<br />

219<br />

values, patient 346–8<br />

decision nodes 336, 336, 399, 404<br />

decision theory 363<br />

decision trees 216, 337, 338, 397<br />

methods of construction 336, 336–8<br />

thrombolytic therapy example 337, 338,<br />

336–8<br />

deduction 165<br />

de-facto rationing 351<br />

degenerative diseases 21, 26<br />

degrees of freedom 399<br />

denial, patient 239<br />

dependent events 105<br />

dependent tests 293<br />

dependent variables 20, 68, 399<br />

depreciation 356<br />

depression 59, 69, 72, 81–2, 235, 247<br />

derivation sets 62, 123, 158<br />

descriptive research 399<br />

descriptive statistics 94, 387, 389, 399<br />

descriptive studies 56, 57–60, 189–90<br />

detection bias 83<br />

diagnosis, consistency 217; see also decision<br />

making<br />

diagnosis, study type 22, 22, 23, 38, 385–6<br />

diagnostic classification schemes 234–5<br />

diagnostic review bias 299–300<br />

diagnostic tests 303–4, 309; see also<br />

probability of disease; utility<br />

absence of definitive tests 299<br />

accuracy criteria 245–7<br />

applicability 306–7<br />

bibliography 418<br />

characteristics 216, 399<br />

comparison 276, 278, 280, 280, 277–81<br />

context bias 300<br />

costs/applicability 307<br />

definition 399<br />

diagnostic thinking 247<br />

filter bias 296–7<br />

formulas 391<br />

gold standard comparisons 304, 305–6<br />

ideal research study 302–3<br />

incorporation bias 298<br />

indeterminate/uninterpretable results 300<br />

learning objectives 244, 276, 295<br />

observer bias 299–300<br />

overview of studies of diagnostic tests<br />

295–6<br />

patient outcome criteria 248; see also<br />

decision trees; values, patient<br />

post-hoc selection of positivity criteria 301<br />

post-test probability and patient<br />

management 308–9<br />

pretest probability 307–8<br />

publication bias 302<br />

quotations 244, 276, 295<br />

reproducibility 301<br />

results impact 306<br />

review/interpretation bias 299–300<br />

ROC curves 276–7, 278, 280<br />

sampling bias 305<br />

selection bias 296–8<br />

social outcome criteria 248<br />

spectrum/subgroup bias 297, 305<br />

study description/methods 304–5<br />

technical criteria 245–6<br />

technological improvement of tests 301–2<br />

therapeutic effectiveness criteria 247–8<br />

two by two tables 391<br />

uses/applications 244–5<br />

validity of results 304–6<br />

verification bias 298<br />

diagnostic thinking 247<br />

diagnostic-suspicion bias 361, 362<br />

dichotomous data 69<br />

dichotomous outcomes 399<br />

dichotomous test results 251, 400<br />

dichotomous variables 138, 139<br />

diet 65<br />

differential diagnosis 216, 224, 225, 225,<br />

223–5, 226, 227, 282, 400<br />

difficult patients 240<br />

digitalis 4<br />

disability 322, 337, 347, 349<br />

discounting 356<br />

discrete data 69<br />

discriminant function analysis 158, 387<br />

discussions 28, 30–1, 173; see also IMRAD<br />

style<br />

disease-free interval 205<br />

disease oriented evidence (DOE) 12, 13<br />

disease-oriented outcomes 72<br />

dispersion measures 30, 94, 100–1<br />

distribution of values 101–2


430 Index<br />

doctor–patient relationship 3, 202–3<br />

doctrine of clinical equipoise 185<br />

DOE (disease oriented evidence) 12, 13<br />

dogmatists, Ancient Greek 3<br />

doing/doers 11<br />

dose-response gradients 194–5<br />

Double, Francois 7<br />

double-blinded studies 29, 76; see also<br />

blinding<br />

drop-out, patient (attrition) 63, 88–9, 171, 361,<br />

360–1<br />

DynaMed database 34, 51<br />

early detection 37<br />

early termination of clinical trials 128–9,<br />

165<br />

EBCP (evidence-<strong>based</strong> clinical practice) 10<br />

EBHC see evidence-<strong>based</strong> health care<br />

Economic Evaluation Database 49<br />

editorials 27<br />

educational prescription 14<br />

EER (experimental event rate) 114<br />

effect size 30, 113–14, 133, 134, 400<br />

meta-analyses/systematic reviews 371, 376<br />

results strength 192–3<br />

effectiveness 400<br />

Effectiveness and Efficiency (Cochrane) 8, 47<br />

effects 19<br />

efficacy 400<br />

eligibility requirements 29<br />

embarrassment, patient 239<br />

EMBASE records 54<br />

empowerment, patient 18<br />

empiricists, Ancient Greek 3<br />

energy balance beliefs 2,2–3,4<br />

Entrez dates 40<br />

environmental sources of error 240–1<br />

epidemiology 6, 107, 107–8, 193<br />

equation, evidence usefulness 53<br />

equivalence studies 140<br />

Erlich, Paul 6<br />

error 69–70; see also bias; precision; Type I-IV<br />

errors<br />

appropriate tests 241; see also diagnostic<br />

tests<br />

biological variations 235, 237<br />

chance 90<br />

clinical consistency, measuring 233–4<br />

confabulation 238<br />

contradictory answers 238<br />

denial, patient 239<br />

diagnostic classification schemes 234–5<br />

diagnostic tools malfunction 241, 242<br />

difficult patients 240<br />

disruptive examination environments 240,<br />

241<br />

embarrassment, patient 239<br />

environmental sources of error 240–1<br />

examinee sources 240<br />

examiner sources 234–7<br />

expectations, physician 235<br />

hypothesis testing 111–13<br />

inference and evidence 234, 242<br />

language barriers 239<br />

learning objectives 233<br />

lying, patient 240<br />

medical 217<br />

medication effects 237<br />

mimimization strategies 241–3<br />

patient ignorance 238<br />

physician ignorance 236<br />

physician off-days 237<br />

problem-oriented medical record 242<br />

questioning patients 235–6<br />

quotation 233<br />

recall bias 237–8<br />

research conduct/misconduct 181<br />

risk maximization/minimization 236–7,<br />

239<br />

staff non-cooperation 240–1<br />

validity 74<br />

<strong>Essential</strong> <strong>Evidence</strong> Plus database 34, 51–2<br />

ethics 66; see also responsible conduct of<br />

research<br />

cost-effectiveness 351, 357–8<br />

decision making 345–6, 347<br />

randomized clinical trials 177–8<br />

etiology, study type 22, 22, 22–3, 38<br />

event rates 104, 114, 115, 389, 390, 400<br />

evidence, consistency 193<br />

evidence, strength of see strength of<br />

evidence/applicability<br />

evidence-<strong>based</strong> clinical practice (EBCP) 10<br />

<strong>Evidence</strong> Based Emergency <strong>Medicine</strong> Group<br />

12<br />

<strong>Evidence</strong> Based Health Care (EBHC)<br />

university course ix, 218–19<br />

evidence-<strong>based</strong> health care 10<br />

art of medicine 16–18<br />

background/foreground questions 14,<br />

13–14<br />

clinical question structure 15–16<br />

critical appraisal 10, 12–13<br />

definition 10<br />

expert vs. evidence-<strong>based</strong> decision making<br />

11–12<br />

importance of evidence 9–10, 188<br />

learning objectives 9


Index 431<br />

quotation 9<br />

steps in practicing 14–15<br />

<strong>Evidence</strong>-Based Interest Group, ACP 12<br />

<strong>Evidence</strong> Based On Call 54<br />

evidence carts 13<br />

evidence-and-outcomes-<strong>based</strong> approach<br />

324<br />

examiner error 234–7<br />

executive tests 244–5, 311; see also screening<br />

exclusion criteria 29, 168–9, 369–70, 376<br />

exercise 119<br />

exhaustion strategy 229, 409<br />

expectation bias 361, 362<br />

expectations, patient 200, 202, 218, 219; see<br />

also placebo effect<br />

expectations, physician 235<br />

expected-values decision making 334, 334–6,<br />

352, 355, 400<br />

experimental event rate (EER) 114<br />

experimental group 114, 400<br />

experimental settings 29, 329, 408<br />

expert <strong>based</strong> randomization 167<br />

expert bias 27<br />

expert opinion 54<br />

expert reviews 25<br />

expert vs. evidence-<strong>based</strong> decision making<br />

11–12<br />

explanatory research 400<br />

explicit reviews 61<br />

exploratory studies 59<br />

exposure 400<br />

exposure suspicion bias 84<br />

external validity 74, 89–90, 168–9, 396; see also<br />

applicability<br />

fabrication of results 181–2; see also<br />

responsible conduct of research<br />

face validity 74<br />

fail-safe N method 374<br />

false alarm rates (FARs) 262, 289, 401<br />

false labeling 313, 319<br />

false negative 401<br />

false negative rates (FNR) 257, 401<br />

false negative test results 112, 130, 246, 253,<br />

253<br />

false positive 401<br />

false positive rates (FPR) 205, 256, 401<br />

false positive test results 112, 120, 246, 253,<br />

253<br />

false reassurance rates (FRRs) 262, 265, 401<br />

falsification of results 181–2; see also<br />

responsible conduct of research<br />

FARs (false alarm rates) 262, 289, 401<br />

fatal flaws 81, 173<br />

Fibonacci 4<br />

field searching 47<br />

file-drawer effect 374<br />

filter bias 296–7, 360<br />

filtering 85<br />

filters 401<br />

filters, literature search 38–9, 46–7<br />

financial incentives 183, 302<br />

Fisher, Sir Ronald 8, 111, 263<br />

five S schema 53<br />

fixed-effects model 371<br />

Florence Nightingale 6<br />

FNR (false negative rates) 257, 401<br />

focus-group interviews 210<br />

follow-up, patient 171, 247, 330, 361, 360–1<br />

foreground questions 14, 13–14<br />

Forest plot 372<br />

FPR (false positive rates) 205, 256, 401<br />

framing bias 205–6<br />

framing effects 348–9, 401<br />

fraud, research 180; see also responsible<br />

conduct of research<br />

frequency polygons 98, 99<br />

frequency tables 363<br />

Freud, Sigmund 6<br />

FRRs (false reassurance rates) 262, 265, 401<br />

functional status 401<br />

funnel plot 374, 373–4<br />

Galen 3<br />

gambling odds 263–4<br />

Ganfyd 34<br />

Gaussian distribution 104, 103–4, 106, 401<br />

gold standard comparisons 304<br />

strength of evidence/applicability 198<br />

test results 252, 252<br />

gender 150, 195<br />

generalizability 396; see also applicability<br />

generalizability of population 101<br />

germ theory 6<br />

gold-standard tests 59, 76, 234, 245<br />

comparing with other tests 252–3, 305–6<br />

comparisons 304<br />

definition 401<br />

diagnostic tests, comparison 279<br />

diagnostic tests, critical appraisal 309<br />

examples 246–7<br />

ideal research study 303<br />

interpretation bias 299–300<br />

post-test probability/patient management<br />

308<br />

pulmonary angiograms 295<br />

strep throat 290<br />

tarnished 299


432 Index<br />

gold-standard tests (cont.)<br />

threshold values 288<br />

verification bias 298<br />

Google/Google Scholar 54<br />

GRADE (Grading of Recommendations<br />

Assessment, Development and<br />

Evaluation) scheme 191, 192, 325, 369,<br />

382<br />

grades of evidence 190–1, 378–81; see also<br />

levels of evidence<br />

grades of recommendation 382<br />

graphing techniques<br />

bar graphs 97,98<br />

box-and-whisker plots 98, 100<br />

deceptive 94–5<br />

frequency polygons 98, 99<br />

histograms 98, 98<br />

meta-analyses/systematic reviews 372–4<br />

presenting information to patients 203<br />

stem-and-leaf plots 97, 96–7<br />

Graunt, John 5<br />

grounded theory 213<br />

Guide to Clinical Preventive Services (USPHS)<br />

317<br />

guidelines 218, 219, 322, 397; see also clinical<br />

prediction rules<br />

critical appraisal 324–5<br />

development 322–4<br />

learning objectives 320<br />

nature of/role in medicine 320–2<br />

quotation 320<br />

handwriting, legible 243<br />

harm vs. benefit 401<br />

Hawthorne effect 75, 86<br />

hazard ratios 363<br />

health literacy 201<br />

Health Technology Assessment Database 49<br />

heterogeneity 370, 371–2, 373, 376<br />

heuristics 231, 401<br />

hierarchies of research studies 190–1<br />

Hippocratic principle 3<br />

histograms 98, 98<br />

history and physical (H & P) 220, 221, 220–2,<br />

241, 268; see also medical history-taking<br />

history function 41, 41<br />

history of medicine<br />

ancient history 2–3<br />

learning objectives 1<br />

modern biomedical research 7–8<br />

quotation 1<br />

recent history 6–7<br />

Renaissance 3–4<br />

statistics 4–6<br />

homeopathy 4<br />

homogeneity 401<br />

hospital bed management 321<br />

human participants in research 184–5<br />

humours 2, 2<br />

hypothesis 167–8, 402; see also research<br />

question<br />

hypothesis generation 221, 222–3<br />

hypothesis testing 110, 109–10, 112; see also<br />

statistics<br />

confidence intervals 116<br />

effect size 113–14<br />

error 111–13<br />

event rates 104, 114<br />

hypothesis, nature of 109, 110–11<br />

learning objectives 109<br />

placebo effect 118–19<br />

quotation 109<br />

signal-to-noise ratio 115–16<br />

statistical tests 116–18, 387<br />

hypothetico-deductive strategy 229, 231, 402<br />

ignorance, patient/physician 236, 238<br />

iLR (interval likelihood ratios) 272, 275, 272–5,<br />

402<br />

IM see information mastery<br />

implicit reviews 61<br />

IMRAD style (introduction, methods, results,<br />

discussion) 27<br />

inception cohort 361, 359–61<br />

incidence 59, 62, 107, 108, 402<br />

incidence bias 59, 360<br />

inclusion criteria 29, 168–9, 369–70, 376<br />

incorporation bias 298, 402<br />

incremental gain 286, 287, 285–7, 402<br />

learning objectives 282<br />

likelihood ratios 283<br />

multiple tests 292, 291–3<br />

quotation 282<br />

real-life applications 293–4<br />

sensitivity/specificity 282, 283, 284, 283–5<br />

threshold values 289, 287–91, 395<br />

two by two tables 283, 284, 283–5<br />

independent events 105<br />

independent tests 292<br />

independent variables 20, 67, 160, 161, 402<br />

in-depth interviews 210<br />

indeterminate/uninterpretable results 300<br />

Indian belief systems 2, 2,3<br />

indication creep 307<br />

induction 165<br />

inductive reasoning 7<br />

industrial revolution 3–4<br />

inference and evidence 234


Index 433<br />

inferential statistics 94, 387, 402<br />

inflation 356<br />

InfoPOEMS 50; see also <strong>Essential</strong> <strong>Evidence</strong><br />

Plus database<br />

information, patient 18; see also medical<br />

records<br />

information mastery 10, 12, 53<br />

information presentation to patients 206; see<br />

also communication with patients<br />

information retrieval strategies 35–7<br />

informed consent 177, 185<br />

initial exploratory studies 59<br />

Institutional Review Boards (IRBs) 66, 177,<br />

185<br />

instrumental rationality 334, 402<br />

instruments/instrument selection 29, 70–2,<br />

241, 242, 402; see also<br />

measurements/instruments<br />

insurance, medical 351, 352<br />

integrating 213<br />

integrity, scientific see responsible conduct of<br />

research<br />

intention-to-treat 89, 172, 402<br />

interference, test results 246<br />

inter-observer consistency 76, 78, 234, 246,<br />

328, 374<br />

inter-observer reliability 76–7, 370, 402<br />

internal validity 74; see also validity<br />

interpretation bias 299–300<br />

interquartile range 101<br />

interval data 68, 363, 387<br />

interval likelihood ratios (iLR) 272, 275, 272–5,<br />

402<br />

intervention criteria 361<br />

intervention studies 214; see also clinical trials<br />

interventions, PICO/PICOT model 15, 35<br />

interviews 210<br />

intra-observer consistency 76, 78, 234, 246<br />

intra-observer reliability 402<br />

intra-rater agreement 76<br />

intrinsic characteristics, test 399<br />

introductions, medical literature 28, 28; see<br />

also IMRAD style<br />

intuition 377<br />

JCB (journal club bank) 12<br />

Jones criteria 247<br />

journal club bank (JCB) 12<br />

Journal of the American Medical Association<br />

(JAMA) 25, 31, 91, 151<br />

journals 411–13<br />

jumping to conclusions 234–5; see also<br />

premature closure<br />

justice, principle of 185, 403<br />

Kaplan-Meier curve 365, 366<br />

Kaposi’s sarcoma 57<br />

kappa statistic 77, 78, 212, 234, 403<br />

clinical prediction rules development<br />

328<br />

meta-analyses/systematic reviews 376<br />

precision/validity 76–7, 78, 78–9<br />

Kelvin, Lord 19<br />

key words 35<br />

knowledge transfer model 197–8<br />

Koch, Robert 6<br />

Koch’s postulates 20–1, 22<br />

KT (knowledge translation) 10<br />

L’Abbé plots 174, 373, 373<br />

lack of proportionality 94, 96<br />

The Lancet 25, 31<br />

language barriers 239<br />

Laplace, Pierre 7, 263<br />

law of large numbers 5<br />

laws, professional conduct 184; see also<br />

malpractice suits; responsible conduct of<br />

research<br />

lead-time bias 315, 314–15, 360<br />

legal cases, professional misconduct 180; see<br />

also malpractice suits; responsible<br />

conduct of research<br />

length-time bias 315, 316, 360<br />

level of significance 120, 134, 133–5, 376, 403<br />

levels of evidence 188–90, 378–81; see also<br />

grades of evidence<br />

Liber abaci (Book of the Abacus) 4<br />

Liber de ludo aleae (Book on Games of<br />

Chance) 5<br />

librarians, health science 55<br />

life years 410<br />

likelihood ratios 216, 251, 283, 403; see also<br />

interval likelihood ratios<br />

positive/negative 254, 255, 255, 254–5, 403<br />

pretest probability 265, 264–6<br />

Likert Scales 68, 71–2<br />

limits function 40, 39–40<br />

Lind, James 7, 164–5<br />

linearity 160, 160<br />

linear-rating scales 347, 347, 403<br />

Lister, Joseph 6<br />

literature see medical literature<br />

literature searching see medical literature<br />

searching<br />

Lloyd, Edward 5<br />

local variations in healthcare provision 9<br />

logistic analysis 363–4<br />

log-rank test 366<br />

longitudinal studies 56–7, 60–4


434 Index<br />

Louis, Pierre 6, 7<br />

lying, patient 238, 240<br />

MAGIC study 368<br />

malpractice suits 218, 219, 308; see also<br />

responsible conduct of research<br />

mammography 77, 76–7, 78, 244, 245<br />

framing bias 206<br />

intra-observer consistency 234<br />

screening 316<br />

managed care organizations (MCOs) 321, 351<br />

Mantel-Cox curve 366<br />

Mantel-Haentszel chi-squared test 370<br />

Markov models 345, 345, 403<br />

Massachusetts General Hospital 229<br />

MAST (Michigan Alcohol Screening Test) 72,<br />

74<br />

matching 403<br />

mathematics and medicine ix<br />

McMaster University, Canada 11<br />

MCOs (managed care organizations) 321, 351<br />

mean 94, 99–100, 403; see also regression to<br />

the mean; standard error of the mean<br />

measurements/instruments 29<br />

attributes 72–3<br />

bibliography 416<br />

definition 403<br />

error 69–70<br />

evaluating 171<br />

improving precision/accuracy 75–6<br />

instruments/instrument selection 70–2<br />

inter/intra-rater reliability tests 76–7<br />

kappa statistic 77, 76–7, 78, 78, 78–9<br />

learning objectives 67<br />

quotation 67<br />

types of data/variables 67–9<br />

validity 73, 73–5<br />

measures of central tendency 30, 94, 98–100<br />

measures of dispersion 30, 94, 100–1<br />

median 94, 100<br />

Medicaid 351<br />

medical history-taking 220, 221, 220–2, 241;<br />

see also history and physical (H & P)<br />

medical literature 25, 28, 31–2<br />

abstracts 28, 28, 28<br />

basic science research 25–7<br />

bibliography/references 28, 31, 414–16, 421;<br />

see also journals<br />

clinical research studies 27–8<br />

clinical reviews 27<br />

conclusions 28, 31<br />

discussion 28, 30–1<br />

editorials 27<br />

explosion 367<br />

introductions 28, 28<br />

journals 24–5<br />

learning objectives 24<br />

meta-analyses/clinical reviews 27<br />

methods 28, 29<br />

quotation 24<br />

results 28, 29–30<br />

searching see medical literature searching<br />

medical literature searching 15, 33–4, 54–5;<br />

see also MEDLINE; PUBMED website<br />

clipboard function 42<br />

Cochrane Library 49, 47–50<br />

databases 34<br />

field searching 47<br />

history function 39–40, 41<br />

information retrieval strategies 35–7<br />

learning objectives 33<br />

limits function 40, 39–40<br />

MeSH search terms 44, 44, 45, 46, 43–6<br />

methodological terms/filters 46–7<br />

point of care databases 53, 51–4<br />

printing/saving 42<br />

quotation 33<br />

responsible 180–1<br />

synonyms/wildcard symbol 37<br />

TRIP database 50–1<br />

medical records 242, 406<br />

medication 154, 167, 237<br />

medicine<br />

art/science of 16–18, 187, 225, 288, 291, 377<br />

and mathematics ix<br />

MEDLINE 34, 37–8, 54; see also PUBMED<br />

website<br />

clipboard function 42<br />

field searching 47<br />

general searching 42, 43<br />

history function 41, 41<br />

limits function 40, 39–40<br />

MeSH search terms 44, 44, 45, 46, 43–6<br />

methodological terms/filters 46–7<br />

saving/printing functions 42<br />

member checking 213<br />

membership bias 84–5<br />

memoing 213<br />

MeSH database 38, 39<br />

MeSH search terms 44, 44, 45, 46, 43–6,50<br />

meta-analyses 27, 372, 403; see also clinical<br />

reviews; systematic reviews<br />

additional guidelines 376–7<br />

guidelines for evaluating 368–76<br />

inclusion criteria 369–70, 376<br />

learning objectives 367<br />

quotation 367<br />

rationale 367–8


Index 435<br />

methods 28, 29, 171, 304–5, 368–9, 370<br />

microscope, invention 3<br />

Middle Ages 3<br />

‘Mikey liked it’ phenomenon 58<br />

milk pasteurization 194<br />

mining, data 122–3<br />

misclassification bias 64, 86–7<br />

misconduct, scientific see ethics; responsible<br />

conduct of research<br />

mnemonics 224, 225, 257, 257; see also<br />

acronyms, mnemonic<br />

mode 94, 100<br />

Modified Rankin Scale 337<br />

Moivre, Abraham de 103<br />

monitoring therapy 245<br />

mortality rates 107, 108, 202<br />

cardiovascular disease 168<br />

colon cancer 36, 37<br />

measles 142<br />

pneumonia 74<br />

Morton, William 6<br />

multiple branching strategy 229, 403<br />

multiple causation 194<br />

multiple linear regression analysis 158, 387<br />

multiple logistic regression analysis 158, 387<br />

multiple outcomes 122–3, 362–3<br />

multiple regression 158, 364, 387; see also<br />

CART analysis<br />

multiple tests use 292, 291–3<br />

multiplication tables 4<br />

multivariate analysis 87, 161<br />

applications 157–9<br />

concomitance 161<br />

independent variables – coding 161<br />

interactions between independent<br />

variables 160<br />

learning objectives 156<br />

linearity 160, 160<br />

nature of 156–7<br />

outliers to the mean 161<br />

overfitting 159<br />

prognosis 362<br />

propensity scores 162–3<br />

quotation 156<br />

research design strength 189<br />

risk determination 157, 158, 159<br />

underfitting 160<br />

Yule–Simpson paradox 162, 163<br />

mutually exclusive events 105–6<br />

n-of-1 trial 175, 189<br />

National Institutes of Health 186, 369<br />

National Research Act (1974) 184<br />

Native American belief systems 2, 2<br />

natural frequencies 205, 206<br />

Nazi atrocities 179<br />

NCBI accounts 42, 50<br />

negative in health (NIH) test result 256, 257<br />

negative likelihood ratios 254, 255, 255, 254–5,<br />

403<br />

negative predictive values (NPVs) 262, 265,<br />

404<br />

negative studies, evaluating 130–1, 135–6; see<br />

also Type II errors<br />

confidence intervals 136–7<br />

continuous variables 139, 138–40<br />

dichotomous variables 138, 139<br />

nomograms 138, 139, 137–40<br />

New England Journal of <strong>Medicine</strong> 25, 31, 150,<br />

229<br />

new tests 301–2<br />

New York Academy of <strong>Medicine</strong> 12<br />

NHS (National Health Service) 13, 48, 49<br />

NICE, UK 357<br />

Nightingale, Florence 6<br />

NIH (negative in health) test result 256, 257<br />

NLM (National Library of <strong>Medicine</strong>), US 47,<br />

369<br />

NNEH (number needed to expose to harm)<br />

127<br />

NNF (number needed to follow) 404<br />

NNSB (number needed to screen to benefit)<br />

127, 317, 316–17, 318<br />

NNSH (number needed to screen to harm)<br />

319<br />

NNTB (number needed to treat to benefit)<br />

125, 125, 205, 346, 354, 404<br />

NNTH (number needed to treat to harm)<br />

125–7, 148, 148, 151, 404<br />

nodes 404; see also decision nodes;<br />

probability nodes<br />

noise 115–16, 276, 277–81; see also ROC curves<br />

nominal data 68, 363, 387<br />

nomograms 138, 139, 137–40, 267, 267, 268<br />

non peer-reviewed journals 25<br />

non-concurrent cohort studies 62, 64, 83<br />

non-inferiority trial 140, 404<br />

non-respondent bias 84<br />

non-steroidal anti-inflammatory drugs<br />

(NSAIDs) 26, 183<br />

normal distribution 103, 104, 103–4, 106,<br />

404<br />

gold standard comparisons 304<br />

strength of evidence/applicability 198<br />

test results 252, 253<br />

NOT (Boolean operator) 36, 35–7<br />

NPVs (negative predictive values) 262, 265,<br />

404


436 Index<br />

NSAIDs (non-steroidal anti-inflammatory<br />

drugs) 26, 183<br />

null hypothesis 28, 110–11, 140, 404<br />

null point 136<br />

numbering systems 4<br />

objective information 404<br />

objectivity, research 185, 186<br />

observational studies 65, 189, 210, 404<br />

observer bias 85–6, 299–300, 328<br />

odds 264, 263–4, 404<br />

odds ratios 142, 146, 147, 145–7, 150, 363<br />

definition 404<br />

formulas 390, 389–90<br />

meta-analyses/systematic reviews 371<br />

results strength 192–3<br />

two by two tables 390<br />

off-days, physician 237<br />

OLDCARTS acronym 221, 221, 222–3<br />

one-tailed tests 121, 121–2, 132, 140, 404<br />

operator dependence 246, 404<br />

opportunity costs 351, 353<br />

OPQRSTAAAA acronym 221, 222–3<br />

OR (Boolean operator) 36, 35–7<br />

ordinal data 68, 363, 387<br />

outcome criteria 248, 361–3; see also decision<br />

trees; values, patient<br />

outcome measurement bias 85–7<br />

outcome misclassification 87<br />

outcomes 64, 328, 404<br />

PICO/PICOT model 16, 21, 35<br />

outcomes study 405<br />

outliers to the mean 99, 376<br />

overfitting 159<br />

Oxford Database of Perinatal Trials 48<br />

Oxford University, UK 11, 12, 190<br />

P value 30, 405<br />

P4P (Pay for Performance) 321<br />

Paccioli, Luca 4<br />

pain<br />

confidence intervals 136<br />

contradictory answers 238<br />

guidelines 322<br />

measurement 71–2<br />

placebo effect 119<br />

relief 240<br />

scales 70<br />

Paracelsus 3<br />

particularizability 396<br />

Pascal, Blaise 5, 333–4<br />

passive smoking 127, 183<br />

Pasteur, Louis 6, 56<br />

Pathman’s pipeline analogy 197, 197–8<br />

pathognomonic 405<br />

pathological specimens 246<br />

patient attrition 63, 88–9, 171, 361, 360–1<br />

patient inception cohort 361, 359–61<br />

patient satisfaction 405<br />

patient values see values, patient<br />

patient-oriented evidence that matters<br />

(POEMS) 12, 13; see also <strong>Essential</strong><br />

<strong>Evidence</strong> Plus database<br />

patient-oriented outcomes 72<br />

patients, PICO/PICOT model 15, 22<br />

pattern recognition 228–9, 231, 405<br />

Pay for Performance (P4P) 321<br />

Pay-Per-View 50<br />

peer pressure 218, 219<br />

peer review 186<br />

peer-review guidelines 324<br />

peer-reviewed journals 24<br />

percent of a percent 104<br />

percentages 104–5<br />

percentages of small numbers 105<br />

percentiles 94, 100, 405<br />

performance criteria 321<br />

persistent vegetative states 335<br />

perspectives, patient 201; see also values,<br />

patient<br />

phototherapy 368<br />

physician behavior, changing 323–4<br />

physician ignorance 236<br />

PICO/PICOT model 14, 15–16, 35, 295–6<br />

PID (positive in disease) test result 256,<br />

i257<br />

Pisano, Leonardo 4<br />

placebo 405<br />

placebo controls see controls/control groups<br />

placebo drugs 113<br />

placebo effect 118–19, 167, 219<br />

plagiarism 181–2<br />

plans 405<br />

podcasts 55<br />

POEMS (patient-oriented evidence that<br />

matters) 12, 13; see also <strong>Essential</strong><br />

<strong>Evidence</strong> Plus database<br />

PogoFrog 54<br />

point estimate 405<br />

point of care databases 53, 51–4<br />

point of indifference 405<br />

points, decision tree 405<br />

POMR (problem-oriented medical record)<br />

242, 406<br />

popularity bias 360<br />

population, patient 22, 35, 101, 405<br />

Port Royal text on logic 334<br />

positive in disease (PID) test result 256, 257


Index 437<br />

positive likelihood ratios 254, 255, 255, 254–5,<br />

403<br />

positive predictive values (PPVs) 262, 265,<br />

405<br />

possession by demons 2<br />

posterior probability see post-test probability<br />

post-hoc subgroup analysis 128, 171<br />

post-test odds 406<br />

post-test probability 225, 251, 262, 267–8, 269,<br />

286, 308–9, 406<br />

potential bias 31<br />

power, statistical 29, 30, 112, 131, 406<br />

determining 131–5<br />

effect size 133, 134<br />

level of significance 134, 133–5<br />

sample size 133, 132–3<br />

standard deviation 135, 135<br />

PPVs (positive predictive values) 262, 265, 405<br />

practice guidelines see guidelines<br />

precision 30, 72–3, 233, 406<br />

diagnostic tests 245<br />

improving 75–6,76<br />

kappa statistic 78<br />

prediction rules see clinical prediction rules<br />

predictive validity 74<br />

predictive values 262, 270, 268–72, 406<br />

predictor variables 328, 406<br />

prehistory of medicine 2–3<br />

premature closure 229, 231–2<br />

pretest odds 406<br />

pretest probability 224, 225, 224–5, 226, 250,<br />

406<br />

diagnostic tests, critical appraisal 307–8<br />

incremental gain 285–7<br />

and likelihood ratios 265, 264–6<br />

multiple tests 293<br />

prevalence 59, 62, 107, 108, 311, 406<br />

prevalence bias 59, 360<br />

prevention studies 23<br />

primary analysis 367<br />

principle of beneficence 185<br />

principle of justice 185<br />

principle of respect for persons 185<br />

printing/saving functions 42<br />

prior probability see pre-test probability<br />

probability 105–7, 264, 263–4, 334, 334–6<br />

probability of disease, diagnostic tests 261–2<br />

Bayes’ theorem 262–3, 264–6, 392<br />

interval likelihood ratios 272, 275, 272–5<br />

learning objectives 261<br />

likelihood ratios/pretest probability 265,<br />

264–6<br />

nomograms 267, 267, 268<br />

odds/probability 264, 263–4<br />

post-test probability calculation 225, 267–8,<br />

269<br />

predictive values 262<br />

predictive values calculation 270, 268–72<br />

quotation 261<br />

probability nodes 336, 336, 404, 406<br />

probability of survival, historical comparisons<br />

6<br />

probability theory 5–6, 389<br />

procedures, experimental 29<br />

professional misconduct see ethics;<br />

responsible conduct of research<br />

prognosis 406<br />

frequency tables 363<br />

inception cohort 361, 359–61<br />

intervention criteria 361<br />

learning objectives 359<br />

logistic analysis 363–4<br />

outcome criteria 361–3<br />

quotation 359<br />

study type 22, 22, 23, 38<br />

survival analysis 364, 365, 409<br />

survival curves 365, 364–6<br />

prognostics 245<br />

propensity scores 162–3<br />

proportional hazards regression analysis 158,<br />

160, 366, 387<br />

proportionality 94, 96<br />

prospective studies 57, 406<br />

PsycINFO 34<br />

publication bias 90, 302, 369, 374, 406<br />

publish or perish 177<br />

PUBMED website 37–8, 39, 38–9, 53, 55<br />

Clinical Queries search function 50, 51<br />

clipboard function 42<br />

field searching 47<br />

general searching 42, 43<br />

history function 41, 41<br />

limits function 40, 39–40<br />

MeSH search terms 44, 44, 45, 46, 43–6<br />

methodological terms/filters 46–7<br />

saving/printing functions 42<br />

purposive sampling 211<br />

Q statistic 370<br />

QALYs (quality-adjusted life years) 348, 351,<br />

355–6, 357, 407<br />

qi 2, 2,2–3<br />

qualitative research<br />

applications 209<br />

applying results 214<br />

data analysis 212–13<br />

data collection 211–12<br />

learning objectives 208


438 Index<br />

qualitative research (cont.)<br />

methods 209<br />

quotation 208<br />

sampling 211<br />

study objectives 210–11<br />

study types 209, 210<br />

qualitative reviews 367, 376<br />

quality-of-life 202, 340, 346, 407<br />

quantitative systematic review see<br />

meta-analyses; systematic reviews<br />

quartiles 94, 100<br />

question, research 369, 407; see also<br />

hypothesis<br />

questioning patients 235–6<br />

questionnaires 70<br />

race 150<br />

random error 69–70<br />

random selection/assignment 407<br />

random-effects model 371–2<br />

randomization 29, 65, 169–70, 407<br />

randomized clinical trials (RCTs) 8, 23, 47,<br />

164–5, 166–7, 173–4<br />

blinding 170<br />

CONSORT statement 177, 176–7<br />

definition 407<br />

discussions/conclusions 173<br />

early termination 165<br />

ethics 177–8<br />

evaluating 166, 166<br />

hypothesis 167–8<br />

inclusion/exclusion criteria 168–9<br />

learning objectives 164<br />

measures of risk 142<br />

methods, description 171<br />

methodological terms/filters 46<br />

n-of-1 trial 175<br />

overview 384<br />

physician decision making 165–6, 176<br />

quotation 164<br />

randomization 169–70<br />

research design strength 189<br />

results 176<br />

results, analysis 172–3<br />

user’s guide 175–6<br />

validity 175–6<br />

range 94, 100<br />

ratio data 68–9, 387<br />

recall bias 61, 62, 83–4, 237–8<br />

receiver operating characteristic (ROC) curves<br />

216, 276–7, 278, 280, 306, 407<br />

recursive partitioning 329<br />

reference standards see gold-standard tests<br />

references, medical literature 28, 31<br />

referral bias 61, 82–3, 150, 329, 360, 407<br />

registry of clinical trials 178<br />

regression to the mean 118<br />

relative rate reduction (RRR) 114<br />

relative risk (RR) 145, 144–5, 147, 146–7, 150,<br />

363<br />

communication with patients 205<br />

confidence intervals (CI) 391<br />

definition 407<br />

formulas 390, 389–90<br />

meta-analyses/systematic reviews 371<br />

results strength 192–3<br />

two by two tables 390<br />

relevance 396<br />

reliability 73, 245, 407<br />

removing patients from study 172<br />

Renaissance 3–4<br />

repeat observations 241<br />

replication 54<br />

replicators 11<br />

reporting bias 61, 83–4, 150<br />

representativeness heuristic 230, 407<br />

reproducibility 301<br />

research conduct/misconduct see responsible<br />

conduct of research<br />

research design strength 188–90<br />

Research and Development Programme, NHS<br />

48<br />

research question 369, 407; see also hypothesis<br />

respect for persons, principle of 185<br />

responsible conduct of research; see also<br />

ethics<br />

conflicts of interest 183, 182–4<br />

definitions of misconduct 181–2<br />

human participants in research 184–5<br />

learning objectives 179<br />

managing conflicts of interest 184<br />

motives for misconduct 182, 183<br />

objectivity 185, 186<br />

peer-review 186<br />

quotation 179<br />

research conduct/misconduct 179–82<br />

results 28, 29–30, 176; see also IMRAD style<br />

applicability 187–8, 332<br />

case–control studies 385<br />

clinical prediction rules 331–2<br />

cohort studies 384–5<br />

diagnosis – study type 385–6<br />

impact 306<br />

meta-analyses/systematic reviews 370–4<br />

randomized clinical trials (RCTs) 172–3, 384<br />

risk assessment 151<br />

specificity 193–4<br />

strength 191–3


Index 439<br />

retrospective bias 368<br />

retrospective studies 57, 83, 407; see also<br />

case-control studies; non-concurrent<br />

cohort studies<br />

review bias 299–300<br />

risk 407, 417<br />

risk assessment<br />

absolute risk 143, 144, 143–4<br />

attributable risk 148, 147–8<br />

confidence intervals 149, 154<br />

learning objectives 141<br />

measures of risk 143, 142–3<br />

nature of risk 153–5<br />

number needed to treat to harm (NNTH)<br />

148, 148<br />

odds ratios 142, 146, 147, 145–7, 150<br />

perspectives on risk 148–9<br />

quotation 141<br />

relative risk 145, 144–5, 147, 146–7, 150<br />

reporting bias 150<br />

user’s guide 151<br />

zero numerator 153, 152–3, 154<br />

risk, attitudes to 334, 348, 348–9<br />

risk determination 157, 158, 159<br />

risk factors 62, 63, 407<br />

decision making 333–4; see also decision<br />

making<br />

estrogen therapy 83<br />

multiple 156; see also multivariate<br />

analysis<br />

and study design 64<br />

risk maximization/minimization 236–7<br />

robust results 173<br />

ROC (receiver operating characteristics)<br />

curves 216, 276–7, 278, 280, 306, 407<br />

Roentgen, William 6<br />

Roman numerals 4<br />

RRR (relative rate reduction) 114<br />

RR see relative risk<br />

RSS feeds 55<br />

rule in/out 408<br />

rules see clinical prediction rules<br />

Rush, Benjamin 4<br />

sample selection/assignment 29<br />

sample size 30, 133, 132–3, 137<br />

samples 29, 94, 101–2, 408<br />

sampling 211<br />

sampling bias 61, 81–2, 305, 408<br />

sampling theory see statistical sampling<br />

sampling to redundancy 211<br />

sanitary engineering 6, 20<br />

saving/printing functions 42<br />

schizophrenia 321<br />

School of Health, University of British<br />

Columbia 291<br />

science of medicine 16–18; see also art of<br />

medicine<br />

scientific misconduct see responsible conduct<br />

of research<br />

Scopus 34, 54<br />

screening 311, 310–12, 408<br />

compliance bias 315–16<br />

criteria for screening 312, 312–14<br />

critical appraisal of studies 318–19<br />

effectiveness 318<br />

executive tests 244–5, 311<br />

lead-time bias 315, 314–15<br />

learning objectives 310<br />

length-time bias 315, 316<br />

medical literature searching 36, 37<br />

pitfalls 314–16<br />

quotation 310<br />

spectrum/subgroup bias 297<br />

scurvy 7, 164–5<br />

secondary analysis 367<br />

second-guessing 218<br />

sedation 240<br />

selection bias 81–2, 86, 296–8, 328–9<br />

self-assessment learning exercises (SALES) 13<br />

SEM (standard error of the mean) 94, 101, 115,<br />

115–16, 389<br />

Semmelweis, Ignatz 6<br />

senses, biological variations 235<br />

sensitive results 172<br />

sensitivity 38, 258, 408<br />

analysis 339, 340, 339–40, 341, 342, 374, 376,<br />

408<br />

cost-effectiveness 356<br />

diagnostic tests 256<br />

differential diagnosis 282<br />

guidelines 325<br />

incremental gain 284, 283–5<br />

mnemonics 257, 257<br />

physician senses 235<br />

post-test probability and patient<br />

management 309<br />

screening 311, 313<br />

spectrum/subgroup bias 297<br />

settings, experimental 29, 329, 408<br />

side effects, medication 154<br />

signal-to-noise ratio 115–16<br />

significance, statistical 30, 117, 124–5, 173,<br />

346, 408<br />

levels of 120, 134, 133–5, 376, 403<br />

single-blinded studies 29, 76<br />

size of study 193<br />

SK (streptokinase) 126–7, 375, 375


440 Index<br />

skewed distributions 101, 102, 103<br />

smallpox vaccine 4<br />

snooping 122–3<br />

SnOut (sensitive tests rule out disease)<br />

acronym 257<br />

Snow, John 6, 193<br />

snowballing 55<br />

SOAP formats 242, 242, 408<br />

social context of medicine 242–3<br />

social desirability bias 211<br />

social outcome criteria 248<br />

software, computer 212<br />

SORT (standards of reporting trials) 28<br />

specificity 38, 193–4, 258, 408<br />

diagnostic tests 256<br />

differential diagnosis 282<br />

incremental gain 284, 283–5<br />

mnemonics 257, 257<br />

post-test probability/patient management<br />

309<br />

screening 313<br />

spectrum/subgroup bias 297<br />

spectrum 408<br />

spectrum bias 83, 297, 305, 408<br />

spin 357<br />

SpIn (specific tests rule in disease) acronym<br />

257, 257<br />

spirits 2<br />

SRs (systematic reviews) 54, 188, 409; see also<br />

clinical reviews; meta-analyses<br />

staff non-cooperation 240–1<br />

standard deviation 94, 101, 135, 135<br />

standard gamble 348, 408<br />

standardized therapy groups see<br />

controls/control groups<br />

stationary nodes 336, 336<br />

statistical analysis 29, 362<br />

statistical power see power, statistical<br />

statistical sampling 5, 6,7<br />

statistical significance see significance,<br />

statistical<br />

statistical tests 116–18, 387<br />

statistically significant effect size 114<br />

statistics 408; see also hypothesis testing<br />

bibliography 416–17<br />

descriptive 94, 387, 389, 399<br />

distribution of values 101–2<br />

epidemiology 107, 107–8<br />

formulas 389–91<br />

history of 4–6,16<br />

inferential 94, 387, 402<br />

learning objectives 93<br />

measures of central tendency 30, 94,<br />

98–100<br />

measures of dispersion 30, 94, 100–1<br />

nature of/role in medicine 93<br />

normal distribution 104, 103–4, 106<br />

percentages 104–5<br />

populations 101<br />

probability 105–7<br />

quotation 93<br />

samples 101–2<br />

visual display see graphing techniques<br />

StatSoft 108<br />

stem-and-leaf plots 97, 96–7<br />

stepwise regression analysis 364<br />

strategy of exhaustion see exhaustion<br />

strategy<br />

stratified randomization 409<br />

strength of evidence/applicability 383<br />

analogy, reasoning by 196<br />

applicability of results 188, 187–8<br />

biological plausibility 195<br />

common sense 196–7<br />

consistency of evidence 193<br />

consistency of evidence over time 195–6<br />

dose-response gradients 194–5<br />

hierarchies of research studies 190–1<br />

learning objectives 187<br />

levels of evidence 188–90<br />

Pathman’s pipeline analogy 197, 197–8<br />

quotation 187<br />

research design strength 188–90<br />

results strength 191–3<br />

specificity of results 193–4<br />

temporal relationships 194<br />

strong recommendations 383<br />

study design<br />

bibliography 416<br />

case reports/series 57, 57–8<br />

case–control studies 61, 60–2<br />

clinical trials 65, 64–6<br />

cohort studies 62–4<br />

cross-sectional studies 59–60<br />

descriptive studies 56, 57–60<br />

learning objectives 56<br />

longitudinal studies 56–7, 60–4<br />

prospective studies 57<br />

quotation 56<br />

retrospective studies 57<br />

strengths/weaknesses 61, 63, 64, 66<br />

types 56–7<br />

study size 193<br />

subgroup analysis 90, 171<br />

subgroup bias 297<br />

subject bias 85<br />

subjective information 409<br />

surgery 3, 7, 154, 170


Index 441<br />

surrogate markers 20, 59, 72, 89–90, 145, 194,<br />

409<br />

survival analysis 364, 365, 409; see also<br />

prognosis<br />

survival curves 365, 364–6<br />

survival rates 37<br />

symmetrical distributions 101, 102, 103<br />

synonyms 37, 44<br />

systematic error 70<br />

technical criteria 245–6<br />

technological improvement of tests 301–2<br />

temporal relationships, cause-and-effect 194<br />

terminology, non-medical 238<br />

test minimizers 236–7<br />

test review bias 299<br />

testing thresholds 216, 288, 409<br />

tests, diagnostic see diagnostic tests<br />

theoretical saturation 211<br />

therapeutic relationship 3, 202–3<br />

therapy studies 22, 22, 23, 38<br />

threshold approach to decision making<br />

343–5, 409<br />

threshold values 289, 287–91, 394–5; see also<br />

incremental gain<br />

time, PICO/PICOT model 16<br />

time trade-offs 409<br />

timelines, disease 311<br />

tissue samples 234, 245, 305<br />

TNRs (true negative rates) 256<br />

TPRs (true positive rates) 205, 256<br />

translation services 239<br />

treatment thresholds 216, 274, 275, 288,<br />

409<br />

trephination 2<br />

triangulation 213<br />

triggering 409<br />

TRIP database 34, 50–1<br />

triple-blinded studies 29, 76<br />

true negative rates (TNRs) 256<br />

true negative test results 130, 253, 253<br />

true positive rates (TPRs) 205, 256<br />

true positive test results 253, 253<br />

trust 180; see also responsible conduct of<br />

research<br />

truth 74<br />

Tuskeegee syphilis studies 179<br />

two alternative-forced choice problem 279,<br />

396<br />

two by two tables 224, 257, 258, 259, 269, 275<br />

diagnostic tests 391<br />

incremental gain 283, 284, 283–5<br />

two-tailed tests 121, 121–2, 132, 140, 409<br />

Type I errors 90, 112, 120–1, 173, 409<br />

bibliography 417<br />

clinical prediction rules development 329<br />

confidence intervals 124, 123–4<br />

learning objectives 120<br />

meta-analyses/systematic reviews 368<br />

multiple outcomes 122–3<br />

number needed to treat 125–7<br />

one-tailed/two-tailed tests 121, 121–2<br />

other sources of error 127–9<br />

quotation 120<br />

randomized clinical trials 171<br />

statistical/clinical significance 124–5<br />

Type II errors 112–13, 131, 173, 410; see also<br />

negative studies, evaluating<br />

bibliography 417<br />

clinical prediction rules development 329<br />

determining power 131–5<br />

effect size 133, 134<br />

learning objectives 130<br />

level of significance 134, 133–5<br />

meta-analyses/systematic reviews 368, 370,<br />

375<br />

non-inferiority/equivalence studies 140<br />

prognosis 362<br />

quotation 130<br />

sample size 133, 132–3<br />

standard deviation 135, 135<br />

Type III errors 113, 133<br />

Type IV errors 113, 133<br />

unadjusted life expectancy 410<br />

uncertainty 410<br />

underfitting 160<br />

University of British Columbia 291<br />

unobtrusive measurements 75–6<br />

users 11, 53<br />

Users’ Guides to the Medical Literature 151,<br />

318, 366, 384–6<br />

bibliography 418–21<br />

randomized clinical trials 175–6<br />

risk assessment 151<br />

website 175<br />

USPHS (United States Public Health Service)<br />

317<br />

utility, diagnostic tests 306<br />

cutoff points 258<br />

definition 410<br />

function 250–1<br />

indications 250<br />

learning objectives 249–50<br />

normal test results 252–3<br />

quotation 249<br />

sample problem 259, 258–60<br />

sensitivity/specificity 256


442 Index<br />

utility, diagnostic tests (cont.)<br />

sensitivity/specificity, using 225, 257, 258,<br />

259, 257–60, 267–8, 269<br />

strength of tests 253<br />

two by two tables 254, 253–4<br />

types of test result 251–2<br />

utility theory 6, 334, 334–6<br />

validation samples 362<br />

validation sets 62<br />

validation studies 158<br />

validity 175–6, 410<br />

case–control studies 385<br />

clinical prediction rules 331<br />

cohort studies 384–5<br />

diagnosis – study type 385–6<br />

diagnostic tests 246<br />

diagnostic tests, critical appraisal<br />

b304–6<br />

guidelines 324–5<br />

measurements/instruments 73<br />

randomized clinical trials (RCTs) 384<br />

risk assessment 151<br />

screening studies 318<br />

types 73–5<br />

values, patient 10, 319, 335, 349, 356, 405<br />

and decision making 346–8<br />

variables 410; see also confounding variables<br />

continuous 139, 138–40<br />

dependent 68<br />

dichotomous 138, 139<br />

independent 67<br />

variance 101, 410<br />

VAS (visual analog scale) 71, 136, 347, 347<br />

Venn diagrams 35,35<br />

verification bias 298<br />

Vesalius 3<br />

VINDICATE mnemonic 224, 225<br />

visual analog scale (VAS) 71, 136, 347, 347<br />

vitamins 4, 87, 164–5, 195, 196<br />

V/Q (ventilation-perfusion) scan 296<br />

water treatment engineering 6, 20<br />

weak recommendations 383<br />

Web of Science 34, 54<br />

websites 12, 34, 421–4<br />

Clinical <strong>Evidence</strong> database 52<br />

clinical trials 369<br />

Cochrane Collaboration 49<br />

confidence intervals calculator for Bayes<br />

Theorem 291<br />

GRADE scheme 369<br />

levels of evidence 191<br />

number needed to treat 127<br />

PogoFrog 54<br />

PUBMED 37–8, 39, 38–9<br />

registry of clinical trials 178<br />

risk assessment 151<br />

Users’ Guides to the Medical Literature<br />

175<br />

weight 73<br />

weighted outcomes 371<br />

whistle-blowers 180, 182<br />

wildcard symbol 37,39<br />

Wiley InterScience interface 49, 49<br />

World Health Organization (WHO) 52<br />

writing, legible 243<br />

Yule–Simpson paradox 162, 163, 410<br />

zero numerator 153, 152–3, 154<br />

zero points 94, 95

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!