Data Mining

| Home | | Pharmacovigilance |

Chapter: Pharmacovigilance: Data Mining in Pharmacovigilance

Data Mining in Pharmacovigilance: A View from the Uppsala Monitoring Centre


In this chapter we are going to use the term ‘data mining’ for any computational method used to auto-matically and continuously extract useful information from large amounts of data. Data mining is a form of exploratory data analysis (Hand, Mannila and Smyth, 2001) and a key component of the knowledge discov-ery process (Fayyad, Piatetsky-Shapiro and Smyth, 1996). Data mining can clearly be used on any data set, but the approach seems particularly valuable when the amount of data is large and the possible rela-tionships within the data set numerous and complex. Although data mining of drug utilisation information, and other relevant data sets such as those relating to poisoning, medical error and patient records, will add greatly to pharmacovigilance (Anonymous, 2003; Bate et al., 2004), research in this area is still prelim-inary and will not be discussed in detail here.

In principle the WHO Collaborating Centre for International Drug Monitoring (the Uppsala Monitor-ing Centre, UMC) has been doing data mining since the mid-1970s, using an early relational database. As with many automated systems, the relational database to a very large extent replicated a manual approach. In this instance it was the Canadian ‘pigeon hole’ system (Napke, 1977), where reports were physically assigned a slot, which encouraged visual inspection. Thus observation could be made of when certain cate-gories of report were unexpectedly high. From the UMC database, countries in the WHO Programme for International Drug Monitoring have been provided with information, reworked by the UMC, on the summarised case data that is submitted from each national centre. This information has been presented to them according to agreed categories and classifi-cations as determined amongst Programme members from time-to-time. This kind of system suffers from the following limitations:

·    It is prescriptive, the groupings being determined on what is found broadly useful by experience

·    Each category is relatively simple, but the informa-tion beneath each heading is complex, and format-ted rigidly

·    There is no indication of the probability of any rela-tionship other than the incident numbers in each time period.

This system does not even have all the user-friendliness of the pigeon hole system, which allowed a user to visually scan the amount of reports as they were filed to see the rate of build up in each pigeon hole. Admittedly, the sorting was relatively coarse, but the continuous visual cue given by the accumu-lation of case reports was very useful. In improving on the pigeon hole system and adapting it for the ever-increasing amounts of data involved, one can imagine a computer program being able to survey all data fields looking for any pair of events that stand out as occurring together more frequently than expected. Different measures of association have been proposed for the purpose of analysing disproportional reporting of ADR terms with drug substances. The proportional reporting ratio (PRR), which is akin to a relative risk, and the reporting odds ratio (ROR) are classical statistical measures of association that can be combined with for example chi-squared tests for associations to guard against spurious findings. Bayesian and empirical Bayesian approaches take this one step further by providing shrinkage estimates such as the Information Component (IC) (Bate et al., 1998; Orre et al., 2000) and the EBGM (DuMouchel, 1999). These are typically closer to the null hypothesis of independence than classical estimates and less volatile when data is scarce. As such, they provide robust measures of association that account for both signifi-cance and strength. Furthermore, a Bayesian approach is intuitively correct for a situation where there is a need to continuously re-assess probability of relation-ships with the acquisition of new data and over time. In Bayesian inference, new data modifies the prior probabilities to posterior probabilities, and the poste-rior probabilities can be used as prior probabilities in subsequent analyses. The process can be iterated indefinitely.

The next level of complexity is to consider the effects of adding other objects as variables. Complex pattern recognition in spontaneous reporting data may extract information related to ADR syndromes, patient risk groups, drug interactions and data quality prob-lems. It typically increases the computational demands and often requires more sophisticated quantitative methods. The UMC has chosen the Bayesian Confi-dence Propagation Neural Network (BCPNN) as the most favourable framework for development in this area. This is a statistical neural network consisting of a matrix of interconnected nodes that represent different data fields. It is trained according to Bayes law based on the data provided to it. The use of Bayesian logic seems natural since the relationship between each node will alter as more data is added. The network ‘learns’ the new weights between nodes, and can be asked how much those weights are changed by the addition of new case data or by the consideration of higher-order associations.

Contact Us, Privacy Policy, Terms and Compliant, DMCA Policy and Compliant

TH 2019 - 2024; Developed by Therithal info.