Children's Mercy Hospital
For Patients and Families   Your Child's Health   Clinical Services   |   For Health Care Professionals   Medical Education   Medical Research

Category: Data mining (June 18, 2007) [incomplete] Data mining is a broad class of statistical tools that are designed for massive data sets. Many of the links in this category refer to methods for genetic data sets, especially microarray studies. Articles are arranged by date with the most recent entries at the top. You can find the theme and closely related categories and other resources at the bottom of this page.

Stats: The pros and cons of control charts versus data mining (November 17, 2007). In a talk I gave in December 2006, I highlighted how in the analysis of adverse event data, control charts can augment more complex statistical tools like data mining. Here's a summary of the pros and cons of using control charts.

Stats: Justifying the sample size for a microarray study (August 9, 2007). I'm helping out with a grant proposal that is using microarrays for part of the analysis. A microarray is system for quantitative measurement of circulating mRNA in human, animal, or plant tissue. A microarray will typically measure thousands or tens of thousands of different mRNA sequences. An important issue for this particular grant (and many grants involving microarray data) is how to justify the sample size. Here are a few references that I will use to develop such a justification.

Stats: Resources for fMRI data analysis (February 8, 2007). I was asked to provide feedback on a grant that will use functional magnetic resonance imaging (fMRI) as one component of the research. This technique is used to quantify brain activity by quantifying changes in blood flow in various regions of the brain. It effectively produces information in the three dimensions of the brain structure, plus a dimension of time. The technology today can produce images localized to a cube with dimensions of approximately 2-4 mm, and these can be measured every 1-4 seconds.

Stats: Resources describing biplots (January 15, 2007). I've written some code in R to present a graphical summary of a complex data set using biplots. I write most of the code myself using the singular value decomposition function (svd) in R. There are a wide range of techniques that can be loosely classified as biplots, such as principal components analysis, multidimensional scaling, correspondence analysis, and canonical variate analysis.

Stats: PharmaIQ talks (December 6, 2006). I attended the conference "Signal Detection and Data Mining" sponsored by PharmaIQ. Here are some notes I took during some of the talks.

Stats: Two cautionary tales about data mining (January 6, 2005). I attended a 7am seminar this morning on data warehousing and data mining, which was quite good. It reminded me of a couple of stories I heard about the pitfalls of data mining.

Stats: Pharmacogenetics Research Network (September 14, 2006). I received an email today discussing a special conference being held by the the Pharmacogenetics Research Network (PGRN). It's an "invitation only" conference, so I must have had someone recommend me for this group. I did not know anything about PGRN, so I did a brief web search.

Stats: (Seminar notes) Working with molecular biologists (July 17, 2006). One of the talks at the 18th Annual Applied Statistics in Agriculture Conference, sponsored by Kansas State University was "A visual aid to help a statistician work with a molecular biologist" by Debbie Boykin of the USDA Agricultural Research Service, coauthored by Earl Taliercio, also of the USDA Agricultural Research Service and Rowena Kelly from Mississippi State University. The original title of the talk was "Improving Power of Microarray Experiments by Adjusting Data so Fewer Differentially Expressed Genes are Overlooked" but Dr. Boykin reviewed the material and decided to change the focus.

Stats: Methods for haplotype analysis (May 31, 2006). I am not an expert on haplotype analysis, but as I understand it, a haplotype is a combination of several SNPs (Single Nucleotide Polymorphisms) that show a stronger association with disease than any single SNP might. Haplotype analysis is difficult because you often only have partial information about the genomes.

Stats: Univariate Model Based Clustering (April 18, 2006). Back in 2001, I attended an excellent short course on a new approach to cluster analysis taught by Adrian Raftery and Chris Fraley at the Joint Statistics Meetings. Their approach, model based clustering, examined the fits of mixtures of normal distributions. This approach is useful for unidimensional and multidimensional data and has many advantages over other clustering approaches like hierarchical clustering and k-means clustering. While I greatly enjoyed that class, I never had need to use this approach until just recently. So I dusted off my old notes and worked out a few simple examples to refresh my memory. I want to present some of these examples here on this weblog.

Stats: The Healthcare Cost and Utilization Project (April 11, 2006). On April 20, I will be attending a webcast sponsored by the Agency for Healthcare Research and Quality (AHRQ) on a large data set they collected, the Healthcare Cost and Utilization Project. The acronym for this data set is HCUP, which I always pronounced HICCUP, but apparently, you are supposed to pronounce it H-CUP.

Stats: An Ensembl search (February 1, 2006). While working on a microarray experiement with a researcher, we had to find a bit of information about a gene with the gene symbol NCOA3. We went to the Ensembl web site (www.ensembl.org) and did a search which yielded the following information.

Stats: How do the various clustering algorithms work? (January 31, 2006). I'm working with someone on some clustering models for his microarray experiment. He asked how the various clustering algorithms work.

Stats: A simple function for a Biplot in R (January 24, 2006). I regularly use a biplot or principal components plot as an initial exploratory tool for microarray analyses, but I have not found a good package that does this for me automatically. Rather than re-inventing the code every time, I created a simple R function that does the job for me. It's not the fanciest or best code in the world, but I wanted to put it here and comment on the various alternative forms of the biplot and principal components plot when I have time.

Stats: Machine Learning tools in R (January 24, 2006). There are a variety of different models that perform supervised learning or classification problems: Diagonal Linear Discriminant Analysis (DLDA), Neural Networks (NN), Support Vector Machines (SVM), k Nearest Neighbors (kNN), Bagging, and Boosting. R has a library, MLInterfaces, that puts a uniform interface in front of the input and output from all of these procedures

Stats: Haplotype analysis (January 13, 2006). One of the people I work with wants to include a haplotype analysis in their research grant. I know nothing about haplotype analysis, so I am currently investigating various publications, web sites, and software. I want to include these resources here and eventually organize a web page that describes the statistical approach to haplotype analysis. I also think that there may be some benefit to using an information theory model in this type of analysis, but that is just some preliminary speculation on my part. I have looked a bit at this issue already while trying to understand the HapMap project.

Stats: RMA normalization of microarrays (October 24, 2005). If you ask most statisticians if they want raw data or processed data, they will, for the most part, prefer to look at the raw data. There are two reasons for this. First, the statisticians want to understand the processing of the data and how that might influence the precision of any further calculations based on the raw data. Second, statisticians may want to try alternative approaches for processing the data and see if that produces better results. An example of this involves the normalization of Affymetrix microarray chips.

Stats: A totally negative microarray experiment (October 14, 2005). I've been cleaning out my old emails and am finding some real gems of good advice. Someone wrote into the Bioconductor email list wondering what to do when the lowest adjusted p-value in the entire experiment was still very large (0.66). A nice response outlined three strategies.

Stats: More on discovering gene information (October 12, 2005)  I was reading an interesting microarray article: A mechanism of cyclin D1 action encoded in the patterns of gene expression in human cancer. Lamb J, Ramaswamy S, Ford HL, Contreras B, Martinez RV, Kittrell FS, Zahnow CA, Patterson N, Golub TR, Ewen ME. Cell 2003: 114(3); 323-34. and was curious what information I could find about cyclin D1. The article mentions the gene symbol (CCND1) but provides no other obvious clues (at least clues that were obvious to me).

Stats: Naming conventions for genes, proteins, etc. (September 8, 2005). When you are analyzing a microarray experiment, the mRNA sequences can be referred to by several different names.

Stats: Finding more information about a gene (September 6, 2005). I ran a few simple experiments using microarray data from a public source: www.genome.org/cgi/content/full/15/3/443/DC1. This is the data set used in the publication: Database of mRNA gene expression profiles of multiple human organs. Son CG, Bilke S, Davis S, Greer BT, Wei JS, Whiteford CC, Chen QR, Cenacchi N, Khan J. Genome Res 2005: 15(3); 443-50.

Stats: Statistical Analysis of Microarrays by Insightful (August 31, 2005). I attended a seminar presented by Michael O'Connell of Insightful Corporation on microarray analysis. Insightful Corporation has a program, S+ArrayAnalyzer, and this talk showed some of the capabilities of this software

Stats: Publicly available microarray data (August 18, 2005). A working paper in the Johns Hopkins Biostatistics department, Searching for Differentially Expressed Gene Combinations. Marcel Dettling, Edward Gabrielson, and Giovanni Parmigiani uses two microarray data sets to test their methodology.

Stats: More on normalization (July 28, 2005). I am looking into a variety of ways to normalize a set of microarrays from Affymetrix. The concept is similar to normalization for a two-dye array, but there are some subtle differences also.

Stats: Permutation tests for microarrays (July 27, 2005). A very simple approach to estimating the proportion of differentially expressed genes uses a permutation approach. The reference for this is Empirical Bayes Analysis of a Microarray Experiment. Efron B, Tibshirani R, Storey JD, Tusher V. Journal of the American Statistical Association 2001: 96(456); 1151-1160.

Stats: Analysis of Gene Expression Data Short Course (July 26, 2005). I'll be taking a short course at the Joint Statistical Meetings next month. It will be taught be Terry Speed, Jean Yang, Ben Bolstad, and James Wettenhall.

Stats: Step-down procedures for multiple comparisons (June 16, 2005). In some research studies, you have a large and difficult to manage set of outcome measures. This is especially true in microarray experiments, where for thousands or tens of thousands of genes, you are measuring the difference in expression levels between two types of tissue. A simple p-value is worthless in this situation, because it will be swamped by thousands of other p-values.

Stats: Application of the ROC curve to microarray data (May 26, 2005). Life is full of surprises. When I was looking at whether the software package R could compute and analyze Receiver Operating Characteristic (ROC) curves, I found out that there is an application of ROC curves for microarray data. Apparently, the positive false discovery rate can be conceived of in a diagnostic testing format as relating to the positive predictive value.

Stats: Dimension reduction in a microarray experiment (May 25, 2005). Given the large number of genes in a microarray experiment, you need to find some way of looking at subsets or linear combinations of these genes. Assume that you have G genes and M microarrays and that the normalized signals are in a matrix X with G rows and M columns. Assume that information about the particular tissues (phenotypic data) is in a matrix Y with G rows and P columns.

Stats: Microarray data analysis, again (April 22, 2005). One of these days, I will have a coherent set of pages talking about microarray data analysis, but for now, all I have is a haphazard set of pages and weblog entries, most of which are woefully incomplete. In an effort to try to pull these together, I am listing below all of these links

Stats: More articles on microarrays (March 10, 2005). It's impossible to keep up with the flood of research on microarrays, but here are a few articles published in the journal, Statistical Applications in Genetics and Molecular Biology that sounded interesting.

Stats: Review articles on microarrays (March 7, 2005). The Medical Science Monitor listed these three articles among their most frequently requested downloads. They all look like good overviews of microarray technology.

Stats: Two cautionary tales about data mining (January 6, 2005). I attended a 7am seminar this morning on data warehousing and data mining, which was quite good. It reminded me of a couple of stories I heard about the pitfalls of data mining.

Stats: A simple microarray experiment (September 21, 2004). Someone just gave me some data with a small microarray. There are four exposed animals (Exp.1 through Exp.4) and four control animals (Exp.5 through Exp.8). The microarray has 96 genes, as well as some housekeeping genes.

Stats: Microarray data analysis (March 18, 2004). The large amount of data is a typical DNA microarray assays makes for a lot of challenges for us statisticians. I've wanted to write a simple introductory web page on this topic for a while, but have never found the time to do it well. There are a couple of recent articles on microarrays published on BioMed Central with full text available on line.

Stats: Guidelines for data mining models (September 22, 2003). I'm not an expert on data mining, but I wanted to outline some of the basic issues associated with data mining problems. This material is based largely on notes that I took during a training class on data mining taught by Richard De Veaux.

Stats: Steps in a typical micro array data analysis (no date). I am not an expert in micro array data analysis. In fact, I'm just starting. I thought that outlining some of the things I am learning as I start to do micro array data analyses would be helpful to others.

Stats: Microarray bibliography and links (no date). Here are some resources if you (as I) are just starting to learn about microarray data analysis.

Stats: Data management in a microarray experiment (no date). I was asked to concentrate on a set of genes associated with the Folate pathway. This list of 43 genes, ABCB1, ABCC1, ABCC3, ..., SLC19A1, and TYMS were stored in an Excel file called FolateGeneList.xls. I converted this file to a csv format, and read it into R.

Stats: Design of microarray experiments (no date). There are a variety of research designs that you can use in a microarray experiment.

Stats: Differential expression in microarray data (no date). You can compute an expression ratio for each gene by taking the average of the log expression levels in the treatment group and subtracting the average of the log expression levels in the control group. This actually produces a log ratio, and you can compute the actual ratio by taking the antilog.

Stats: Importing data from a microarray experiment (no date). There are so many different ways that data can come to you in a microarray experiment that it is hard to document how to import the data. Here are a few examples, plus some random notes and thoughts.

Stats: Normalization for microarray data (no date). Normalization is the process of adjusting values in a microarray experiment to improve consistency and reduce bias.

Stats: Software for microarray data analysis (no date). There is a wide range of software available for the analysis of microarrays. I will use Bioconductor which is a set of libraries for a statistical programming language called R. Both Bioconductor and R are open source, which means that you can obtain the pacakge at no cost.

Stats: Supervised learning for microarray data Here are some documented examples of how to use supervised learning methods of the analysis of microarray data.

Stats: Unsupervised learning for microarray data (no date). Here are some documented examples of how to use unsupervised learning methods of the analysis of microarray data.

Stats: What is a microarray? (no date). A microarray is a tool for measuring the amount of messenger RNA (mRNA) that is circulating in a cell. It is the mRNA that transfers information from the genes from DNA inside the nucleus of a cell to create various proteins. Even though they have the exact same DNA, different cells have different amounts of various mRNA because they need to produce different proteins. For example, only certain cells in the pancreas produce insulin even though the DNA code for producing insulin exists inside all cells.

Theme and closely related categories:

Other resources:

[Return to full topic list] [Read current weblog entries]

This webpage was written on 2007-06-18 and was last modified on 2008-07-08.