620-472 Data Mining
Data Mining refers to the management and analysis of large data sets. As it has matured it has developed a more statistical flavour, but Data Mining still owes much of its character to disciplines such as machine learning, pattern recognition, database design and high performance computing.
Data Mining became possible with the advent of large-scale data collection and
the computing power necessary to process it. Data Mining involves all of the
following steps
- Data Warehousing
- Data Cleaning
- Data Description and Visualisation
- Data Analysis and Interpretation
This course deals only with step 4 of the Data Mining process: data analysis
and interpretation.
Techniques covered by the course include: Market Basket Analysis; Tree based classification; Logistic Regression; Neural Networks; Hierarchical and K-means clustering; and Regression Splines.
The themes that run through the course are
- Model fitting and selection and how to avoid overfitting
- Scalable algorithms that can be used with very large data sets
- How to acommodate high-dimensional data
- Actionability and interpretability of models
Prerequisites
None required, however students would benefit from having completed an introductory probability or statistics unit, such as 620-131, 620-160, 260-201 or 620-370.
Lecturer
Dr Owen Jones, room 221 Richard Berry building.
Contact details are available on the
departmental web site.
Lectures are on Mondays, 10 - 12, room 215 of the Richard Berry building.
A lab help session has been timetabled for Fridays, 3:15 - 4:15, room G70 (Wilson lab) of the Richard Berry building
Course Material
The book by Kuhnert and Venables uses R to apply many of the techniques we cover:
Assessment
20% coursework (weekly/fortnightly assignments) 80% exam
Past Exams
Online Resources
Two useful datamining websites
An interactive and educational implementation of the k-means algorithm
Here are links to some useful R Resources:
References
- T. Hastie, K. Tibshirani and J. Friedman. The Elements of Statistical Learnig, Data Mining, Inference and Prediction. Springer, 2001.
- P.N. Tan, M. Steinback and V. Kumar. Introduction to Data Mining. Addison Wesley, 2005.
- P. Giudici. Applied Data Mining: Statistical Methods for Business and Industry. Wiley, 2003.
- M. Hegland. Data Mining Techniques. Acta Numerica, pp 313-355, 2001.
- B.D. Ripley. Pattern Recognition and Neural Networks. CUP, 1996.
- J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufman, 2001.
- I.H. Witten and E. Frank. Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations, 2nd Edition. Morgan Kaufman, 2005.
- U.M. Fayyad, G. Piatetsky-Shapiro, P. Smyth and R. Uthurusamy Eds. Advances in Knowledge Discovery and Data Mining. MIT Press, 1996.
- G. Piatetsky-Shapiro and W. Frawley. Knowledge Discovery in Databases. AIII Press, 1991.
- W. N. Venables and B. D. Ripley. Modern Applied Statistics with S. Fourth Edition. Springer, 2002.
Back to Owen Jones' home page or the Department of Mathematics and Statistics home page