Honours Projects in Applied Statistics and Stochastic Processes

The following is a sample of projects currently available, or that have been completed in the past, listed by supervisor. If there is a topic that you are interested in that does not appear on this list, then we will happily consider that too. For more information on these or other topics just ask someone in the Applied Statistics or Stochastic Processes groups.

Kostya Borovkov

Large deviation probabilities for random walks with mixture distributions. Large deviation probabilities have important applications, for example in risk theory and statistics. It is well known that the large deviation behaviour of random walks with "light tails" differs qualitatively from those with "heavy tails". The aim of this project is to study the transient case, when the jump distributions are mixtures of light- and heavy-tailed distributions.

Credit migration processes. Bank customers are regularly ranked according to their capacities to meet their financial commitments. A standard model for credit migration assumes that each customer's credit ranking evolves according to a Markov chain with a common transition matrix. This assumption is poorly supported by the empirical data. The project will aim at extending the standard model, in particular by assuming that to different customers there correspond different transition matrices.

Value-at-risk vs. expected shortfall. The project will aim at analysing and comparing these two popular risk measures, with possible extensions involving some more sophisticated characteristics of the underlying probabilistic models (e.g. the expected shortfall).

Ian Gordon

Propensity score adjustment. Accounting for potential bias in non-randomised studies of an intervention has often been done by fitting a statistical model that adjusts for relevant, measured, explanatory variables. Rubin and others have advocated an alternative approach in which the probability of receiving the intervention is first modelled using all available information: this is the "propensity score". Then the substantive estimation of the intervention effect is made, adjusting for the propensity score. This project looks at the propensity score and its application in more general settings.

Graham Hepworth

Estimation of proportions by group testing. Group testing occurs when units from a population are pooled together and tested as a group for the presence of a particular attribute. Examples of its application are the testing of blood for diseases, the transfer of viruses to plants by insects, assessment of the proportion of defective units in manufacturing, and testing of seed for pathogens. Group testing can substantially reduce cost, but increases the complexity of estimating the proportion of the population possessing the attribute.

Discrete interval estimation. Confidence intervals for parameters that take discrete rather than continuous values has attracted more research since greater computing power has enabled exact methods to be employed more readily. Interval estimation for a binomial proportion in particular is of interest to researchers and practitioners alike.

Richard Huggins

Statistical analysis of family and twin data. Examine models, maximum likelihood and robust inference for data arising from family and twin studies. The project can consider the theoretical aspects, including nonparametric methods measurement error models or concentrate on various practical projects, some of which may be joint projects with researchers in other departments.

Statistical analysis of mark recapture data. Several projects are possible ranging from theoretical properties of models and inference procedures to the fitting of these models to real data sets. Models can range from the classical parametric models to modern nonparametric models for the relationships between covariates and capture probabilities. There is the possibility of collaboration with wildlife researchers.

Owen Jones

Yield Management. Yield management is used in the travel and hotel industries to optimise revenue: as the flight time nears or if a room is still unbooked, the retailer will reduce the cost to try and encourage customers to buy. This project will look at how to apply yield management techniques to help a local travel company optimise its revenue.

Scaling stochastic processes. Unusual scaling properties have been observed in a variety of data such as EEG (brainwaves), ECG (heart rate), meteorological, telecommunications and financial data. The search for good models for such data is a hot topic, and so far has produced self-similar; fractal; multi-fractal; multi-scaling and long-range dependent processes. I have a number of applied and theoretical problems one could look at in this area.

Markov models for fish competition. An applied project, using Markov chains to model the behaviour of small fish competing for a limited resource (in this case, rocks to hide under). The challenge is to allow for the size of the fish.

Stochastic optimisation with pruning. This project will first look at stochastic optimisation and later developments such as simultaneous perturbation and common random numbers. It will then look at how branching and pruning ideas can be used to improve the algorithm.

Experimental design. Can we generate experimental designs by constructing measures of how good a design is, and then using numerical optimisation techniques to find the best design? How does this approach compare to traditional experimental design?

Branching process models for fish stocks.Modelling fish stocks poses a number of problems, such as the age structure of the population, large variations in fecundity (birth rate) from year to year, and the effect of harvesting. The variation in birth rate in particular lends itself to a stochastic modelling approach.

Incremental noisy optimisation using a Bayesian surface model. A more speculative project, to look at efficient ways of finding the maximum of a function when you can only observe it approximately, and when each observation is expensive.

Modelling variation in mass spectrometry experiments. Mass spectrometry is an exciting scientific tool, but there is a need for a statistical analysis of how errors appear and are propagated through the measurement process.

Stochastic modelling of a mineral supply problem. A stochastic modelling, simulation and optimisation problem provided by Geoff Robinson, CSIRO. Allowing for variable production and demand, can we demonstrate the feasibility of a proposed scheme for supplying minerals of different grades.

Guoqi Qian

Statistical estimation and testing for data involving missing values. Data collected from observational studies often contain missing values. Ignoring missing values in data analysis may lead to wrong conclusions. The interesting issues here are how to model the data involving missing values, how to feasibly estimate the parameters involved, and whether the missing values are ignorable or non-ignorable. Several projects are available on this topic, concerning various issues and using either the real data or simulated data. Statistical tools to be used for these projects include maximum likelihood principle and Monte Carlo EM algorithm.

Statistical model selection. Model selection is important in almost every application of statistics. But it is also one of the major unresolved problems in statistics. To perform model selection for a given problem one needs to have a model selection criterion. If the number of candidate models is enormous one also needs to have a feasible computing procedure to carry out the search of the best model. Further, derivation of large sample asymptotic properties will provide justification for using a particular model selection method. To reduce the complexity, one can focus on a particular family of statistical models.

Statistical analysis of longitudinal data. The data are longitudinal when a set of variables is repeatedly measured for each individual over time whereas measurements for different individuals are obtained independently. Longitudinal data are widely seen in epidemiological studies of chronic diseases. Three types of models are available for analyzing longitudinal data, including marginal models, mixed-effect models and Markov transitional models. Techniques such as general estimating equations (GEE), Monte Carlo EM and Markov chain Monte Carlo (MCMC) are often used in these models.

Andrew Robinson

Seed Dispersal. Dispersal is a critical process in invasions by exotic species and in the maintenance of structure within plant and animal communities. A large number of studies in recent years have modelled the rate of spread and pattern within such populations. However, most of these models make arbitrary assumptions about the pdf of dispersal distances. We have one of the most detailed data sets for dispersal: we have mapped x,y coordinates of fruits on the ground after dispersal and x,y,z coordinates of fruits on plants prior to dispersal in wild radish (an agricultural weed). The usual approach is to consider the distances of all dispersed pods from the centre of the plant. However, seeds on the outermost branches of a plant may drop straight to the ground: their dispersal distance is much less than their distance from the plant's centre. Can we estimate the parameters of a pdf for dispersal distance from these data with respect to their point of origin within the parent plant's canopy? What is the most appropriate function to use for this?

Methods of Inference. Statistical inference is not a monolith. There are at least three distinct approaches to statistical inference that support flourishing communities of scientists and decision-makers, and several more in niche areas. A number of these have become established in only the last decade or so. This project will survey the range of statistical inference strategies, identifying the common points and differences, and then apply them to some straightforward modelling problems for estimation and inference: e.g. estimating a mean, regression parameters, analysis of variance, etc. Key to the outcome will be identifying the conceptual underpinnings of the inference: what it is necessary to assume, what it is necessary to believe, and so on. The tools used will be R and the argument mapping software called Reason!able. The outcome will be a comparative survey of inference strategies and tools and an R package to implement them in simple modelling problems.

Environmental Monitoring. Monitoring and assessing environmental events requires a translation from a biological effect to an abstract model, fitting the model, and interpreting its parameter estimates in the context of an asymmetric loss function. The EPA of Victoria wishes to develop models and procedures that will connect toxicity information for specific species to a biological event model, with sparse data, in order to be able to establish guidelines for action. The guidelines will alert the EPA that, for example, the probability of a toxic episode has become unacceptably high. The current strategy uses percentile cutoffs, but the nominated rank and the cutoff are arbitrary.

Big BAF. A new forest inventory technique has been recently introduced, catchily entitled "Big-BAF". Data from such an inventory are hierarchical, and as yet, there seems to be little agreement as to the best way to analyse them. Heuristic evidence suggests that Big-BAF is more efficient than its predecessor VBAR, but the conditions under which it will be true have not yet been explored.

Critical Period Analysis. is a new technique sometimes used in agriculture to try to discern optimal weed-control strategies. This requires fitting two models to experimental data and estimating where the models cross, and how far apart their asymptotes are. This project considers a recent approach to this problem using maximum likelihood and segmented regression.

Sampling and Estimation. In ecological studies it is often of interest to be able to estimate the age of the oldest tree in a forest. This turns out to be a sampling problem with numerous facets. This project will look at the application of hierarchical extreme-value distributions to the problem.

Size distributions are useful tools for the management of natural resources. They provide managers with important information about the maturity and cohesiveness of the resource in question. This project will use some of the exciting new techniques of Functional Data Analysis to fit and predict from size distributions.

Search engines are a useful tool of the Internet. Key marketing features for search engines are the index size (the number of unique pages that the engine has indexed) and timeliness (how up to date is the index). Using such indicators, this project will apply statistical techniques to quantitatively compare several search engines, such as Sensis, MSN, Yahoo!, and Google.

Felisa Vazquez-Abad

Stochastic Lagrange methods for optimisation under uncertainty. Primal-dual methods for non-linear optimisation are used to find local optima of cost functions. Interesting areas of application deal with systems under uncertainty: telecommunications and transportation networks, inventory systems, etc. A stochastic recursive algorithm is used when the cost function cannot be found exactly, but rather is subject to random error. This project will consider fundamental problems of bias caused by these observation errors.

Implementation of Measure Valued Derivatives. This project will implement and compare various derivative estimation methods for Gaussian processes, notably MVD. Models for a transportation system and a financial problem can be the focal point of the project.

Optimisation under probability constraints. Many telecommunication and risk problems in insurance and finance involve optimisation under a probability constraint. A reformulation of the problem in terms of the percentiles of distributions may yield better numerical properties for optimisation. In particular there are typically fewer numerical instabilities than the original problem. In this project we will study the problem of gradient estimation of percentiles for a series of non-linear constraint functions.

Study of the behaviour of some TCP/IP algorithms for congestion control. On-going research with colleagues in Caltech, Los Angeles, which could provide a number of projects for students interested in Internet traffic.

Study of the Melbourne airport car park. On-going project with Natashia Boland, concerned with optimising the design and use of car parks at Melbourne airport. Various sub-problems would be suitable honours projects.

Study of behaviour of certain policies for Optical Burst Switching networks. On-going research with colleagues in Ecole Polytechnique, Montreal, which could provide projects for students interested in Internet traffic.

Ray Watson

Population processes. Population sizes fluctuate over time and so we like to model them with stochastic processes. This project will consider various classical and more modern models for populations. Factors that should be incorporated include population and environmental pressures, competition, predation, immigration, etc.

Survival analysis (in discrete time). Classical survival analysis estimates the life-span of individuals, assuming that this is a continuous random variable. Many life spans are more naturally modelled as discrete random variables, e.g. the number of times you can use something before it breaks, or the number of IVF treatments before success. This project will look at how survival analysis can be adapted to the discrete setting.

Epidemic models. Are models for the spread of disease, and are extremely important for developing strategies for containing outbreaks or for developing effective immunisation programmes. This project will look at various epidemic models and apply them to real data.

Probability of fault detection. Fault detection techniques, for example those used to find structural defects in aircraft wings, can produce both false negatives and false positives. This project will look at problems of calibrating and optimising such techniques.

Modelling (breast) cancer and cancer screening. Given limited resources, how do we most effectively screen for cancer? Answering this question properly requires an understanding of how the cancer behaves and how the screening test behaves, and both of these are highly variable. Various honours projects are available in this area.

Sports statistics. If you want to win you should play the percentages, and to estimate percentages you need statistics. Gamblers have also been known to take a keen statistical interest in sports.

Aihua Xia

Normal approximation to dependent data. A simple model for the total sum of the claims on an insurance portfolio is the well-known Cramer-Lundberg model, under which claims occur according to a renewal process and claim sizes are i.i.d. This project considers the quality of normal approximations to the distribution of the total sum process S(t) and applies the results to some real data sets.

Discrete Central Limit Theorems. Many events, such as call arrivals in a telecommunications system or the positions of palindromes in a family of herpes virus genomes in a DNA sequence, can be modelled as a point process and are inherently discrete. Models for these processes often need to be very complex, which makes it difficult or impossible to apply them in practice, and so we try and find approximations using more tractable discrete models. This project considers a discrete version of the central limit theorem, where we approximate a sum using a discrete random variable rather than a normal random variable.

Association behaviour in non-linear death processes. A death process is a Markov process describing a closed population where individuals die at a rate dependent on the current population size. This project considers the lifespan of each individual in the case where individuals interact.

Walter and Eliza Hall Institute (WEHI)

There are many ongoing projects in bioinformatics, statistical genetics and biostatistics, within which an honours student could work. See http://bioinf.wehi.edu.au. Projects will be jointly supervised by a department member and a member of WEHI. Contact Richard Huggins for details.