Set Cover Algorithms for very large datasets
by Tony Wirth
Abstract: The problem of Set Coverâ€”to find the smallest subcollection of sets that covers some universeâ€”is at the heart of many data and analysis tasks. It arises in a wide range of settings, including operations research, machine learning, planning, data quality and data mining.
Although finding an optimal solution is NP-hard, the greedy algorithm is widely used, and typically finds solutions that are close to optimal.
However, a direct implementation of the greedy approach, which picks the set with the largest number of uncovered items at each step, does not behave well when the input is very large and disk resident. The greedy algorithm must make many random accesses to disk, which are unpredictable and costly in comparison to linear scans. In order to scale Set Cover to large datasets, we provide a new algorithm which finds a solution that is provably close to that of greedy, but which is much more efficient to implement using modern disk technology. Our experiments show a ten-fold improvement in speed on moderately-sized datasets, and an even greater improvement on larger datasets.
This paper is joint work with Graham Cormode and Howard Karloff, both at AT&T Labs-Research, and will appear in CIKM 2010.
Tony Wirth received the BSc (Hons) and MSc degrees from the University of Melbourne, and the MA and PhD degrees at Princeton University.
Tony's research interests are based around theoretical computer science, including graph algorithms, approximation algorithms, algorithms on external memory, communication complexity, adaptive sampling, and sequence analysis problems in bioinformatics. Tony joined the Department of Computer Science and Software Engineering as a lecturer in 2005; the day after this seminar, he will become a senior lecturer.
For More Information: contact: David Wood. email: email@example.com