Introduction
Recently, there has been an increasing use of python for numerical computation. It's use has expanded to several fields, including (but not limited to) signal processing, image processing, biology, electromagnetics and astronomy. For numerical computation, python, in combination with several packages such as numpy, scipy and matplotlib is becoming an alternative to specialized numerical software such as matlab and IDL.
Machine learning is one field where python is still lacks key tools. For example, in machine learning research, matlab is often used: many matlab toolboxes are available, such as the toolboxes by K. Murphy and netlab. For people interested in doing machine learning with python, the lack of existing tools means a large barrier to entry. The goal of this proposal is to implement a package leveraging and extending existing tools (numpy and scipy), and thus making python a strong contender for use by the machine learning community.
Background and motivation
Machine learning is a large component of what is commonly called Artificial Intelligence (AI). It consists in studying methods to help computers learn from data. It is therefore related to statistical estimation and data mining (which tries to find patterns from data automatically or semi-automatically). Examples of applications are DNA sequencing, speech processing, natural language processing and stock market analysis.
Several types of algorithms exist in current machine learning. Supervised algorithms use training data to create a mapping that can then be used to classify data. For example, for current automatic speech recognition systems (eg converting speech signal into a sequence of words), it is often required for a user to first train the system with their voice. On the other hand unsupervised algorithms, also called clustering algorithms, try to find structures and patterns automatically from data.
The strength of this proposal is that it will unify a series of existing development efforts and algorithms. By defining a common API we will make these algorithms easier to use together, and provide a natural home for further development of new algorithms. This will help to build a development and user community that will help sustain this code base after the coding project is complete.
The author of this proposal is using statistical clustering for speech processing, in particular speech detection in noisy environments.
Goal
The goal is to implement several classic algorithms used in machine learning, as well as some related visualization and storage tools. Several machine learning algorithms are already implemented in python (svm by Albert Strasheim from the 2006 SOC, gaussian mixtures by the author of this proposal), but they lack a common API and storage model. One of the strengths of the python language is its easy to read syntax, we want the implementation to follow this principle.
We would therefore like to focus on the following issues:
- a core set of simple algorithms, with low level data representation (numpy arrays): candidate algorithms are: basic clustering algorithms (eg K-mean); EM for Mixture of Gaussian and support vector machines (SVM). The goal here is to unify the API from existing code.
- a set of functions for data and classification visualization.
- all implemented algorithms should have a straightforward, 100% python implementation. Of course, for efficiency reasons, C implementation of some algorithms may be desirable at a later stage. The goal is to make simple things simple, and complicated things possible; the implementation should be readable by students who would like to study one particular algorithm.
- The high level API should be usable by somebody without a deep knowledge of the underlying algorithms in order to make the barrier of entry as low as possible.
Once core functionality is complete and works for different kinds of data (speech data in my case, images for neuroimaging - http://neuroimaging.scipy.org), we can add other models.
Implementation
Some packages are already implemented: scipy.clusters (kmean); scipy.sandbox.pyem for EM algorithm (finite mixtures of Gaussian); scipy.sandbox.svm for SVM. The idea is first to clean up those packages, and make them follow similar conventions for data representation, such as row vs column conventions for arrays and data types. Our goal is to make the packages easier to use: scipy.clusters is used, but has generated some confusion and requests for a wider set of options (according to the scipy Mailing List).
Once the existing algorithms are cleaned up, I will implement a common high level interface. Several well known data-mining algorithms rely on the same abstractions:
- feature: this can be the raw data, or more commonly data derived from the raw data.
- domain: the possible values for the feature. The common domain are finite, integers and continuous. Of course, numerical representation means than no number in memory can really be continuous, but it is a useful and commonly used abstraction for theory.
- class: the labels corresponding to the data.
For example, in speech recognition (more precisely in the acoustic model part of speech recognition), the features are often derived from the spectral representation of the speech signal, are in the continuous domain and the classes are phonemes (a, i, s, etc...), which is a finite domain.
Packages to leverage
Packages which will be used:
- data representation and supporting algorithms: numpy and scipy
- storage: pytables
- visualization: matplotlib.
numpy and scipy are becoming the standard tools for efficient numerical computation with python: numpy gives python an efficient array structure, and scipy builds on numpy to give higher level scientific tools.
pytables is a package which can be used to store huge amounts of data in python. The advantages of pytables are the following:
- based on a robust, widely used file format, hdf5, which is also usable directly from C and java: http://hdf.ncsa.uiuc.edu/HDF5/.
- hdf5 format supports nested data, compression, ragged arrays. etc...
- pytables natively supports numpy, and is easy to install (at least on linux and windows).
Timeline
Two milestones are defined.
Milestone 1
- Look at existing packages, to see what kind of problems they solve, and how they do it (in particular, data representation).
- Clean scipy.sandbox.pyem (use logsumexp for underflow problems, and method for regularization of covariance; add a classifier method/class; make documentation compatible with numpy standard; see the TODO in pyem). Move it out of the sandbox.
- Clean up kmean in scipy.clusters (row vs column representation, change distance function, unit tests).
- Clean up scipy.sandbox.svm. Upgrade to the latest version of libsvm. Move it out of the sandbox.
Due *9th July.*
Milestone 2
- arff reader/writer: arff is a file format for data representation, widely used by weka, a commonly used data-mining system: http://www.cs.waikato.ac.nz/~ml/weka/arff.html
- Basic GUI tools as a scikits project: fix level for probability density (pdf) functions, pdf estimation via kernel for visualization, representation for plotting data vs labels.
- Some "high level" examples with known data set for clustering: this is really important from an the point of view of package evangelism and is the equivalent in scientific programming of screenshots for other software.
- Better implementation of models: scalability for different fields (speech processing, image processing, etc...).
- If time is still available, other models (with feedback from the scipy community, hopefully).
Due *August 20.*
Documentation and development methodology
bug tracking and source control
Having already contributed to non-trivial bug fixing for both numpy and scipy, and being the main author of one package in scipy sandbox, I already have a good understanding of the scipy codebase and its community, as well as most of their tools, including the bug tracking system (based on Trac), the source control system (svn, to which I have had write access for several months) and test methods (scipy uses its own unit test classes, based on standard python unit test). Part of the clean-up will be to implement unit tests (sandbox.svm and sandbox.pyem already have tests, but not scipy.clusters).
Doc format
Doc format: the numpy community has recently defined a standard for documenting code, using a reST derived format. pyem already uses reST for global documentation, but the docstrings need to be improved.
About the author
My name is David Cournapeau, and I am a French national working as a first year PhD student at Kyoto University in Japan (entered in April 2006) in the computer science department. From December 2004 to March 2006, I worked as an engineer researcher at ATR in speech processing.
I have both an engineering degree from the ENST Paris (major in signal processing and statistical estimation, obtained in 2004) and a master degree awarded by Paris VI university on signal processing, computers science and acoustics applied to music processing (obtained in 2003).
I have been a user of open source software for several years, and I am more and more involved in open source software development for the last year. I contributed several thousand lines of code for scipy (I am the author of pyem, a toolbox for machine learning, included in scipy), and I am the main author of pyaudiolab and pysamplerate, two small python toolboxes useful for audio processing in python.
License
Same as SCIPY, that is BSD-like license (without advertisement clause).
Useful links
Reflections on implementation problems arising with machine learning algorithms:
- Programming Languages for Machine Learning Implementations: http://hunch.net/?p=230
- Which programming language should I use? http://www.cs.ubc.ca/~murphyk/Software/which_language.html
Links for datasets:
Similar softwares
Similar softwares in various languages:
- toolboxes by Kevin Murphy (matlab): http://www.cs.ubc.ca/~murphyk/Software/index.html
- torch (c++): http://www.torch.ch/
- pymix (python): http://algorithmics.molgen.mpg.de/pymix.html
Higher level (eg can create models graphically, etc...):
- netlab (matlab): http://www.ncrg.aston.ac.uk/netlab/
- orange (C++ + python): http://magix.fri.uni-lj.si/orange/
- weka (java): http://www.cs.waikato.ac.nz/ml/weka/
Bibliography
- Information theory, inference and learning algorithms by David Mc Kay (great for insight)
- Neural Networks for Pattern Recognition by C. Bishop (really good, different approach than Mc Kay)
- Pattern Classification (2nd ed.) by Richard O. Duda (not great, but covers most classical algo)
