Introduction

This is the main page of pymachine, a project to implement machine learning tools for the scipy environment. This project is starting as a summer of code, but we hope that other people will jump in later to improve and add functionalities. The original wikified proposal can be found here: MachineLearningOriginalProposal. Some useful informations will be written in the pymachine blog: http://pymachine.blogspot.com/.

Outside the proposal, not much info is available yet, but this is only the starting of the project (on 28th May 2007). As nothing is released yet, this page is mainly a brain dump. Once I will have something public to release, I will clean this page.

requirements

You need to have the following softwares to build scikits.learn:

  • python (>= 2.4)
  • numpy (any version > 1.0.1 should do)
  • scipy (>= 0.5.2)
  • setuptools

Getting the code

No official release, still rough on the edges, but you should at least be able to use the code which was moved from the scipy sandboxes (pyem, svm, ga and ann). To get the code, simply :

svn co http://svn.scipy.org/svn/scikits/trunk/learn learn.svn

This will give you a directory learn.svn. You can then install it like:

python setup.py install

Note that the package depends on numpy, scipy, as well as setuptools. To use the sandboxes from scipy, you can do:

from scikits.learn.machine import svm

This will give you exactly the same toolbox than from scipy.sandbox import svm.

Note that Python 2.5 is needed for the scikit.

Presentation and wishes from the community

There was quite a lengthy thread on scipy-user about pymachine, I will sum it up at MachineLearningOpeningThread?. http://projects.scipy.org/pipermail/scipy-user/2007-May/012146.html

Content

Manifold Learning

  • Dimensionality reduction
  • Multidimensional regression
  • Probabilistic projection

Things to do

Handling large data

One of the main limitation of packages such as Orange or Weka is the inability to handle large data (that is data which cannot fit in memory). This is a difficult problem to solve, both at implementation and api levels. This requires:

  • an abstraction at the dataset level: that is, instead of loading all the data at once, there should be the possibility to load data per block.
    • At the IO level, pytables looks like a good candidate (can handle Gb of data).
    • At the Api level ?
  • an abstraction at the learning algorithm level: this depends on the algorithm used. The problem is that most of the time, the algorithms assume all the data are available at once. For k-means and EM, it is possible to extend the algorithm to run it in "mini batch", for example, but I am not sure there is a general approach.
  • parallel execution, clusters: I have absolutely no knowledge on this. This is a hard problem, and gives a new set of issues on its own.

Datasets concept

A subpackage containing datasets has been started: http://projects.scipy.org/scipy/scikits/browser/trunk/learn/scikits/learn/datasets. This is intended to be released independantly of learn in a near future. Some tools have to be implemented for easy manipulation of the data, such as:

  • getting only some attributes
  • getting only some classes of the dataset if class information is available

Some problems on the top of my head:

  • scalability
  • compatibility (with Orange, weka: Arff reader, tab delimited separated file reader, etc...)
  • common description:
    • each dataset package defines a load function, which returns a dictionary with values data, label and class (only data is mandatory).
    • label[i] is the label index of data[i], data is a record array with attributes, class[label[i]] is the class name of data[i].
    • this convention should be flexible enough for most datasets, while being powerful enough to enable the manipulation mentionned before (kind of normalization if the data are seen as a relational database). I should take a deeper look at PyTables?, and sees what kind of "queries" are possible with it, to see if it is worthwile to use it as THE underlying IO engine (means one more dependency).

Prototype for classifiers

One of the stated goal of the SoC is to have "high level" tools for machine learning. It should be easy enough to have some basic classifiers/datasets/trainers class to use datasets already available, training algorithms already available (svm, em, cluster) and giving basic results (using CV, accuracy, etc...)