WEKA is a machine learning workbench that supports many activities of machine learning practitioners.
Data preprocessing : As well as a native file format (ARFF), WEKA supports various other
formats (for instance CSV, Matlab ASCII files), and database connectivity through JDBC.
Data can be filtered by a large number of methods (over 75), ranging from removing particular
attributes to advanced operations such as principal component analysis.
Classification : One of WEKA’s drawing cards is the more than 100 classification methods it
contains. Classifiers are divided into “Bayesian” methods (Naive Bayes, Bayesian nets, etc.),
lazy methods (nearest neighbor and variants), rule-based methods (decision tables, OneR,
RIPPER), tree learners (C4.5, Naive Bayes trees, M5), function-based learners (linear regression,
SVMs, Gaussian processes), and miscellaneous methods. Furthermore,WEKA includes
meta-classifiers like bagging, boosting, stacking; multiple instance classifiers; and interfaces
for classifiers implemented in Groovy and Jython.
Clustering : Unsupervised learning is supported by several clustering schemes, including EM-based
mixture models, k-means, and various hierarchical clustering algorithms. Though not
as many methods are available as for classification, most of the classic algorithms are included.
Attribute selection : The set of attributes used is essential for classification performance. Various
selection criteria and search methods are available.
Data visualization : Data can be inspected visually by plotting attribute values against the
class, or against other attribute values. Classifier output can be compared to training data
in order to detect outliers and observe classifier characteristics and decision boundaries. For
specific methods there are specialized tools for visualization, such as a tree viewer for any
method that produces classification trees, a Bayes network viewer with automatic layout, and
a dendrogram viewer for hierarchical clustering.
WEKA also includes support for association rule mining, comparing classifiers, data set generation,
facilities for annotated documentation generation for source code, distribution estimation, and data