The Knowledge Base is a dynamic part of the system that can be supplemented and refreshed through The
Intelligent KB Editor. We should notice that there are two potential sources of knowledge to be discovered for
the proposed system. These are the analysis of theory background that lies behind the feature extraction and
classification methods, and field experiments.
In the first case, knowledge is formulated by an expert in the area of the specific feature extraction methods
and classification schemes, and then represented as a set of rules by a knowledge engineer in the terms of a
knowledge representation language that is supported by the system. We argue that it is possible and
reasonable to categorise the facts and rules that are present in the Knowledge Base. Categorisation can be
done according to the way the knowledge has been obtained – has it been got from the analysis of
experimental results of from the domain theory, was it put automatically by the Intelligent KB Editor or by a
knowledge engineer (who could be a data miner as well). Another categorisation criterion is the level of
confidence of a rule. The expert can be sure in a certain fact or may just think or to hypothesize about another
fact. In a similar way, a rule that has been just generated from the analysis of results by experimenting on
artificially generated data sets but has been never verified on real-worlds data sets and a rule that has been
verified on a number of real-world problems. These two rules definitely should not have the same level of
In addition to the “trust“ criteria due to the categorisation of the rules it is possible to adapt the system to a
concrete researcher needs and preferences by giving higher weights to the rules that actually are the ones of
And, in the second case, a data miner can discover knowledge during the analysis of results obtained from
the experiments as separate facts, trends and dependencies. In the same manner, discovered knowledge is
represented as a set of rules by a knowledge engineer using of the knowledge representation language.
Alternatively, the knowledge acquisition process can be automatic, i.e. the knowledge discovery process
would be accomplished without any interference with a human expert. This may happen using the possibility
of deriving new rules and updating the old ones based on the analysis of results obtained during the self-run
In both the last cases we have a problem of learning how the Intelligent KB Editor should try to build up a
classification or a regression model on meta-data resulted from experiments. In this context the input
parameters for a classification model are specific data set characteristics and a classification model’s outputs
that include accuracy, sensitivity, specificity, time complexity, etc. The combination of a feature extraction
method’s and a classification model’s names with their parameter values represents a class label. When
building a regression model – meta-data-set attributes are data set characteristics, the feature extraction
method’s and the classification model’s names, and one of the model output characteristics is the attribute
which value (continuous) has to be predicted.
The results obtained to the present stage of research show a high level of complexity in
dependencies between the data set characteristics and the best-suited scheme for the data mining process.
In order to further develop our understanding it is necessary to proceed the research with the following
- Generation of artificial data sets with known characteristics (simple, statistical and information theoretic
- Design of experiments on the generated artificial data sets;
- Derivation of dependencies and definition of the criteria from the obtained results;
- Development of a knowledge base defining a set of rules on the set of obtained criteria;
- Proof of the constructed theory with a set of experiments on real-world data sets.
Thus, three basic research methods are used in the research: the theoretical approach, the constructive
approach, and the experimental approach. These approaches are closely related and are applied in parallel.
The theoretical backgrounds are exploited during the constructive work and the constructions are used for
experimentation. The results of constructive and experimental work are used to refine the theory.
An example of such a procedure can be presented as:
- Generation of artificial data sets with the number of attributes from 2 to 100, with the number of
instances from 150 to 5000, with the number of classes from 2 to 10, with the average correlation
between the attributes from 10% to 90%, with the average noisiness of attributes from 10% to 50%,
with the percent of irrelevant attributes from the total number of attributes from 10% to 50%.
- Design of the experiments on generated artificial data sets and analysing accuracy and efficiency of
the classification model built on different learning algorithms and using different feature extraction
methods. Tuning of the input parameters for each combination is required.
- Analysis of the dependencies and trends between output accuracies and efficiencies, feature
extraction methods and classifiers, their input parameters, and pre-defined data set characteristics.
- Definition of a set of rules that reflect found dependencies and trends.
- Execution of a number of experiments on UCI data sets using DSS for the best-suited feature
extraction method and classifier selection.
- Addition of the invented rules that were successfully validated during the tests on the benchmark
data sets to the knowledge base.