Input Datasets
In this project 4 text classification standardized datasets are used to feed the neural nets and report the efficiency of the algorithms.
The idea is to evaluate
both architectures using different datasets. The statistics summary for each dataset is shown in the Table below.
The optization is done at each trial chosing a set of parameters applied to each dataset. There is two versions of each dataset:
TFIDF and distance-based meta-features (MF)
| Dataset | Size | #Features | #Classes | Mean | Minor Class | 1st Quartile | Median | 3rd Quartile | Major Class |
|---|---|---|---|---|---|---|---|---|---|
| 20NG | 18766 | 61050 | 20 | 938 | 627 | 952 | 978 | 988 | 998 |
| 4UNI | 8274 | 40195 | 7 | 1182 | 13 | 343 | 929 | 1382 | 3757 |
| REUTERS | 13327 | 19590 | 90 | 148 | 2 | 8 | 29 | 91 | 3964 |
| ACM | 24897 | 59990 | 11 | 2263 | 63 | 761 | 2041 | 3278 | 6562 |
The datasets we use to create all visualizations are derived from the process of optimization (with 5-fold cross-validation)
of each set of parameters applied to each dataset version. During the optization we get the all trials (we set 80 trials) with all
the set of parameters that were tested, time and loss.