This project uses transformer-based techniques to classify scientific Bengali texts into their respective scientific domains. Also, performance of traditional ML and DL models are investigated on traditional and local contextual word embedding techniques.
You can find the code for this project on GitHub.
Data Collection
Data was gathered from Bengali academic books, specifically those used for the Secondary School Certificate (SSC) and Higher Secondary Certificate (HSC) exams conducted by the Intermediate and Secondary Education Boards of Bangladesh. The dataset is categorized into six classes: five scientific domains and one non-scientific domain. The classes are 'Physics', 'Chemistry', 'Biology', 'Information and Communication Technology (ICT)', 'Mathematics', and 'Others'. Over a period of two months, 6000 texts were collected, with 1000 texts for each class.
Annotation was primarily based on the subject matter of the source books. However, some texts had overlapping class boundaries. To address this, three annotators assigned labels to each data point, and the final label was determined by majority vote. An expert annotator then verified the final labels to create the definitive dataset. The average Kappa value for the dataset is 0.9928, indicating almost perfect agreement according to the kappa scale.
You can find the dataset on Kaggle.
Data Preprocessing
The preprocessing steps included:
- Text Cleaning: Removing punctuation, numbers, special characters, as well as Greek and English letters from the text.
- Tokenization: Splitting the text into individual words or tokens.
- Stop Words Removal: Removing common words that do not contribute to the text's meaning.
- Vectorization: Converting text into numerical vectors by exploring transformer-based techniques, traditional embedding techniques (BoW, TF-IDF) and local contextual embedding techniques (Word2Vec, FastText, GloVe).
- Padding: Ensuring all text sequences are of equal length by adding special padding tokens where necessary.
Data Analysis
Exploratory data analysis (EDA) was performed to understand the distribution of texts across the different domains. Various visualization techniques such as bar charts, histograms, and word clouds were used to gain insights into the data.
Key findings from the analysis included:
- The total number of words and unique words in the dataset.
- The most frequent words in each class.
- The length distribution of texts in the dataset.
- The most common unigrams, bigrams, and trigrams for each class.
- Word clouds visualizing the most frequent words for each class.
For detailed visualizations and further insights, please refer to the README file on my GitHub repository.
Classification
Several classification models were implemented and evaluated, including transformers, traditional machine learning models and deep learning models:
- Transformers:
Five transformer-based models were implemented: mBERT, Distil mBERT, Bangla BERT, Indic BERT, and XLM-RoBERTa. Initially, the dataset was divided into train, validation, and test sets in a ratio of 70:15:15. The models were trained and validated using the train and validation sets, focusing on hyperparameter optimization during the learning phase. Only the best models were evaluated on unseen data from the test set.
Hyperparameters Values Dropout 0.1 Maximum sequence length 40 Batch size 32 Number of epochs 20 Early stopping patience 5 Optimizer AdamW Learning rate 2e-5 Weight decay 0.01 Loss function Cross entropy loss XLM-RoBERTa was the best-performing model out of all the five classifiers with a macro f1-score of 0.9158 and an accuracy of 0.9156.
- Traditional Machine Learning Models: The embeddings using BoW and TF-IDF were used to train four traditional machine learning (ML) models: Naive Bayes (NB), Logistic Regression (LR), Support Vector Machines (SVM), and Random Forest (RF). TF-IDF+SVM gave the best f1-score of 0.9151 and accuracy of 0.9150.
- Deep Learning Models: The embeddings using Word2Vec, FastText, and GloVe were used to train three deep learning (DL) models: Convolutional Neural Network (CNN), Bidirectional Long Short-Term Memory (BiLSTM), and CNN+BiLSTM. FastText+BiLSTM achieved the best f1-score of 0.9165 and accuracy of 0.8722.
The performance of these models was evaluated using metrics such as accuracy, precision, recall, and F1-score. The transformer-based models, particularly XLM-RoBERTa, demonstrated superior performance in classifying scientific Bengali texts.
Future Work
There are several directions for future work to enhance this project:
- The dataset can be extended to include more diverse and comprehensive data samples.
- Overlapping class data can be increased to improve the model's ability to handle ambiguous cases.
- Other model variants can be implemented, and additional hyperparameters can be explored to further optimize performance.
Research Publication
The full research work is officially published in the 4th International Conference on Robotics, Electrical and Signal Processing Techniques 2025. Read the full paper in IEEE Xplore: