Coursework 1:
Submitted for the partial fulfilment of the DSM140 course
By Hendrik Matthys van Rooyen
230221176
Introduction¶
Problem area¶
The area of concern for this text categorization project is that of scientific literature categorization, utilizing the data provided in the arXiv-10 dataset. This dataset, as well as the problem area, is crucial to the rapid and accurate classification of scientific documents. This is particularly important considering the volume of literature produced in a wide variety of fields.(Papers with Code - arXiv-10 Dataset, 2022)
An efficient categorization model can aid in improving the efficiency of the labeling, therefore enhancing accessibility and rate of access to both newly published and archival literature. The increasing volume and diversity of scientific literature (‘Number of Academic Papers Published Per Year – WordsRated’, 2023) necessitate such advancements in categorization techniques. As research continues to grow across various scientific domains, the ability to quickly identify and classify documents becomes more critical.
The arXiv-10 dataset, representing a broad spectrum of scientific research, offers an ideal testing ground for developing and refining these categorization models. By efficiently categorizing this dataset, researchers can more easily locate relevant literature, thereby accelerating research and discovery processes.
Dataset¶
Origin and Purpose:
The dataset has been created by a group of researchers with the goal of facilitating the study and creation of machine learning models. (Papers with Code - arXiv-10 Dataset, 2022)
The dataset is compiled from arXiv repository, which itself is a large collection of academic literature, available open and freely. (Index - arXiv info)
Size and Scope:
The dataset contains 100 000 entries, ranging over a wide variety of scientific fields.
Dataset Structure:
The arXiv-10 dataset has it's entries equally distributed across 10 classes. The title, abstract, and label is provided as fields.
Data Types and Features
Text is the main data type in the dataset, with the label being categorical text. The other "metadata" has been removed in the preparation of the dataset. (Papers with Code - arXiv-10 Dataset, 2022)
Applications and Utility:
Ideal for training and evaluating models in text classification, topic modeling, and natural language understanding.
Serves as a benchmark for comparing the performance of various machine learning models, especially in the domain of academic text analysis.
Challenges and Considerations:
The level of academic language and the specialized terminology used in the papers may pose challenges for model training and accuracy, but could also serve as a good differentiator in the classification of different groups.
Access and Availability:
The arXiv-10 dataset is freely available for download (Papers with Code - arXiv-10 Dataset), with an open source license. And the master arXiv repository is available through the provided arXiv API, as well as on Kaggle (arXiv Dataset) and on other online repositories (Index - arXiv info), with a specified CC0: Public Domain license.
Ongoing Developments and Updates
The master arXiv repository is continuously updated with new litterature, while arXiv-10 stays largly stagnant.
Objectives¶
The primary objective of this project is to create an effective text classifier tailored for the arXiv-10 dataset.
The project seeks to compare the accuracy of simple classification methods to that of benchmarks provided in research papers.
The following classification algorithms will be implemented from the sklearn
library for comparison:
- RandomForestClassifier
- SVC
- MultinomialNB
- LogisticRegression
- SVC - linear
- AdaBoostClassifier
As part of the analysis, confusion matrices will be rendered to help with the understanding of where classification problems occur. (Python Machine Learning - Confusion Matrix)
Lastly, the investigation also focuses on how hyperparameter tuning influences the training of a well-performing model.
As previously mentioned, the resulting model could be utilized in the classification of other texts. Therefore, attempts will be made to export the model, so it can hypothetically be utilized at a later stage by interested parties.
Evaluation methodology¶
The method uses accuracy as the primary success metric. This involves checking how many times it correctly labels data compared to all the attempts it makes. This metric is chosen specifically due to its application in benchmarks.
The project compares various classification algorithms from the sklearn library. It examines RandomForestClassifier, SVC, MultinomialNB, LogisticRegression, SVC with a linear kernel, and AdaBoostClassifier. The basis of comparison is their accuracy.
These methods' accuracy is the also contrasted with benchmarks established in existing research papers.
To gain a deeper understanding of each classifier, confusion matrices are employed. These matrices are instrumental in discerning not just the general accuracy of the classifier but also the specific errors it commits, such as confusing different classifications.
The effect of hyperparameter tuning on the models' performance is also explored. This involves adjusting the algorithms' parameters to observe their impact on the model's ability to learn from the training data. (Persistence — joblib 1.4.dev0 documentation)
Data Pre-processing¶
For the pre-processing data, the data is read from the pre-downloaded arxiv100.csv
dataset into a pandas dataframe
After the nltk
stopwords and wordnet is downloaded, the re-used functions is defined, these include:
- A function used to clean the text
- A function to get some text statistics
- Non-Empty Strings Count
- Average Word Count
- Most Common Word
- Average Character Length of Entries
- Count of Unique Words
It may be worth noting the decision to first define some of the reuseable components of the function to clean the text, as this was done to speed up the runtime of said function.
The data is then cleaned by first cleaning the title and abstract columns individually, and then combining their results.
The cleaning process involves:
- Removing non-letters from the text using regex.
- Lemmatizing the text (‘Python | Lemmatization with NLTK’, 2018)
- Removing stop words from the lemmatized text. (‘Removing stop words with NLTK in Python’, 2017)
Once this has been accomplished, the data is split into testing and training sets using the sklearn
library's train_test_split
.
Finally the CountVectorizer
is used to vectorize the the features (Bag of Words) (‘Using CountVectorizer to Extracting Features from Text’, 2020). The vectorizer is fitted on the training data.
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
import nltk
import re
import numpy as np
from sklearn.metrics import accuracy_score, confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt
import pandas as pd
from collections import Counter
import numpy as np
from joblib import dump, load
import pandas as pd
file_path = 'arxiv100.csv'
data = pd.read_csv(file_path)
data.head()
title | abstract | label | |
---|---|---|---|
0 | The Pre-He White Dwarfs in Eclipsing Binaries.... | We report the first $BV$ light curves and hi... | astro-ph |
1 | A Possible Origin of kHZ QPOs in Low-Mass X-ra... | A possible origin of kHz QPOs in low-mass X-... | astro-ph |
2 | The effects of driving time scales on heating ... | Context. The relative importance of AC and D... | astro-ph |
3 | A new hard X-ray selected sample of extreme hi... | Extreme high-energy peaked BL Lac objects (E... | astro-ph |
4 | The baryon cycle of Seven Dwarfs with superbub... | We present results from a high-resolution, c... | astro-ph |
# Download necessary NLTK data
nltk.download('stopwords')
nltk.download('wordnet')
# Preload reused components in attempt to speed up the clean_text function.
regex = re.compile("[^a-zA-Z]")
stops = set(stopwords.words("english"))
lemmatizer = WordNetLemmatizer()
# Function to clean text data
def clean_text(text):
# Remove non-letter characters and lower case all words
text = regex.sub(" ", text).lower()
# Remove stop words and lemmatize the words
meaningful_words = [lemmatizer.lemmatize(word) for word in text.split() if word not in stops]
return " ".join(meaningful_words)
def text_column_stats(df, column_name):
if column_name not in df.columns:
return
# Filtering out empty or NaN entries
valid_texts = df[column_name].dropna().astype(str)
# Function to calculate average word count
def avg_word_count(texts):
word_counts = texts.apply(lambda x: len(x.split()))
return np.mean(word_counts)
# Function to find the most common word
def most_common_word(texts):
words = ' '.join(texts).split()
most_common = Counter(words).most_common(1)
return most_common[0][0] if most_common else 'No words'
stats = {
'Non-Empty Strings Count': valid_texts.count(),
'Average Word Count': avg_word_count(valid_texts),
'Most Common Word': most_common_word(valid_texts),
'Average Character Length of Entries': valid_texts.apply(len).mean(),
'Count of Unique Words': len(set(' '.join(valid_texts).split()))
}
return stats
[nltk_data] Downloading package stopwords to [nltk_data] C:\Users\mvanr\AppData\Roaming\nltk_data... [nltk_data] Package stopwords is already up-to-date! [nltk_data] Downloading package wordnet to [nltk_data] C:\Users\mvanr\AppData\Roaming\nltk_data... [nltk_data] Package wordnet is already up-to-date!
text_column_stats(data, 'title')
{'Non-Empty Strings Count': 100000, 'Average Word Count': 9.96153, 'Most Common Word': 'of', 'Average Character Length of Entries': 77.21297, 'Count of Unique Words': 95785}
text_column_stats(data, 'abstract')
{'Non-Empty Strings Count': 100000, 'Average Word Count': 154.75792, 'Most Common Word': 'the', 'Average Character Length of Entries': 1057.10563, 'Count of Unique Words': 505570}
# Clean the dataset
data['cleaned_title'] = data['title'].apply(clean_text)
data['cleaned_abstract'] = data['abstract'].apply(clean_text)
# Combine title and abstract for text representation
data['combined_text'] = data['cleaned_title'] + " " + data['cleaned_abstract']
# Splitting the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(data['combined_text'], data['label'], test_size=0.2, random_state=42)
text_column_stats(data, 'combined_text')
{'Non-Empty Strings Count': 100000, 'Average Word Count': 107.29445, 'Most Common Word': 'model', 'Average Character Length of Entries': 844.74635, 'Count of Unique Words': 95658}
data['combined_text']
0 pre white dwarf eclipsing binary wasp report f... 1 possible origin khz qpos low mass x ray binary... 2 effect driving time scale heating coronal arca... 3 new hard x ray selected sample extreme high en... 4 baryon cycle seven dwarf superbubble feedback ... ... 99995 semiparametric estimation space time max stabl... 99996 spatial causal analysis wildland fire contribu... 99997 neural conditional event time model event time... 99998 efficient estimation com poisson regression ge... 99999 algcomparison comparing performance graphical ... Name: combined_text, Length: 100000, dtype: object
# Vectorization using Bag of Words model
vectorizer = CountVectorizer(analyzer = "word", tokenizer = None, preprocessor = None, stop_words = None, max_features = 5000)
X_train_features = vectorizer.fit_transform(X_train)
X_test_features = vectorizer.transform(X_test)
X_train_features.shape, X_test_features.shape
((80000, 5000), (20000, 5000))
Baseline¶
The baseline models as described on Papers with Code - arXiv-10 Dataset makes use of different models (see table below), with accuracies ranging between 0.746 and 0.794 with the most recent being achieved in 2023.
These benchmark models are however different form those implemented in this project, as they make use of more advanced machine learning techniques, such as Transformers and Hierarchical Attention Networks (HAN), while the models in this project focuses on simpler methods.
These benchmarks were chosen, as they have been achieved on the same data, as maintained on Papers with Code - arXiv-10 Dataset, the Protoformer result is also dated fairly recent, which provides an up to date goal.
Model | Accuracy | Paper | Year |
---|---|---|---|
Protoformer | 0.794 | Protoformer: Embedding Prototypes for Transformers | 2023 |
RoBERTa | 0.779 | RoBERTa: A Robustly Optimized BERT Pretraining Approach | 2019 |
DocBERT | 0.764 | DocBERT: BERT for Document Classification | 2019 |
HAN | 0.746 | Hierarchical Attention Networks for Document Classification | 2019 |
Classification¶
Starting the classification process, a function is created that will be utilized at the end of each model's training to both produce the accuracy statistic, as well as display the confusion matrix for further consideration.
As discussed before, the following models is then trained from the sklearn
library:
- RandomForestClassifier
- SVC
- MultinomialNB
- LogisticRegression
- SVC - linear
- AdaBoostClassifier
After consideration of each of the models' accuracy statistic, as well as the speed at which the model trains, a model is chosen on which the hyper parameter tuning is performed.
Throughout the training process some of the well preforming models is stored for future use.
# Function to print the results
def print_results(y_test, predictions):
accuracy = accuracy_score(y_test, predictions)
print("Accuracy:", accuracy)
unique_labels = y_test.unique()
conf_matrix = confusion_matrix(y_test, predictions, labels=unique_labels)
plt.figure(figsize=(10, 7))
sns.heatmap(conf_matrix, annot=True, fmt='g', cmap='Blues', xticklabels=unique_labels, yticklabels=unique_labels)
plt.xlabel('Predicted labels')
plt.ylabel('True labels')
plt.title('Confusion Matrix')
plt.show()
RandomForestClassifier¶
With the random forest model being the first to be trained it was quite surprizing that the simple model outperformed the earlier benchmarks set by RoBERTa
(0.779), DocBERT
, and HAN
, comming close to the benchmark (0.794) set by Protoformer
in 2023 as well.
from sklearn.ensemble import RandomForestClassifier
# Instantiate the model
rf_model = RandomForestClassifier(n_estimators=100, random_state=42)
# Train the model
rf_model.fit(X_train_features, y_train)
SVC - rbf¶
The SVC model took considerably longer to train, with over 35 minutes, and the rate of prediction was also underwhelming with, taking much longer as well.
The results however does indicate some justification, beating the Protoformer
benchmark, as well as the earlier Random Forest Classifier
from sklearn.svm import SVC
# Instantiate the model
svm_model = SVC()
# Train the model
svm_model.fit(X_train_features, y_train)
SVC()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
SVC()
# Evaluate the model
svm_predictions = svm_model.predict(X_test_features)