A Clickbait Classifier Using Quantum NLP
A simple clickbait classifier using Quantum Natural Language Processing
Introduction
As you can see in a previous post I wrote an introduction on Quantum Natural Language Processing (QNLP) and in this post I will show how to develop a clickbait classifier using DisCoCat diagrams and quantum classifiers. We are using here the Clickbait Dataset from Kaggle.
The process can be divided into the following steps:
Transforming sentences into DisCoCat diagrams
Categorical Compositional Distribution (DisCoCat) is a mathematical framework that allows to convert sentences into linear maps, using tensor products to relate different grammar types. This fact is particularly interesting because tensor products can be easily implemented using quantum computing, thus, this is a quantum friendly approach for QNLP.
I am not going to explain deeply about this concept because it is rather dense, but I am going to focus on the implementation of this algorithm in Python., using lambeq and discopy packages, which are very user friendly. But, if you are willing to learn more about the fundamentals of QNLP I strongly encourage you to study more about the subject.
This is the implementation of the first part of the classifier, which transforms our data into a DisCoCat diagram:
import random
import numpy as np
import pandas as pd
import joblib
from sklearn.model_selection import train_test_split
from lambeq import BobcatParser
from lambeq import remove_cups
df = pd.read_csv('clickbait_data.csv')
df['target'] = df['clickbait'].map(lambda x: [1.0, 0.0] if x == 0 else [0.0, 1.0])
df['headline'] = df['headline'].str.lower()
df['headline'] = df['headline'].str.replace(r'[^a-zA-Z0-9\s]+', '')
df['headline'] = df['headline'].map(lambda x: x.replace(r"'s", " is"))
df['headline'] = df['headline'].map(lambda x: x.replace(r"you're", "you are"))
df['headline'] = df['headline'].map(lambda x: x.replace(r"isn't", "is not"))
df['headline'] = df['headline'].map(lambda x: x.replace(r"hasn't", "has not"))
df['headline'] = df['headline'].map(lambda x: x.replace(r"haven't", "have not"))
df['headline'] = df['headline'].map(lambda x: x.replace(r"hadn't", "had not"))
df['headline'] = df['headline'].map(lambda x: x.replace(r"'ll", " will"))
df['headline'] = df['headline'].map(lambda x: x.replace(r"wouldn't", "would not"))
df['headline'] = df['headline'].map(lambda x: x.replace(r"mustn't", "must not"))
df['headline'] = df['headline'].map(lambda x: x.replace(r"couldn't", "could not"))
df['headline'] = df['headline'].map(lambda x: x.replace(r"shouldn't", "should not"))
df['headline'] = df['headline'].map(lambda x: x.replace(r"must've", "must have"))
df['headline'] = df['headline'].map(lambda x: x.replace(r"could've", "could have"))
df['headline'] = df['headline'].map(lambda x: x.replace(r"you've", "you have"))
df['headline'] = df['headline'].map(lambda x: x.replace(r"should've", "should have"))
df['headline'] = df['headline'].map(lambda x: x.replace(r"you'd", "you would"))
df['headline'] = df['headline'].map(lambda x: x.replace(r"i'd", "i would"))
df['headline'] = df['headline'].map(lambda x: x.replace(r"-", " "))
df['headline'] = df['headline'].map(lambda x: x.replace(r":", ""))
df['headline'] = df['headline'].map(lambda x: x.replace(r",", ""))
df['headline'] = df['headline'].map(lambda x: x.replace(r'"', ''))
df['headline'] = df['headline'].map(lambda x: x.replace(r"'", ''))
df['headline'] = df['headline'].map(lambda x: x.replace(r".", ''))
df['headline'] = df['headline'].map(lambda x: x.replace(r"%", ''))
df['size'] = df['headline'].map(lambda x: len(x.split(' ')))
df = df[df['size'] < 6]
df = df[df['size'] > 2]
train_data, test_data, train_labels, test_labels = train_test_split(df['headline'].tolist(), df['target'].tolist(), test_size=0.6, random_state=SEED, stratify=df['target'])
reader = BobcatParser(verbose='text')
raw_train_diagrams = reader.sentences2diagrams(train_data, suppress_exceptions=True)
raw_test_diagrams = reader.sentences2diagrams(test_data, suppress_exceptions=True)
count = 0
train_diagrams = []
for idx, diagram in enumerate(raw_train_diagrams):
try:
train_diagrams.append(remove_cups(diagram))
except:
idx = idx - count
count += 1
train_labels.pop(idx)
train_data.pop(idx)
count = 0
test_diagrams = []
for idx, diagram in enumerate(raw_test_diagrams):
try:
test_diagrams.append(remove_cups(diagram))
except:
idx = idx - count
count += 1
test_labels.pop(idx)
test_data.pop(idx)
In this part of the script I have made some simple data preparation to clean text data (sorry for the huge amount of replaces, I’m not used to handle with ‘ in my mother language). I also limited the size of the sentences up to 6 characters, because the larger the sentece is the more complex the quantum circuits will become and larger circuits would add complexity that our limited computational resources aren’t able to handle.
Here the BobcatParser is responsible to convert our sentences into DisCoCat diagrams and the remove_cups is reponsible for greatly simplifying the complexity of these diagrams by removing cups. In practice, this means reducing the complexity of the converted quantum circuit and thus simplifying optimization task.
This is the simpified diagram of the sentence “Pittsburgh Penguins win Stanley Cup”:
Note here that the sentence is centered around the word “win”, which carries the token “s”, which simbolizes the sentence.
Converting diagrams into quantum circuits
These DisCoCat diagrams can be transformed into different types of quantum circuits. In Lambeq page there is a list of all possible options and we are using here the IQP ansatz.
Here we have the code of the part which transforms the simplified diagrams into parameterizable quantum circuits:
from lambeq import AtomicType, IQPAnsatz
ansatz = IQPAnsatz({AtomicType.NOUN: 1, AtomicType.SENTENCE: 1},
n_layers=1, n_single_qubit_params=1)
count = 0
train_circuits = []
for idx, diagram in enumerate(train_diagrams):
try:
train_circuits.append(ansatz(diagram))
except:
idx = idx - count
count += 1
train_labels.pop(idx)
train_data.pop(idx)
count = 0
test_circuits = []
for idx, diagram in enumerate(test_diagrams):
try:
test_circuits.append(ansatz(diagram))
except:
idx = idx - count
count += 1
test_labels.pop(idx)
test_data.pop(idx)
The IQPAnsatz function was parameterized in the simplest way possible, where each noun (n) and sentence (s) token will have 1 qubit. Also, each qubit will have only 1 parameter to be optimized. This choice was made to reduce complexity of our optimization task, which is very drainful in terms of resources.
Here wer have our circuit for the sentence “Pittsburgh Penguins win Stanley Cup”.
Here we note that each word has only one qubit, except for “win”. This happens because one parameter is related to the word itself and the other to the sentence.
Preparing data for model training
In this step we are seting our data for the model training. Since we are using quantum device simulators, we have a lot of constraints regarding computational power, so we are going to use a small number of samples for our classifier, in this case, 50. If you try to run this code you will be able to check that it will take a couple days to train the model in the next step, even for 50 training samples.
df_train = pd.DataFrame(columns=['headline', 'circuits', 'labels'])
df_train['headline'] = train_data
df_train['circuits'] = train_circuits
df_train['labels'] = train_labels
df_train50 = df_train.sample(50)
train_data50 = df_train50['headline'].tolist()
train_circuits50 = df_train50['circuits'].tolist()
train_labels50 = df_train50['labels'].tolist()
joblib.dump(train_data50, 'train_data50_1q.sav')
joblib.dump(train_labels50, 'train_labels50_1q.sav')
joblib.dump(train_circuits50, 'train_circuits50_1q.sav')
joblib.dump(test_data, 'test_data_1q.sav')
joblib.dump(test_labels, 'test_labels_1q.sav')
joblib.dump(test_circuits, 'test_circuits_1q.sav')
Model training
Now that we have our quantum circuits, it is time to train our model. Note that for each sentence we have a distinct ansatz, but the same word will always have the same parameters. The optimization task will iterate over the circuits to find the best parameters adjustment for our clickbait classification.
BATCH_SIZE = 10
EPOCHS = 300
LEARNING_RATE = 0.001
SEED = 42
import random
import numpy as np
import pandas as pd
import joblib
from sklearn.model_selection import train_test_split
random.seed(SEED)
np.random.seed(SEED)
train_circuits = joblib.load('train_circuits50_1q.sav')
train_labels = joblib.load('train_labels50_1q.sav')
test_circuits = joblib.load('test_circuits_1q.sav')
test_labels = joblib.load('test_labels_1q.sav')
from pytket.extensions.qiskit import AerBackend
from lambeq import TketModel
all_circuits = train_circuits + test_circuits
backend = AerBackend()
backend_config = {
'backend': backend,
'compilation': backend.default_compilation_pass(2),
'shots': 256
}
model = TketModel.from_diagrams(all_circuits, backend_config=backend_config)
from lambeq import BinaryCrossEntropyLoss
# Using the builtin binary cross-entropy error from lambeq
bce = BinaryCrossEntropyLoss()
acc = lambda y_hat, y: np.sum(np.round(y_hat) == y) / len(y) / 2 # half due to double-counting
eval_metrics = {"acc": acc}
from lambeq import QuantumTrainer, SPSAOptimizer
trainer = QuantumTrainer(
model,
loss_function=bce,
epochs=EPOCHS,
optimizer=SPSAOptimizer,
optim_hyperparams={'a': 0.1, 'c': 0.1, 'A':0.0004*EPOCHS},
evaluate_functions=eval_metrics,
seed=0
)
from lambeq import Dataset
train_dataset = Dataset(
train_circuits,
train_labels,
batch_size=BATCH_SIZE)
trainer.fit(train_dataset)
model.save('qnlp50_300_1q.lt')
import matplotlib.pyplot as plt
fig, ((ax_tl), (ax_bl)) = plt.subplots(2, 1, sharex=True, sharey='row', figsize=(10, 6))
ax_tl.set_title('Training set')
ax_bl.set_xlabel('Iterations')
ax_bl.set_ylabel('Accuracy')
ax_tl.set_ylabel('Loss')
colours = iter(plt.rcParams['axes.prop_cycle'].by_key()['color'])
range_ = np.arange(1, trainer.epochs + 1)
ax_tl.plot(range(1,len(trainer.train_epoch_costs)+1), trainer.train_epoch_costs, color=next(colours))
ax_bl.plot(range(1, len(trainer.train_eval_results['acc'])+1), trainer.train_eval_results['acc'], color=next(colours))
Here we are using the TketModel from Lambeq using the SPSA Optimizer, which is commonly used in QML. I chose 300 epochs to achieve good convergence and a suitable loss at the end of the training.
Model testing
In this step we are testing our trained model with our test data. Have in mind that we have a significant number of words in our test data that wasn’t in our train data, which will significally affect our results here.
import random
import numpy as np
import pandas as pd
import joblib
test_circuits = joblib.load('test_circuits_1q.sav')
test_labels = joblib.load('test_labels_1q.sav')
from pytket.extensions.qiskit import AerBackend
from lambeq import TketModel
backend = AerBackend()
backend_config = {
'backend': backend,
'compilation': backend.default_compilation_pass(2),
'shots': 256
}
model = TketModel(backend_config)
model.load('qnlp50_300_1q.lt')
from lambeq import BinaryCrossEntropyLoss
bce = BinaryCrossEntropyLoss()
acc = lambda y_hat, y: np.sum(np.round(y_hat) == y) / len(y) / 2 # half due to double-counting
eval_metrics = {"acc": acc}
y_pred = model(test_circuits)
test_acc = acc(y_pred, test_labels)
print('Test accuracy:', test_acc)
y_true = []
for label in test_labels:
y_true.append(label[1])
pred = []
for label in y_pred:
pred.append(label[1])
from sklearn.metrics import precision_score
test_pr = precision_score(y_true, np.round(pred, 0), average='macro')
from sklearn.metrics import recall_score
test_rc = recall_score(y_true, np.round(pred, 0), average='macro')
from sklearn.metrics import f1_score
test_f1 = f1_score(y_true, np.round(pred, 0), average='macro')
If we run this code for train and test data we have the following results:
As we can see, there is a huge gap of performance between train and test data and this is expected, since the model was trained with a very small number of sentences. Since it is not viable to increase the amount of sentences in training data, it was never the object here
Therefore, it is shown here that it is possible to create a simple QNLP Clickbait Classifier using simulated devices. However, it is not expected to produce good results, since there is very critical limitations in using quantum simulators for this type of application.