How to Calculate BLEU Score in Python?

BLEU score in Python is a metric that measures the goodness of Machine Translation models. Though originally it was designed for only translation models, now it is used for other natural language processing applications as well. The BLEU score compares a sentence against one or more reference sentences and tells how well does the candidate sentence matched the list of reference sentences. It gives an output score between 0 and 1. A BLEU score of 1 means that the candidate sentence perfectly matches one of the reference sentences.This score is a common metric of measurement for Image captioning models.In this tutorial, we will be using sentence_bleu() function from the nltk library. Let’s get started.

Calculating the BLEU Score in Python

To calculate the BLEU score, we need to provide the reference and candidate sentences in the form of tokens.

We will learn how to do that and compute the score in this section. Let’s start with importing the necessary modules.

from nltk.translate.bleu_score import sentence_bleu

Now we can input the reference sentences in the form of a list. We also need to create tokens out of sentences before passing them to the sentence_bleu() function.

1. Input and Split the Sentences

The sentences in our reference list are:


'this is a dog'
'it is dog'
'dog it is'
'a dog, it is'

We can split them into tokens using the split function.


reference = [
    'this is a dog'.split(),
    'it is dog'.split(),
    'dog it is'.split(),
    'a dog, it is'.split() 
]
print(reference)

Output:


[['this', 'is', 'a', 'dog'], ['it', 'is', 'dog'], ['dog', 'it', 'is'], ['a', 'dog,', 'it', 'is']]

This is what the sentences look like in the form of tokens. Now we can call the sentence_bleu() function to calculate the score.

2. Calculate the BLEU Score in Python

To calculate the score use the following lines of code:


candidate = 'it is dog'.split()
print('BLEU score -> {}'.format(sentence_bleu(reference, candidate)))

Output:

We get a perfect score of 1 as the candidate sentence belongs to the reference set. Let’s try another one.


candidate = 'it is a dog'.split()
print('BLEU score -> {}'.format(sentence_bleu(reference, candidate)))

Output:


BLEU score -> 0.8408964152537145

We have the sentence in our reference set, but it isn’t an exact match. This is why we get a 0.84 score.

3. Complete Code for Implementing BLEU Score in Python


from nltk.translate.bleu_score import sentence_bleu
reference = [
    'this is a dog'.split(),
    'it is dog'.split(),
    'dog it is'.split(),
    'a dog, it is'.split() 
]
candidate = 'it is dog'.split()
print('BLEU score -> {}'.format(sentence_bleu(reference, candidate)))

candidate = 'it is a dog'.split()
print('BLEU score -> {}'.format(sentence_bleu(reference, candidate)))

4. Calculating the n-gram Score

While matching sentences you can choose the number of words you want the model to match at once. For example, you can choose for words to be matched one at a time (1-gram). Alternatively, you can also choose to match words in pairs (2-gram) or triplets (3-grams).

In the sentence_bleu() function you can pass an argument with weights corresponding to the individual grams.

For example, to calculate gram scores individually you can use the following weights:

  • Individual 1-gram: (1, 0, 0, 0)
  • Individual 2-gram: (0, 1, 0, 0)
  • Individual 3-gram: (0, 0, 1, 0)
  • Individual 4-gram: (0, 0, 0, 1)

Python code for the same is given below:


from nltk.translate.bleu_score import sentence_bleu
reference = [
    'this is a dog'.split(),
    'it is dog'.split(),
    'dog it is'.split(),
    'a dog, it is'.split() 
]
candidate = 'it is a dog'.split()

print('Individual 1-gram: %f' % sentence_bleu(reference, candidate, weights=(1, 0, 0, 0)))
print('Individual 2-gram: %f' % sentence_bleu(reference, candidate, weights=(0, 1, 0, 0)))
print('Individual 3-gram: %f' % sentence_bleu(reference, candidate, weights=(0, 0, 1, 0)))
print('Individual 4-gram: %f' % sentence_bleu(reference, candidate, weights=(0, 0, 0, 1)))

Output:


Individual 1-gram: 1.000000
Individual 2-gram: 1.000000
Individual 3-gram: 0.500000
Individual 4-gram: 1.000000

By default, the sentence_bleu() function calculates the cumulative 4-gram BLEU score, also called BLEU-4. The weights for BLEU-4 are as follows:

Let’s see the BLEU-4 code:


score = sentence_bleu(reference, candidate, weights=(0.25, 0.25, 0.25, 0.25))
print(score)

Output:

That’s the exact score we got without the n-gram weights added.

Conclusion

This tutorial was about calculating the BLEU score in Python. We learned what it is and how to calculate individual and cumulative n-gram BLEU scores. Hope you had fun learning with us!