Part of Speech Augmentation in Markov Chain Next Word Predictor

Stephen Paff

Comp 6720 Class Project

Introduction

This project analyzes the effectiveness of modulating by part of speech in Markov chains when predicting the next word given the start of the sentence thus far. The following question guided the research: Could the part of speech of words be used to augment Markov chains and improve their accuracy in predicting the next word in a sentence? Three different methods were tested:

1) Word Markov Chains:

Step 1: Given the order n, it creates a chain of n words back (or to the sentence beginning if the sentence thus far is shorter than n words) and finds the set of all phrases that start with that chain in the training data.

Step 2: It determines and returns the most common next word among that set as its prediction for the next word.

Sadviya’s description (2018) formed the basis for this process, and others have used a Markov Chains process like this as word predictor and/or text generator (Dejeu 2017, Ward 2019, Shiffman 2016, Yurichev 2019). In this project, it functions as the control method and the other methods test two different ways to use part of speech to augment the Markov Chain next word prediction.

2) Excluder:

Step 1: Given the order n, it finds all phrases that have the same combination of parts of speech, using nltk.pos_tag() to determine part of speech (Bird 2019, NLTK 2019, Liberman 2019). For example, if the given sentence so far consists of an adjective + noun, then it finds all phrases in the training data that start with an adjective and a noun.

Step 2: Among this set, it determines the most common next part of speech and establishes the set of all phrases that use this part of speech next. For example, if most of the time a verb proceeds an adjective and noun, it will find all phrases that are adjective, noun, and then verb.

Step 3: Like the Word Markov Chains, it then determines the most common next word in this next set, as its prediction for the next word.

This method involves excluding based on the next common part of speech, hence the name “excluder.”

3) Multiplier:

Step 1: Like Word Markov Chains, it starts by creating a chain of n words back (or to the sentence beginning if the sentence thus far is shorter than n words) and finding the set of all phrases that start with that chain in the training data.

Step 2: For each word and part of speech, it calculates p_w and p_p respectively. p_w is the probability that the given word was the next word in the training data, aka the proportion of times it occurs as the next word in the training set. p_p is the probability that the next word would be of part of speech p, aka the proportion of times that the next word is of that part of speech in the set. nltk.pos_tag() was also used to determine part of speech.

Step 3: For each potential next word, it calculates k = p_w*p_p. (k is an arbitrary letter used here.)

Step 4: It finds the word with the greatest k as its prediction for the next word.

This project used the text of “Alice in Wonderland” by Lewis Carroll made available by the Gutenberg press (Project Gutenberg 2008) to test the effectiveness of each method. Each trained on the first six chapters of the book and tested on the final, seventh chapter. The project used word accuracy – that is the percentage of next words correctly predicted – a common measure for accuracy of next word algorithms (Mirowski and Vlachos 2015; Maase 2018; Dejeu 2017; Yuecheng, Mnih, Hinton 2008). For each method, the order n values from 1 to 9 were tested.

Summary of Results: The multiplier performed the same as the control, but the excluder performed slightly worse.

Set-Up

Importing Data

This section imports the data from Alice in Wonderland text file.

In [1]:
import nltk
nltk.download('averaged_perceptron_tagger')
import pandas as pd
from string import punctuation
[nltk_data] Downloading package averaged_perceptron_tagger to
[nltk_data]     C:\Users\spaff\AppData\Roaming\nltk_data...
[nltk_data]   Package averaged_perceptron_tagger is already up-to-
[nltk_data]       date!

Data Cleaning and Preparations Functions

These are a set of functions intended to clean and prepare the data.

In [2]:
# These are functions needed to clean the data. 

# sentence_markers mark the characters used to end sentences
sentence_markers = ['.', '?', '!', '\n\n']

# For the given string text, clean_data splits it into sentences and cleans the sentences
def clean_data(text): 
    sentences = split_into_sentences(text)
    sentences = replace(sentences, '\n', ' ')
    sentences = replace(sentences, '’', ' ')
    sentences = lower(sentences)
    sentences = remove_punctuation(sentences)
    sentences = remove_starting_spaces(sentences)
    return sentences

# For the given list of strings l, lower makes them all its characters lowercase. 
def lower(l):
    for x in range(len(l)): 
        l[x] = l[x].lower()
    return l

# For the given list of strings l remove_punctuation removes all puncuation. 
def remove_punctuation(l):
    new_list = []
    for value in l: 
        new_list.append(''.join(c for c in value if c not in punctuation))
    return new_list

# For the given list of strings l, removes the starting space for any string that starts with an space, such as " hello" to "hello".
def remove_starting_spaces(l):
    new_l = []
    for value in l:
        new_value = value
        if len(value) > 0 and value[0] == ' ':
            new_value = value[1:]
        new_l.append(new_value)
    return new_l

# For the given list of strings l, this replaces the character char1 with character char2 in all of its strings. 
def replace(l, char1, char2):
    for x in range(len(l)): 
        l[x] = l[x].replace(char1, char2)
    return l

#For the given string s, this splits it into sentences, returning a list of sentences. 
def split_into_sentences(s):
    l = [s]
    for marker in sentence_markers: 
        l2 = []
        for string in l:
            l = string.split(marker)
            for v in l:
                l2.append(v)
            l = l2
    return l
In [3]:
# These are methods needed to prepare the training and testing data once cleaned. 

# For the given list of strings sentence_list, this finds the most common word in all the strings. 
def find_most_common_word(sentence_list): 
    words = [] 
    for sentence in sentence_list: 
        words += sentence
    df = pd.DataFrame(words)
    return df[df.columns[-1]].value_counts().index.tolist()[0]    

# For the given list of strings text_list, this splits each string into a list by space. It returns a list of lists where each list value is a list of words in the corresponding sentence.
def get_sentence_lists(text_list): 
    sentence_splits = []
    for text in text_list:
        split = text.split(' ')
        while '' in split:
            split.remove('')
        sentence_splits.append(split)
    return sentence_splits

'''
For the given list of strings text_list, this splits each string into a list by space and also finds the part of speech tag for each of these words, using nltk.pos_tag(). I
It returns a two-value tuple. Its first value is a list of lists where each list value is a list of words in the corresponding sentence.
Its second value is a list of lists where each list value is a list of part of speech tags for each word in the corresponding sentence.
'''
def get_sentence_list_and_tags(text_list): 
    sentence_tags = []
    sentence_splits = []
    for text in text_list:
        split = text.split(' ')
        while '' in split:
            split.remove('')
        sentence_splits.append(split)
        l = nltk.pos_tag(split)
        tags = []
        for value in l: 
            tags.append(value[1])
        sentence_tags.append(tags)
    return (sentence_splits, sentence_tags)

# For the given list of strings text_list, this splits each string into a list by space. It returns a list of lists where each list value is a list of part of speeches matching the part of speech of each word in each sentence, using nltk.pos_tag().
def get_sentence_tags(list_of_text): 
    while '' in list_of_text:
        list_of_text.remove('')
    l = nltk.pos_tag(list_of_text)
    tags = []
    for value in l: 
        tags.append(value[1])
    return tags 

Models

Importing Training and Testing Data

This section imports the training and testing data and performs basic data cleaning and preparation.

In [4]:
'''
Training Data: Alice in Wonderland Chapters 1-6
Testing Data: Chapter 7 "Alice's Adventure"
'''
training_data = clean_data(open('Alice in Wonderland - Training Data.txt','r').read())
testing_data = clean_data(open('Alice in Wonderland - Testing Data.txt','r').read())

training_data_info = get_sentence_list_and_tags(training_data)
most_common_word = find_most_common_word(training_data_info[0])

Testing

This section conducts the testing for each of the three methods described above.

In [5]:
# phrase_so_far must be split by space already
def excluder(phrase_so_far, sentence_list, sentence_tags): 
    search_tags = get_sentence_tags(phrase_so_far)
    phrases, tags = find_all_instances(sentence_list, search_tags, sentence_tags)
    most_common_next_tag = most_common_final_word(tags)
    phrases_matching_tag = []
    for x in range(len(tags)): 
        if tags[x][-1] == most_common_next_tag: 
            phrases_matching_tag.append(phrases[x])
    instances = find_all_instances(phrases_matching_tag, phrase_so_far)
    return most_common_final_word(instances)

'''

This set searches sentence_list for all instances of the same set of words in search_values 
If sentence_tags is empty, this searches for all instances of the phrase sentence_list. For example, if search_values was ['she', 'was'], it would search for all instances of phrases starting with 'she was' in sentence list. 
If sentence_tags is not empty, this search for all instances of the combination of parts of speech as is present in search_values (e.g. if search_values is of the form adjective and noun, then it will find all instances of adjective + noun)
Parameter Requirements: 
sentence_list must be a list of a list of strings, where each string is the word in the original text and each list of string are sentences in that text. 
search_values must be a list of strings corresponding to the phrase searching for (e.g. ['she', 'was'])
sentence_tags must either be empty or a list of a list of strings, where each string is the part of speech tags that correspond (in order) to word in sentence_list
'''
def find_all_instances(sentence_list, search_values, sentence_tags = []):
    l_values = len(search_values)
    if sentence_tags == []: 
        instances = [] 
        for sentence in sentence_list:
            l_sentence = len(sentence)
            if l_values > l_sentence: 
                continue
            for x in range(l_sentence - l_values): 
                phrase = sentence[x:x+l_values]
                if phrase == search_values: 
                    if x + l_values != l_sentence: 
                        instances.append(phrase + [sentence[x + l_values]])
                    else: 
                        instances.append(phrase + [''])
        return instances
    else: 
        new_sentence_tags = []
        new_sentence_list = []
        l_values = len(search_values)
        for x in range(len(sentence_tags)): 
            tags = sentence_tags[x]
            sentence = sentence_list[x]
            l_tags = len(tags)
            if l_values > l_tags: 
                continue
            for y in range(l_tags - l_values): 
                tag_phrase = tags[y:y+l_values]
                sentence_phrase = sentence[y:y+l_values]
                if tag_phrase == search_values: 
                    if y + l_values != l_tags: 
                        new_sentence_tags.append(tag_phrase + [tags[y + l_values]])
                        new_sentence_list.append(sentence_phrase + [sentence[y + l_values]])
                    else: 
                        new_sentence_tags.append(tag_phrase + [''])
                        new_sentence_list.append(sentence_phrase + [''])
        return (new_sentence_list, new_sentence_tags)

'''
markov() uses markov chains to predict the next work given the sentence_so_far and training_data. order, do_excluder, do_multiplier influence what kind of test it does
Parameters: 
sentence_so_far must be list of strings containing words that are the sentence up to that point. For example, ['she', 'was'] would represent the partial sentence the sentences 'she was...'. 
training_data_sentence_list is the list of strings containing the words of the sentences trained on.
training_data_tags is the list of strings containing the parts of speech tags corresponding to each word in training_data_sentence 
training_data must be a list of a list of strings, where each list of strings contains the words of the sentences trained on. For example, [['she', 'was', 'cool'], ['he', 'was', 'not']] would use the sentences 'he was cool' and 'she was not' as training
order is an integer corresponding to the order n of how many chains to use (see introduction)
do_excluder must be True if the excluder should be conducted and otherwise False
do_excluder must be True if the multiplier should be conducted and otherwise False
'''
def markov(sentence_so_far, training_data, order, do_excluder, do_multiplier): 
    phrase_so_far = sentence_so_far[-1*order:]
    if(do_multiplier): 
        return multiplier(phrase_so_far, training_data[0], training_data[1])
    elif(do_excluder): 
        return excluder(phrase_so_far, training_data[0], training_data[1])
    else: 
        return word_markov(phrase_so_far, training_data[0])

# Given the list of strings list_of_values, most_common_final_word finds the most common word and returns it. 
def most_common_final_word(list_of_values): 
    df = pd.DataFrame(list_of_values)
    if len(df) == 0: 
        return most_common_word
    return df[df.columns[-1]].value_counts().index.tolist()[0]

'''
Finds the word with the maximum of the formula k = (p_p*p_w) where p_p is the probability that p is the part of speech of the next word and p_w is the probability that w is the next word (see introduction)
Parameters: 
phrases_so_far must be a list of strings for the chain of words using to predict the next word. 
sentence_list must be a list of lists of strings for the list of words using as training data
sentence_tags must be a list of lists of strings for the part of speech of training data words in sentence_list
'''
def multiplier(phrase_so_far, sentence_list, sentence_tags):
    instances = find_all_instances(sentence_list, phrase_so_far)
    df_instances = pd.DataFrame(instances)
    if(len(df_instances) == 0): 
        return most_common_word
    df = pd.DataFrame(df_instances[df_instances.columns[-1]].value_counts())    
    # wc stands for word count
    df.columns = ['wc']
    df['p_w'] = df['wc'] / len(df)
    # t stands for tags
    df['t'] = get_sentence_tags(df.index)
    # tc stands for tag count
    def determine_tag_count(x): 
        return df['t'].value_counts()[x]
    df['tc'] = df['t'].apply(determine_tag_count)
    df['p_p'] = df['wc'] / len(df)
    df['k'] = df['p_p'] * df['p_w']
    return df[df['k'] == max(df['k'])].index[0]

'''
This conducts the word_markov process described in the introduction. 
Parameters: 
phrases_so_far must be a list of strings for the chain of words using to predict the next word. 
sentence_list must be a list of lists of strings for the list of words using as training data
'''
def word_markov(phrase_so_far, sentence_list):
    instances = find_all_instances(sentence_list, phrase_so_far)
    return most_common_final_word(instances)
    
'''
test conducts the test given the testing_sentences and training_data. order, do_excluder, do_multiplier influence what kind of test it does
Parameters: 
testing_sentences must be a list of a list of strings, where each list of strings contains the words of the sentences tested. For example, [['she', 'was', 'cool'], ['he', 'was', 'not']] would test the sentences 'he was cool' and 'she was not'
training_data_sentence_list is the list of strings containing the words of the sentences trained on.
training_data_tags is the list of strings containing the parts of speech tags corresponding to each word in training_data_sentence 
training_data must be a list of a list of strings, where each list of strings contains the words of the sentences trained on. For example, [['she', 'was', 'cool'], ['he', 'was', 'not']] would use the sentences 'he was cool' and 'she was not' as training
order is an integer corresponding to the order n of how many chains to use (see introduction)
do_excluder must be True if the excluder should be conducted and otherwise False
do_excluder must be True if the multiplier should be conducted and otherwise False
'''
def test(testing_sentences, training_data, order, do_excluder, do_multiplier):
    correct = 0 
    total = 0 
    testing_sentence_list = get_sentence_lists(testing_sentences)
    for sentence in testing_sentence_list: 
        for x in range(len(sentence)): 
            phrase = sentence[0:x]
            predicted_word = markov(phrase, training_data, order, do_excluder, do_multiplier)
            if predicted_word == sentence[x]: 
                correct+=1
            total+=1
    return 100*correct / total
In [6]:
# Creates a DataFrame table with the accuracy of each method based on order n and outputs it both in this document and as an Excel file
accuracy_table = pd.DataFrame(columns = ['Word Markov', 'Excluder', 'Multiplier'])
for x in range(1, 10): 
    word = test(testing_data, training_data_info, x, False, False)
    excl = test(testing_data, training_data_info, x, True, False)
    mult = test(testing_data, training_data_info, x, False, True)
    accuracy_table.set_value(x, 'Excluder', excl)
    accuracy_table.set_value(x, 'Word Markov', word)
    accuracy_table.set_value(x, 'Multiplier', mult)
accuracy_table.to_excel('Accuracy Table.xlsx')
accuracy_table
C:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:7: FutureWarning: set_value is deprecated and will be removed in a future release. Please use .at[] or .iat[] accessors instead
  import sys
C:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:8: FutureWarning: set_value is deprecated and will be removed in a future release. Please use .at[] or .iat[] accessors instead
  
C:\ProgramData\Anaconda3\lib\site-packages\ipykernel_launcher.py:9: FutureWarning: set_value is deprecated and will be removed in a future release. Please use .at[] or .iat[] accessors instead
  if __name__ == '__main__':
Out[6]:
Word Markov Excluder Multiplier
1 14.0067 8.52787 14.0067
2 12.1963 9.2425 12.1963
3 10.5288 8.43259 10.5288
4 9.14721 7.62268 9.14721
5 8.909 7.62268 8.909
6 8.909 7.62268 8.909
7 8.909 7.62268 8.909
8 8.909 7.62268 8.909
9 8.909 7.62268 8.909

Results and Conclusions

As shown by the table above, the excluder method is less accurate than the control method. The multiplier method has the same accuracy (which most likely means that every time it predicts the same word). Markov chains of order 1 seem to be the most accurate, with the accuracy diminishing or staying the same as the n increases.

To the guiding question on whether using part of speech to augment Markov chains would improve accuracy, the results indicate that it most likely does not, at least based on these two methods for doing so. They are either less accurate or just as accurate as not factoring in part of speech.

Other people use Markov chains to create new sentences of the “same style” as the original text (Sadviya 2018, Dejeu 2017, Ward 2019, Shiffman 2016, Yurichev 2019). The project’s results could indicate that segmenting by part of speech does not improve how well the created text corresponds to the original author(s) style. In addition to Markov chains, other methods, such as neural nets, have also become popular for next word text predictors, generally with significantly better accuracy (Zweig, Platt, Meek, Burges, Yessenalina, Liu 2012; Hinton, Srivastava, Swersky 2016; Mnih, Hinton 2007; Yuecheng, Mnih, Hinto 2008; Masse 2018; Ghosh, Kristensson 2015; Gangwar, Ghate, Raj 2018; Mirowski, Vlachos 2015; Woods 2016), and more research would be needed to determine whether there are good strategies for these to incorporate part of speech to improve accuracy.

Bibliography

Bird, Steven, Ewan Klein, and Edward Loper. 2019. "Categorizing and Tagging Words." In Natural Language Processing with Python. https://www.nltk.org/book/ch05.html.

Dejeu, Alexander. 2017. From “What is a Markov Model” to “Here is how Markov Models Work”. January 29. https://hackernoon.com/from-what-is-a-markov-model-to-here-is-how-markov-models-work-1ac5f4629b71.

Gangwar, Aparna, Pinak Ghate, and Vaibhaw Raj. 2018. "Sentence Completion Using Text Prediction."

Ghosh, Shaona, and Per Ola Kristensson. 2015. "Neural Networks for Text Correction and Completion in Keyboard Decoding." Journal of Latex Class Files 1-14.

Hinton, Geoffrey, Nitish Srivastava, and Kevin Swersky. 2016. "Learning to Predict the Next Word." Edited by Collin McDonnell. February 4. https://www.youtube.com/watch?v=ReUrmqStBd4.

Liberman, Mark. 2019. Alphabetical list of part-of-speech tags used in the Penn Treebank Project. https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html.

Masse, David. 2018. Predicting the Next Word: Back-Off Language Modeling. August 30. https://medium.com/@davidmasse8/predicting-the-next-word-back-off-language-modeling-8db607444ba9.

Mirowski, Piotr, and Andreas Vlachos. 2015. "Dependency Recurrent Neural Language Models for Sentence Completion." Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics 511-517.

Mitchell, Tom M. 1997. Machine Learning. Redmond: McGraw-Hill Science.

Mnih, Andriy, and Geoffrey Hinton. 2007. "Three New Graphical Models for Statistical Language Modelling." Proceedings of the 24 the International Confer- 641-648.

NLTK. 2019. Corpus Readers. http://www.nltk.org/howto/corpus.html.

Project Gutenberg. 2008. Project Gutenberg's Alice's Adventures in Wonderland, by Lewis Carroll. http://www.gutenberg.org/ebooks/11.

Sadviya, Ashwin. 2018. Next Word Prediction using Markov Model. March 16. https://medium.com/ymedialabs-innovation/next-word-prediction-using-markov-model-570fc0475f96.

Shiffman, Daniel. 2016. "Markov Chains - Programming with Text." The Coding Train, October 22. https://www.youtube.com/watch?v=v4kL0OHuxXs.

Suwal, Bhushan, and Andrew Savage. 2018. "Improving Sentences with Online Policy Gradient Methods."

Ward, Daniel. 2019. Markov Models for Text Analysis. http://www.stat.purdue.edu/~mdw/CSOI/MarkovLab.html.

Woods, Aubrie. 2016. "Exploiting Linguistic Features for Sentence Completion." Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics 438-442.

Yuecheng, Zhang, Andriy Mnih, and Geoffrey Hinton. 2008. "Improving a statistical language model by modulating the effects of context words." European Symposium on Artificial Neural Networks.

Yurichev, Dennis. 2019. Autocomplete using Markov chains. February 20. https://yurichev.com/blog/markov/.

Zhang, Zheng. 2018. "Sense - Recommendation System based on Affective Interpretations of Social Media Posts: A Proposed User Interface Design." RIT Scholar Works 1-33.

Zweig, Geoffrey, John C Platt, Christopher Meek, Christopher Burges, Ainur Yessenalina, and Qiang Liu. 2012. "Computational Approaches to Sentence Completion." ACL '12 Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics 601-610.