Pythonic Explorations of Wordle and 5 Letter English Words
MANIFESTO font created by Tomaz Leskovec, artwork by Kevin Lease in homage to the word puzzle discussed in the blog. |
Some of my friends enjoy doing Wordle (https://www.nytimes.com/games/wordle/index.html), a game created by Josh Wardle in which you try to identify a 5 letter word. This made me think about the lexicon of 5 letter English words and Wordle strategies. For example, which words would in general be better initial guesses? Given a word attempt and resulting feedback information, which would be the next best word attempt? I did a Google search and other people have already published Wordle solvers. Therefore, this is going to be an exercise for my own edification, with a goal of improve my use of Python and the enjoyment of working through this thought experiment.
With 26 letters and 5 positions, there are theoretically 26 to the 5th power number 'words', or 11,881,376 five letter combinations. Of course, many of these are not English words, so getting a list of English words will significantly reduce the possible solution space.
I obtained a list of English words from https://github.com/dwyl/english-words and used the file words_alpha.txt, which contains 370,103 words.
My first step was to winnow the full list of words down to a list of 5 letter words.
with open("words_alpha.txt") as file:
allwords = file.read().splitlines()
for i in allwords:
if len(i) == 5:
print(i)
This gave me a list of five letter words that contains 15918 entries which I saved as fiveletterwords, which represents 4.3% of the total list of 370103 words. There is a huge difference in size between the nearly 12 million possible 5 letter combinations and much smaller actual five letter English word list.
The next question I asked is how many of these 5 letter words contain a given letter of the alphabet and how often is a letter found altogether in the 5 letter word set?
import string
fiveletterwords = []
with open("words_alpha.txt") as file:
allwords = file.read().splitlines()
alphabet_string = string.ascii_lowercase
alphabet_list = list(alphabet_string)
def testletter(letter):
letter_total_freq = 0
letter_uniq_word_freq = 0
for i in fiveletterwords:
x = i.count(letter)
letter_total_freq = letter_total_freq + x
if x > 0:
letter_uniq_word_freq = letter_uniq_word_freq + 1
print(letter + " " + str(letter_uniq_word_freq) + " " + str(letter_total_freq))
for i in allwords:
if len(i) == 5:
fiveletterwords.append(i)
for letter in alphabet_list:
testletter(letter)
Output (letter, number of 5 letter words in which the letter can be found, number of times letter found in 5 word set)
a 7247 8392
b 1936 2089
c 2588 2744
d 2639 2811
e 6728 7800
f 1115 1238
g 1867 1971
h 2223 2284
i 4767 5067
j 372 376
k 1663 1743
l 3923 4246
m 2361 2494
n 3773 4043
o 4613 5219
p 2148 2299
q 139 139
r 4864 5143
s 5871 6537
t 3866 4189
u 3241 3361
v 853 878
w 1160 1171
x 357 361
y 2476 2521
z 435 474
Output list with just vowels:
a 7247 8392
e 6728 7800
i 4767 5067
o 4613 5219
u 3241 3361
y 2476 2521
Top 10 consonants from above list in descending order:
s 5871 6537
r 4864 5143
t 3866 4189
l 3923 4246
n 3773 4043
m 2361 2494
d 2639 2811
c 2588 2744
h 2223 2284
p 2148 2299
My father suggested to me that a word guess need not be an actual word, for example, someone could guess all-vowel 5 letter combination AEIOU, which would represent most of the highest frequency letters.
I tried this strategy and found that Wordle rejected AEIOU with "not in word list."
So a Wordle entry has to be an actual word.
One possibility for the best initial Wordle choice would be a word that will have the greatest chance of containing a letter or letters that will be found at any position in the Wordle solution.
Or to put it another way, to take the letters from each English 5 letter word, and ask how many other unique 5 letter English words can be found with at least one letter match somewhere in the word (position of match ignored).
import string
fiveletterwords = []
with open("words_alpha.txt") as file:
allwords = file.read().splitlines()
alphabet_string = string.ascii_lowercase
alphabet_list = list(alphabet_string)
def testword(word):
counter = 0
allmatches = []
unique_list = []
list_of_letters = list(word)
for j in list_of_letters:
for k in fiveletterwords:
if j in k:
allmatches.append(k)
for x in allmatches:
if x not in unique_list:
unique_list.append(x)
for x in unique_list:
counter = counter + 1
print(word + " " + str(counter))
for i in allwords:
if len(i) == 5:
fiveletterwords.append(i)
for i in fiveletterwords:
testword(i)
Top 25 initial word choices predicted by the first method with their scores are below. Did you know the meaning of kioea? Me neither. Apparently it was a Hawaiian bird that is now extinct. Interesting but not useful because it is not a word that Wordle will allow. Nor will it allow aoife, aueto. I get down to AEONS on my list before I find a work it will take. I think wordle solution set lexicon must be somewhat like scrabble lexicon. Stoae is the next word that wordle will take from the list (the plural of stoa, a freestanding colonnade or covered walkway; I looked it up).
kioea 15239
aoife 15227
aueto 15206
aeons 15151
aotes 15130
stoae 15130
arose 15129
oreas 15129
seora 15129
aesir 15123
aries 15123
arise 15123
raise 15123
serai 15123
aloes 15119
alose 15119
osela 15119
solea 15119
ousia 15104
aurei 15055
uraei 15055
hosea 15043
oshea 15043
aisle 15018
elias 15018
A second approach to help select the best initial Wordle guess was to take each five letter word, look at the 5 letters present in that word and see how many other 5 letter words have a particular letter at the exact same position with a point given for each match. The higher the score for a word, it means the greater number of occurrences that an exact match of a letter in the word to a letter at the same position in the list of 5 letter words.
import string
fiveletterwords = []
with open("words_alpha.txt") as file:
allwords = file.read().splitlines()
alphabet_string = string.ascii_lowercase
alphabet_list = list(alphabet_string)
def testword(word):
wordscore = 0
counter = 0
allmatches = []
unique_list = []
list_of_letters = list(word)
for index1, value1 in enumerate(list_of_letters):
for k in fiveletterwords:
secondwordlistofletters = list(k)
for index2, value2 in enumerate(secondwordlistofletters):
if index1 == index2:
if value1 == value2:
wordscore = wordscore + 1
print(word + " " + str(wordscore))
for i in allwords:
if len(i) == 5:
fiveletterwords.append(i)
for i in fiveletterwords:
testword(i)
To sort the output:
import string
five_word_dict = {}
with open("wordscore_list2.txt") as file:
fivescores = file.read().splitlines()
for i in fivescores:
info = []
info = i.split(' ', 1 )
five_word_dict[info[0]] = info[1]
sort_orders = sorted(five_word_dict.items(), key=lambda x: int(x[1]), reverse=True)
for i in sort_orders:
print(i[0], i[1])
Top 25 initial word choices by the second method with their scores:
sanes 11579
sales 11401
sores 11295
cares 11268
bares 11213
sates 11124
tares 11053
pares 11016
sones 10989
seres 10984
canes 10962
mares 10921
banes 10907
dares 10873
sades 10855
soles 10811
sages 10802
sabes 10787
fares 10756
lares 10751
bales 10729
panes 10710
saris 10697
sires 10683
cores 10678
Once an initial guess is made, the results can be used to filter the 5 letter English word set to help make the next guess.
There are four filters to apply to our list of words based on the results of an attempt:
1. Letters present at some position in the word
2. Letters absent from any position in the word
3. Letter present but excluded from a position in the word
4. Letter present and must be at a position in the word
I have created a python script to filter the word list by these rules:
#kl_python_word_filter.py by Kevin Lease 2022
#licensed under a Creative Commons
# Attribution-NonCommercial-ShareAlike 4.0 International License
import string
fiveletterwords = []
keepwords = []
keepwords2 = []
wordchars = []
keep_letters = ['o','u', 't']
#keep_letters = ['a', 'r', 'l']
remove_letters = ['p', 'i', 's','y','h']
#remove_letters = ['e', 'o', 'n', 's', 'g', 'v', 'y', 'i', 'b']
# positions are indexed from 0
remove_at_position = ['o2','u3','o1','u2']
#remove_at_position = ['a0', 'r1', 'a2', 'l0']
#keep_at_position = []
keep_at_position = []
with open("words_alpha.txt") as file:
allwords = file.read().splitlines()
for i in allwords:
if len(i) == 5:
fiveletterwords.append(i)
print("five letter words:" + str(len(fiveletterwords)))
def keepword(word, keep_letters):
global keepwords
flag = 1
for j in keep_letters:
if j not in word:
flag = 0
if flag == 1:
keepwords.append(word)
def remove_letter_func(letter):
global keepwords
global keepwords2
for word in keepwords:
if letter not in word:
if word not in keepwords2:
keepwords2.append(word)
keepwords.clear()
keepwords = [i for i in keepwords2]
keepwords2.clear()
def remove_letter_at_position(letter, index):
global keepwords
global keepwords2
global wordchars
keepwords2.clear()
for word in keepwords:
wordchars.clear()
wordchars = list(word)
if wordchars[int(index)] == letter:
pass
#print(word + " " + wordchars[value] + " " + key)
#keepwords.remove(word)
if wordchars[int(index)] != letter:
if word not in keepwords2:
keepwords2.append(word)
keepwords.clear()
keepwords = [i for i in keepwords2]
def keep_letter_at_position(letter, index):
global keepwords
global keepwords2
global wordchars
keepwords2.clear()
for word in keepwords:
wordchars.clear()
wordchars = list(word)
if wordchars[int(index)] == letter:
keepwords2.append(word)
keepwords.clear()
keepwords = [i for i in keepwords2]
if keep_letters: #check to make sure list not empty
for i in fiveletterwords:
keepword(i, keep_letters)
print("words containing keep letters:" + str(len(keepwords)))
if remove_letters: #check to make sure list not empty
for letter in remove_letters:
remove_letter_func(letter)
print("words left after removing absent letters:" + str(len(keepwords)))
if remove_at_position: #check to make sure list not empty
for pair in remove_at_position:
pairlist = list(pair)
letter = pairlist[0]
index = pairlist[1]
#print ("letter:" + letter + " " + "index:" + str(index))
remove_letter_at_position(letter, index)
print("words after removing letters by position:" + str(len(keepwords)))
if keep_at_position: #check to make sure list not empty
for pair in keep_at_position:
pairlist = list(pair)
letter = pairlist[0]
index = pairlist[1]
keep_letter_at_position(letter, index)
print("words after keeping letters by position:" + str(len(keepwords)))
for i in keepwords:
print(i)
Looking over the list of possible word solutions at earlier stages, there are many words with which I am unfamiliar, which reminds me of how much I don't know and to be humble.
However, if I look at historical answers for Wordle this month, all of the word solutions are words in my personal lexicon. So, another filter that could be created would be based on this principle, a 'familiarity index.' The solution that springs to my mind, I would take one or more open source books from Project Gutenberg, parse the text into words and keep the 5 letter words as a list. Theoretically, this could be used to narrow the list of 5 letter words to one more likely to be in Wordle.
Disclaimer #1:
WORDLE is copyright 2022 to the New York Times. This blog is unaffiliated with both Wordle or the New York Times. This content was done for my own scholarship and not for any profit.
Disclaimer #2:
The author does not make any warranties about the completeness, reliability and accuracy of this information. Any action you take upon the information on this site is strictly at your own risk, and the author will not be held liable for any losses and damages in connection with the use of this information.
This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
Comments
Post a Comment