What's the longest rotational word in the English language.

MrAl

Joined Jun 17, 2014
8,063
Here is my final code. A few changes:

1. I changed the source dictionary to usa2.txt from http://www.gwicks.net/dictionaries.htm. It is a USA English dictionary with no nonsense (AFAICT). Note that "sexploitation" is not a word in this dictionary.
2. I added code to pre-sanitize the dictionary file, allowing only whole words with only two or more lowercase letters. Presumably, this rejects acronyms, abbreviations, hyphenated words, and proper nouns.
3. Rather than grepping the whole file for each dictionary word, I now only grep the lines after the line containing the current word being grepped. This reduces the search space by one line each iteration, and avoids the "weewee"s.
4. Since I pre-sanitize, I don't need to check each word for validity prior to each grep.

For the usa2.txt dictionary, the entire run time is:

real 8m46.443s
user 8m43.483s
sys 2m58.861s

Resulting output file attached.

@nsaspook: please run this on your best box. And report back the execution time. Thanks.

Bash:
#!/bin/bash

#Rotodrome Finder
#JoeyD999@AAC
#December 2019

#Select desired dictionary (or add your own)

#SOURCEDICT=/usr/share/dict/words                                #Ubuntu Linux Dictionary
#SOURCEDICT=words.txt                           #from https://github.com/dwyl/english-words -- contains nonsense, abbreviations, proper names, and acronyms.
SOURCEDICT=usa2.txt                                                            #from http://www.gwicks.net/dictionaries.htm -- just USA English words

DICT=infile.txt
TEMP=tempfile.txt
OUT=outfile.txt

#Clean start -- delete old temporary and output files

rm -f $TEMP $OUT

#avoid file not found error during dup check

touch $TEMP

#Sanitize SOURCEDICT to DICT:  Create sorted DICT with words length >= 2 that contain only lowercase english alphabet letters

grep -E "^[a-z][a-z]+$" $SOURCEDICT | sort > $DICT

#Set up a source dictionary line counter to limit search space

LINE=1

#Iterate over each word in the dictionary file

cat $DICT | while read WORD || [[ -n $WORD ]]; do

    LINE=$(($LINE+1))                                                            #search space is all source file lines below this one
    LENGTH=${#WORD}                                                       #get word length

    if (! grep -iwq $WORD $TEMP); then                      #only process words that do not exist in $TEMP file

        #Construct a regexp to compare all permutations simultaneously
        #i.e. "joeyd" => "^(oeydj|eydjo|ydjoe|djoey)$"

        REGEX="^("

        for ((i=1 ; i < LENGTH; i++)); do

            if ((i>1)); then
                REGEX=${REGEX}\|${WORD:$i}${WORD:0:((i))}
            else
                REGEX=${REGEX}${WORD:$i}${WORD:0:((i))}
            fi

        done

        REGEX="${REGEX})$"  #close the regexp
      
        #Match all lines in $DICT beginning with current word line+1

        unset MATCHLIST

        for MATCH in `tail -n +$LINE $DICT | grep -iE $REGEX`; do
            MATCHLIST="$MATCHLIST $MATCH"
        done
      
        #Append each successful set of matches to $TEMP file

        if [[ -v MATCHLIST ]]; then
            echo "$LENGTH $WORD $MATCHLIST" | tee -a $TEMP
        fi

    fi
done

#sort the final $TEMP into $OUT

sort -n $TEMP > $OUT
Another question that comes up is if we have "tables" and find "stable" should we list:
"stable tables"

alone or also list:
"tables stable"

as a separate entry since both words have their valid rotations so maybe that counts as 2 instead of just 1.
 

joeyd999

Joined Jun 6, 2011
4,477
Another question that comes up is if we have "tables" and find "stable" should we list:
"stable tables"

alone or also list:
"tables stable"

as a separate entry since both words have their valid rotations so maybe that counts as 2 instead of just 1.
This test:

Bash:
if (! grep -iwq $WORD $TEMP); then                      #only process words that do not exist in $TEMP file
explicitly prevents such (IMO) duplication.

Remove it if you wish.
 

MrAl

Joined Jun 17, 2014
8,063
I found the following alpha distribution of finds:

Code:
[FONT=courier new]
 a   b   c   d   e   f  g  h   i   j k  l   m   n  o   p   q  r   s    t   u   v  w   x y  z
{315,147,243,145,287,65,89,201,132,7,88,232,163,91,212,286,22,213,1166,455,104,39,173,7,49,12}
[/FONT]
Clearly words that start with 's' are most abundant, while 't' is second, and 'a' is third.
Looks like the 'x' group came in because of some Roman numerals.
This distribution comes from counting each found rotation so "abc" "bca" would come out to two entries not one where one word begins with 'a' and the other word begins with 'b' and so would list as:
abc bca
bca abc

so the distribution counts would increase 1 for 'a' and 1 for 'b' (if those were real words of course).


SIDE NOTE:
For some reason every time i edit this post i have to puts the 'space' back in front of the 'a' in the first line. Editing removes that first space every time.
 
Last edited:

joeyd999

Joined Jun 6, 2011
4,477
Attached is the output of the big dictionary at https://github.com/dwyl/english-words after using the "sanitizing" code. A glaring omission is "no" and "on". For some reason, the dictionary spells "no" with a capital "N", so it gets sanitized out.

The processing time was:

real 73m41.120s
user 72m17.594s
sys 20m28.459s


The longest words are:

11 anarthropod arthropodan
11 decimosexto sextodecimo
11 deviscerate eviscerated
11 ethological lethologica
11 grasshopper hoppergrass
11 housemother motherhouse
11 peculations speculation
11 stabulation tabulations
12 breakweather weatherbreak
12 manslaughter slaughterman
13 exploitations sexploitation
 

Attachments

MrAl

Joined Jun 17, 2014
8,063
Attached is the output of the big dictionary at https://github.com/dwyl/english-words after using the "sanitizing" code. A glaring omission is "no" and "on". For some reason, the dictionary spells "no" with a capital "N", so it gets sanitized out.

The processing time was:

real 73m41.120s
user 72m17.594s
sys 20m28.459s


The longest words are:

11 anarthropod arthropodan
11 decimosexto sextodecimo
11 deviscerate eviscerated
11 ethological lethologica
11 grasshopper hoppergrass
11 housemother motherhouse
11 peculations speculation
11 stabulation tabulations
12 breakweather weatherbreak
12 manslaughter slaughterman
13 exploitations sexploitation
I noticed they used capitals for some words so i used:
List=lower(List)
which makes them all lower case before i start.
 

joeyd999

Joined Jun 6, 2011
4,477
I noticed they used capitals for some words so i used:
List=lower(List)
which makes them all lower case before i start.
The point of "pre-sanitizing" the dictionary was to remove acronyms, abbreviations, hyphenated words, and proper nouns prior to the search -- to produce a result with less garbage.

I could allow capital letters -- maybe only on the first letter (which will allow in proper nouns). The proceeding code automatically ignores character case during the search -- without requiring a "lower()" function. That's what the -i option does in:

Bash:
for MATCH in `tail -n +$LINE $DICT | grep -iE $REGEX`; do
I'll run it and report back later so you can see the difference.
 

joeyd999

Joined Jun 6, 2011
4,477
I'll run it and report back later so you can see the difference.
Attached is the output allowing initial capitalization (i.e. proper nouns and names).

Processing time:

real 98m55.727s
user 96m44.686s
sys 26m33.463s

Here's the code:

Bash:
#!/bin/bash

#Rotodrome Finder
#JoeyD999@AAC
#December 2019

#Select desired dictionary (or add your own)

#SOURCEDICT=/usr/share/dict/words                                #Ubuntu Linux Dictionary
SOURCEDICT=words.txt                           #from https://github.com/dwyl/english-words -- contains nonsense, abbreviations, proper names, and acronyms.
#SOURCEDICT=usa2.txt                                                            #from http://www.gwicks.net/dictionaries.htm -- just USA English words

DICT=infile.txt
TEMP=tempfile.txt
OUT=outfile.txt

#Clean start -- delete old temporary and output files

rm -f $TEMP $OUT

#avoid file not found error during dup check

touch $TEMP

#Sanitize SOURCEDICT to DICT:  Create sorted DICT with words length > 2 that contain only lowercase english alphabet letters
# -- or an initial capital (i.e. proper nouns)

grep -E "^[a-zA-Z][a-z]+$" $SOURCEDICT | sort > $DICT

#Set up a source dictionary line counter to limit search space

LINE=1

#Iterate over each word in the dictionary file

cat $DICT | while read WORD || [[ -n $WORD ]]; do
 
    LINE=$(($LINE+1))                                                            #search space is all source file lines below this one
    LENGTH=${#WORD}                                                       #get word length
  
    if (! grep -iwq $WORD $TEMP); then                      #only process words that do not exist in $TEMP file

        #Construct a regexp to compare all permutations simultaneously
        #i.e. "joeyd" => "^(oeydj|eydjo|ydjoe|djoey)$"

        REGEX="^("

        for ((i=1 ; i < LENGTH; i++)); do

            if ((i>1)); then
                REGEX=${REGEX}\|${WORD:$i}${WORD:0:((i))}
            else
                REGEX=${REGEX}${WORD:$i}${WORD:0:((i))}
            fi

        done
 
        REGEX="${REGEX})$"  #close the regexp
        
        #Match all lines in $DICT beginning with current word line+1

        unset MATCHLIST

        for MATCH in `tail -n +$LINE $DICT | grep -iE $REGEX`; do
            MATCHLIST="$MATCHLIST $MATCH"
        done
        
        #Append each successful set of matches to $TEMP file

        if [[ -v MATCHLIST ]]; then
            echo "$LENGTH $WORD $MATCHLIST" | tee -a $TEMP
        fi

    fi
done

#sort the final $TEMP into $OUT

sort -n $TEMP > $OUT
 

Attachments

If you're into palindromes, how about a book?
Lid Off A Daffodil A Book of Palindromes by John Pool Illustrations by Peter Brookes
Not that I'm really into such things...
 

MrAl

Joined Jun 17, 2014
8,063
If you're into palindromes, how about a book?
Lid Off A Daffodil A Book of Palindromes by John Pool Illustrations by Peter Brookes
Not that I'm really into such things...
Sounds like an interesting read on a cold winters night curled up in front of the fireplace.
Or you could just take a sleeping pill it will work just as well.:)
 

joeyd999

Joined Jun 6, 2011
4,477
@joeyd999
Processing time

Could you explain the difference between real, user and sys?

??
From https://stackoverflow.com/questions/556405/what-do-real-user-and-sys-mean-in-the-output-of-time1

Real is wall clock time - time from start to finish of the call. This is all elapsed time including time slices used by other processes and time the process spends blocked (for example if it is waiting for I/O to complete).

User is the amount of CPU time spent in user-mode code (outside the kernel) within the process. This is only actual CPU time used in executing the process. Other processes and time the process spends blocked do not count towards this figure.

Sys is the amount of CPU time spent in the kernel within the process. This means executing CPU time spent in system calls within the kernel, as opposed to library code, which is still running in user-space. Like 'user', this is only CPU time used by the process.
Note that I ran the code on a quad core Intel i7, which can run up to 8 threads simultaneously. Therefore, the User and Sys time can exceed elapsed Real time.
 
Top