Invitational round | 5 points | N/A | Problem statement | Official solution | Tags: ComputationalPuzzle
smallsmall is followed by: elephants (2 times), strawberries (1 time); so the next word is elephantselephants is followed by: eat (1 time), are (2 times); so the next word is areare is followed by: not (1 time), alleged (2 times), very (1 time); so the next word is allegedalleged is followed by: spies (2 times), blueberries (1 time); so the next word is spiesSo the sentence is small elephants are alleged spies.
In J2, the critical information is as follows:
And the common token counts:
| 1️⃣ | 2️⃣ | 3️⃣ | 4️⃣ | 5️⃣ | 6️⃣ | |
|---|---|---|---|---|---|---|
| 1️⃣ | 2 | 2 | 1 | 0 | ||
| 2️⃣ | 2 | 0 | 1 | |||
| 3️⃣ | 2 | 0 | ||||
| 4️⃣ | 1 | 0 | 1 | 2 | ||
| 5️⃣ | 1 | 0 | 1 | |||
| 6️⃣ | 0 | 2 |
Note that in general, the words overlap quite a bit, so we want to tokenize the words such that they share tokens as much as possible. That is, we always want to extract longest common substrings as tokens. My first assumption: "c" never appears as a token on its own, because the two tokens above are both "c" followed by something else.
A. [ch]anging
B. [co][a][ch][es]
C. ex[ce]ssive
D. pro[ce]eding
E. pro[ce]sses
F. started
The "ex" and "pro" prefixes must be tokens, because the suffixes look more complex and we only have 4 tokens to work with. Among the suffixes, note that "sses" contains "es", a known token. The remainder, "ss", also appears in C.
A. [ch]anging
B. [co][a][ch][es]
C. [ex][ce][ss][ive]
D. [pro][ce]eding
E. [pro][ce][ss][es]
F. started
If we divide "eding" into "ed" and "ing", then "ed" can be reused for F and "ing" can be reused for A.
A. [ch]ang[ing]
B. [co][a][ch][es]
C. [ex][ce][ss][ive]
D. [pro][ce][ed][ing]
E. [pro][ce][ss][es]
F. start[ed]
Let's count the common tokens. Although we haven't fully tokenized A and F, we can know for sure that a token doesn't appear in them if there's no such substring.
| A | B | C | D | E | F | |
|---|---|---|---|---|---|---|
| A | 1/2 | 0 | 1 | 0 | 0/1 | |
| B | 1/2 | 0 | 0 | 1 | 0/1 | |
| C | 0 | 0 | 1 | 2 | 0 | |
| D | 1 | 0 | 1 | 2 | 1 | |
| E | 0 | 1 | 2 | 2 | 0 | |
| F | 0/1 | 0/1 | 0 | 1 | 0 |
1️⃣ shares 2 tokens with 2 other words, so this must be E, while 2️⃣ and 3️⃣ are C and D, in some order. 1️⃣ shares 1 token with 4️⃣, so 4️⃣ is B. 4️⃣ is known to share 2 tokens with 6️⃣, so 6️⃣ must be A, and the tokenization of A contains the token "a": A. [ch][a][ng][ing]. B is known to share no tokens with C or D, but 4️⃣ needs to share 1 token with 5️⃣, so 5️⃣ must be F, and F also contains the token "a": F. st[a]rt[ed]. Finally, 5️⃣ shares 1 token with 2️⃣ and no tokens with 3️⃣, so 2️⃣ is D and 3️⃣ is C.
(This is mostly guesswork, based on the assumption that the tokenization should create as many common tokens as possible. I'm not sure if there's an entirely rigorous way to do this.)