The Shaney Text Composer
Shaney is a program which reads in some text files, does some statistical analysis and then generates text based on the results of the analysis. In detail, it works as follows:
When reading a file, the file is split into words. Normally, this is
done by simply cutting the input at whitespace (hence, punctation is
conserved and "he", "he.", "he," and "He" are four different words).
Hence, for order 1 (default), one word is chosen randomly at the beginning; call it A. Then, the next word is chosen randomly based upon which words followed A (=the context) in the source text. If, for example, in the source text word B followed 2 times and word C followed 1 time after A, shaney will chose B with a probability of 2/3 and C with 1/3. Once this word is written, it serves again as the context for the next one and so on.
Shaney allows you to set arbitary context sizes (i.e. order values). For order N, the last N written words, A1...AN, serve as the context and the next word to be written is chosen among those words which followed A1...AN in the source text.
Of course, large values for N (around 5 and above) will quite likely simply reproduce the source text. Using option -m, shaney can be told to tell you whenever it had a choice about which word to write next by prefixing that word with "> ". The most "interesting" orders are 1 and 2, but the implementation does not impose any limit on the value.
Shaney is programmed in a way to allow for large input files. I've tested it with a 600kb text file (80000 words, 20000 different) without any trouble; it may, however, consume a lot of memory for large order values (e.g. 140Mb at order 30 for the 600kb text mentioned before, while "just" consuming 30M for order 3).
You may feed multiple files into shaney. At the end of each file, an
EOF word is read in, and once the EOF word is reached during text production,
the algorithm stops. If you pass option -r, text production
is restarted again after EOF word using a randomly chosen start word.
If you pass -s, no EOF word will be read between
input files (effectively treating all input files like a single large one).
You may set a word limit for shaney using option -l=NUM. If set to something larger than 0, no more than NUM words will be written. Use together with -r to guarantee that NUM words get written.
The output can be formatted to fit your terminal using option -c=COLS. If set to something non-negative, shaney tries to never make lines longer than COLS chars; if not specified (or negative), all the text is written into a single line (line-wrapped by your terminal).
Download and Build
For the ix86-linux-gnu platform, you may download a binary which should run without trouble. For other platforms, and all those interested, the source code is provided.
To build the sources, I recommend using GCC.
Simply compile using
For online help, use
Although shaney is effectively a two-afternoon-hack, the code should have
fairly clean design. However, a more memory-conserving storage method
could be used and less frequent use of realloc() would be
appreciated. (For efficiency, all reallocation calls will do alloc-ahead,
i.e. allocate more than actually needed to reduce the number of subsequent
calls.) Furthermore, linear lookups in the inner loops could be
replaced by faster algorithms.