Nav:  [home][sim] > [shaney]
 
← [What are these? Why?]

The Shaney Text Composer

Shaney

Shaney is a program which reads in some text files, does some statistical analysis and then generates text based on the results of the analysis. In detail, it works as follows:

When reading a file, the file is split into words. Normally, this is done by simply cutting the input at whitespace (hence, punctation is conserved and "he", "he.", "he," and "He" are four different words).
Then, for each word, Shaney analyzes the probability that it comes next after a context of N previous words. The parameter N is called the order (tuneable using option -o=N). Use option -T to get a complete text dump of the read-in text before computing statistics.

Hence, for order 1 (default), one word is chosen randomly at the beginning; call it A. Then, the next word is chosen randomly based upon which words followed A (=the context) in the source text. If, for example, in the source text word B followed 2 times and word C followed 1 time after A, shaney will chose B with a probability of 2/3 and C with 1/3. Once this word is written, it serves again as the context for the next one and so on.

Shaney allows you to set arbitary context sizes (i.e. order values). For order N, the last N written words, A1...AN, serve as the context and the next word to be written is chosen among those words which followed A1...AN in the source text.

Of course, large values for N (around 5 and above) will quite likely simply reproduce the source text. Using option -m, shaney can be told to tell you whenever it had a choice about which word to write next by prefixing that word with "". The most "interesting" orders are 1 and 2, but the implementation does not impose any limit on the value.

Shaney is programmed in a way to allow for large input files. I've tested it with a 600kb text file (80000 words, 20000 different) without any trouble; it may, however, consume a lot of memory for large order values (e.g. 140Mb at order 30 for the 600kb text mentioned before, while "just" consuming 30M for order 3).

You may feed multiple files into shaney. At the end of each file, an EOF word is read in, and once the EOF word is reached during text production, the algorithm stops. If you pass option -r, text production is restarted again after EOF word using a randomly chosen start word. If you pass -s, no EOF word will be read between input files (effectively treating all input files like a single large one).
You may use argument - to read from stdin.

You may set a word limit for shaney using option -l=NUM. If set to something larger than 0, no more than NUM words will be written. Use together with -r to guarantee that NUM words get written.

The output can be formatted to fit your terminal using option -c=COLS. If set to something non-negative, shaney tries to never make lines longer than COLS chars; if not specified (or negative), all the text is written into a single line (line-wrapped by your terminal).

Download and Build

For the ix86-linux-gnu platform, you may download a binary which should run without trouble. For other platforms, and all those interested, the source code is provided.

Source: shaney.cc   [21kb C++ source]
Version:3.0   (Oct 22, 2003)
Author:Wolfgang Wieser   (report bugs here)
License:GNU GPL (Version 2)
Binary:ix86-linux-gnu: shaney-i386 [42kb]   (only requires libc.so.6)

To build the sources, I recommend using GCC. Simply compile using
  g++ -O2 shaney.cc -o shaney
should do the trick on linux.
For other operating systems, you may need to uncomment the signal handling which makes shaney exit cleanly if you send SIGTERM (often ^C on the terminal).

For online help, use
  shaney --help

Bugs

Although shaney is effectively a two-afternoon-hack, the code should have fairly clean design. However, a more memory-conserving storage method could be used and less frequent use of realloc() would be appreciated. (For efficiency, all reallocation calls will do alloc-ahead, i.e. allocate more than actually needed to reduce the number of subsequent calls.) Furthermore, linear lookups in the inner loops could be replaced by faster algorithms.
Nevertheless, feel free report any bugs you find directly to me...


[home] [site map]
Valid HTML 4.01!
Copyright © 2003-2004 by Wolfgang Wieser
Last modified: 2004-10-04 23:01:28