Writing Sugar Documentation with a Neural Network

1 Dec 2016
What a terrible failure, probably?

Believe it or not, Sugar has documentation. But what if we could have more documentation? Maybe we could use a Recurrent Neural Network to learn form the docs that we already wrote, to write new docs? We'll, you can't say no if you don't try!

Let's do it!

We are going to use a library called Torch RNN, which basically does everything for us:

docker pull crisbal/torch-rnn:base
mkdir -p $HOME/torch-rnn/sugar-data/
cd $HOME/torch-rnn/sugar-data/
sudo chcon -Rt svirt_sandbox_file_t $HOME/torch-rnn/sugar-data/

sudo docker run --rm --tty=true --interactive=true --volume $HOME/torch-rnn/sugar-data:/data crisbal/torch-rnn:base bash
# Now we are running inside the pre-setup docker container

Great, not quit the docker container, we'll come back to that later. We first need to extract the data from the help activity into a single text file to train our network:

git clone https://github.com/godiard/help-activity --depth=1
find help-activity/source/ -type f -name '*.rst' -print0 | xargs -0 cat > input.txt

If you open the input.txt file, you will see that it is a pile of help documentation text. This will be used to train our network. Go back into the docker container (docker run ...​ from above) and now we can train the network:

# python scripts/preprocess.py --input_txt /data/input.txt --output_h5 data/input.h5 --output_json data/input.json
Total vocabulary size: 117
Total tokens in file: 361025
  Training size: 288821
  Val size: 36102
  Test size: 36102
Using dtype  <type 'numpy.uint8'>

# th train.lua -input_h5 data/input.h5 -input_json data/input.json -gpu -1
Epoch 1.01 / 50, i = 1 / 5750, loss = 4.752145
Epoch 1.02 / 50, i = 2 / 5750, loss = 4.644123
Epoch 1.03 / 50, i = 3 / 5750, loss = 4.498253
...a.long.time...
Epoch 4.13 / 50, i = 360 / 5750, loss = 2.037364
...ultrabook.not.so.ultra.now...
Epoch 5.16 / 50, i = 478 / 5750, loss = 1.796518
...graphics.card.would.have.been.good...
Epoch 5.81 / 50, i = 553 / 5750, loss = 1.690430

While you're waiting, now is the right time to check out Presenter Club. With Presenter Club, you can make great presentations, faster - even faster than training this network! Presenter Club is the only speech first presentation app. Best of all, it is free as in price and free as in AGPLv3. Sign up for free while you wait!

Results

So training the model is really slow. How slow? It took a good hour or longer on my laptop. Fun fact - if you thought your laptop was slow because it took long to compile WebKit, your laptop is not the best for machine learning :(

I trained it up to checkpoint 5750 (all the way until the training script stopped!). Then I generated a few examples from the following seeds:

Browse

Browse is vilute signeds, bloptering whith to view. When are button then to eatch

- Activity to make Ewigh 200, the community name a but work tiving the encrients and Vinuse losize rewill retund for bech are group,, and the serect stops you to chapsenars a
  nd page can sugar for collenterax

In this other haptions
, mith to which it Protcusing by mavight, your nout moring on Called on) wating the ficas.

3. Oper bouttate. The seter indrograge can the improscay in the from Journal studebadatch

**_Toold_**

.. image :: ../images/Wirseding.rst-:

Sugar

Sugar iswith in the re internal displayeetters Activity

senized and unternet we the coper's your cauleting what your find more sets and some sure messources.

.. image :: ../image`.png

- 1 and instrresples, wor this for icon, sugar Activite prosect more http:/,

This iesson locace anyill—boud, there ease conterster (1. 4 ancelser network can button: View is 22) and indease, the Ibacus alongmance is the Support Acking work phover. The tollows as mear 2005 ``impage.

1. Grame it worblest by choition

Scaning number used you can drog the friger with a felling files usife number on the plassiona selected inture is it.

- Activity is desp.
- - B loc. Anallably icon culd teen, have by while port of your projectles. Be seic-ter tcop peroce voractions:

- 4 Neyboard.ust” Chould entre turnerts type Finlest tito Actitition

Using, where to and copy you can timelabla

Activity

tivity View make roing inswer main abovem.8. In starting: you are Sames Toold (Cactigins

- Actio domgs, it secosk done instateds, playboud
  :::::::

       AL   × Clisude select dowunteral.

Note <http://wiki.laptops.org/go/usernap.pckgug/::::

.. image:: ../imake': . Helt it click on lonks to match and your view to the lasts stepce think wates (will button\*

- Internils menu allange filew.

Using sterb reported ove Activity. Hele and searn you will finls of sansticed, that you ovelotalinent) (is invideat with open on a properting mane.

## Tuble hill you wart the chilicking the access

So this is just random text for the most part. But it is important to appreciate what the network has been able to learn even with our tiny dataset:

  • The rst ..​ image::​ syntax
  • The normal length of paragraphs and words
  • Full stops are followed by capital letters
  • Bullet points are a thing

Conclusion

So, this technology is probably not yet ready to replace our actual documentation, or even the contribitions of some GCI students! But this just highlights how exciting machine learning is. Problems that traditional programmers thought of as "hard" - like image clasification or translation - are now just as easy as collecting a training dataset. If you want a function approximated, then machine learning is your friend.

VC firms have said that mobile is eating the world, wearables are the future, IoT will change everything, and VR will eat the world. Not every claim has panned out for them. But I'm going to place my bets that machine learning is not only the future, but the past and present. We live in very exciting times.