Tuesday, February 7, 2012

Writing Smell Detector (WSD) - a tool for finding problematic writing

tl;dr version: WSD is a python tool to help find problems in your writing. Here's the source and here's example output

In grad school, I wrote a program that used a series of regular expressions to detect "writing smell" (analogous to code smell), i.e., telltale signs of bad writing and mistakes. The rules for smelliness were loosely based on one of my favorite writing how-to's: Style: Toward Clarity and Grace by Joseph Williams.

The program took as input a text file and output was an annotated report with snippets of the offending bits. I used it for all my papers and found it really helpful, but the coding was very, um,  academic (i.e., written for use by the person who wrote it) and it was written in Mathematica [1], which was the language I knew best at the time. FWIW, here is my original version.

For a long time, I've wanted to port it to some other language and make it accessible and capable of receiving new rule contributions and explanations. To this end, I recently commissioned an oDesk contractor (utapyngo) to make a more polished, modular version in Python. I think he totally outdid himself. It's got a nice modular model now that lets you easily incorporate new rules and he greatly improved upon my often-flawed regular expressions. Be forewarned---the documentation is non-existent and the rules aren't explained, but I plan to take fix this over time, while I'm using it.

It's open source (courtesy of oDesk, who paid the bills) and available here on github (live example output). To use it, just clone it, install the python package jinja2 and then do:

$ python wsd.py -o output_file.html your_masterpiece.tex

Here's a screenshot of what the HTML output looks like, illustrating the a/an rule (i.e., that it's "an ox" but "a cat"):

Note the statement of the rule, the patterns that it looks for and the snippets. It also has a hyperlink to the full text, which is available at the bottom of the document.

A few thoughts:
  1. If you're interested in contributing (rules or features), let me know. 
  2. It might be nice to turn this into a web-service, though my instinct is that someone interested in algorithmically evaluating their LaTeX/structured text isn't going to find cloning the repository & then running a script to be a big obstacle. And they probably don't want to make their writing public.   
  3. A few weeks ago, I read this usethis profile of CS professor Matt Might. In the software section of the interview, he said that he had some shell scripts that do something similar. I haven't really investigated, but maybe there's ideas here worth incorporating. 
[1] When I told the other members of the oDesk Research / Match Team that I had code for doing this writing smell thing, they were impressed and wanted a copy; when I told them it was written in Mathematica, they thought this was hilarious and mocked me for several minutes. I tried to explain that Mathematica actually has great tools for pattern matching, but this fell on deaf ears.


  1. It might be interesting to turn it into an Emacs add-on since a person can then edit the text. Making it LaTeX-aware (ignore text between $'s for instance) would be interestinger.

    (I use http://bnbeckwith.com/index.php/writegood-mode/ to help with passive voice.)

  2. These are not "smells", they are syntax rules. You used "code smells" as an example... pretty sure the whole point of "smells" is that cannot be identified with formal rules.

    1. Some of the rules are straightforward syntax rules, but not all. For example, the code finds nominalizations. In many cases, they are perfectly fine and appropriate, but they are symptomatic of unclear writing and are an indication you might want to re-factor your writing in that section. Many of the code smell examples are similar---they are not syntax rules and you can't just do some automatic re-factoring but they do have detectable patterns that could help you automatically spot the problems.

  3. I'm pretty much a grammar crazy person, but being a non native English speaker, just found a couple of misplaced "it's" in my articles. Ooops! Awesome little tool, thank you for sharing :D

  4. You might be interested in the series of software developed by Jack Lyon. I have no idea how your program compares to what he's developed. I am a professional editor and I know many editors who use his programs: PerfectIt, EditorsToolKit, etc. They find they can work much faster as a result. I suspect yours would be a good complement to his. His blog, with links to his websites where he offers free trials of his software, is here: http://americaneditor.wordpress.com/category/editorial-matters/editing-tools-editorial-matters/

  5. Thank you & congratulations. I have had automating parts of Williams' Style: Toward Clarity and Grace on my someday projects list for a long time. I'm impressed how much of the book you've managed to create rules for.
    I still don't know what kind of output would be useful for applying Williams' ideas of coherence, cohesion and finding the actors of the story. And I'd love to be able to find places to apply parallel structure and other techniques for adding grace.

  6. Really, a WSD, thats a thing I would`nt have run into ordinary coding.