Skip to content

MBTAsux - Mining the zeitgeist

13-Mar-09

My latest personal project is MBTAsux.

For those who don’t know, the MBTA is Boston’s public transportation authority, running subways, busses, commuter rail, and the like.

To say the least, many people are unhappy with the way it is operated. So, as a subject near and dear to my heart, I decide to make MBTAsux.

What it is: grabbing twitter messages and posting them in a format that allows easy skimming, in addition to extracting some data from the text.

The things I’m interested in:

  • Rudimentary sentiment analysis, i.e. how are people feeling about the MBTA right now?
  • Location tracking. I want to figure out where people complain the most. Control that for the “size” of the stations (Park and South Station would probably win the popular vote here).

Things I’ve yet to implement that I think are essential:

  • Submission form, and a mobile version of it. You know, for people who don’t use twitter.
  • Map. Alas, people are not really mentioning their exact stops when they complain. So there is a really small percentage of twitters coming in that would be mappable. This brings me to the next feature:
  • A nano-format for complaining about the MBTA on twitter and other media. Something like: s:kendall someone just played the Marseillaise on the hanging pipes #mbtasux
  • How I wrote it:

    • Python
    • Google App Engine
    • Latest version of the code will be released soon under a BSD license.

    Enjoy!

    Popularity: 37% [?]

unisteg.py — Hiding text in text using unicode

29-Dec-08

I’m proudly presenting my latest little script: unisteg.py.

This is a steganography tool that can hide text within text that is unicode encoded, and has lots of diacritics. I’m exploiting a feature of unicode that allows characters with diacritics to be written either as a monolithic “composed” character that is a single symbol, or in a “decomposed” form in which the component symbols combine. These two different ways to represent the underlying characters are visually indistinguishable. This is where I’m hiding the secret plaintext.

Usage: unisteg.py [options]  >

Prints output to stdout by default.

Options:
  -h, –help            show this help message and exit
  -s, –steg            Hide plaintext in covertext to produce cyphertext.
  –url-plain=URL_PLAIN
                        URL to retrieve plaintext from
  –url-cover=URL_COVER
                        URL to retrieve covertext from
  –file-plain=FILE_PLAIN
                        File to retrieve plaintext from
  –file-cover=FILE_COVER
                        File to retrieve covertext from
  -b, –binary          Use if the plaintext is a string of 1s and 0s
  -e ENCODING, –encoding=ENCODING
                        Encoding of the covertext, if not unicode. See Python
                        codecs module for possible values.
  -u, –unsteg          Derive plaintext from cyphertext.
  –url-steg=URL_STEG   URL to retrieve cyphertext from
  –file-steg=FILE_STEG
                        File to retrieve cyphertext from
  -o OUT, –out=OUT     Filename of output

To test:


$ unisteg.py -s --url-cover "http://www.theholyquran.org/sura_print.php?kid=1&sid=2" -e latin5 -o steg.txt "this is a test"
$ unisteg.py -u --file-steg steg.txt

This software is distributed under a BSD license with the endorsement restriction clause removed.

Popularity: 65% [?]

Exocortex Paper

28-Dec-08

I have finished my independent study course titled Exploring the Exocortex.

I enjoyed it immensely and learned a lot while doing it, only some of which I was able to condense into the paper below.

Some thanks:

  • Dan Grover — for mentioning MontyLingua to me and speeding up the development process many-fold
  • Hugo Lin — for MontyLingua
  • Steven Bird, Edward Loper, and Ewan Klein — for NLTK
  • James Allen — for providing the impetus to choose VerbNet over FrameNet thus saving me many headaches.
  • Timothy Hickey — for advising the course and allowing such non-standard research to take place

The paper:

“Exploring the Exocortex: An Approach to Optimizing Human Productivity” by Michael KatsevmanPDF

I will publish the code soon as I finish cleaning it up and packaging.

Popularity: 58% [?]

Processing User Goals and Narratives

11-Nov-08

In order to model a strategy to reach a goal, we need to parse some user input.

A goal is a particular frame with particular arguments. Each step in a strategy is—in fact—also a goal! Some goals are stubs, certainly.

This feature means that the system understands the underlying details better and better. If once "make a salad" is specified as a step in "make a dinner", and later "make a salad" is narrated, next time "make a dinner" is undertaken, the details of salad-making can be taken into account.

So, how does one undertake processing a narrative?

Each sentence is examined separately. It is an underlying assumption of the system that sentences will be kept simple. So, the goal is one statement, and each sentence in the narrative is a statement.

I am adopting the method described in “A Maximum Entropy Approach to FrameNet Tagging” (2008) by Michael Fleischman and Eduard Hovy.

According to that model, the MaxEnt classifier (I’ll be using an NLTK impementation) will take these features:

  • Phrase type: PP, NP, etc.
  • Voice
  • Position: position in the sentence
  • Grammatical function: external argument, object argument, etc.
  • Head word: the verb in question

And decide what each word in the sentence is what frame element (Agent, Cause, etc.)

In addition to those features, an n-gram model may be applied, wherein the each subsequent word processed will be supplied the classification of some of the previous words, since once one word is classified as an Agent, another one is unlikely to also be one.

So, a user tells a simple story, and what do we get? We get a frame tagged with the head word. That is, a Motion frame, for example, would also include the particular verb lemma:

    The boy walked to school

  • Theme: “the boy”
  • Direction: “to”
  • Goal: “school”
  • Head: “WALK”

The space of strategies is basically a graph of frames. As each frame gets defined in terms of possible subsequent frames, a Hidden Markov Model of narratives is generated.

Then, a wide variety of techniques is available for leveraging HMMs to get us better strategies!

Popularity: 63% [?]

Douglas Engelbart - Augmenting Human Intellect: A Conceptual Framework

15-Oct-08

In my paper reports I focus on materials that are relevant to my goals, rather than a general and exhaustive overview of what the papers discussed. I will concentrate on presenting the pertinent ideas I have gleaned from these sources. I will include asides by myself—i.e. comments on the material—within blockquotes.

As one of my initial papers I chose a very important work by one of the luminaries of human-computer interaction Douglas Engelbart—best known for inventing the computer mouse. Augmenting Human Intellect: A Conceptual Framework is a fairly hefty research report describing an approach to augmenting human intellectual capabilities.

  • Engelbart follows the common model of human cognition as a sensory-mental-motor complex. Inputs are provided by the senses, processed via some mental system, and then various motor functions output the results back into the world.
  • Problems are approached by humans by creating solutions that are broken down into many processes and subprocesses. These process collections are called process hierarchies.

    These are what I have chosen to call strategies, and each (sub)process is essentially equivalent to a frame.

  • Different process capabilities of an individual—i.e. the actions the individual may perform—form that individual’s repertoire hierarchy.
  • Goals/problems are general things that represent general solutions to such items, e.g. memorandum would represent a sequence of actions involves in writing a memo.

    It seems that the goals, as described by Engelbart, are similar to the concept of prototypes.

  • Engelbart provides a figure represent a fun experiment he conducted. In order to figure out how one may augment a human further, one must understand better how we have been augmenting ourselves up to now. So, this experiment has to do with “de-augmenting” an individual. First, the subject wrote “Augmentation is fundamentally a matter of organization” using a typewriter, taking only a few seconds. Then, the subject produced the statement in cursive, doing it much slower. Then the experiment of “de-augmenting a human by attaching a brick to a pen” proceeded. With a brick attached to the pen, writing in cursive, performance time, as well as quality of product was reduced markedly.

    Although the nature of the product itself had no changed much, the efficiency as well as convenience of the activity was greatly reduced first by elimination of augmenting tools, and then actively reducing the capability of remaining tools. This shows that the statement to be written “Augmentation is fundamentally a matter of organization” is truly a key point. The organization of the writing procedure into typing improves overall productivity greatly.

  • Augmenting capabilities does not hinge on a particular mental theory, since it is only the selection and efficiency of capabilities that is affected. The exact nature and process of the capabilities is of secondary importance.
  • Then, Engelbart refers to Vannevar Bush’s seminal 1945 article in the Atlantic Monthly “As We May Think”. He quotes extensively from it, describing Bush’s Memex system (a major inspiration for the World Wide Web). He goes on on to note that the Memex has but an added benefit of speed and convenience over a traditional filing system.

    That is, no new capabilities were truly added. Only that instead of walking through a hall of filing cabinets, recall is fast. Much like a phone call is a mere spatial surrogate of talking in person.

    One of the reasons that Bush’s “predictions” (perhaps self-fulfilling since many inventors and developers were inspired by this article) are so apt is that little technological development remains that is not just an externalization of faculties (i.e. capabilities) that were previously performed less efficiently or maybe wholly internally.

Engelbart lays the foundations of my approach to helping humans achieve goals. I want to derive process hierarchies and repertoire hierarchies by annotating strategy narratives using FrameNet, so that the system may select an optimal process hierarchy for each goal (at each point in time, the optimal strategy may most certainly change based on further input).

References:

  • D. C. Engelbart, “Augmenting human intellect: A conceptual framework,” Stanford Research Institute, Tech. Rep., October 1962.(HTML | PDF)

Popularity: 70% [?]

Exploring the Exocortex

08-Oct-08

For the past several weeks I have been working on an Independent Study course at Brandeis University which I have titled Exploring the Exocortex: Machine Learning for Human Behavior, advised by Professor Tim Hickey.

Originally conceived as an attempt to use biologically inspired machine learning techniques such as neural nets and genetic algorithms towards modeling and then improving day-to-day human behavior, the course has moved towards a more direct path to solving that problem.

I have read several papers and chapters in books, summaries of which I will post soon. In the end, this series of posts (which my be followed using the category exocortex on this blog) will adapted and augmented into a paper, which I will also post here. I believe that research should be done openly and publicly, and so, that’s just what I shall do.

What is the exocortex?
To the best of my knowledge, this is a term coined by researcher Ben Houston–and popularized by science fiction author Charlie Stross–to describe the various systems humans may use in thinking but which are not part of our bio-brain. Already, our Blackberries, iPhones, and other essential electronic devices are proto-exocortices (yup, the plural isn’t pretty).

Why am I working on the exocortex?
As human civilization has grown, we have increased in complexity. Some welcome this, some don’t. Some believe that it will lead to some sort of Singularity. The Flynn Effect most likely is a result of humans attempting to adapt to this environment which is growing exponentially more complex. Already the problems of an Attention Economy, pioneered by the same people who pioneered modeling human behavior and augmenting human cognition, are apparent: There are more things one must pay attention to, within the same time constraints and physical limitations.

Thus, it seems obvious to me that to cope with this information, and more importantly, attention load humans must create appropriate tools. The exocortex is a collective name for those tools.

What do you mean by “Human Behavior”?
I am planning to specifically tackle the problems I have greatest difficulty with allocating attention to: those pesky appointments and other thing one might put on a calendar. These things have a relatively high importance, and also allow pretty easy assessment of goal completion.

How do you plan to work on that?
I am devoting this course to creating a document detailing what I believe is a path of least resistance to a piece of software that can model strategies for goal completion and evaluate the best ones. If time permits, I will implement as much of it as I can.

Here’s a rough plan for such a system:

  • Goals are inputted into the system.
  • User provides goal strategy by narrating real-life activities.
  • User strategy narrative is annotated using frames from FrameNet.
  • Goal-completion satisfaction is rated by user. This is somehow applied to constituent frames.
  • Process is repeated, and different frames are assigned different valuations based on perceived contribution to goal completion.
  • System provides best set of frames to form optimal strategy for the completion of each goal.

How does this system differ from a PIM, e.g. on a Blackberry?
Various calendar systems may provide reminders, perhaps with some intelligence noticing your location etc., that assign to you the task of evaluating your current strategy and seeing if it matches a hypothetical optimal strategy for accomplishing the goal specified in the reminder. This is an attention heavy process. Instead, I would like to move as much of the strategy modeling and evaluation as possible out to the exocortex. Even when strategy evaluation is not yet optimal within the system, merely providing concrete strategy options should reduce the attention needed by the user to evaluate a course of action.

Are you really going to do this, and not let it stagnate like you’ve done with Gargoyle?
Well, I’m still working on Gargoyle, slowly but surely! Many new things in my life have taken time I could spend on it (some will be revealed soon).

This exocortex project, however, is guaranteed within the semester time frame as a grade depends on it. So, you can be assured of results. I hope you’re interested and excited, because I am!

Popularity: 74% [?]

Science Fiction — Narratives That Let Us Grow

22-Sep-08

Science Fiction. I like reading it. But I find just about anything calling itself that in video form to be very disappointing.

Why do I like science fiction? Well, one could pick any of numerous tropes and assume it has some attraction to me. Perhaps it’s space, or aliens, or AI. Maybe I’m just attracted to the futurism aspect, the extrapolation of current circumstances and the examining of possibilities. Well, although the last one may approach what SF means to me, none of these reasons really captures its appeal.

Let’s start with what bothers me about most SF in Movies and on TV. In just about every case, I create a description such as “Soap Opera… IN SPACE” (that’s Star Trek) “Jesus/Harry Potter/Frodo Baggins*… IN SPACE” (that’s Star Wars). Well, what’s the problem with that? Assuming whatever I put before “IN SPACE” or “WITH LASERS” or “IN THE NOT TOO FAR OFF FUTURE” is valid and valuable, there really isn’t a problem, right?

Various form of narratives, be they prose, poetry, movies, tv series, oral histories, what have you, attempt to present and explicate some aspect of humanity, the universe, and our experiences in it. Since this encompasses just about everything, well, there aren’t real limits. Much of what we would call “fiction” or “drama” works within the bounds of the “real world”. This means that although the characters may have never existed, their experiences are set within an environment that we right now, or in the past, may have found easily possible. Science Fiction, and really just about everything one could call Speculative Fiction, instead sets the interactions within a reality stretched somehow. Perhaps it’s stretched into the future. Perhaps it’s stretched merely beyond the bounds of the mundane (as in magical realism).

So, in both the SF I like and the SF I dislike, this stretching seems to take place, so what is my problem? Well, it is that the stretching of reality must serve a purpose. Just like most narratives cannot be truly mundane for we could not extract something to add to our lives from them in that case, the stretching of reality cannot have utterly no effect on the narrative itself. This is, however, exactly the case in my “…IN SPACE” examples. The same lessons, ideas, memes, and emotions can be just as easily be gleaned from the narrative with even vast shifts in environment.

The SF I enjoy most, however, makes the stretching of reality integral. That is the whole point. It means to stretch reality in some way, and then examine, speculatively, the effects on the aforementioned humans, the universe, and their mutual interactions. Change the environment, and suddenly things just don’t make sense. Isaac Asimov’s Foundation series may have been roughly patterned on Gibbon’s The Rise and Fall of the Roman Empire, but without pretty serious revisions it can’t be easily placed within another context. His Robot stories fundamentally examine the interactions of robots and humans. You can’t merely replace “robots” with something else and retain narrative cohesiveness.

My point of view is limited by what I can experience. “Plain” Fiction can provide many new points of view, which is truly necessary, in my opinion. SF, however, goes even further beyond that. It provides not only new ways to view similar things, but it creates wholly novel things and shows different ways of looking at those. This means that not only does it train one to perhaps see things as others see them, but it also allows a better understanding of how others might see things that are not yet existing. That is, faced with different choices, not only can the consequences be plotted, but also a more complex set of potential multiple understandings of the choices.

Yes, this may sound like an overly practical end for narratives. But I believe that whether we want to or not, we internalize the narratives we consume and then proceed to reapply them as sorts of priors. If we merely wallow in archetypes the narratives that we ourselves create will be constrained.

I want our minds to be free, our future to be thick with possibility, and our past to be replete with the ambiguities that it truly contained rather than the mere certainty of what came to pass.

* young orphan-ish male living mundane life discovers that his father (or uncle) are rather greater than they seemed, and receive relic/gift from him (usually arbitrarily). Proceed to go on adventure of discovery, gaining much wisdom in the process.

Popularity: 79% [?]

Moviesneak — Stretch your movie ticket dollar

07-Jul-08

Ever sneak from one movie to another at the movie theatre, after only paying for one ticket? Well, I have. And I like it.

This can often be done by looking at the showtimes ahead of time and finding movies that are times relatively near each other. Although this is sometimes thwarted by placing the temporally adjacent movies in different sections of the theatre, there are often opportunities for this type of sneaking.

But instead of bothering to study movie showtime tables and such, I’ve written a little python script to do it for me. It’s rather rudimentary. Here’s how to use it:

  1. Get moviesneak.py
  2. Get showtimes for a theatre in your area.
    1. The format moviesneak accepts is Google Showtimes, which can be accessed by googling showtimes <zipcode>, or using this url: http://www.google.com/movies?near=<zipcode>.
    2. Then select all the showtimes in a particular movie theatre, starting from the review stars, not including the theatre’s address.
    3. Copy these into a file somewhere.
  3. Run moviesneak: moviesneak.py showtimes-file [optional time threshold in seconds].
    • By default, the threshold is 15 minutes (900 seconds). The threshold is how long before or after the end of one movie the following one should be. So a 15 minutes threshold means the following movie can start 15 mins before the current one ends, or 15 after (remember that even though it starts 15 mins before the current ends, you won’t have to leave early, you’ll only miss the previews)

The code is under a BSD license, so just about anything can be done with it. If you have time to muck with BeautifulSoup or just plain regex, it would be nice if moviesneak could query google showtimes (or any other source) directly. The algorithm is very simple, and pretty ugly, but it can be plugged into pretty surroundings easily (such as a web interface).

Edit: I haven’t bothered writing a chaining algorithm (i.e. find the longest sequence of contiguous movies) mostly because I really can’t watch more than 2 movies in a row at a movie theatre. However, one shouldn’t be hard to write, since the corpus size is small enough even horrible algorithms will chug through quickly. Maybe when I’m bored enough/free enough I’ll write one to extend moviesneak.

Popularity: 100% [?]

Teaching

23-Jun-08

The primary task of a useful teacher is to teach his students to recognize ‘inconvenient’ facts - I mean facts that are inconvenient for their party opinions. And for every party opinion there are facts that are extremely inconvenient, for my own opinion no less than for others. I believe the teacher accomplishes more than a mere intellectual task if he compels his audience to accustom itself to the existence of such facts. I would be so immodest as even to apply the expression ‘moral achievement’, though perhaps that may sound too grandiose for something that should go without saying.

Popularity: 87% [?]

An svnwiki in Python

20-Jun-08

On a whim, I wrote a little wiki type thing in Python that sits on top of an SVN repo. It’s incredibly basic, and basically lets you browse a repo and edit files. Natively, it supports markdown as its default display mechanism, but it would be trivial to teach it the meaning of file extensions and have other view templates.

The intended use was for a personal notebook type thing, which I decided to abandon. Basically, it’s a lot like Jottit, except you actually have all your data, and can replicate it between locations. Yes, it also sounds like git-wiki,but I only found out about that after I’ve finished coding this version.

Although I decided not to use it, someone else my find it useful, at least the codebase. I am offering it here with absolutely no warranty, and you can use it however you like, you can attribute or not, whatever. Since this was built for private use, i.e. no public access, I was going to integrate grep into it, and other such utils, leveraging unix text processing for search, mass editing, etc. So one may be interested in continuing that. I did not implement a facility for adding pages easily, though that’s a trivial piece of coding.

Code: svnwiki.tar.gz Change the base variable to reflect the location of your repo.
Requirements:

  • web.py
  • pysvn (on debian this is the python-svn package rather than python-subversion)
  • markdown

Popularity: 86% [?]

Close
E-mail It