FYI.

This story is over 5 years old.

Entertainment

Wordseye Is An Artistic Software Designed To Depict Language

Developed by two young university prodigies, Wordseye turns syntax into images.

While their fellow friends from Columbia University were reading science-fiction books about intergalactic battles on distant planets, Bob Coyne and Richard Sproat were developing the uncanny Wordseye project. The motto of this software is terribly simple: the user must describe a scene or situation in a short text, and the computer program then uses this verbal input to generate a visual representation based on the description. It functions like an image/text based version of the Surrealist game exquisite corpse, a back-and-forth volley of interpretation between man and machine where we truly get a sense of just how well we understand each other (or not).

Advertisement

The result is sometimes so weird and abstract that we wanted to have a few words with one of the project's creators, Bob Coyne.

The Creators Project: I’ve never had the chance to meet a PhD student in computational linguistics.
Bob Coyne: I’ve always loved and been fascinated by language, especially the connotative/poetic/associative aspects. A single word can evoke so much and can mean such different things in different contexts. In order to understand all the possible associations you have to be able understand the more literal/prosaic meaning first. So my interest is a combination of wanting to understand how it all works combined with an interest in the artistic expression aspects of an artificially intelligent system. When I was in college I wrote some poetry-generating software… so that was the start of it. Then I worked in computer graphics for quite a while, but I found that I was more interested in pictures as language (how do they represent and connote meaning?) than in the pixilation aspects. (Similar I think to Duchamp’s position against “retinal art”).

What was the original aim of Wordseye?
I had worked in computer graphics for quite a while (about 15 years) and never really had the time to use the tools I created. There’s always so much preparation and work that goes into creating graphics. So the idea came to me that it would be great to create graphics in a very quick manner by just describing what you wanted. You’d give up a lot of control over detail, but could create stuff really quickly assuming the system could interpret the text input. I also liked the idea of the indeterminacy of not knowing exactly what you’d get. My original intention was to have it create animation, but I quickly realized that doing static scenes (like comic book panels) would be more doable and actually more interesting in some ways.

Advertisement

And how did you choose the graphics?
We licensed a library of 3D models, so we use whatever’s in that. There are about 2000 different objects—probably half a dozen dogs, a couple dozen types of tables, etc. Sometimes the models are very generic, more like the platonic form, perhaps, and sometimes they’re more specialized subtypes. For example, if you type “the cat is on the table” into the system, you might get an ordinary table, like a kitchen table, but you could also get a pool table or blackjack table, etc. The system lets the user specify textually which one they want (ex: by typing “kitchen table”), or they can choose graphically from a list of possibilities.

One interesting thing in this is that language and graphics can be vague in different ways. If you say “table,” you don’t [automatically] know what type. But when you see a table, it’s always a particular type, even if that type is generic. On the other hand, the intentionality in pictures can be much more open to interpretation than in language. If you see two people facing each other, seated at a table, you don’t really know what they’re doing. They might be trying to figure out a problem, or thinking about something, or just chatting, or looking at something together. We’re currently working on extending the system to handle the depiction of verbs and actions.

But what would happen now if a user were to type something the software doesn’t understand?
There are many levels to what it doesn’t understand. At the simplest level, there might be a word it doesn’t know or an object it doesn’t have. For example, assume you say “An armadillo is on the road” and it doesn’t know the word “armadillo” (it actually does know it, but let’s just assume), then it will create a 3D object by extruding the letters of the word “armadillo” and put those letters, as 3D objects, on the road. This actually is kind of amusing and often informative since, if you misspell something, it will show you your misspelled word, which it can’t find in its dictionary, in the scene. Alternatively, it can pick a related object. So if you type in “robin” and it doesn’t have one, it can put another type of bird in the scene instead as a substitute.

Advertisement

Other types of things are handled less gracefully. For example, objects have parts and it really doesn’t know much about parts of objects. So sometimes it’ll think the part you refer to is a separate object. I can go on and on about all the things that can confuse it. There are so many that it’s really necessary for the user to adapt to the system to work in the realm of things it can handle. Though, of course, we’re trying to extend the range of that.

In many cases, the resulted images seem a bit surrealistic. I think it’s oddly paradoxical because surrealism is supposed to be the pure automatism of the human psyche. At least, the Surrealists wanted us to believe it was.
Interesting observation and question. I think that one of the key “principles” of surrealism is the separation of form and function. So rather than a system of this sort representing our thoughts with perfect accuracy, I think instead they can represent instead the blanks in our thoughts. One thing I find with the system, and this relates to how pictures can both be more specific than language and more vague at the same time in different ways, is that you get a picture from some very low-level descriptive text and can then give it a title which gives it a completely different interpretation than you might have originally had. So finding the title becomes part of the process. A process like this: Thought/intention → low-level-description → picture → interpretation → picture with title. Part of the surrealist effect is also the rendering process and lighting. Computer graphics, in general, tend to look more surrealistic. But I think there’s more to it than that and that the “pure computational automatism” of the system coupled with the bounce-back to the human to interpret and discover a title add a truer type of surrealism.

Advertisement

Speaking of which, what are your thoughts on technological singularity?
I think we’re quite a ways from the point when computers can be truly intelligent. Language is a good test area for it, though. Much of the work going on in language processing is statistical in nature and doesn’t really model meaning. A lot can be done with statistical models (e.g. Google search, or rough machine translation), but I think those methods hit a wall. So some people are overly optimistic about how fast things will progress.

Ah, so computers are doomed.
No, I just think there will be roadblocks along the way. In the 80’s there was great hope for AI and expert systems etc., but none of that really panned out. Now there’s great hope for machine learning techniques to scale up in the same way. But ultimately, I think even if you had an infinite amount of hardware power, there’s still the issue of what you do with it to achieve machine intelligence. Somehow the answer of just simulating the brain doesn’t seem satisfactory (assuming it’s even feasible) without being able to tap directly into the symbolic nature of our thoughts. So that will require a better understanding at the conceptual level of language and semantics, etc. I think the link between language and the world (at least as perceived) is the real hard nut to crack. Language implies semantics of some sort and that semantics is grounded in our models and perceptions of the world. Difficult questions like how symbol systems operate and encode meaning.

There are actually softwares that can approximatively guess by whom a book was written.
Yes, they look for statistical similarity in word choice, syntactic patterns, etc. These sorts of things will be very sensitive to genre however—i.e. if someone’s writing a tweet versus a PhD thesis—there won’t be much similarity even though it’s the same author. But these statistical methods are very effective for what they can do. I just don’t see them scaling up to true intelligence/understanding without better ways of modeling the semantic content of language. Even though you can easily make people think that a software is really intelligent.

It’s pretty limited though. Cornell students developed a talking software with pre-made answers and connected it to itself. The result is a silly dialogue you can see here.
That’s hilarious! You realize quickly that they have no idea of what they’re saying. A very old AI program called Eliza did a similar thing… it gave vague and plausible sounding replies to the human conversation partner. It helped that Eliza was playing the role of a psychologist, so statements of the sort, “How does it make you feel that ?” could make it seem like it was truly understanding the content.

Have you ever felt outsmarted by a software?
With Wordseye, there have been various times I couldn’t figure why it came up with the picture it did, but after thinking about it, I realized it was interpreting things in a different way… not necessarily in a plausible way (for humans), but at least in a possible way. My hope is that as the software progresses there will be more and more interesting and surprising “misinterpretations.” These really highlight where our interpretation of language is guided by the context and default expectations, etc.

Do you think you’ll be able to make Wordseye represent abstract notions?
Yeah, one way of depicting abstract notions is to depict them literally, since metaphorical language is often used. [For example], for “time flies” you could have a scene with a watch with wings hovering over a landscape. Another strategy is to use iconic elements in scenes to represent abstractions. For example, using a not sign (a circle with a slash through it) to enclose the objects for the concept being negated. Another is to leave the scene vague, maybe by just putting the referenced objects next to each other, and let the user infer the abstract relation. The original version of Wordseye handled some of these. We’ll be adding that into the current version also. It should be ready in the next couple of months.