Project Description

Project Overview

Learning is an active area of research, both from the psychological perspective of discovering how people learn, and from the artificial intelligence perspective of trying to program computers to mimic human learning abilities. While learning is often thought of as gathering knowledge, much learning also takes place on a more abstract level, for example in the forming generalizations about things we encounter, in categorizing objects, or in learning the structure of the knowledge we acquire. Although it is not hard to make computers adept at gathering information (e.g. google), it is much harder to program computers to learn how to learn.

Josh Tenenbaum, the mentor of my UROP, is currently investigating categorical learning, where the inputs are various objects and the output is a categorization of those objects. The simplest form of learning to learn, in this context, is a first order generalization about categories. For example, if a learner is categorizing animals, the learner might find that humans have two legs and no tail, tigers have four legs and a tail, and spiders have eight legs and no tail. If the learner is categorizing letters, the learner might discover that most Latin letters are the same thickness, without small details, and mostly connected (i.e. each letter is a single piece), and most Chinese letters are similar in thickness, have many details, and are made of multiple pieces. There are also second order generalizations about categories. Within any species of animal, the number of legs and the presence of a tail are generally constant, while the size and surface patterns may vary widely. In the case of letters, a learner may realize that the level of detail and the connectedness of characters is generally constant within an alphabet, but varies between alphabets. Learners who make second order generalizations can make reasonable hypotheses about new categories on the basis of a very small number of samples; seeing only a single character from a new alphabet, say formed without much detail in a single piece, a learner who thinks that connectedness and level of detail are similar within a category can reasonably infer that this alphabet has letters consisting of a single connected piece without small details.

The UROP I am doing involves collecting and analyzing many handwriting samples of various characters from various alphabets that the writer may or may not have seen before. The handwriting samples will be used as input data to help design an algorithm that can transfer inferences about a single category (alphabet) to new categories, via a model of second order generalizations. Because the samples will be collected digitally, via Amazon Mechanical Turk, I will be working on campus.

Personal Role & Responsibilities

I've collected twenty handwriting samples of each character in fifty different alphabets have been collected via Amazon Mechanical Turk. Examples can be found at Russ and Brenden (grad students involved in the UROP) have programmed various algorithms (both general and specific, pixel-based and stroke-based) to attempt various character recognition and categorization tasks. A description of the performance of Brenden's model can (currently) be found at lake_etal_cogsci2011.pdf. I've begun to collect data for a human baseline performance on similar tasks. So far, I've designed and run, on a small scale, an experiment in which subjects are presented with an example character for a brief period of time (on the order of 50 ms), which is then covered with noise, and then presented with an example image. The subject must say whether the two images are examples of the same character, or different characters. As this is the first time (as far as I understand what my mentors have told me) that a data set like this has been collected, we're not sure what the best experiments are to find a human baseline. Presently, I'm in the process of designing a webpage (experiment data collection setup) for collecting data on how long it takes subjects to distinguish between various different characters. Other types of data to be collected might include data about how well people do at categorizing characters into alphabets, or data that is more similar to the tasks that the models are tested on, such as, given an example of a character that they've never seen before, picking out which of twenty images (also never seen before), are drawings of the same character as the example. Part of the process will be to discuss and evaluate various ways of asking the question "how good are people at learning characters that they've never seen before?", though I anticipate not being able to contribute much to this decision process. (I do, however, find such discussions between Josh and Russ and Brenden interesting.)

I might help my mentors with the models they'll be running on the data, and I hope to contribute to the paper my mentors are working on.

Goals/Personal Statement

I find the concept of learning interesting, from the perspective of trying to understand how the mind works and from the perspective of artificial intelligence. I hope to learn more about learning from this UROP. I hope to learn what kind of questions there are to be asked about learning, and possibly be involved in discovering answerers to some of these (for example, there is the question of whether, when asked to draw a new instance of a character they've never seen before, people start with a general concept of what a character is, and then refine it to better match the given example, or if they first start with drawings that are close to the given image, and then impose the constraint that it must be "character-like", or something else entirely). Additionally, I enjoy programming, and hope to improve my skills through working on the data collection aspect of this UROP. Ultimately, I hope to enjoy my UROP experience, and better learn what type of research, and which processes involved in research, I enjoy.


Get the Dataset


You can download the data set or download the data set including unreviewed submissions. Data set zip files are (hopefully) updated daily at midnight. Click here to see the last time the dataset was updated. You may also download the matlab file results.mat.





Please by tracing a random alphabet to be added to the data set.

Other Tasks


Please email jgross AT mit DOT edu with questions, concerns, or bug reports.