Annotation is usually seen as a task undertaken by domain experts (or at least those who have had significant training in a given annotation schema and understand the content being annotated). What if we could get non-experts to do the same, thereby parallelizing effort, as well as dramatically reducing the amount of training required? We explore the feasibility of leveraging crowd workers to annotate text (e.g., Ubuntu IRC logs) with complex labels (e.g., entities) that requires expertise they may not possess.
I am also a part of Project Sapphire, a partnership with IBM Research (Project Sapphire is within IBM's broader Cognitive Horizons Network), as well as a part of the Michigan Interactive and Social Computing (MISC) group.
In-home robots offer the promise of automatically handling mundane daily tasks, thereby improving access for people with disabilities, and providing on-demand access to remote physical environments. Unfortunately, while robotic motion planning and manipulation has been an active area of research in recent years, the ability to understand never-before-seen scenes with unknown objects remains an open challenge. We are developing EURECA, a system that leverages crowds to help understand 3D scenes in near real-time, and introduces novel computer-assisted selection tools that allow groups of non-experts to segment objects more quickly and accurately than existing methods. Further, we can do this in a privacy preserving manner that is appropriate for home and office settings.
See: S.R. Gouravajhala, J.Y. Song, J. Yim, R. Fok, Y. Huang, F. Yang, K. Wang, Y. An, W.S. Lasecki. Towards Hybrid Intelligence for Robotics. In Collective Intelligence Conference (CI 2017). New York, NY.
Crowd-powered conversational systems---in which ever-changing groups of remote human workers collectively hold a conversation with end users---can help bootstrap automated dialog systems by generating training data in real scenarios and succeed where well-trained automated approaches fail. However, since no one worker is present during all sessions, these systems fail to remember all relevant information from interactions that span multiple sessions, leading over time to the loss of conversational context. We introduce Mnemo, a crowd-powered dialog system plug-in that uses collective processes and automated support to maintain a “collective crowd memory” of user conversations through crowd-generated facts, which workers predict would be important when a similar topic is again discussed in the future by a given user.
LegionTools is a software tool that allows researchers and end users to easily recruit and route Amazon Mechanical Turk (AMT) workers for synchronous realtime tasks. LegionTools provides an easy-to-use interface for requesters to deal with HITs that require richer integration with projects, as well as makes it easier to recruit, retain, route, and bonus workers for synchronous tasks.
See: the LegionTools interface here. Please email me if you would like more information!
Conversing with Data Using Crowds
We introduced a framework for using the crowd as a middle-layer that allows intelligent ML systems to become better at interacting with, and understanding, large datasets.
See: S.R. Gouravajhala, D. Koutra, W.S. Lasecki. Towards Crowd-Assisted Data Mining. In CHI Workshop on Human Centred Machine Learning (HCML 2016). San Jose, CA. 2016.
Masters Degree Work
Light-Emitting Data: Inferring Web Browsing Activity From Router LED Blinks
Developed system and algorithms to fingerprint webpages based solely on router LED blink patterns. We propose a modified Edit-distance based algorithm that uses $k$-NN for classification of webpages and evaluate using threat models that reflect real-world uses of privacy-enhancing technologies (e.g., VPN and private windows).
GrayFuzz: Fuzzing Using Side Channels
Developed system and framework to leverage information being leaked in side channels (i.e. power and timing) in order to map out the internal state machine transitions of a device, which allows for more intelligent fuzzing.
Stigmalware: Investigating the Prevalence of Malware in the Clinical Domain
Detection of anomalous signals and malware signatures in medical device network flow traffic and darknet flows. Designed heuristics to filter suspicious traces from connection information, including performing timing analysis and graph analysis.