Joshua V. Dillon, PhD

Software Engineer, Google, Inc., Mountain View, California
Contact Information


Research Interests

My research interests lie in machine learning, computational statistics, and information visualization. My application areas are large-scale data problems, primarily text analysis; I am also interested in modeling social, image, biological, and financial data. Broadly speaking my work addresses two themes:

  1. Methodology: What theoretical performance can we expect of machine learning algorithms?
  2. Application: Can we use these insights to organize and present real world data in a more meaningful and useful way?

These themes have natural interplay—they parallel the iterative process typical in analyzing and modeling data. Central to this interplay is a fundamental question: How can resources be efficiently expended to achieve desirable levels of accuracy?

My thesis, “Stochastic m-Estimators for Controlling Accuracy-Cost Tradeoffs,” addresses this issue by developing a mathematical framework capable of spanning a continuum of explicit and implicit tradeoffs present in machine learning. Abstractly, these tradeoffs consists of exchanging finite physical resources for improved accuracy. More concretely, these limiting factors, or costs, may be computational, such as time-limited cluster access for parameter learning, or they may be financial, such as purchasing human-labeled training data under a fixed budget. This work explores these accuracy-cost tradeoffs by proposing a family of estimators that maximizes a stochastic variation of the traditional m-estimator.

These “stochastic m-estimators” (SMEs) are constructed by stitching together different m-estimators, at random. Each such instantiation resolves the accuracy-cost tradeoff differently, and taken together they span a continuous spectrum of accuracy-cost tradeoff resolutions. My thesis proves the consistency of the estimators and provides formulas for their asymptotic variance and statistical robustness. I also demonstrate their usefulness for:

  1. Controlling the computational complexity of parameter leaning in Markov random fields,
  2. Controlling the labeling cost associated with semi-supervised learning (“To label or not to label?”),

As well as a variety of other tradeoffs, i.e., active learning, robust loss functions, random variate generation, etc.

My advisor is Professor Guy Lebanon and I frequently collaborate with Doctor Kevyn Collins-Thompson at Microsoft Research.

Dissertation / Defense (handout-4up, handout-6up)


Curriculum Vitae

Short / Long / Download
(Click the left-most year for more details.)
Education
2011 Ph.D., Computational Science & Engineering,
Georgia Institute of Technology, Atlanta, Georgia. January 2009 – December 2011.
2008 M.S., Electrical & Computer Engineering,
Purdue University, West Lafayette, Indiana. August 2005 – December 2008.
  • GPA: 3.69/4.00
2005 B.S., Computer Engineering & Electrical Engineering,
Michigan Technological University, Houghton, Michigan. August 2001 – April 2005.
  • Summa Cum Laude
  • Double Major (CPU Design, Signal Processing)
  • GPA: 3.90/4.00
Engineering Experience
Current Software Engineer, Google, Inc., Mountain View, California. October 2011 – Present.
  • Adsense.
2005 Intern, ThermoAnalytics, Hancock, Michigan. Spring 2005.
  • Solely designed and implemented a QScript-to-C translator optimized for numerical computing applications. Efforts included lexical analysis, context free grammar specification, developing an abstract syntax tree representation with corresponding auto-typing symbol table, and implementing the semantic (code-emitting) routines.
2004 Intern, IBM, Rochester, Minnesota. Summer 2004.
  • Implemented VHDL logic designs for the floating-point core of the Cell processor. Conducted timing analysis, synthesis, and testing of over 20 logic macros. Significantly improved team turnaround time by automating several report generating tasks (Perl) and implementing a tailored layout prototyping tool (Java).
2003 Intern, IBM, Rochester, Minnesota. Summer 2003.
  • Designed, implemented, and packaged an SAP R/3 cluster management plug-in for iSeries Navigator (Java). Design goals included extensibility, graphical ease-of-use, and an aggressive release cycle to meet clients' demands. Efforts also involved the coordination of domestic and German colleagues. End product was delivered in a fully packaged form, ahead of schedule.
2002 Intern, Michigan Department of Transportation, Cass City, Michigan. Summer 2002.
  • Sole on-site inspector responsible for verifying contractors' adherence to design specifications. Responsible for chemical and physical quality control, logging payable items, and updating project plans.
Research Experience
2011 Research Assistant, Georgia Institute of Technology, Atlanta, Georgia. January 2009 – December 2011.
  • Quantified the asymptotic accuracy of generative semi-supervised learning based on an extension of stochastic composite likelihood.
  • Developed the stochastic m-estimator framework for controlling tradeoffs in machine learning.
2009 Intern, Microsoft Research, Redmond, Washington. Summer 2009.
  • Developed a flexible optimization framework for constraining probabilistic models with imprecise domain knowledge. Applied this framework for finding robust pseudo-relevance feedback models for information retrieval which balance notions of expansion reward and risk due to term uncertainty.
2008 Research Assistant, Purdue University, West Lafayette, Indiana. January 2006 – December 2008.
  • Proposed a family of point estimators that resolve the computation-accuracy tradeoff present in maximum likelihood. Proved their consistency and provided formulas for their asymptotic variance and computational complexity. Demonstrated their usefulness for several graphical models including CRFs and Boltzmann machines.
  • Developed the locally weighted bag of words framework for representing sequential text. Applied framework to several text analysis tasks, i.e., classification, segmentation, summarization, and visualization.
  • Investigated statistical machine translation and diffusion kernels for unsupervised metric learning for text documents.
2006 Summer Scholar, DOE Joint Genome Institute and Lawrence Livermore National Laboratory, Walnut Creek, California. Summer 2006.
  • Investigated unsupervised learning techniques for statistical process control. Applied these techniques to the Joint Genome Institute's DNA sequencing process to identify combinations of reagents, machines, and operators that lead to under-performing modes of operation.
2005 Extreme Blue Intern, IBM , Austin, Texas. Summer 2005.
  • Developed unsupervised classification techniques which exploit harmonically related features. Applied this work for automatic error detection in the Linux kernel. Feature engineering included, source/binary alignment, resolving dynamic control flow, and system profiling with call-stack resolution.
Teaching Experience
2008 Lecturer, Purdue University. Spring, Fall 2008.
  • Two semester instructor for an ECE undergraduate course which acquaints students with scripted language software engineering tools, i.e., Python and korn shell. Responsibilities included curriculum design and delivering weekly lectures to 60+ students. One of only two graduate student lecturers.
2007 Coordinating TA, Purdue University. Fall 2007.
  • Coordinating TA for “Microprocessor System Design and Interfacing,” an undergraduate course which introduces microprocessor system design, assembly programming, and digital/analog interfaces. Held lab office hours and managed five undergraduate TAs.
Fellowships
2011Marshall Sherfield Postdoctoral Fellowship, Marshall Aid Commemoration Commission, 2011–13.
2010DHS Fellowship in Data Analysis and Visual Analytics, Department of Homeland Security, 2010–12.
2005Ross Graduate Fellowship, Purdue University, 2005–06.
2001Board of Control—Full Tuition, Michigan Technological University, 2001–2005.
Distinctions
2007 US delegate, 57th Lindau Meeting of Nobel Laureates and Students (1,2). Germany, 2007.
  • Sponsored by the NSF Directorate for Mathematical and Physical Sciences.
2005Summa Cum Laude, Dept. of ECE. Michigan Technological University, 2005.
2002Award of Excellence. Dept. of Mathematics, Michigan Technological University, 2002.
Other Distinctions
2006Eta Kappa Nu ECE Honor Society, Beta Chapter. Purdue University, 2006.
2005Eta Kappa Nu ECE Honor Society, Beta Gamma Chapter. Michigan Technological University, 2005.
2004Phi Kappa Phi Honor Society. Michigan Technological University, 2004.
2004Tau Beta Pi Engineering Honor Society, Michigan Beta Chapter. Michigan Technological University, 2004.
2001Valedictorian, Cass City High School. Cass City, Michigan, 2001.
Publications
Journal
2010 J. Dillon and G. Lebanon. Stochastic Composite Likelihood. Journal of Machine Learning Research (JMLR) (in press), 2010.
2007 G. Lebanon, Y. Mao, and J. Dillon. The Locally Weighted Bag of Words Framework for Document Representation. Journal of Machine Learning Research (JMLR) 8(Oct):2405-2441, 2007.
2007 Y. Mao, J. Dillon, G. Lebanon. Sequential Document Visualization. IEEE Transactions on Visualization and Computer Graphics (INFOVIS), 13(6) 2007.
Conference
2010 J. Dillon, K. Balasubramanian, and G. Lebanon. Asymptotic analysis of generative semi-supervised learning. In Proc. of the International Conference on Machine Learning (ICML), 2010.
2010 J. Dillon and K. Collins-Thompson. A Unified Optimization Framework for Finding Reliable Pseudo-Relevance Feedback Models. Proceedings of the Nineteenth International Conference on Information and Knowledge Management (CIKM), 2010.
2009 J. Dillon and G. Lebanon. Statistical and Computational Tradeoffs in Stochastic Composite Likelihood. Proc. of the 12th International Conference on Artificial Intelligence and Statistics (AISTATS). 2009.
2007 J. Dillon, Y. Mao, G. Lebanon, and J. Zhang. Statistical Translation, Heat Kernels, and Expected Distance. Proc. of the 23rd Conference on Uncertainty in Artificial Intelligence (UAI), 2007.
Workshop
2010 K. Collins-Thompson and J. Dillon. Controlling the Search for Expanded Query Representations by Constrained Optimization in Latent Variable Space. SIGIR Workshop on Query Representation and Understanding, 2010.
2006 J. Dillon, Y. Mao, G. Lebanon, and J. Zhang. Statistical Translation, Heat Kernels, and Expected Distance. NIPS workshop on Learning to Compare Examples, 2006.
Working
2010 S. Kim, J. Dillon, and G. Lebanon. Visualizing version controlled documents. Manuscript upon request, 2010.
Software
2011 J. Dillon. Semaphore, Matlab Central. August 2011.

Matlab interface to POSIX semaphore functionality.

2010 J. Dillon. Sharedmatrix, Matlab Central. August 2010.

Allows 2D full/sparse matrix or 2D cell to be shared between multiple Matlab sessions, provided they have access to the same shared memory resources, i.e., the processes are on the same physical system. This program uses shared memory functions specified by POSIX and therefore doesn't use disk I/O for sharing. It should work trivially on Linux (tested on Ubuntu) and will probably work when compiled with Cygwin.

Copyright © 2010, Joshua V. Dillon. All rights reserved.