|
The Enron email corpus is being applied by a number of
different researchers in different contexts. Below is a list of known projects
and research topics being investigated that are making use of the corpus.
This list aims to capture all known uses, and through the Enron Corpus mailing list, the hope is that we can
avoid people duplicating work in processing and marking-up the corpus for
research purposes (e.g., adding specific markup of concepts or features, analysing the structure of the corpus, loading the corpus into database tables).
- Carnegie Mellon University:
- William Cohen has taken on the responsibility for distributing the
Enron email corpus as a
resource for researchers.
- Bryan Klimt and Yiming Yang have done
some data cleaning and
preliminary data analysis of the Enron corpus, in terms of folder usage, thread
characteristics and message distribution across users.
Read More about Enron work at CMU
- University of California, Berkeley:
- A search interface for the Enron email collection, developed by
Andrew Fiore and Marti Hearst.
- A set of
categories developed by Marti Hearst and her students that are planned to
be used for annotating a subset of the Enron email messages.
- A
subset of about 1700 labeled email messages (4.5M) focusing on
business-related emails and the California Energy Crises and on emails that
occurred later in the collection, trying to avoid very personal messages,
jokes, and so on. Students in Marti's course annotated the selected messages
with the category labels. Each message was labeled by two people, but no claims
of consistency, comprehensiveness, nor generality are made about these
labelings.
- The
Enronic email visualization and clustering tool by Jeff Heer, built on his prefuse toolkit. (1.9M jar
file). This provides for graph based visualizations of social networks
within Enron, based on email interaction between users.
- A database
representation(219 MB compressed) of the Enron email collection, built by
Andrew Fiore and Jeff Heer,
containing the enron email messages. This version contains many but not all of
the tables used in the search tool, as well as special tables to be used with
the Enronic visualization tool. This database version of the corpus has
had a substantial amount of processing performed on the contents of the
database to remove duplicates, normalize names, and so on.
Read More about Enron work at UC Berkeley
- University of Massachusetts, Amherst:
- Ron Bekkerman, a PhD
student supervised by Andrew McCallum, has
worked with the Enron corpus to perform automatic categorization of
email into folders. The work presented to date has made use of only
seven users' email, for whom the number of messages and folders are
particularly large.
- Andrés
Corrada-Emmanuel has done a large amount of data consistency
checking, based on MD5 digests for each of the email messages. This has
allowed him to identify duplicate messages, and to provide mappings
between messages and their respective senders and receivers. Other data
cleansing work includes normalisation of email addresses, which
suggests that only 149 different folder users exist in the corpus.
Read More about Enron work at the UMass
- University of Southern California:
-
Jitesh Shetty has worked
with Jafar Adibi (ISI) to use
the Enron corpus for testing the effectiveness of some Link Discovery
techniques which are used for counter terrorism and fraud detection.
To do so, they have created yet another cleaned database (MySQL)
version of the corpus. They have also looked at social network
analysis, and have managed to construct a list of job titles for many
of the users represented in the Enron corpus.
Read More about Enron work at USC/ISI
- University of Iowa:
- Columbia University:
Owen Rambow, Aaron Hanly, Martin Jansche and others at Columbia University are
using the Enron corpus for work in email summarisation, particularly for email
thread summarisation.
|