Section Image
SGI Home
Blog
Language Technology
Email Research
Enron Email Corpus
Java
About SGI


















The Enron Email Corpus

The Enron email corpus contains data from roughly 150 senior management executives of Enron, which was originally made public during the Federal Energy Regulatory Commission's investigation of Enron.

The corpus contains roughly 500,000 messages, organised into folders. Thanks to the work of people at MIT, SRI and CMU, the dataset has been cleaned (mostly stripping attachments and removing some emails for reasons of employee privacy), and made available online.

An unparalleled research dataset

The Enron corpus is completely unparalleled in terms of email datasets that can be used for research purposes. It is more extensive than any other research-friendly email corpus (that I know of) by several orders of magnitude. Many people in a variety of Natural Language Processing, Machine Learning and a bunch of other fields have realised this, and have started to analyse the corpus as the basis of a number of different research programs. These range from investigations into social networks and organisational communication to data mining and text classification tasks.

Coordinating research using the Enron corpus

Unfortunately, despite such widespread interest, the community using the Enron corpus seems to be very fragmented, with many researchers seemingly unaware of how others are using the corpus. This has the potential to result in much wasted effort if different research groups duplicating each other's work, especially in terms of data markup and data cleansing, which are both huge tasks given the size and inconsistencies of the corpus.

The main motivation for creating this website is to pull together all the known work happening with the Enron Corpus, and to encourage users to share data and knowledge about the corpus. The Enron Corpus Mailing List has been setup for exactly this purpose.

If you're working with (or thinking of using) the Enron dataset, why not join the discussion list.