Big Data Data Analysis Data Visualization

Social Network Analysis with Hadoop & Gephi

As part of a Hadoop course project, I was tasked with utilizing Cloudera’s Hadoop Ecosystem. The dataset which I used was the Enron email dataset which contained most of the emails sent between the employees of ENRON. The tools of the Hadoop Ecosystem would be used to analyse and aggregate the information following which I will demonstrate the usage of Gephi for a quick SNA

SNA using Gephi

My approach is to keep the project simple while implementing the core concepts of Hadoop. Since the original dataset consists of several files [C], would be utilizing the data [A] [B] from the database which was created by Shetty and Adibi [1].

Let’s begin with loading the entire email data into MySQL. We will filter out a section of the good stuff we would need for further processing in Hadoop. Used the following scripts to load the data.

  1. #1. Create a database in MySQL
  3. $ mysql -u root -e 'create database enron'
  5. #2. Load the Enron data into the database
  7. $ mysql -u root enron < enron-mysqldum_v5.sql
  9. #3. Login as root to the Enron database created in Step 1
  11. $ mysql -u root enron

This is the list of tables that exist in the newly created database:

Skeleton Page

For analysis of the data, we dumped the results of the query below in mysql and used Hue to load the TSV file into HDFS. Copy the code as below:

  1. $ mysql -u root -B -e "select sender, rvalue as 'receiver', date, rtype,DAYNAME(date) as 'dayn',MONTHNAME(date) as 'monthn',YEAR(date) as 'yr' from message m, recipientinfo r where m.mid = r.mid;" enron > enron_email.tsv

Hive helps us aggregate data using the SQL. We wanted the total number of emails exchanged between all employee pairs. So for this, a table was created with all senders, receivers and the count where sender is smaller than receiver (Alphabetically). Then all records were fetched where sender is greater than receiver and inserted into the same table but while inserting the receivers were sent to sender columns and senders to receivers. The plan being to sort the senders and receivers so that we can further group them to find the count of all the emails exchanged between them.

  1. create table combined as
  2. select sender, receiver, count(*) NUM_OF_EMAILS_SENT from enron_email_chain
  3. where sender != receiver
  4. and sender < receiver
  5. group by sender, receiver
  6. order by NUM_OF_EMAILS_SENT desc
  8. insert into table combined
  9. select receiver, sender, count(*) NUM_OF_EMAILS_SENT from enron_email_chain
  10. where sender != receiver
  11. and sender > receiver
  12. group by sender, receiver
  14. select sender, receiver, sum(num_of_emails_sent) as total from combined
  15. group by sender, receiver
  16. order by total desc

Now we need to get the data in the following form so that it can be ingested easily into Gephi.
Final Version

Once we import the data into the nodes in Gephi, we select the a layout and click on Run. To get the final version below:

Final Version

Leave a Reply

Your email address will not be published. Required fields are marked *