Categories
Company News

Featured in ET Newsmakers

Categories
Company News

Published in Hindustan Times – Mint

Categories
Data Engineering

Exploring Data Engineering on AWS

Migrating your infrastructure to another location can be a stressful and fiddly experience, not helped by the fact you can’t just simply move your files, scripts and database.

I started exploring options on moving our company data ops infrastructure onto a more flexible and scalable environment. As far as cloud solutions go, AWS and Google Cloud are some of the cost effective options I could think of. I created a playlist of good material I found as part of the research.

Categories
Data Engineering Data Visualization

Dashboarding with the ELK(Elasticsearch, Logstash, Kibana) Stack

My recent internship at Audience Partners presented me an opportunity to create a dashboard which would help in “Display Reach Analysis” on Campaign operations data. The primary source of the data being the DSPs that help in serving ADs with the publishers. As a proof of concept to create a data pipeline, I utilized the ELK stack(Elasticsearch, Logstash, Kibana) to come up a neat dashboard with a scalable infra at the back-end. We will walk-thru the creation steps of the design of the framework and familiarize ourselves with the components of the ELK stack.

Kibana Dashboard

To be added …

Categories
Big Data Data Analysis Data Visualization

Social Network Analysis with Hadoop & Gephi

As part of a Hadoop course project, I was tasked with utilizing Cloudera’s Hadoop Ecosystem. The dataset which I used was the Enron email dataset which contained most of the emails sent between the employees of ENRON. The tools of the Hadoop Ecosystem would be used to analyse and aggregate the information following which I will demonstrate the usage of Gephi for a quick SNA

SNA using Gephi

My approach is to keep the project simple while implementing the core concepts of Hadoop. Since the original dataset consists of several files [C], would be utilizing the data [A] [B] from the database which was created by Shetty and Adibi [1].

Let’s begin with loading the entire email data into MySQL. We will filter out a section of the good stuff we would need for further processing in Hadoop. Used the following scripts to load the data.

  1. #1. Create a database in MySQL
  2.  
  3. $ mysql -u root -e 'create database enron'
  4.  
  5. #2. Load the Enron data into the database
  6.  
  7. $ mysql -u root enron < enron-mysqldum_v5.sql
  8.  
  9. #3. Login as root to the Enron database created in Step 1
  10.  
  11. $ mysql -u root enron

This is the list of tables that exist in the newly created database:

Skeleton Page

For analysis of the data, we dumped the results of the query below in mysql and used Hue to load the TSV file into HDFS. Copy the code as below:

  1. $ mysql -u root -B -e "select sender, rvalue as 'receiver', date, rtype,DAYNAME(date) as 'dayn',MONTHNAME(date) as 'monthn',YEAR(date) as 'yr' from message m, recipientinfo r where m.mid = r.mid;" enron > enron_email.tsv

Hive helps us aggregate data using the SQL. We wanted the total number of emails exchanged between all employee pairs. So for this, a table was created with all senders, receivers and the count where sender is smaller than receiver (Alphabetically). Then all records were fetched where sender is greater than receiver and inserted into the same table but while inserting the receivers were sent to sender columns and senders to receivers. The plan being to sort the senders and receivers so that we can further group them to find the count of all the emails exchanged between them.

  1. create table combined as
  2. select sender, receiver, count(*) NUM_OF_EMAILS_SENT from enron_email_chain
  3. where sender != receiver
  4. and sender < receiver
  5. group by sender, receiver
  6. order by NUM_OF_EMAILS_SENT desc
  7.  
  8. insert into table combined
  9. select receiver, sender, count(*) NUM_OF_EMAILS_SENT from enron_email_chain
  10. where sender != receiver
  11. and sender > receiver
  12. group by sender, receiver
  13.  
  14. select sender, receiver, sum(num_of_emails_sent) as total from combined
  15. group by sender, receiver
  16. order by total desc

Now we need to get the data in the following form so that it can be ingested easily into Gephi.
Final Version

Once we import the data into the nodes in Gephi, we select the a layout and click on Run. To get the final version below:

Final Version

Categories
Data Visualization

Data Visualization with Keshif

Crawling on the web for data visualization tools and packages that have been built using the D3.js, I chanced upon a library called “Keshif”. Keshif created at the University of Maryland is the brainchild of M. Adil Yalçın, Ph.D. You can find his profile here: PROFILE.

keshif

Let’s get started with a quick tutorial on creating a Keshif data visualization online here:

Now that we have the ground cleared with the basics. Let’s move on to some advanced stuff. I am now going to show you how we can customize and use the Keshif engine to create our personalized data visualization dashboard. The Git repository for Keshif provides us lots of demo examples of the types of visualization browsers we can create. We will try recreating one of these called “SelfieCity”.

Let’s begin with the skeleton html of our main page.

 
 
 
		<script type="text/javascript" src="https://www.google.com/jsapi"></script>
		<script type="text/javascript" src="./js/jquery-1.11.1.min.js"></script>
		<script type="text/javascript" src="./js/d3.3.5.5.min.js" charset="utf-8"></script>
		<script type="text/javascript" src="./js/keshif.js" charset="utf-8"></script>
		<script type="text/javascript" src="./js/moment.min.js" charset="utf-8"></script>
		<script type="text/javascript" src="./js/demo.js" charset="utf-8"></script>
Selfiecity – Self-portraits (selfies) in five cities across the world
 

The page currently looks like this:

Skeleton Page

Keshif has a single main function which helps load the data source, creates summaries and loads visuals or displays items. There are several ways of invoking the main function. For more examples, check the demo examples on Keshif’s website. We will be creating another script file called “keshif_demo.js” which would contain the browser main function. Copy the code as below:

$(document).ready( function(){
  resizeBrowser(100,130);
  $(window).resize(function(){
    resizeBrowser(100,130);
    browser.updateLayout();
  });
 
  browser = new kshf.Browser({
    domID: "#chart_div",
    categoryTextWidth: 150,
    itemName: "Selfies",
    source: {
      gdocId: '1KwkzkZo7rHyORyqF5qL2TmIeBgOHz1FtGXg0sPrTsbw',
      tables: "Selfiecity"	  
    },
    summaries: [
      { title: "City", value: "city",
        catLabel: function(){
          switch(this.id){
            case 'ny': return "New York";
            case 'bangkok': return "Bangkok";
            case 'sao_paulo': return "Sao Paulo";
            case 'moscow': return "Moscow";
            case 'berlin': return "Berlin";
          }
        } },
      { title: "Sex", value: "sex" },
      { title: "Age", value: "age", showPercentile: true },
      { title: "Photo: Eyes Closed", value: "eye_closed" },
      { title: "Photo: Mouth Open", value: "mouth_open_wide", collapsed: true },
      { title: "Photo: Has Glasses", value: "glasses" },
 
      { title: "Happyness", value: "emotion_happy", layout: 'right', showPercentile: true },
      { title: "Calmness", value: "emotion_calm", layout: 'right', type: 'interval', },
      { title: "Confusion", value: "emotion_confused", layout: 'right', type: 'interval',},
      { title: "Sadness", value: "emotion_sad", layout: 'right', },
      { title: "Anger", value: "emotion_angry", collapsed: true, layout: 'right', },
      { title: "Disgust", value: "emotion_disgust", collapsed: true, layout: 'right', type: 'interval' }
    ],
    itemDisplay: {
      sortingOpts: {title:"Happyness", value: "emotion_happy"},
      displayType: "grid",
      detailsToggle: "off",
      maxVisibleItems_Default: 104,
      recordView: function(){ return "<img>"; },
      visibleCb: function(d){
          d3.select(d.DOM.record).select("img").attr("src",
              'https://d25rsf93iwlmgu.cloudfront.net/selfies/150/'+this.city+'/'+this.id);
      }
    }
  });
});
    • domID: In our case #chart_div which is the same id we gave to the div in the html body.
    • gdocId: This is the id of the Google Doc which contains our data. Keshif provides several ways to load data. It also supports several types including XML, JSON, Text files.
    • tables: Name of the table in the Google Doc.
    • summaries: This section summarizes the data as visual elements on the browser. Each element of the summary is identified by the title field. Additional options like showPercentile, catLabel, collapsed, layout, type etc. can be used to modify the appearance of the element. See API documentation for more help.
    • itemDisplay: This section creates the middle section of the browser. The visibleCb element along with recordView of this section renders this section
    • sortingOpts: Provides a dropdown list to sort the elements in the middle section.

The page currently looks like this:

Step 2 - Work in Progress

Let’s customize it a bit more. Keshif provides hidden options to remove the ribbons and make it look neater. Let’s set these options. Copy the code below before the browser main function.

socialShare = false;
noRibbon = true;
showLogo = true;

The page currently looks like this:

Step 3 - Removing Ribbons

Finally, let’s add in some cool CSS to create the hover effects on the selfies in the browser. We have succesfully recreated the demo on Keshif website here: Keshif SelfieCity
Selfiecity is a project by Dr. Lev Manovich, Moritz Stefaner, Mehrdad Yazdani, Dr. Dominikus Baur, Daniel Goddmeyer, Alise Tifentale, Nadov Hochman, Jay Chow.

You can also see their interactive visual selfie browser here: Selfie Exploratory

.listItemGroup{
  background-color: black !important;
}
.listItem{
  width: 70px;
  padding-top: 0px !important;
  background-color: black !important;
  overflow: visible !important;
}
.listItem[highlight=true]{
  background-color: orangered !important;
}
.listItem[highlight=true] .content img{
  border-color: orangered;
  border-width: 3px;
}
.listItem:hover{
  z-index: 200 !important;
}
.listItem:hover .content img{
  transform: scale(2);
}
.listItem &gt; .itemRow{
  display: block !important;
  overflow: visible !important;
}
.content img{
  display: block;
  border: solid;
  border-color: black;
  border-width: 1px;
  border-radius: 0px;
  background-color: white;
  width: 100%;
  transition: all 200ms linear;
  -webkit-transition: all 200ms linear;
  -o-transition: all 200ms linear;
  -moz-transition: all 200ms linear;
}
.listItem[highlight=true] .content img{
  background-color: orangered;
}
.content span.title{
  display: block;
  margin-left: auto;
  margin-right: auto;
  margin-top: 2px;
  text-align: center;
  font-size: 0.8em;
}

Our final HTML source looks like this:

 
 
 
		<script type="text/javascript" src="https://www.google.com/jsapi"></script>
		<script type="text/javascript" src="./js/jquery-1.11.1.min.js"></script>
		<script type="text/javascript" src="./js/d3.3.5.5.min.js" charset="utf-8"></script>
		<script type="text/javascript" src="./js/keshif.js" charset="utf-8"></script>
		<script type="text/javascript" src="./js/moment.min.js" charset="utf-8"></script>
		<script type="text/javascript" src="./js/demo.js" charset="utf-8"></script>
		<script type="text/javascript" src="./js/keshif_demo.js" charset="utf-8"></script>
Selfiecity – Self-portraits (selfies) in five cities across the world
 

Final version:

Final Version

For more information about the API:
GitHub Source
API Documentation