Friday, 10 February 2012

Postgresql – quickly check your data demographics

Hi all,

A few days ago, when starting go work on a new postgresql datawarehouse, I was scared by poor performance and poor exec plans while data volume was not so big…

Here is a quick way to check how your data demographics are, by using the pg_stat_user_tables view.


select relname, last_vacuum, last_autovacuum, last_analyze, last_autoanalyze 
from
pg_stat_user_tables,


order by
relname



image



Then, like me, you can ask you dba why some tables do not have up to date data demographics … especially big / frequently queried tables. Note that you have a lot of interesting views / tables to learn more about your data demographics in postgresql :




  • pg_stat_activity,


  • pg_stat_database,


  • pg_stat_all_tables

    • pg_stat_sys_tables


    • pg_stat_user_tables






  • pg_stat_all_indexes,

    • pg_stat_sys_indexes


    • pg_stat_user_indexes





And others, using the same naming schema …

Tuesday, 26 July 2011

Query Twitter with Talend to see what people think about …

Hi all,

Almost on holidays after a very hard working year. I have some beach time.

This morning, I tried to query Twitter and to process the data. My purpose is to quickly build a data set showing what people are talking about … let’s say, Obama. Easy, Twitter is providing an interface to run queries and retrieve the results back with json format.

Well, my proposition is to implement a basic word frequency analysis. My tool is Talend, but this process is also easy to set up using java, php, python, shell, Kettle …

image

Here are some details :

  • fileInputJSON : just read the json from Twitter search engine. The syntax is : http://search.twitter.com/search.json?&q=YOUR_QUERY_HERE&rpp=10000, where YOUR_QUERY_HERE = the key word you want to search for. Don’t forget to distribute the answer into a string column :
    • answer ====> “$..text” : where “$..text” stands for the tweet message you want to read.
  • tmap : just to transform the text into lower case. I tried to do it on the fly in the json step, but it does not work.
  • tnormalize : transform all words from anwer into rows (separator is space).
  • tfilter : filter your data in order to avoid “|” (pipes) and other special chars + get rid of words having less than 4 letters.
  • taggregate : aggregate on the words and add a new column named nb that will store the distinct count of all words.
  • tsortrow : now you have words and counts, just sort the data from higher counts to lower (desc).
  • tsamplerow : as we just want to read the first top 20 words, we create a sample based on the sorted rows with range “1..20”.

Finally, I print everything into the console by using a tlogrow + “:” separator.

Here are the data for the keyword “obama”, the query was done on July 26 at 16h00.

image

Finally, I will use Jonathan Feinberg, wordle, to create a nice and sexy word map with the above results.

image

 

Setting up the process and testing it only took 20 minutes. Of course, this can be improved a lot by adding string cleaning, custom data filtering… or better :

  • Capture who created the tweet and who received it : you can create relationship networks with, for instance, networkx.
  • Send the data to a powerfull text analytics framework like NLTK, for better analytics.
  • Create the word map picture into Talend. I have some source code for this, will have to work on it …

I ll see how to add one of these features soon.

Tuesday, 19 July 2011

Interfacing Talend with Amazon SDB (AWS SDB) – quick way

Hi all,

I had the following challenge : read some ftp account informations (ftp server, username, password, target directory) stored in Amazon SDB and use it in a Talend transformation, published as a web service. You know about SDB I hope. For those who don’t, SDB is a key / value database provided by Amazon. So you can name SDB a noSQL database.
I played with the SDB API from Amazon, and succeeded after some coding and “Talending”. Here is how I did.
Below is only a small part of a much larger project, composed of a large webservice collection, created for my client.
This project does the following :
  • Query into a database using dynamic params given by the user at run time from a Flex portal (a query engine like business objects !),
  • Return a rowcount of the query, into the portal. I’m working in web marketing : counting people (segments) before creating a campaign is very important …
  • Generate an extract of the data, process this extract according to the params given by the user (separator, encoding, spliting, zipping …),
  • Send this data file to different “tubes” : router, ftp, AWS S3, local download …
Here we focus on the ftp sending part, using SDB to retrieve infos.

 

The process.

 

image

The job (partial).

image

 

Data structure in AWS SDB.

I’m using a very nice firefox plugin in order to have easy and quick access to my SDB ecosystem : sdbtool. My data structure is simple ( “dd” is of course not the true value …) :
  • Item : ftp
    • Attribute names : Address :
        • Attribute value : dd
      • Attribute names :Login
        • Attribute value : dd
      • Attribute names : PKey
        • Attribute value : dd
      • Attribute names : Password
        • Attribute value : dd
      • Attribute names : Port
        • Attribute value : 21
Here is a screencap of my sdbtool view :
image

Explanations.

First, we load all the needed libraries, using tlibraryload component.
  • aws-java-sdk-1.0.14.jar
  • commons-codec-1.3.jar
  • commons-httpclient-3.0.1.jar
  • commons-logging-1.1.1.jar
  • jackson-core-asl-1.4.3.jar
  • stax-api-1.0.1.jar
  • stax-1.2.0.jar
For the aws-java-sdk-1.0.14.jar import, I had to write some imports. These imports are required to be able to use the aws jdk.
image
Then we have a tRowGenerator in which I create the value for the variable myDomain, that will be used in SDB queries (see code below). You can avoid this step, I created it only for quick testing purpose.
Then, we have to code a little tJavarow. This java code will :
  • connect to AWS SDB. You must have an account for AWS SDB.
  • run several queries, using SQL, to retrieve ftp account informations :
    • ftp server address
    • ftp server login
    • ftp server pass
    • ftp server port
    • ftp server pkey, if needed
  • store the query results in output_row.[name] so they can be used in Talend process.
The code in the tJavaRow. First we create some credentials (use yours) and then create an endpoint with sdb address from AWS : https://sdb.eu-west-1.amazonaws.com. Be carefull to set a valid endpoint, using a valid country zone.
The code is finally simple : create a string containing your sql query, then call a getItems() function. An Item is sent back, simply call a getAttribute in order to retrieve the value you need.
I chose, for simplicity, to run a different query for each item I need from SDB. Of course, you can write it shortly.
  
        BasicAWSCredentials credentials = new BasicAWSCredentials("KL45LKJ4325MLKJ2345", "LKJ45LKJmlkjdlkjGRhjKLJSFSDG432534");    
        
    final String[] FTP_Items;
    
        

        AmazonSimpleDB sdb = new AmazonSimpleDBClient(credentials);
        sdb.setEndpoint("https://sdb.eu-west-1.amazonaws.com");
        try {
        int i = 0;
            String myDomain = "Clients"; 
            String selectExpression = "select FTP_Address from `" + myDomain + "`where code_client = '" + context.client_name +"'";
           
            SelectRequest selectRequest = new SelectRequest(selectExpression);
            for (Item item : sdb.select(selectRequest).getItems()) {
                for (Attribute attribute : item.getAttributes()) {
                    output_row.FTP_Address = attribute.getValue().toString();
                }
            }         
            selectExpression = "select FTP_Login from `" + myDomain + "`where code_client = '" + context.client_name +"'";
            selectRequest = new SelectRequest(selectExpression);
            for (Item item : sdb.select(selectRequest).getItems()) {
                for (Attribute attribute : item.getAttributes()) {
                    output_row.FTP_Login = attribute.getValue().toString();
                }
            }
            selectExpression = "select FTP_Pass from `" + myDomain + "`where code_client = '" + context.client_name +"'";
            selectRequest = new SelectRequest(selectExpression);
            for (Item item : sdb.select(selectRequest).getItems()) {
                for (Attribute attribute : item.getAttributes()) {
                    output_row.FTP_Pass = attribute.getValue().toString();
                }
            }
            selectExpression = "select FTP_Port from `" + myDomain + "`where code_client = '" + context.client_name +"'";
            selectRequest = new SelectRequest(selectExpression);
            for (Item item : sdb.select(selectRequest).getItems()) {
                for (Attribute attribute : item.getAttributes()) {
                    output_row.FTP_Port = Integer.valueOf(attribute.getValue());
                }
            }
            selectExpression = "select FTP_PKey from `" + myDomain + "`where code_client = '" + context.client_name +"'";
            selectRequest = new SelectRequest(selectExpression);
            for (Item item : sdb.select(selectRequest).getItems()) {
                for (Attribute attribute : item.getAttributes()) {
                    output_row.FTP_PKey = attribute.getValue().toString();
                }
            }
        } catch (AmazonServiceException ase) {
            System.out.println("AWSException");
            System.out.println("ErrorMsg:    " + ase.getMessage());
            System.out.println("HTTPStatcode: " + ase.getStatusCode());
            System.out.println("AWS Errcode:   " + ase.getErrorCode());
            System.out.println("Errortype:       " + ase.getErrorType());
            System.out.println("RequestID:       " + ase.getRequestId());
        } catch (AmazonClientException ace) {
            System.out.println("AWSClientException");
            System.out.println("Error Message: " + ace.getMessage());


Final.

After retrieving all the item I need for sending on ftp (server, username, pass, port or location for ssh key), I store all this into global variables. Then, these global variables are used as arguments into two very customized scripts (needed in my case) that will send the files : simple ftp or sftp when needed. Finally, I catch some usefull infos from the custom ftp scripts, process it into a tmap and send this information into a tBufferOutput step. That way, I can provide a soap feed back when calling this webservice.

This post is very consice, feel free to ask me for more infos about this process.

Links.

Tuesday, 21 June 2011

… about AWS cloud, Talend, Jaspersoft, Postgresql and typical EC2 internal addressing issues …

Hi all,

I’m terribly late with this article, initially scheduled for January 2011 … sorry. Maybe it is a bit outdated now, anyway, I publish it …
Let’s talk about EC2 cloud computing, Talend, Postgresql and JasperServer. Basic setup.
You already know all the pros and cons with cloud computing, I won’t talk about that. As to me, I love cloud computing and use it everyday, because of these particular advantages :
  • Scalablity : scale up or down any instance, according to your needs,
  • Flexibility : create your own instances, boot them, create quick sandboxes, replicate data …
  • Pay per use : you pay for what you use (cpu, storage, security …),
  • Opex, no capex !
Cloud computing is still something new, and it is not surprising to discover softwares that are not ready for it or not fully “cloud compliant”. I recently faced such an issue when implementing Postgresql, Talend and Jaspersoft, which remain my preferred open source BI tools.

First issue

Let’s imagine we have a single server, hosting Postgresql. No big deal with that as long as we use this instance in a simple way : I can start my instance, host data on a persistent EBS, connect to it and stop it whenever I want. By using elastic IPs, I can assign a “fixed” IP address to this server and can easily set up a connection string. Note on 16/12/2010 : Amazon is now offering a DNS service.
Now let’s imagine we need a typical BI architecture (tiers) : one ETL (Talend or Pentaho of course !), a Postgresql database in the middle and Jaspersoft for reporting.
That’s a bit more complex because we need our Postgresql server to allow connections from the ETL and from the reporting tool. On top of that, we want to fully leverage all cloud computing features : stop the servers when they are not used, boot them when the service is needed, maybe change their network properties ... eventually we want this to be fully automated and working without any human actions like changing the connection strings, starting/stopping the servers …
Let’s have a look to a little schema now. As you can see, we have now our architecture up and running. We are also using elastic IPs for each server, which is mandatory for the following demonstration. IPs are fake.
image

How to read Public DNS, Private DNS and Elastic IPs on AWS EC2 ?
Imagine we have an instance running. This instance has an Elastic IP which is 46.52.186.25 and the private IP address is 11.235.33.6.
The Private DNS name is : ip-11-235-33-6.eu-west-1.compute.internal
The Public DNS name is : ec2-46-52-186-25.eu-west-1.compute.amazonaws.com
You see the relationship ?

Ok, now, how do you think we will configure Postgresql server to allow connexions from the ETL server and from the Reporting server ? Easy, here is one answer :
  1. By making the ETL Server and the reporting server point to Postgresql. For that, we will use this nice little Elastic IP we previously set up for Postgresql server because it’s soooo easy to do that way …
  2. By writing the ETL server Elastic IP and reporting server Elastic IP into Postgresql pg_hba.conf of course … because here again it is soooo easy natural to do so.
  3. Don’t forget to open the corresponding ports in your security groups (see picture above).
Ok, easy. Let’s go for it. We make Talend and Jasper point to Postgresql like this :
Jasper server connexion screen : Postgresql database <===> Jasperserver
image
Talend client connexion screen : your client <===> Talend server
image

Talend server connexion screen : Talend server <===> Postgresql database
image

And then we write down the Elastic IPs into the pg_hba file like this, in order to allow Talend server and JasperServer to connect to the postgresql database. This is a basic pg_hba.conf, I encourage you to add stronger authentication.
image
We are done. Don’t forget to adjust the security groups like this :
  • Talend Server : allow 8080, allow 22
  • Postgresql Server : allow 5432, allow 22
  • Jasperserver : allow 80 (or 443 if https), allow 22
Okay, this stuff is fully working, you can test it.
But wait … that’s not the good way to do ! By using the elastic IPs to set up communication between each server/node, we just created a weird monster that makes the traffic going OUT of the cloud and going BACK INTO the cloud. Don’t forget you are paying for that. Look at this schema.

image

First solution

The best practice is to avoid using elastic IPs in order to set up network traffic between servers that are hosted inside the EC2 cloud. Instead, use EC2 internal adresses.
Ok, but … wait a minute.
  • How do I do to retrieve the internal address from inside EC2 ? 
The solution rely on a poorly documented EC2 feature : when you query an ec2 public DNS server from inside EC2, you will be given back the corresponding internal IP address. Just what we need !!!!
For instance, if you query your ETL Server from your your Postgresql server, by using the famous host command, you will have :
image

You see what you have to do ? Replace all elastic IPs, except for your Talend client, by internal IPs. Like that, your internal data won’t leave the cloud, like below.

image
After using the internal addressing, the connexion screens will look like this :
Jasper server connexion screen : Postgresql database <===> Jasperserver

image

 

Talend server connexion screen : Talend server <===> Postgresql database
image


Second issue

Well, ok, we solved our first issue : using internal addresses between the ETL server and the Postgresql server. But, I can see two other issues :
  • Postgresql still does not accept DNS names in the pg_hba.conf ! Only IP addresses allowed. So We can’t ask Postgresql and pg_hba.conf to resolve the dns for us.
  • What if I decide to reboot the ETL server, or the Reporting server ? These internal adresses are nice but they are changing each time I reboot / restart server in EC2. Then, how to keep my Postgreqsl pg_hba.conf updated with frequently changing adresses ?
image…not allowed …

Second solution

No, there is still no support for DNS entries in the pg_hba.conf. I know this is a long awaited feature, at least by me. But, unless I’m wrong (tell me), writing down a DNS name in pg_hba.conf won’t work and the server won’t start.
We need to find a way to update the pg_hba.conf with the last / current ec2 internal addresses corresponding to the ETL server and the Reporting server. Easy, we will use a bit of shell code here. This script will retrieve the internal IP Address for each server (ETL and JasperServer) by using the command host and will update this address in the pg_hba.conf by using some sed or awk. Then, by using a sighup, Postgresql server will apply the new address configuration.
Nothing complex, but the success rely on a good timing.
image
Note here : I created an ORCHESTRATOR, a specialized instance in EC2, to monitor all my servers. This orchestrator will run this kind of script as soon as it detects any change in the internal addressing schema. This ORCHESTRATOR will be detailed in a future article (I made several public presentations, and a lot of people seem interested …).
And the shell script. This shell asks for the internal address, then updates the corresponding line. For that, you must  maintain your file in a tidy way : labels are needed.
################################ 
#                              # 

#      IP adress lookup        # 

#                              #  
################################ 
# POSTGRES (DATABASE) Server
# Public DNS : ec2-12-345-678-999.eu-west-1.compute.amazonaws.com 


# TALEND (ETL) Server     
ETL_SERVER=`host ec2-11-222-33-444.eu-west-1.compute.amazonaws.com | sed 's/.*has address //g'` 

# JASPER (BI & reports) Server    
JASPER_SERVER=`host ec2-22-33-444-555.eu-west-1.compute.amazonaws.com | sed 's/.*has address //g'` 



# Echoing all     
echo "" 

echo "################## EC2 Addresses Update ######################" 

echo "Will update EC2 Talend Server address with : " $ETL_SERVER 

echo "Will update EC2 Jasper Server address with : " $JASPER_SERVER 



echo ""

# Find and replace line Talend Server TALEND_NB=`grep -n "Talend server connexion" /mnt/postgres/data/pg_hba.conf | cut -d":" -f1` TALEND_NB=$((TALEND_NB+1)) sed -i "$TALEND_NB s%.*%host    all         all         $ETL_SERVER/32      md5%" /mnt/postgres/data/pg_hba.conf # Find and replace line Jasper Server JASPER_NB=`grep -n "JasperServer connexion" /mnt/postgres/data/pg_hba.conf | cut -d":" -f1` JASPER_NB=$((JASPER_NB+1)) sed -i "$JASPER_NB s%.*%host    all         all         $JASPER_SERVER/32      md5%" /mnt/postgres/data/pg_hba.conf



The end

Having a small (or even big) BI architecture up and running into EC2 is not a big deal. Having it properly set – in order not to pay extra fees – is something different and need some basic thinking before doing. The addressing issue which is technically simple, can have negative impact on your project if you don’t manage it from the start.

I will recommand any AWS / EC2 user (BI or not) to create their own admin tools and scripts, based on the various available APIs, in order to  :
  • reduce reaction time,
  • be fully independent,
  • spare time (graphical tools are nice but need clicks, clicks and clicks …)
Some usefull links about AWS / EC2 documentation :
    Feel free to contact me if this article is not clear enough.

    Monday, 28 March 2011

    Wordle, wordmap, word clouds … what’s in a name ?

    Hi all,

    Wordmaps, wordles, word clouds … are pretty famous these days. But they are still “difficult” to generate : no api, often proprietary code, often only web based tools, difficulty to render good looking fonts ….

    Last week, I was attending a client meeting and we showed some wordmaps of our own, created with R : great success. If you are involved into datamining or simply data visualization, these wordmaps / wordle are definitely a must have.

    Here is a wordle I created using the excellent web site wordle.net, created by Jonathan Feinberg. Simply paste some of your data (previously prepared and formated) into text fields, press generate and the magic comes !

    imageWordmap based on the keywords used to reach this blog.

    Matthias Oehler and I are currently working on a webservice aimed at creating wordmaps. The process will be : send a soap message with your data in it and you will get your png wordmap in return. Promising, hum ?

    Stay in touch …

    Thursday, 24 March 2011

    New Data cleansing / mining using Google Refine in the cloud with AWS

    Hi all,

    As promised, an article about a really nice piece of software that will allow you to run data cleansing and data mining jobs with fun.

    Let’s implement Google Refine in Amazon Web Services (aka “le cloud”).

    image

    Google refine ?

    According to Google, “Google Refine is a power tool for working with messy data, cleaning it up, transforming it from one format into another, extending it with web services, and linking it to databases like Freebase.”

    With Google Refine, it’s easy to load big datafiles and process this data : cell fusion, clustering, groups, adding key-values, transcoding, data modification / data customization with web service calls … Just imagine an Excel grid, but on steroïds.

    image

    Let’s go now …

    As usual, you first need a valid AWS / EC2 account. Once it’s been done, you need an instance (a server). I recommend using a Fedora Core instance for Google Refine instead of a Ubuntu one. I’m a great fan of Ubuntu and use them in a lot of crucial apps, but I faced many issues with running Google Refine on top of a Ubuntu Lucid AMI (ram usage, freezing, jdk erratic behaviour). Please, choose instead the fedora one : amazon/fedora-8-x86_64-v1.14-std. Of course, let’s go with 64 bits and with an EBS volume attached to the instance in order to provide data consistency over time (everything not located on EBS is ephemeral).

     

    image

    I won’t detail how to create and start an instance on AWS, but here are the majors steps. You can easily go through these steps by using the Firefox pluging called Elastifox. Or, for the guys having muscles, use the AWS EC2 api, which can be donwloaded here.

    • Create a key pair and download the private key : this will give you SSH access to your instance.

    image

    • Create a security group : create a dedicated security group for your instance and open the following ports : 22 for ssh and 3333 for Google Refine webservice.

    image

    • Choose your instance : amazon/fedora-8-x86_64-v1.14-std is a good choice. This corresponds to AMI ami-1d042f69

    image

    • Run your instance :
      • Be carefull with the availability zone. This one is in USA but you can also place your instance in Ireland or in Asia. Be sure not to put yourself outlaw by placing sensitive data out of your safe harbor.
      • Be sure you assign the key pair and the security group you created on steps 1 and 2.
      • Choose an instance size. As I have plenty of money, I chose a m2.xlarge instance with plenty of ram.

    image

    • SSH into your instance : you can use Putty, but don’t forget to convert the pem key (unix style private key) into ppk (windows style private key).

    image

    • Create a volume and attach it to your newly running instance. This volume will be used to store Google Refine itself and all the data you will work with.
      • You can create a volume by using Elasticfox and attach it to the instance. Below you can find the different prompts. I choose 100 Gigas but you can set it smaller.

    image

    image

      • Once the volume is created and attached to the instance, simply create a filesystem on it, using ext3 for instance : sudo mkfs.ext3 /dev/sdf
      • Create a mount point in /mnt : sudo mkdir /mnt/refine
      • Add mount point in /etc/fstab
      • Mount the newly created volume : sudo mount /mnt/refine
    • Download and install jdk 1.6
    • Finally, assign an Elastic IP to your instance. This will be easier to connect to your instance and start using Google Refine. Once again, using ElasticFox will save you a lot of time. (of course you can assign an elastic IP before connecting with SSH, which is more logic I admit …).

      image

    • Download and install Google Refine. This is really a no brainer … 
      • Go into /mnt/refine (that means goind into your EBS).
      • Download Refine : sudo wget http://google-refine.googlecode.com/files/google-refine-2.0-r1836.tar.gz
      • unzip and untar the archive :
        • sudo tar xzf google-refine-2.0-r1836.tar.gz
      • Start Google Refine
        • sh refine –I 0.0.0.0 –m 8000M
        • That means : start refine, listen to all addresses and assign 8 Giga of memory. Hey, that’s what we need here when playing with data !
        • Google Refine starts ….

          Starting Google Refine at 'http://0.0.0.0:3333/'

          10:11:01.905 [refine_server] Starting Server bound to '0.0.0.0:3333' (0ms)
          10:11:01.906 [refine_server] Max memory size: 8000M (1ms)
          10:11:01.955 [refine_server] Initializing context: '/' from '/mnt/refine/google-refine-2.0/webapp' (49ms)
          10:11:03.288 [refine] Starting Google Refine 2.0 [r1836]... (1333ms)
          10:11:03.297 [FileProjectManager] Using workspace directory: /root/.local/share/google/refine (9ms)
          10:11:03.299 [FileProjectManager] Loading workspace: /root/.local/share/google/refine/workspace.json (2ms)

     

    Time to play now !

    Ok, most of the work is done. I hope the quick AWS EC2 walk thru is sufficient for most of you. In the other case, feel free to contact me.

    Now let’s play with Google Refine. Simply open your browser and point to the IP address you assigned to your instance : http://your IP Adress:3333/

    … and you are done !

    image

    Ok, I have some more time, let’s create a simple project. Simply choose a file in the Data File zone, name your project and provide some more informations about your file : separator, header, limit, auto detect value types …

    After a short while, booom, your file is loaded and you have access to your data, ready to work on it.

    image

    Stay in touch for the next coming articles, I will show you how to fully leverage Google Refine and how to enrich your data with spectacular value added and services.

    One last thing, you definitely need to go and read my friend’s blog about datamining. His name is Matthias Oehler (dataminer) and he is a kind of R and Google Refine wizard. You will learn a lot reading his articles as soon as his website will be open (from a few hours to a few days according to him …).

    More informations

    • You can compile and deploy Google Refine into Google AppEngine, which is even a better solution.
    • You can find some screencast about Google Refine here.

     

    Feel free to contact me if needed.

    Wednesday, 23 March 2011

    PageRank 4 and next coming articles ...

    Hi all,

    First, I want to thank everybody (well, people reading that blog) for my pagerank 4.

    image

    It's been a long time since I wrote a post here. As you may remember, I'm deeply involved into a datamining startup as CTO.

    Be prepared because I'm back in a short while with a lot of articles / experiences and feed back about :
    Vincent