Wednesday, January 28, 2009

Integrating Facebook Hive with Symfony

A symfony feature that has been really helpful while working on projects with it is the debugging capabilities of the framework. Almost every day I see my self going to the command line an doing a tail -f log/frontend_dev.log in my symfony project. This helps me see what is going on behind the scenes, spot bugs, find places where I can improve the code, etc. If something goes wrong there is always a developer on the team that shouts “check the symfony logs”, showing that they had became an essential tool for development. 


But not everything shines under the sun. Sometimes it happens that we would like to have a tool that allows us to filter the logs according to specific criteria. A tool that goes beyond a simple cat logs/frontend_dev.log | grep SELECT. We wanted to perform some analysis on the website usage, basically to help us improve it performance.


A colleague talked about Facebook Scribe, that was not actually related with this dream tool but later lead me to learn about the existence of the Apache Hive project (which was started by Facebook).


The Hive project allows to load a logs file and then filter it with SQL like commands. -Hive is more than this raw description, you can read more about it here.


My idea was to adapt the file logging format of symfony to make easy to import those files inside a Hive database. Because Hive support table partitioning by date, it should be easier to load the data from the logs and then perform the analysis with the SQL like syntax provided by Hive.


After some fights with Ant and Java 1.6 It was possible for me to get Hive running in my Mac. Then I just created a shameless copy of the sfFileLogger and renamed it to sfHiveLogger. I did small changes here and there and got it ready to log in a format that it’s easy to load later into Hive. I browsed through my testing symfony project to generate some logs and then I moved to the command line to start the fun with Hive.


There I created a table to hold the logs with the following command:


CREATE TABLE sflogs(

  logTime STRING, 

  priorityName STRING,

  message STRING

  COMMENT 'This is the sflogs table' 

  PARTITIONED BY(dt STRING) 

  ROW FORMAT DELIMITED

    FIELDS TERMINATED BY '\011' 

    LINES TERMINATED BY '\012';


Everything was working smoothly. The next step was to load the data from the logs file into the sflogs table to start issuing queries to it. I did it with this command:


LOAD DATA LOCAL INPATH '/path/to/myproject/log/frontend_dev.log.2009-01-28' INTO TABLE sflogs PARTITION (dt='2009-01-28');


With the data loaded I started to issue some commands against the Hive console like:


SELECT * FROM sflogs WHERE message LIKE "{sfRequest}%";


SELECT DISTINCT priorityname FROM sflogs;


SELECT COUNT(1) FROM sflogs;


The results were similar as when we work on a mysql client which was awesome. What amazed me the most was how with some easy changes it's possible to adapt symfony to our needs. But which were this changes? Here they are:

First we need to enable the sfHiveLogger in the logging.yml file of your symfony project under the sf_file_debug entry


Then do the shameless copy of the symfony class to your lib folder and rename it to sfHiveLogger.


Then change the code of the initialize method to look like this:


    if (!isset($options['file']))

    {

      throw new sfConfigurationException('File option is mandatory for a file logger');

    }


    $dir = dirname($options['file']);


    if (!is_dir($dir))

    {

      mkdir($dir, 0777, 1);

    }

    

    $logFileName = $options['file'] . '.' . date('Y-m-d');


    $fileExists = file_exists($logFileName);

    if (!is_writable($dir) || ($fileExists && !is_writable($logFileName)))

    {

      throw new sfFileException(sprintf('Unable to open the log file "%s" for writing', $logFileName));

    }


    $this->fp = fopen($logFileName, 'a');

    if (!$fileExists)

    {

      chmod($logFileName, 0666);

    }


Basically the change there is to append the current date in “Y-m-d-” format at the end of the logs file name. This will make easier to import the logs into Hive -only if you want them partitioned by date-.


Then on the log method change the line with:


$line = sprintf("%s %s [%s] %s%s", strftime('%b %d %H:%M:%S'), 'symfony', $priorityName, $message, DIRECTORY_SEPARATOR == '\\' ? "\r\n" : "\n");


to:


$line = sprintf("%s\t%s\t%s%s", strftime('%b %d %H:%M:%S'), $priorityName, $message, "\n");


Here what we do is to apply a little formating there. The most important part is to have the tab and then new line characters as delimiters because this was what we specified in the CREATE TABLE command above.


With this easy steps we can have a Hive enabled log file. If you want to learn more about Hive and the supported commands and the theory behind it please refer to the wiki.


Conclusion:


Besides that I’m still comparing Hive with other solutions to parse and analyze the logs, I think that this tool has a lot of potential to help debugging and profiling symfony applications. If we polish the log format and refine the table structure, then it’s shouldn’t be hard to setup some cronjobs that generate reports of the website usage, improving the usability of the symfony logs.

2 comments:

Mis papis said...

This is cool, you should consider creating a symfony plugin.

Robert Speer said...
This comment has been removed by the author.