In my last update on regular expressions i.e. regex, I mentioned that regex have the group thing which is used extensively by java people in order to enhance its productivity. So in this update, we are going to talk about the groups which are used in regular expressions.

The concept of groups is quite simple. In this, we divide the passed input string into some number of groups. This number is not fixed and varies depending upon our requirements. We need to make sure that the passed string should be compiled in proper format with the help of these groups.

There is not much to explain about this concept. The previous post about regex is the basic one and same thing is implemented over here but with the help of the groups. That is it.

The following program will give you some insight about the concept.

package regex; 

import java.io.BufferedReader;

import java.io.BufferedWriter;

import java.io.FileReader;

import java.io.FileWriter;

import java.io.IOException;

import java.util.regex.Matcher;

import java.util.regex.Pattern;

 

publicclass SampleLogFileParsing {

publicstaticvoid main(String[] args) throws IOException {

BufferedReader bufferedReader = new BufferedReader(new FileReader(

“logfile.txt”));

BufferedWriter fileWriter = new BufferedWriter(new FileWriter(

“parsedLogFile.csv”));

fileWriter

.write(“IP,USERIDENTIFIER,USERID,DATEANDTIME,HTTPMETHOD,REQUEST,HTTPVERSION,STATUSCODE,SIZE,REFERRER,BROWSERINFORMATION”);

fileWriter.newLine();

 

String line = null;

while ((line = bufferedReader.readLine()) != null) {

Pattern pattern = Pattern

.compile(“^(\\d{1,3}.\\d{1,3}.\\d{1,3}.\\d{1,3}) (\\S+) (\\S+) \\[(\\d{2}/\\S{3}/\\d{4}:\\d{2}:\\d{2}:\\d{2}) ([0-9\\-\\#\\S]{5})(\\]) (\”)(\\S{1,5}) ((#N/A|(http://)?[a-zA-Z_0-9\\-]+(\\.\\w[a-zA-Z_0-9\\-]+)+(/[#&\\n\\-=?\\+__\\%/\\w]+)([\\.\\w]+)?)) ((HTTP/)[0-9].[0-9]?)\” (\\d{3}) (\\d{0,9}) (.|\”(http://)?[a-zA-Z_0-9\\-]+(\\.\\w[a-zA-Z_0-9\\-]+)+(/[#&\\n\\-=?\\+\\%/\\.\\w ]+)\”) \”((.*)+)\”$”);

Matcher matcher = pattern.matcher(line);

if (matcher.find()) {

fileWriter.write(matcher.group(1) + “,”);

fileWriter.write(matcher.group(2) + “,”);

fileWriter.write(matcher.group(3) + “,”);

fileWriter.write(matcher.group(4) + “,”);

fileWriter.write(matcher.group(8) + “,”);

fileWriter.write(matcher.group(9) + “,”);

fileWriter.write(matcher.group(15) + “,”);

fileWriter.write(matcher.group(17) + “,”);

fileWriter.write(matcher.group(18) + “,”);

fileWriter.write(matcher.group(19) + “,”);

fileWriter.write(matcher.group(23) + “,”);

 

fileWriter.flush();

}

fileWriter.newLine();

}

fileWriter.close();

bufferedReader.close();

}

}

 

 

If you take a close look at above program, you will see that we have dealt with two files in the program. One is the input file and the other is the output file. Input file with name “logfile.txt” contains the log file data of 10 lines and this data is totally unstructured. The parsing of this data is done with the help of our above program. And the structured output is stored in the output file with the name “parsedLogFile.csv”.

The contents of input file are as follows.

50.113.9.20 – jerry [01/Apr/2007:18:11:13 50005] “GET http://www.sample.com/sports/cricket.htm HTTP/1.0” 200 69 . “Mozilla/5.0 (compatible; MSIE 7.0; Windows NT 5.2; WOW64)”231.133.16.229 – cartha [01/Apr/2007:05:12:56 -0503] “GET http://www.sample.com/worldnews/euro.htm HTTP/1.0” 200 106 “http://www.aol.co.jp/search.htm?tennis” “Mozilla/5.0 (Windows; U; Windows NT 5.1; de-AT; rv:1.8a1) Gecko/20040520”

133.219.103.64 – kenrry [01/Apr/2007:23:18:26 00000] “GET http://www.sample.com/sports/cricket.htm HTTP/1.0” 408 69 “http://www.msn.co.ch/search.htm?ayurveda” “Mozilla/5.0 (Windows; U; Win98; en-US; rv:1.8a6) Gecko/20050111”

185.174.232.245 – kim [01/Apr/2007:16:59:47 00000] “GET http://www.sample.com/health/hospitals.htm HTTP/1.0” 200 294 . “Mozilla/4.0 (compatible; MSIE 6.0b; Windows NT 4.0)”

249.162.73.253 – cartul [01/Apr/2007:13:57:02 -00-1] “GET http://www.sample.com/worldnews/euro.htm HTTP/1.0” 200 106 “http://www.bbc.co.it/search.htm?Swiss Holidays” “Mozilla/5.0 (Windows; U; Win 9x 4.90; en-US; rv:1.4) Gecko/20030624 Netscape/7.1 (ax)”

99.35.88.231 – cheryl [01/Apr/2007:15:42:37 -00-9] “GET http://www.sample.com/localnews/central_bank.htm HTTP/1.0” 200 148 . “Mozilla/4.0 (compatible; MSIE 6.0b; Windows NT 5.1)”

98.134.55.24 – jean [01/Apr/2007:11:29:11 -00-9] “GET http://www.sample.com/localnews/bangalore.htm HTTP/1.0” 200 267 . “Mozilla/4.0 (compatible; MSIE 6.0b; Windows NT 5.1)”

59.3.99.251 – meeiti [01/Apr/2007:02:43:56 -00-1] “GET http://www.sample.com/sports.htm HTTP/1.0” 101 230 “http://www.msn.co.de/search.htm?formula1” “Mozilla/5.0 (Windows; U; Windows NT 5.1; de-AT; rv:1.8b) Gecko/20050217”

242.81.121.248 – julie [01/Apr/2007:12:04:11 -00-9] “GET http://www.sample.com/education/b-schools.htm HTTP/1.0” 200 90 . “Mozilla/4.0 (compatible; MSIE 6.0b; Windows NT 5.1)”

21.249.45.243 – cinfer [01/Apr/2007:01:13:23 -00-9] “GET http://www.sample.com/education/b-schools.htm HTTP/1.0” 200 90 . “Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0; Alexa Toolbar)”

As you can see the input data is totally unstructured and it is difficult to debug it manually, but this thing is done with ease with the help of regex.

The output file will look like this

ParsedOutput
parsedlogfile

 

I guess this clears the air pretty clearly. Thanks for having a read. Good day.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s