Parsing apache log files

I just started learning Python and would like to read an Apache log file and put parts of each line into different lists.

line from the file

172.16.0.3 – – [25/Sep/2002:14:04:19 +0200] “GET / HTTP/1.1” 401 – “” “Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.1) Gecko/20020827”

according to Apache website the format is

%h %l %u %t ”%r” %>s %b ”%{Referer}i” ”%{User-Agent}i

I’m able to open the file and just read it as it is but I don’t know how to make it read in that format so I can put each part in a list.

Contents hide

Answers:

Method 1

Method 2

Method 3

Method 4

Method 5

Method 6

Answers:

Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.

Method 1

This is a job for regular expressions.

For example:

line = '172.16.0.3 - - [25/Sep/2002:14:04:19 +0200] "GET / HTTP/1.1" 401 - "" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.1) Gecko/20020827"'
regex = '([(d.)]+) - - [(.*?)] "(.*?)" (d+) - "(.*?)" "(.*?)"'

import re
print re.match(regex, line).groups()

The output would be a tuple with 6 pieces of information from the line (specifically, the groups within parentheses in that pattern):

('172.16.0.3', '25/Sep/2002:14:04:19 +0200', 'GET / HTTP/1.1', '401', '', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.1) Gecko/20020827')

Method 2

I have created a python library which does just that: apache-log-parser.

>>> import apache_log_parser
 >>> line_parser = apache_log_parser.make_parser("%h <<%P>> %t %Dus "%r" %>s %b  "%{Referer}i" "%{User-Agent}i" %l %u")
>>> log_line_data = line_parser('127.0.0.1 <<6113>> [16/Aug/2013:15:45:34 +0000] 1966093us "GET / HTTP/1.1" 200 3478  "https://example.com/" "Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.18)" - -')
>>> pprint(log_line_data)
{'pid': '6113',
 'remote_host': '127.0.0.1',
 'remote_logname': '-',
 'remote_user': '',
 'request_first_line': 'GET / HTTP/1.1',
 'request_header_referer': 'https://example.com/',
 'request_header_user_agent': 'Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.18)',
 'response_bytes_clf': '3478',
 'status': '200',
 'time_received': '[16/Aug/2013:15:45:34 +0000]',
 'time_us': '1966093'}

Method 3

Use a regular expression to split a row into separate “tokens”:

>>> row = """172.16.0.3 - - [25/Sep/2002:14:04:19 +0200] "GET / HTTP/1.1" 401 - "" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.1) Gecko/20020827" """
>>> import re
>>> map(''.join, re.findall(r'"(.*?)"|[(.*?)]|(S+)', row))
['172.16.0.3', '-', '-', '25/Sep/2002:14:04:19 +0200', 'GET / HTTP/1.1', '401', '-', '', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.1) Gecko/20020827']

Another solution is to use a dedicated tool, e.g. http://pypi.python.org/pypi/pylogsparser/0.4

Method 4

RegEx seemed extreme and problematic considering the simplicity of the format, so I wrote this little splitter which others may find useful as well:

def apache2_logrow(s):
    ''' Fast split on Apache2 log lines

    http://httpd.apache.org/docs/trunk/logs.html
    '''
    row = [ ]
    qe = qp = None # quote end character (qe) and quote parts (qp)
    for s in s.replace('r','').replace('n','').split(' '):
        if qp:
            qp.append(s)
        elif '' == s: # blanks
            row.append('')
        elif '"' == s[0]: # begin " quote "
            qp = [ s ]
            qe = '"'
        elif '[' == s[0]: # begin [ quote ]
            qp = [ s ]
            qe = ']'
        else:
            row.append(s)

        l = len(s)
        if l and qe == s[-1]: # end quote
            if l == 1 or s[-2] != '\': # don't end on escaped quotes
                row.append(' '.join(qp)[1:-1].replace('\'+qe, qe))
                qp = qe = None
    return row

Method 5

import re


HOST = r'^(?P<host>.*?)'
SPACE = r's'
IDENTITY = r'S+'
USER = r'S+'
TIME = r'(?P<time>[.*?])'
REQUEST = r'"(?P<request>.*?)"'
STATUS = r'(?P<status>d{3})'
SIZE = r'(?P<size>S+)'

REGEX = HOST+SPACE+IDENTITY+SPACE+USER+SPACE+TIME+SPACE+REQUEST+SPACE+STATUS+SPACE+SIZE+SPACE

def parser(log_line):
    match = re.search(REGEX,log_line)
    return ( (match.group('host'),
            match.group('time'), 
                      match.group('request') , 
                      match.group('status') ,
                      match.group('size')
                     )
                   )


logLine = """180.76.15.30 - - [24/Mar/2017:19:37:57 +0000] "GET /shop/page/32/?count=15&orderby=title&add_to_wishlist=4846 HTTP/1.1" 404 10202 "-" "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)"""
result = parser(logLine)
print(result)

Method 6

Add this in httpd.conf to convert the apache logs to json.

LogFormat "{"time":"%t", "remoteIP" :"%a", "host": "%V", "request_id": "%L", "request":"%U", "query" : "%q", "method":"%m", "status":"%>s", "userAgent":"%{User-agent}i", "referer":"%{Referer}i" }" json_log

CustomLog /var/log/apache_access_log json_log
CustomLog "|/usr/bin/python -u apacheLogHandler.py" json_log

Now you see you access_logs in json format.
Use the below python code to parse the json logs that are constantly getting updated.

apacheLogHandler.py

import time
f = open('apache_access_log.log', 'r')
for line in f: # read all lines already in the file
  print line.strip()

# keep waiting forever for more lines.
while True:
  line = f.readline() # just read more
  if line: # if you got something...
    print 'got data:', line.strip()
  time.sleep(1)

All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0

0 0 votes

Article Rating