I just started learning Python and would like to read an Apache log file and put parts of each line into different lists.
line from the file
172.16.0.3 – – [25/Sep/2002:14:04:19 +0200] “GET / HTTP/1.1” 401 – “” “Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.1) Gecko/20020827”
according to Apache website the format is
%h %l %u %t ”%r” %>s %b ”%{Referer}i” ”%{User-Agent}i
I’m able to open the file and just read it as it is but I don’t know how to make it read in that format so I can put each part in a list.
Answers:
Thank you for visiting the Q&A section on Magenaut. Please note that all the answers may not help you solve the issue immediately. So please treat them as advisements. If you found the post helpful (or not), leave a comment & I’ll get back to you as soon as possible.
Method 1
This is a job for regular expressions.
For example:
line = '172.16.0.3 - - [25/Sep/2002:14:04:19 +0200] "GET / HTTP/1.1" 401 - "" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.1) Gecko/20020827"' regex = '([(d.)]+) - - [(.*?)] "(.*?)" (d+) - "(.*?)" "(.*?)"' import re print re.match(regex, line).groups()
The output would be a tuple with 6 pieces of information from the line (specifically, the groups within parentheses in that pattern):
('172.16.0.3', '25/Sep/2002:14:04:19 +0200', 'GET / HTTP/1.1', '401', '', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.1) Gecko/20020827')
Method 2
I have created a python library which does just that: apache-log-parser.
>>> import apache_log_parser
>>> line_parser = apache_log_parser.make_parser("%h <<%P>> %t %Dus "%r" %>s %b "%{Referer}i" "%{User-Agent}i" %l %u")
>>> log_line_data = line_parser('127.0.0.1 <<6113>> [16/Aug/2013:15:45:34 +0000] 1966093us "GET / HTTP/1.1" 200 3478 "https://example.com/" "Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.18)" - -')
>>> pprint(log_line_data)
{'pid': '6113',
'remote_host': '127.0.0.1',
'remote_logname': '-',
'remote_user': '',
'request_first_line': 'GET / HTTP/1.1',
'request_header_referer': 'https://example.com/',
'request_header_user_agent': 'Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.18)',
'response_bytes_clf': '3478',
'status': '200',
'time_received': '[16/Aug/2013:15:45:34 +0000]',
'time_us': '1966093'}
Method 3
Use a regular expression to split a row into separate “tokens”:
>>> row = """172.16.0.3 - - [25/Sep/2002:14:04:19 +0200] "GET / HTTP/1.1" 401 - "" "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.1) Gecko/20020827" """
>>> import re
>>> map(''.join, re.findall(r'"(.*?)"|[(.*?)]|(S+)', row))
['172.16.0.3', '-', '-', '25/Sep/2002:14:04:19 +0200', 'GET / HTTP/1.1', '401', '-', '', 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.1) Gecko/20020827']
Another solution is to use a dedicated tool, e.g. http://pypi.python.org/pypi/pylogsparser/0.4
Method 4
RegEx seemed extreme and problematic considering the simplicity of the format, so I wrote this little splitter which others may find useful as well:
def apache2_logrow(s):
''' Fast split on Apache2 log lines
http://httpd.apache.org/docs/trunk/logs.html
'''
row = [ ]
qe = qp = None # quote end character (qe) and quote parts (qp)
for s in s.replace('r','').replace('n','').split(' '):
if qp:
qp.append(s)
elif '' == s: # blanks
row.append('')
elif '"' == s[0]: # begin " quote "
qp = [ s ]
qe = '"'
elif '[' == s[0]: # begin [ quote ]
qp = [ s ]
qe = ']'
else:
row.append(s)
l = len(s)
if l and qe == s[-1]: # end quote
if l == 1 or s[-2] != '\': # don't end on escaped quotes
row.append(' '.join(qp)[1:-1].replace('\'+qe, qe))
qp = qe = None
return row
Method 5
import re
HOST = r'^(?P<host>.*?)'
SPACE = r's'
IDENTITY = r'S+'
USER = r'S+'
TIME = r'(?P<time>[.*?])'
REQUEST = r'"(?P<request>.*?)"'
STATUS = r'(?P<status>d{3})'
SIZE = r'(?P<size>S+)'
REGEX = HOST+SPACE+IDENTITY+SPACE+USER+SPACE+TIME+SPACE+REQUEST+SPACE+STATUS+SPACE+SIZE+SPACE
def parser(log_line):
match = re.search(REGEX,log_line)
return ( (match.group('host'),
match.group('time'),
match.group('request') ,
match.group('status') ,
match.group('size')
)
)
logLine = """180.76.15.30 - - [24/Mar/2017:19:37:57 +0000] "GET /shop/page/32/?count=15&orderby=title&add_to_wishlist=4846 HTTP/1.1" 404 10202 "-" "Mozilla/5.0 (compatible; Baiduspider/2.0; +http://www.baidu.com/search/spider.html)"""
result = parser(logLine)
print(result)
Method 6
Add this in httpd.conf to convert the apache logs to json.
LogFormat "{"time":"%t", "remoteIP" :"%a", "host": "%V", "request_id": "%L", "request":"%U", "query" : "%q", "method":"%m", "status":"%>s", "userAgent":"%{User-agent}i", "referer":"%{Referer}i" }" json_log
CustomLog /var/log/apache_access_log json_log
CustomLog "|/usr/bin/python -u apacheLogHandler.py" json_log
Now you see you access_logs in json format.
Use the below python code to parse the json logs that are constantly getting updated.
apacheLogHandler.py
import time
f = open('apache_access_log.log', 'r')
for line in f: # read all lines already in the file
print line.strip()
# keep waiting forever for more lines.
while True:
line = f.readline() # just read more
if line: # if you got something...
print 'got data:', line.strip()
time.sleep(1)
All methods was sourced from stackoverflow.com or stackexchange.com, is licensed under cc by-sa 2.5, cc by-sa 3.0 and cc by-sa 4.0