3 Parser for Apache log files. This is a port to python of Peter Hickman's
4 Apache::LogEntry Perl module:
5 <http://cpan.uwinnipeg.ca/~peterhi/Apache-LogRegex>
7 Takes the Apache logging format defined in your httpd.conf and generates
8 a regular expression which is used to a line from the log file and
9 return it as a dictionary with keys corresponding to the fields defined
16 # Format copied and pasted from Apache conf - use raw string + single quotes
17 format = r'%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"'
19 p = apachelog.parser(format)
21 for line in open('/var/apache/access.log'):
25 sys.stderr.write("Unable to parse %s" % line)
27 The return dictionary from the parse method depends on the input format.
28 For the above example, the returned dictionary would look like;
35 '%r': 'GET /images/previous.png HTTP/1.1',
36 '%t': '[23/Jan/2004:11:36:20 +0000]',
38 '%{Referer}i': 'http://peterhi.dyndns.org/bandwidth/index.html',
39 '%{User-Agent}i': 'Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2) Gecko/20021202'
42 ...given an access log entry like (split across lines for formatting);
44 212.74.15.68 - - [23/Jan/2004:11:36:20 +0000] "GET /images/previous.png HTTP/1.1"
45 200 2607 "http://peterhi.dyndns.org/bandwidth/index.html"
46 "Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.2) Gecko/20021202"
48 You can also re-map the field names by subclassing (or re-pointing) the
51 Generally you should be able to copy and paste the format string from
52 your Apache configuration, but remember to place it in a raw string
53 using single-quotes, so that backslashes are handled correctly.
55 This module provides three of the most common log formats in the
58 # Common Log Format (CLF)
59 p = apachelog.parser(apachelog.formats['common'])
61 # Common Log Format with Virtual Host
62 p = apachelog.parser(apachelog.formats['vhcommon'])
64 # NCSA extended/combined log format
65 p = apachelog.parser(apachelog.formats['extended'])
67 For notes regarding performance while reading lines from a file
68 in Python, see <http://effbot.org/zone/readline-performance.htm>.
69 Further performance boost can be gained by using psyco
70 <http://psyco.sourceforge.net/>
72 On my system, using a loop like;
74 for line in open('access.log'):
77 ...was able to parse ~60,000 lines / second. Adding psyco to the mix,
78 up that to ~75,000 lines / second.
82 __license__ = """Released under the same terms as Perl.
83 See: http://dev.perl.org/licenses/
85 __author__ = "Harry Fuecks <hfuecks@gmail.com>"
87 "Peter Hickman <peterhi@ntlworld.com>",
88 "Loic Dachary <loic@dachary.org>"
93 class ApacheLogParserError(Exception):
98 Allows dicts to be accessed via dot notation as well as subscripts
99 Makes using the friendly names nicer
101 def __getattr__(self, name):
106 # Explanatory comments copied from
107 # http://httpd.apache.org/docs/2.2/mod/mod_log_config.html
112 # Size of response in bytes, excluding HTTP headers.
113 '%B':'response_bytes',
114 # Size of response in bytes, excluding HTTP headers. In CLF
115 # format, i.e. a "-" rather than a 0 when no bytes are sent.
116 '%b':'response_bytes_clf',
117 # The contents of cookie Foobar in the request sent to the server.
118 # Only version 0 cookies are fully supported.
121 # The time taken to serve the request, in microseconds.
122 '%D':'response_time_us',
123 # The contents of the environment variable FOOBAR
130 # The request protocol
131 '%H':'request_protocol',
132 # The contents of Foobar: header line(s) in the request sent to
133 # the server. Changes made by other modules (e.g. mod_headers)
137 # Number of keepalive requests handled on this connection.
138 # Interesting if KeepAlive is being used, so that, for example,
139 # a "1" means the first keepalive request after the initial one,
140 # "2" the second, etc...; otherwise this is always 0 (indicating
141 # the initial request). Available in versions 2.2.11 and later.
142 '%k':'keepalive_num',
143 # Remote logname (from identd, if supplied). This will return a
144 # dash unless mod_ident is present and IdentityCheck is set On.
145 '%l':'remote_logname',
147 '%m':'request_method',
148 # The contents of note Foobar from another module.
151 # The contents of Foobar: header line(s) in the reply.
153 '%{}o':'reply_header',
154 # The canonical port of the server serving the request
156 # The canonical port of the server serving the request or the
157 # server's actual port or the client's actual port. Valid
158 # formats are canonical, local, or remote.
161 # The process ID of the child that serviced the request.
163 # The process ID or thread id of the child that serviced the
164 # request. Valid formats are pid, tid, and hextid. hextid requires
165 # APR 1.2.0 or higher.
168 # The query string (prepended with a ? if a query string exists,
169 # otherwise an empty string)
171 # First line of request
172 # e.g., what you'd see in the logs as 'GET / HTTP/1.1'
174 # The handler generating the response (if any).
175 '%R':'response_handler',
176 # Status. For requests that got internally redirected, this is
177 # the status of the *original* request --- %>s for the last.
180 # Time the request was received (standard english format)
182 # The time, in the form given by format, which should be in
183 # strftime(3) format. (potentially localized)
184 #'%{format}t':'TODO',
185 # The time taken to serve the request, in seconds.
186 '%T':'response_time_sec',
187 # Remote user (from auth; may be bogus if return status (%s) is 401)
189 # The URL path requested, not including any query string.
191 # The canonical ServerName of the server serving the request.
192 '%v':'canonical_server_name',
193 # The server name according to the UseCanonicalName setting.
194 '%V':'server_name_config', #TODO: Needs better name
195 # Connection status when response is completed:
196 # X = connection aborted before the response completed.
197 # + = connection may be kept alive after the response is sent.
198 # - = connection will be closed after the response is sent.
199 '%X':'completed_connection_status',
200 # Bytes received, including request and headers, cannot be zero.
201 # You need to enable mod_logio to use this.
202 '%I':'bytes_received',
203 # Bytes sent, including headers, cannot be zero. You need to
204 # enable mod_logio to use this
208 def __init__(self, format, use_friendly_names=False):
210 Takes the log format from an Apache configuration file.
212 Best just copy and paste directly from the .conf file
213 and pass using a Python raw string e.g.
215 format = r'%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"'
216 p = apachelog.parser(format)
221 self._use_friendly_names = use_friendly_names
222 self._parse_format(format)
224 def _parse_format(self, format):
226 Converts the input format to a regular
227 expression, as well as extracting fields
229 Raises an exception if it couldn't compile
232 format = format.strip()
233 format = re.sub('[ \t]+',' ',format)
237 findquotes = re.compile(r'^\\"')
238 findreferreragent = re.compile('Referer|User-Agent', re.I)
239 findpercent = re.compile('^%.*t$')
240 lstripquotes = re.compile(r'^\\"')
241 rstripquotes = re.compile(r'\\"$')
244 for element in format.split(' '):
247 if findquotes.search(element): hasquotes = 1
250 element = lstripquotes.sub('', element)
251 element = rstripquotes.sub('', element)
253 if self._use_friendly_names:
254 self._names.append(self.alias(element))
256 self._names.append(element)
261 if element == '%r' or findreferreragent.search(element):
262 subpattern = r'\"([^"\\]*(?:\\.[^"\\]*)*)\"'
264 subpattern = r'\"([^\"]*)\"'
266 elif findpercent.search(element):
267 subpattern = r'(\[[^\]]+\])'
269 elif element == '%U':
272 subpatterns.append(subpattern)
274 self._pattern = '^' + ' '.join(subpatterns) + '$'
276 self._regex = re.compile(self._pattern)
278 raise ApacheLogParserError(e)
280 def parse(self, line):
282 Parses a single line from the log file and returns
283 a dictionary of it's contents.
285 Raises and exception if it couldn't parse the line
288 match = self._regex.match(line)
292 for k, v in zip(self._names, match.groups()):
296 raise ApacheLogParserError("Unable to parse: %s with the %s regular expression" % ( line, self._pattern ) )
298 def alias(self, name):
300 Override / replace this method if you want to map format
301 field names to something else. This method is called
302 when the parser is constructed, not when actually parsing
305 For custom format names, such as %{Foobar}C, 'Foobar' is referred to
306 (in this function) as the custom_format and '%{}C' as the name
308 If the custom_format has a '-' in it (and is not a time format), then the
309 '-' is replaced with a '_' so the name remains a valid identifier.
311 Takes and returns a string fieldname
316 if name.startswith('%{'):
317 custom_format = '_' + name[2:-2]
318 name = '%{}' + name[-1]
321 custom_format = custom_format.replace('-', '_')
324 return self.format_to_name[name] + custom_format
330 Returns the compound regular expression the parser extracted
331 from the input format (a string)
337 Returns the field names the parser extracted from the
338 input format (a list)
343 Frequenty used log formats stored here
346 # Common Log Format (CLF)
347 'common':r'%h %l %u %t \"%r\" %>s %b',
349 # Common Log Format with Virtual Host
350 'vhcommon':r'%v %h %l %u %t \"%r\" %>s %b',
352 # NCSA extended/combined log format
353 'extended':r'%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\"',