The server access log records all requests processed by the server. The location and content of the access log are controlled by the CustomLog directive. Of course, storing the information in the access log is only the start of log management. The next step is to analyse this information to produce useful statistics.
The principal use of awk is to break up each line of a file into ‘fields’ or ‘columns’ using a pre-defined separator. Because each line of the log file is based on the standard format we can do many things quite easily.
Using the default separator which is any white-space (spaces or tabs) we get the following:
awk ‘{print $1}’ access.log # ip address (%h)
awk ‘{print $2}’ access.log # RFC 1413 identity (%l)
awk ‘{print $3}’ access.log # userid (%u)
awk ‘{print $4,5}’ access.log # date/time (%t)
awk ‘{print $9}’ access.log # status code (%>s)
awk ‘{print $10}’ access.log # size (%b)
awk -F\” ‘{print $2}’ access.log # request line (%r)
awk -F\” ‘{print $4}’ access.log # referer
awk -F\” ‘{print $6}’ access.log # user agent
Now that you understand the basics of breaking up the log file and identifying different elements, we can move on to more practical examples. But before we do that, we should explain how you can modify your log format and quickly extend capabilities of these simple examples.
The format argument to the LogFormat and CustomLog directives is a string. This string is used to log each request to the log file. It can contain literal characters copied into the log files and the C-style control characters “\n” and “\t” to represent new-lines and tabs. Literal quotes and backslashes should be escaped with backslashes.
The characteristics of the request itself are logged by placing “%” directives in the format string, which are replaced in the log file by the values as follows:
%%
The percent sign
%a
Remote IP-address
%A
Local IP-address
%B
Size of response in bytes, excluding HTTP headers.
%b
Size of response in bytes, excluding HTTP headers. In CLF format, i.e. a ‘-‘ rather than a 0 when no bytes are sent.
%{Foobar}C
The contents of cookie Foobar in the request sent to the server. Only version 0 cookies are fully supported.
%D
The time taken to serve the request, in microseconds.
%{FOOBAR}e
The contents of the environment variable FOOBAR
%f
Filename
%h
Remote host
%H
The request protocol
%{Foobar}i
The contents of Foobar: header line(s) in the request sent to the server. Changes made by other modules (e.g. mod_headers) affect this. If you’re interested in what the request header was prior to when most modules would have modified it, use mod_setenvif to copy the header into an internal environment variable and log that value with the %{VARNAME}e described above.
%k
Number of keepalive requests handled on this connection. Interesting if KeepAlive is being used, so that, for example, a ‘1’ means the first keepalive request after the initial one, ‘2’ the second, etc…; otherwise this is always 0 (indicating the initial request). Available in versions 2.2.11 and later.
%l
Remote logname (from identd, if supplied). This will return a dash unless mod_ident is present and IdentityCheck is set On.
%m
The request method
%{Foobar}n
The contents of note Foobar from another module.
%{Foobar}o
The contents of Foobar: header line(s) in the reply.
%p
The canonical port of the server serving the request
%{format}p
The canonical port of the server serving the request or the server’s actual port or the client’s actual port. Valid formats are canonical, local, or remote.
%P
The process ID of the child that serviced the request.
%{format}P
The process ID or thread id of the child that serviced the request. Valid formats are pid, tid, and hextid. hextid requires APR 1.2.0 or higher.
%q
The query string (prepended with a ? if a query string exists, otherwise an empty string)
%r
First line of request
%R
The handler generating the response (if any).
%s
Status. For requests that got internally redirected, this is the status of the *original* request — %>s for the last.
%t
Time the request was received (standard english format)
%{format}t
The time, in the form given by format, which should be in an extended strftime(3) format (potentially localized). If the format starts with begin: (default) the time is taken at the beginning of the request processing. If it starts with end: it is the time when the log entry gets written, close to the end of the request processing. In addition to the formats supported by strftime(3), the following format tokens are supported:
sec
number of seconds since the Epoch
msec
number of milliseconds since the Epoch
usec
number of microseconds since the Epoch
msec_frac
millisecond fraction
usec_frac
microsecond fraction
These tokens can not be combined with each other or strftime(3) formatting in the same format string. You can use multiple %{format}t tokens instead. The extended strftime(3) tokens are available in 2.2.30 and later.
%T
The time taken to serve the request, in seconds.
%{UNIT}T
The time taken to serve the request, in a time unit given by UNIT. Valid units are ms for milliseconds, us for microseconds, and s for seconds. Using s gives the same result as %T without any format; using us gives the same result as %D. Combining %T with a unit is available in 2.2.30 and later.
%u
Remote user (from auth; may be bogus if return status (%s) is 401)
%U
The URL path requested, not including any query string.
%v
The canonical ServerName of the server serving the request.
%V
The server name according to the UseCanonicalName setting.
%X
Connection status when response is completed:
X =
connection aborted before the response completed.
+ =
connection may be kept alive after the response is sent.
– =
connection will be closed after the response is sent.
(This directive was %c in late versions of Apache 1.3, but this conflicted with the historical ssl %{var}c syntax.)
%I
Bytes received, including request and headers, cannot be zero. You need to enable mod_logio to use this.
%O
Bytes sent, including headers, cannot be zero. You need to enable mod_logio to use this.
%{VARNAME}^ti
The contents of VARNAME: trailer line(s) in the request sent to the server.
%{VARNAME}^to
The contents of VARNAME: trailer line(s) in the response sent from the server.
List all user agents ordered by the number of times they appear
awk -F\” ‘{print $6}’ access.log | sort | uniq -c | sort -fr
Identify problems with your site
Identify problems with your site by identifying the different server responses and the requests that caused them:
awk ‘{print $9}’ access.log | sort | uniq -c | sort
The output shows how many of each type of request your site is getting. A ‘normal’ request results in a 200 code which means a page or file has been requested and delivered but there are many other possibilities.
The most common responses are:
200 – OK
206 – Partial Content
301 – Moved Permanently
302 – Found
304 – Not Modified
401 – Unauthorised (password required)
403 – Forbidden
404 – Not Found
What is Causing 404s?
A 404 error is defined as a missing file or resource. Looking at the request URI will tell you which one it is.
$ grep ” 404 ” access.log | cut -d ‘ ‘ -f 7 | sort | uniq -c | sort -nr
404 Request Responses
$ cat access.log | awk ‘($9 ~ /404/)’ | awk ‘{ print $7 }’ | sort | uniq -c | sort -rn | head -n 25
Unique Request IP Addresses
$ cat access.log | awk ‘{ print $1 }’ | sort | uniq -c | sort -rn | head -n 25
Unique Request IP Addresses – Resolve country
needs: apt-get install geoip-bin
$ cat access.log | awk ‘{ print $1 }’ | sort | uniq -c | sort -rn | head -n 25 | awk ‘{ printf(“%5d\t%-15s\t”, $1, $2); system(“geoiplookup ” $2 ” | cut -d \\: -f2 “) }’
Who’s ‘hotlinking’ my images?
Something that really annoys some people is when their bandwidth is being used by their images being linked directly on other websites.
awk -F\” ‘($2 ~ /\.(jpg|gif)/ && $4 !~ /^http:\/\/www\.n0where\.net/){print $4}’ access.log | sort | uniq -c | sort
Blank User Agents
A ‘blank’ user agent is typically an indication that the request is from an automated script or someone who really values their privacy. The following command will give you a list of ip addresses for those user agents so you can decide if any need to be blocked:
awk -F\” ‘($6 ~ /^-?$/)’ access.log | awk ‘{print $1}’ | sort | uniq
Too Much Load From One Source?
When your site is under a heavy load, you should know whether the load is from real users or something else:
A configuration or system problem
A client app or bot hitting your site too fast
A denial of service attack
cat access.log | cut -d ‘ ‘ -f 1 | sort | uniq -c | sort -nr
Top 10 of visiting ip’s
cat access.log | awk ‘{ print $1 ; }’ | sort | uniq -c | sort -n -r | head -n 10
Traffic in kilobytes per status code
cat access.log | awk ‘ { total[$9] += $10 } END { for (x in total) { printf “Status code %3d : %9.2f Kb\n”, x, total[x]/1024 } } ‘
Top 10 referrers
cat access.log | awk -F\” ‘ { print $4 } ‘ | grep -v ‘-‘ | grep -v ‘http://www.adayinthelife’ | sort | uniq -c | sort -rn | head -n 10
Top 10 user-agents
How simple is this? The user-agent is in column 6 instead of 4 and we don’t need the grep’s, so this one needs no explanation:
cat access.log | awk -F\” ‘ { print $6 } ‘ | sort | uniq -c | sort -rn | head -n 10
Generates a list that shows the last 10,000 hits to a site.
tail -10000 access.log| awk ‘{print $1}’ | sort | uniq -c |sort -n
Requests per day
awk ‘{print $4}’ access.log | cut -d: -f1 | uniq -c
Requests per hour
grep “29/Jul” access.log | cut -d[ -f2 | cut -d] -f1 | awk -F: ‘{print $2″:00″}’ | sort -n | uniq -c
Requests per minute
Run the following command to see requests per minute:
grep “29/Jul/2015:06″ access.log | cut -d[ -f2 | cut -d] -f1 | awk -F: ‘{print $2”:”$3}’ | sort -nk1 -nk2 | uniq -c | awk ‘{ if ($1 > 10) print $0}’
Total unique visitors:
cat access.log | awk ‘{print $1}’ | sort | uniq -c | wc -l
Unique visitors today:
cat access.log | grep `date ‘+%e/%b/%G’` | awk ‘{print $1}’ | sort | uniq -c | wc -l
Unique visitors this month:
cat access.* | grep `date ‘+%b/%G’` | awk ‘{print $1}’ | sort | uniq -c | wc -l
Unique visitors on arbitrary date:
cat access.* | grep 28/Jul/2015 | awk ‘{print $1}’ | sort | uniq -c | wc -l
Unique visitors for the month:
cat access.* | grep Jun/2015 | awk ‘{print $1}’ | sort | uniq -c | wc -l
Sorted statistics of “number of visits/requests” “visitor’s IP address”:
cat access.log | awk ‘{print “requests from ” $1}’ | sort | uniq -c | sort
Most Popular URL’s
$ cat access.log | awk ‘{ print $7 }’ | sort | uniq -c | sort -rn | head -n 25
Real-time Requests
$ tailf access.log | awk ‘{ printf(“%-15s\t%s\t%s\t%s\n”, $1, $6, $9, $7) }’
Real time – Resolve IP’s
$ tailf access.log | awk ‘{ “geoiplookup ” $1 ” | cut -d \\: -f2 ” | getline geo printf(“%-15s\t%s\t%s\t%-20s\t%s\n”, $1, $6, $9, geo, $7); }’
Unique IP addresses:
cat access.log | awk ‘{print $1}’ | sort | uniq
Unique IP addresses with date-time stamp:
cat access.log | awk ‘{print $1 ” ” $4}’ | sort | uniq
Unique IP addresses and browser:
cat access.log | awk ‘{print $1 ” ” $12 ” ” $19}’ | sort | uniq
Unique IP addresses and OS:
cat access.log | awk ‘{print $1 ” ” $13}’ | sort | uniq
Unique IP addresses, date-time and request method:
cat access.log | awk ‘{print $1 ” ” $4 ” ” $6}’ | sort | uniq
Unique IP addresses, date-time and request URL:
cat access.log | awk ‘{print $1 ” ” $4 ” ” $7}’ | sort | uniq