awk
is a versatile command in Unix/Linux for text processing, data extraction, and reporting.
Below are the takeaways from geeksforgeeks documentation:
Reference: https://www.geeksforgeeks.org/awk-command-unixlinux-examples/
Syntax:
awk options 'selection _criteria {action }' input-file > output-file
Built-in variables
- Awkβs built-in variables include the field variablesβ$1, $2, $3, and so on ($0 is the entire line)
- NR:Β NR command keeps a current count of the number of input records. – Β prints all the lines along with the line number.Β
- NF:Β NF command keeps a count of the number of fields within the current input record.Β – $NF represents last field.Β
- FS: FS command contains the field separator character which is used to divide fields on the input line. The default is βwhite spaceβ, meaning space and tab characters. FS can be reassigned to another character (typically in BEGIN) to change the field separator.
- RS: RS command stores the current record separator character. Since, by default, an input line is the input record, the default record separator character is a newline.
- OFS: OFS command stores the output field separator, which separates the fields when Awk prints them. The default is a blank space. Whenever print has several parameters separated with commas, it will print the value of OFS in between each parameter.
- ORS:Β ORS command stores the output record separator, which separates the output lines when Awk prints them. The default is a newline character. print automatically outputs the contents of ORS at the end of whatever it is given to print.Β
NR Variable Note
- Print empty line number:
awk 'NF == 0 {print NR}' Β geeksforgeeks.txt
- $NF represents last field.Β
count the lines in a file:
awk 'END { print NR }' geeksforgeeks.txt
Programming
print the squares of first numbers from 1 to n say 6: awk 'BEGIN { for(i=1;i<=6;i++) print "square of", i, "is",i*i; }'
Complex use case
Case 1 – Parsing and Summarizing Log Files
Say if I have a HTTP protocol request log as below
192.168.1.1 - - [10/Oct/2023:13:55:36 -0400] "GET /index.html HTTP/1.1" 200 2326
192.168.1.2 - - [10/Oct/2023:14:05:21 -0400] "POST /login HTTP/1.1" 302 512
192.168.1.3 - - [10/Oct/2023:14:15:55 -0400] "GET /products/1 HTTP/1.1" 200 1548
192.168.1.4 - - [10/Oct/2023:14:20:33 -0400] "GET /about-us HTTP/1.1" 200 984
192.168.1.5 - - [10/Oct/2023:14:35:47 -0400] "GET /contact HTTP/1.1" 200 1105
192.168.1.6 - - [10/Oct/2023:14:45:22 -0400] "POST /api/data HTTP/1.1" 200 2048
192.168.1.1 - - [10/Oct/2023:15:00:18 -0400] "GET /news HTTP/1.1" 200 3072
192.168.1.2 - - [10/Oct/2023:15:05:29 -0400] "DELETE /api/session HTTP/1.1" 204 0
192.168.1.3 - - [10/Oct/2023:15:15:45 -0400] "GET /images/logo.png HTTP/1.1" 200 1024
192.168.1.4 - - [10/Oct/2023:15:25:53 -0400] "PUT /profile/update HTTP/1.1" 200 768
192.168.1.5 - - [10/Oct/2023:15:40:19 -0400] "GET /dashboard HTTP/1.1" 200 2150
192.168.1.6 - - [10/Oct/2023:15:50:05 -0400] "GET /settings HTTP/1.1" 200 1234
awk '{print $1}' access.log | sort | uniq -c | sort -nr
Above command analyzes the access log file to count how many requests were made from each IP address.
'{print $1}' access.log
extracts the first field (IP address) from each line.sort
sorts the IP addresses.uniq -c
counts occurrences of each IP.sort -nr
sorts the count in descending order.
Case 2 – Generating Reports from CSV Data
Sample csv data
ItemID,ItemName,Quantity,Price
1001,Apple,30,0.50
1002,Banana,20,0.20
1003,Orange,25,0.35
1001,Apple,15,0.50
1003,Orange,10,0.35
1002,Banana,30,0.20
awk -F, 'BEGIN {print "ItemName | Total Quantity | Total Sales ($)"; print "-----------------------------------------"} NR > 1 {qty[$2] += $3; sales[$2] += $3 * $4} END {for (item in qty) printf "%-10s | %14d | %15.2f\n", item, qty[item], sales[item]}' sales.csv
-F,
: Tellsawk
to use the comma (,
) as the field separator, as the data is in CSV format.NR > 1
:NR
stands for “Number of Records” which inawk
is synonymous with the current line number being processed.NR > 1
skips the first line (the header).qty[$2] += $3; sales[$2] += $3 * $4
: For each line, these expressions accumulate the total quantity and total sales per item.$2
,$3
, and$4
refer to the second (ItemName), third (Quantity), and fourth (Price) fields of the current line, respectively. Arraysqty
andsales
are indexed by the item name.END {for (item in qty) print item, qty[item], sales[item]}
: After processing all lines, this block iterates over theqty
array and prints the item name, total quantity, and total sales for each item.BEGIN {print "ItemName | Total Quantity | Total Sales ($)"}
: Before processing any data, this prints a header row for the report.printf "%-10s | %14d | %15.2f\n", item, qty[item], sales[item]
: Usesprintf
for formatted output, ensuring columns are aligned and sales are shown with two decimal places. The-10s
format specifier aligns item names left in a field 10 characters wide.
Output:
ItemName | Total Quantity | Total Sales ($)
-----------------------------------------
Apple | 45 | 22.50
Banana | 50 | 10.00
Orange | 35 | 12.25
Case 3 – Filtering and Transforming Data
Case 4 – Analyzing System Processes
ps -u root -o pid,vsz,comm | awk '{print $2, $3}' | sort -nr
Sorts the processes by memory usage in descending order over root user control.
Case 5 – Text Transformation and Cleanup
Example: Remove HTML tags from a file, leaving only the visible text.
awk '{gsub(/<[^>]*>/, ""); print}' file.html
gsub(/<[^>]*>/, "")
uses the global substitution function gsub
to remove anything matching the regular expression for HTML tags.