Deep dive to explore the charm of Linux awk command

Spread the love

awk is a versatile command in Unix/Linux for text processing, data extraction, and reporting.

Below are the takeaways from geeksforgeeks documentation:
Reference: https://www.geeksforgeeks.org/awk-command-unixlinux-examples/

Syntax:

awk options 'selection _criteria {action }' input-file > output-file

Built-in variables

Awk’s built-in variables include the field variables—$1, $2, $3, and so on ($0 is the entire line)
NR: NR command keeps a current count of the number of input records. – prints all the lines along with the line number.
NF: NF command keeps a count of the number of fields within the current input record. – $NF represents last field.
FS: FS command contains the field separator character which is used to divide fields on the input line. The default is “white space”, meaning space and tab characters. FS can be reassigned to another character (typically in BEGIN) to change the field separator.
RS: RS command stores the current record separator character. Since, by default, an input line is the input record, the default record separator character is a newline.
OFS: OFS command stores the output field separator, which separates the fields when Awk prints them. The default is a blank space. Whenever print has several parameters separated with commas, it will print the value of OFS in between each parameter.
ORS: ORS command stores the output record separator, which separates the output lines when Awk prints them. The default is a newline character. print automatically outputs the contents of ORS at the end of whatever it is given to print.

NR Variable Note

Print empty line number: awk 'NF == 0 {print NR}' geeksforgeeks.txt
$NF represents last field.

count the lines in a file:

awk 'END { print NR }' geeksforgeeks.txt

Programming

print the squares of first numbers from 1 to n say 6: awk 'BEGIN { for(i=1;i<=6;i++) print "square of", i, "is",i*i; }'

Complex use case

Case 1 – Parsing and Summarizing Log Files

Say if I have a HTTP protocol request log as below

192.168.1.1 - - [10/Oct/2023:13:55:36 -0400] "GET /index.html HTTP/1.1" 200 2326
192.168.1.2 - - [10/Oct/2023:14:05:21 -0400] "POST /login HTTP/1.1" 302 512
192.168.1.3 - - [10/Oct/2023:14:15:55 -0400] "GET /products/1 HTTP/1.1" 200 1548
192.168.1.4 - - [10/Oct/2023:14:20:33 -0400] "GET /about-us HTTP/1.1" 200 984
192.168.1.5 - - [10/Oct/2023:14:35:47 -0400] "GET /contact HTTP/1.1" 200 1105
192.168.1.6 - - [10/Oct/2023:14:45:22 -0400] "POST /api/data HTTP/1.1" 200 2048
192.168.1.1 - - [10/Oct/2023:15:00:18 -0400] "GET /news HTTP/1.1" 200 3072
192.168.1.2 - - [10/Oct/2023:15:05:29 -0400] "DELETE /api/session HTTP/1.1" 204 0
192.168.1.3 - - [10/Oct/2023:15:15:45 -0400] "GET /images/logo.png HTTP/1.1" 200 1024
192.168.1.4 - - [10/Oct/2023:15:25:53 -0400] "PUT /profile/update HTTP/1.1" 200 768
192.168.1.5 - - [10/Oct/2023:15:40:19 -0400] "GET /dashboard HTTP/1.1" 200 2150
192.168.1.6 - - [10/Oct/2023:15:50:05 -0400] "GET /settings HTTP/1.1" 200 1234

awk '{print $1}' access.log | sort | uniq -c | sort -nr

Above command analyzes the access log file to count how many requests were made from each IP address.

'{print $1}' access.log extracts the first field (IP address) from each line.
sort sorts the IP addresses.
uniq -c counts occurrences of each IP.
sort -nr sorts the count in descending order.

Case 2 – Generating Reports from CSV Data

Sample csv data

ItemID,ItemName,Quantity,Price
1001,Apple,30,0.50
1002,Banana,20,0.20
1003,Orange,25,0.35
1001,Apple,15,0.50
1003,Orange,10,0.35
1002,Banana,30,0.20

awk -F, 'BEGIN {print "ItemName | Total Quantity | Total Sales ($)"; print "-----------------------------------------"} NR > 1 {qty[$2] += $3; sales[$2] += $3 * $4} END {for (item in qty) printf "%-10s | %14d | %15.2f\n", item, qty[item], sales[item]}' sales.csv

-F,: Tells awk to use the comma (,) as the field separator, as the data is in CSV format.
NR > 1: NR stands for “Number of Records” which in awk is synonymous with the current line number being processed. NR > 1 skips the first line (the header).
qty[$2] += $3; sales[$2] += $3 * $4: For each line, these expressions accumulate the total quantity and total sales per item. $2, $3, and $4 refer to the second (ItemName), third (Quantity), and fourth (Price) fields of the current line, respectively. Arrays qty and sales are indexed by the item name.
END {for (item in qty) print item, qty[item], sales[item]}: After processing all lines, this block iterates over the qty array and prints the item name, total quantity, and total sales for each item.
BEGIN {print "ItemName | Total Quantity | Total Sales ($)"}: Before processing any data, this prints a header row for the report.
printf "%-10s | %14d | %15.2f\n", item, qty[item], sales[item]: Uses printf for formatted output, ensuring columns are aligned and sales are shown with two decimal places. The -10s format specifier aligns item names left in a field 10 characters wide.

Output:

ItemName | Total Quantity | Total Sales ($)
-----------------------------------------
Apple      |             45 |          22.50
Banana     |             50 |          10.00
Orange     |             35 |          12.25

Case 3 – Filtering and Transforming Data

Case 4 – Analyzing System Processes

ps -u root -o pid,vsz,comm | awk '{print $2, $3}' | sort -nr

Sorts the processes by memory usage in descending order over root user control.

Case 5 – Text Transformation and Cleanup

Example: Remove HTML tags from a file, leaving only the visible text.

awk '{gsub(/<[^>]*>/, ""); print}' file.html

gsub(/<[^>]*>/, "") uses the global substitution function gsub to remove anything matching the regular expression for HTML tags.