Filter out duplicate lines from sorted text files using Uniq

Mastering the uniq Command in Linux Unlock the power of filter out duplicate lines from sorted text files and command output. This guide covers everything from basic usage to advanced configurations, empowering DevOps engineers to manage and manipulate data effortlessly.

Vamsi Penmetsa
itversity
6 min readOct 20, 2024

--

uniq command in linux terminal
uniq command in linux terminal

Introduction

Imagine you’re an editor tasked with identifying and removing duplicate lines from a massive manuscript. You need a precise and efficient tool to streamline this process. In the world of Linux systems, the uniq command serves a similar purpose. It allows you to filter out or report repeated lines in sorted text files or command output, making data manipulation straightforward and efficient. This article delves into the intricacies of the uniq command, offering both theoretical insights and practical use cases to help you master data extraction.

Follow https://medium.com/itversity publication for articles on Full Stack, Data Engineering, DevOps, Cloud, etc.

✅ Save the List: LINUX for DevOps Engineer on Medium

Do SUBSCRIBE 📩 Vamsi Penmetsa for daily DevOps dose.

Understanding uniq

What is uniq?

The uniq command is a Unix utility used to filter out or report repeated lines in a sorted file or from the output of a command. It helps in identifying unique lines from a data set, which is particularly useful in data analysis and processing tasks.

Historical Background

The uniq command has been an essential part of Unix-like operating systems since their early development. It provides a simple yet powerful way to manage and manipulate text data, making it an indispensable tool for system administrators and DevOps engineers.

Real-world Analogy

Imagine uniq as a sieve that allows only unique data to pass through, filtering out duplicates. Whether you need unique entries from a sorted list or want to count occurrences of specific lines, uniq provides the precision and efficiency you need.

Photo by Olga Kovalski on Unsplash

Key Concepts and Definitions

Before diving into the usage of uniq, it’s essential to understand some key terms:

  • Sorted File: A file where lines are arranged in a specific order (usually alphabetical or numerical).
  • Duplicate Line: A line that appears more than once in a file.
  • Unique Line: A line that appears exactly once in a file.

In-Depth Usage and Examples

Basic Usage of uniq

To filter out duplicate lines from a sorted file, use the following syntax:

$ uniq [options] filename

Filtering Unique Lines

To display only unique lines, use the -u option:

$ uniq -u filename

Example

Display unique lines from example.txt:

$ uniq -u example.txt

Counting Duplicate Lines

To count the number of occurrences of each line, use the -c option:

$ uniq -c filename

Example

Count occurrences of each line in example.txt:

$ uniq -c example.txt

Displaying Duplicate Lines

To display only duplicate lines, use the -d option:

$ uniq -d filename

Example

Display duplicate lines from example.txt:

$ uniq -d example.txt

Ignoring Case

To ignore case differences when comparing lines, use the -i option:

$ uniq -i filename

Example

Ignore case differences in example.txt:

$ uniq -i example.txt

Specifying Fields to Compare

To skip a specified number of fields before comparing lines, use the -f option:

$ uniq -f number filename

Example

Skip the first field before comparing lines in example.txt:

$ uniq -f 1 example.txt

Common Options for uniq

  • -c, — count: Prefix lines by the number of occurrences.
  • -d, — repeated: Only print duplicate lines.
  • -u, — unique: Only print unique lines.
  • -i, — ignore-case: Ignore differences in case when comparing lines.
  • -f, — skip-fields=N: Avoid comparing the first N fields.
  • -s, — skip-chars=N: Avoid comparing the first N characters.

Intermediate and Advanced Techniques

Using uniq with Pipes

You can use uniq in combination with other commands using pipes to filter unique lines from command output.

Example

Filter unique usernames from the /etc/passwd file:

$ cut -d ':' -f 1 /etc/passwd | sort | uniq

Counting Unique Lines

Count the number of unique lines in a file:

$ sort filename | uniq | wc -l

Example

Count unique lines in example.txt:

$ sort example.txt | uniq | wc -l

Filtering Unique Lines with Specific Fields

Extract and compare specific fields to filter unique lines.

Example

Extract unique user IDs from the /etc/passwd file:

$ cut -d ':' -f 3 /etc/passwd | sort | uniq

Hands-On Exercise

Let’s put your knowledge to the test with a practical exercise.

Prerequisites

  • A Linux system with the uniq command available.
  • Basic knowledge of the terminal.
  • A sample text file for testing.

Exercise

Filter Unique Lines:

  • Create a sample text file named sample.txt.
  • Use uniq to display only unique lines from sample.txt.

Count Duplicate Lines:

  • Use uniq to count the number of occurrences of each line in sample.txt.

Display Duplicate Lines:

  • Use uniq to display only duplicate lines from sample.txt.

Ignore Case Differences:

  • Create a sample text file named sample_case.txt with mixed case lines.
  • Use uniq to ignore case differences and filter unique lines from sample_case.txt.

Use uniq with Pipes:

  • Use uniq in combination with other commands to filter unique lines from the /etc/passwd file.

Expected Results

By the end of this exercise, you should be able to:

  • Filter unique and duplicate lines using uniq.
  • Count occurrences of each line using uniq.
  • Ignore case differences when comparing lines using uniq.
  • Use uniq in combination with other commands using pipes.

Advanced Use Cases

Extracting Data from Log Files

In a DevOps environment, filtering unique entries from log files can help you analyze and troubleshoot issues efficiently.

Example: Filtering Unique IP Addresses

Extract unique IP addresses from a log file:

$ cut -d ' ' -f 1 log_file.txt | sort | uniq

Processing Large Text Files

When dealing with large text files, uniq can be used to filter and analyze unique lines without loading the entire file into memory.

Example: Filtering Unique Lines

Filter unique lines from a large text file:

$ sort large_file.txt | uniq

Integrating uniq in Shell Scripts

You can integrate uniq into shell scripts to automate data filtering tasks.

Example: Automating Data Filtering

Create a script filter_unique.sh to filter unique lines from a text file:

#!/bin/bash
sort $1 | uniq > unique_lines.txt

Make the script executable:

$ chmod +x filter_unique.sh

Run the script:

$ ./filter_unique.sh sample.txt

Troubleshooting uniq Issues

Common Errors

  • Unsorted Input: Ensure the input file is sorted before using uniq.
  • File Not Found: Ensure the file path is correct and the file exists.
  • Permission Denied: Ensure you have the necessary permissions to read the file.

Example: Resolving Unsorted Input

Sort the File

To ensure that the uniq command works correctly, the input file needs to be sorted. You can use the sort command to sort the file and save the sorted output to a new file.

$ sort sample.txt -o sample_sorted.txt

Use uniq on Sorted File

Once the file is sorted, you can use the uniq command to filter out duplicate lines.

$ uniq sample_sorted.txt

Full Example with Answers

1. Create a Sample File

First, let’s create a sample file named sample.txt with some duplicate lines.

$ cat > sample.txt << EOL
apple
banana
apple
orange
banana
grape
EOL

2. Sort the File

Now, sort the sample.txt file and save the output to sample_sorted.txt.

$ sort sample.txt -o sample_sorted.txt

Output:

$ cat sample_sorted.txt
apple
apple
banana
banana
grape
orange

3. Use uniq on Sorted File

Finally, use the uniq command to filter out duplicate lines from the sorted file.

$ uniq sample_sorted.txt

Output:

apple
banana
grape
orange

In summary, by sorting the file first, we ensure that the uniq command can correctly identify and filter out duplicate lines.

Bonus cheatsheet 🎁

Conclusion

In this article, we’ve explored the depths of the uniq command, from its basic usage to advanced configurations. We’ve also provided practical examples and a hands-on exercise to help you master data filtering. By leveraging uniq, you can efficiently manage and manipulate text data, enhancing your ability to analyze and process information in Linux-based systems.

Your Next Challenge

Now that you’re familiar with uniq, challenge yourself to explore other text processing tools like awk, sed, and grep. Understanding these tools will further enhance your ability to manipulate and analyze text data effectively.

Next Steps for Further Learning

Practice Recommendations

  • Filter and manipulate different types of text data using uniq.
  • Experiment with different options and understand their implications.
  • Share your data filtering strategies and findings with the DevOps community for feedback and improvement.

Discussion Questions

  • How can you balance simplicity and efficiency when using uniq for data filtering?
  • What are some real-world scenarios where uniq proved invaluable for managing and manipulating text data?
  • How can you integrate uniq with other text processing tools for a comprehensive data management strategy?

If you liked this post:
🔔 Follow Vamsi Penmetsa
♻ Repost to help others find it
💾 Save this post for future reference

--

--

Vamsi Penmetsa
Vamsi Penmetsa

Written by Vamsi Penmetsa

Lead SRE, I post a FREE daily DevOps blog – FOLLOW ✅ & consider a SUBSCRIBE 📩 | DevOps Community (40K+ Pro's) Q&A? https://www.linkedin.com/groups/13986647