Removing Duplicates From a List Using Python
Recently I’ve been working on a Python project for a class that evaluates PCAPs. It looks through the PCAP using DPKT (among other libraries), and when doing that it notes the source and destination IP addresses, and any IP addresses mentioned within HTTP packets. It also notes other things, but that’s beside the point. When parsing a packet capture and noting all IPs encountered, the list can get pretty long! Of course, there are ways to ensure that duplicates don’t get put into the list in the first place, but in this instance I just chose to evaluate the list and remove the duplicates after the PCAP had been fully parsed. Let’s start this off simple and take a look at implementing a list in Python:
listExample = ["abc", "def", "hij", "def"]
This is a simple list, but as you notice it has a duplicate at the end, which will need to be fixed later. Lists can also have their array referenced, like in a print statement:
print(listExample[1])
print(listExample[3])
Referencing the list like this will make it easy for us to iterate through it looking for duplicates. Creating a function to remove the duplicates from a Python list should be relatively straightforward. Here’s an example:
listExample = ["abc", "def", "hij", "def"]
fixedList = [] # Initializing an empty list
[fixedList.append(x) for x in listExample if x not in fixedList]
print(fixedList)
At this point, the output should be “[‘abc’, ‘def’, ‘hij’]”. The main part of this code that is important is the “if x not in fixedList” portion. So the program is checking if the value is already present in the resulting list before appending the value to the list. If it is present, it doesn’t append it.
That’s it! There are tons of different ways to do things in Python, but I thought I’d just post this as it’s the first time I’ve actually had to use lists, as coming over from C, lists are nowhere near as easy to implement, and utilizing arrays in Python and C are not as flexible as lists for certain purposes like this.
If you want to read more, check out these links:
Python Lists: https://www.w3schools.com/python/python_lists.asp
Removing Duplicates from Python Lists: https://www.geeksforgeeks.org/python-ways-to-remove-duplicates-from-list/
Python Arrays vs. Lists: https://learnpython.com/blog/python-array-vs-list/
This is an afterword about my current project, so feel free to skip it. I ended up initially adding the information from the parsed packets to strings, separating the information with newlines, so when it came time to remove the duplicates and have a clean and easy way to print these things out, I went searching and found out the beauty of lists. I may eventually change the original functions so that these use lists first and foremost instead of strings to contain the data, but for now, they will stay this way. This search led me to find out about the split function in Python, which came in handy!
The split function splits a string into a list where each word from that string is a list item. So it would work like this:
string1 = "Hi how are you"
list1 = string1.split()
print(list1)
And the output would be “[‘Hi’, ‘how’, ‘are’, ‘you’]”. Pretty handy if you put yourself into a corner with a string and need to change that information over! More examples of this in use can be found here: https://www.w3schools.com/python/ref_string_split.asp