Performance of finding duplicate array entries

I ran across a question today while browsing.  The person asking wanted to know how to find duplicate entries in an int array, and the answer that someone else quickly gave was to make nested for loops to iterate over an array of integers.  My first thought was a little different than that, so I looked around to see if it had been answered other places in the same way.  Nested for loops seems to be the most common suggestion.  I also found a few that related to strings that use a HashSet, which I have used to identify duplicates before.  I wondered which way is really faster, which is where this quest begins.

I was tempted to go into some Big O notation ramblings, but it would take longer than I have here to analyze the methods fully, so let’s skip that and see what  happens when we write some code and run it.  It’s more fun to do an experiment anyway.

I intentionally duplicated code and logic in each section so each test would be as similar as possible, and I add the duplicates that are identified to an ArrayList<String> so I will have taken some action on it.  It does add to the elapsed time, but it adds the same time for each test.  When the sample size is small, you can print the duplicates ArrayList to see what was found.  It’s probably not a good idea to do that when you start with 10,000 elements though.

Sorted vs. Nested Loop:

The results were as I expected.  The relative overhead of sorting the arrays is very high when there are only a few values to search.  The efficiency, in my test environment, evens out at around 145 elements.  When there are fewer than 145, the nested loop is faster than the sorted array, but when there are more, the sorted array with sequential search method is faster than the nested loop. The same holds true for Strings.  If you have over 100,000 elements in the int test, the difference is more than 6 seconds in favor of the sorted method.

HashSet Method:

For strings, the HashSet method is the clear winner in every case.  Int arrays are a slightly different story.  To put the number into the HashSet, in my test, I converted it to a String first.  When the array contains 15000 or less elements, the HashSet is the same or slower than sorted, but when the array is larger, the HashSet method is quite a bit faster than either of the other two.  Just for fun I tried one with 1,000,000 int elements, and the sorted array much faster than the HashSet.

Conclusion:

If speed is your only concern, and you have some strict rules that your array size conforms to, then you can tune your code to match.  Don’t go overboard with this though. For many programmers, 1.9ms is not something you really need to worry about.  Some of us work in very high transaction rate systems, and every millisecond counts.  We still need to weigh that against memory usage, so saving 1.9ms while burning a Megabyte of memory might not be the right way to go.  Every application is going to be a little different, so do your homework and experiment a little.  Take a look at the results and code I added below.   If you have a different way that works better, I’d love to hear from you.  I’d love to see what this looks like when it is multi-threaded.

Results with 10 elements:

Random numbers generated = 10
sorted numbers duration = 558311
loopCounter = 10
unsorted numbers duration = 11733
loopCounter = 45
unsorted numbers in HashSet duration = 39600
loopCounter = 10
sorted strings duration = 494267
loopCounter = 10
unsorted strings duration = 26400
loopCounter = 45
unsorted strings in HashSet duration = 14178
loopCounter = 10

Results with 100 elements:

Random numbers generated = 100
sorted numbers duration = 651689
loopCounter = 100
unsorted numbers duration = 387688
loopCounter = 4950
unsorted numbers in HashSet duration = 379867
loopCounter = 100
sorted strings duration = 997333
loopCounter = 100
unsorted strings duration = 871689
loopCounter = 4950
unsorted strings in HashSet duration = 276711
loopCounter = 100

Results with 1,000 elements:

Random numbers generated = 1000
sorted numbers duration = 1973155
loopCounter = 1000
unsorted numbers duration = 23475470
loopCounter = 499500
unsorted numbers in HashSet duration = 3872978
loopCounter = 1000
sorted strings duration = 7550401
loopCounter = 1000
unsorted strings duration = 32994137
loopCounter = 499500
unsorted strings in HashSet duration = 954311
loopCounter = 1000

Results with 10,000 elements:

Random numbers generated = 10000
sorted numbers duration = 15986179
loopCounter = 10000
unsorted numbers duration = 106388102
loopCounter = 49995000
unsorted numbers in HashSet duration = 21584447
loopCounter = 10000
sorted strings duration = 28177604
loopCounter = 10000
unsorted strings duration = 1442832050
loopCounter = 49995000
unsorted strings in HashSet duration = 2630711
loopCounter = 10000

Results with 100,000 elements:

Random numbers generated = 100000
sorted numbers duration = 104342102
loopCounter = 100000
unsorted numbers duration = 6736310411
loopCounter = 704982704
unsorted numbers in HashSet duration = 59437652
loopCounter = 100000
sorted strings duration = 267270212
loopCounter = 100000
unsorted strings duration = 167903382832
loopCounter = 704982704
unsorted strings in HashSet duration = 28475825
loopCounter = 100000

Here is the code:

/**
* test the duration of different ways of finding duplicates in string and int arrays.
*/
public void duplicates(int randomNumberCount){
	long startTime = 0;
	long endTime = 0;
	int loopCounter = 0;

	ArrayList duplicates = new ArrayList();

	Random random = new Random(System.currentTimeMillis());
	int[] randomNumbers = new int[randomNumberCount];
	for(int count = 0; count &lt; randomNumberCount; count++){
		randomNumbers[count] = random.nextInt(randomNumberCount);
	}
	System.out.println("Random numbers generated = " + randomNumberCount);

	/********************************************************
	 *  Find a duplicate int
	 *  sort the array, then check for sequential duplicates.
	 ********************************************************/
	int[] sortedNumbers = Arrays.copyOf(randomNumbers, randomNumberCount);
	startTime = System.nanoTime();
	Arrays.sort(sortedNumbers);
	for (int i = 0; i &lt; sortedNumbers.length; i++) {
		int testVal = sortedNumbers[i];
		int tempPos = i + 1;
		boolean foundDuplicate = false;
		loopCounter++;
		while(tempPos &lt; sortedNumbers.length &amp;&amp; testVal == sortedNumbers[tempPos]){
			foundDuplicate = true;
			duplicates.add("" + testVal);
			tempPos ++;
			loopCounter++;
		}
		if(foundDuplicate){
			i = tempPos-1;
		}
	}
	endTime = System.nanoTime();
	System.out.println("sorted numbers duration = " + (endTime - startTime));
	System.out.println("loopCounter = " + loopCounter);
	loopCounter = 0;
	//System.out.println(duplicates.toString());
	duplicates.clear();

	/********************************************************
	 *  Find a duplicate int
	 *  Use nested for loops to find duplicates.
	 ********************************************************/

	int[] unSortedNumbers = Arrays.copyOf(randomNumbers, randomNumberCount);
	startTime = System.nanoTime();
	for (int i = 0; i &lt; unSortedNumbers.length; i++) {
		boolean foundDuplicate = false;
		for (int j = i+1; j &lt; unSortedNumbers.length; j++) {
			if(i != j &amp;&amp; unSortedNumbers[i] == unSortedNumbers[j]){
				foundDuplicate = true;
			}
			loopCounter++;
		}
		if(foundDuplicate){
			duplicates.add("" + unSortedNumbers[i]);
		}
	}
	endTime = System.nanoTime();
	System.out.println("unsorted numbers duration = " + (endTime - startTime));
	System.out.println("loopCounter = " + loopCounter);
	loopCounter = 0;
	//System.out.println(duplicates.toString());
	duplicates.clear();

	/*********************************************************************
	 *  Find a duplicate int
	 *  Make each int a string, and add it to a HashSet to find duplicates
	 *********************************************************************/
	//use a hashset to find duplicate ints.
	HashSet numbersHashSet = new HashSet();
	startTime = System.nanoTime();
	for (int number : unSortedNumbers) {
		if(!numbersHashSet.add("" + number)){
			duplicates.add("" + number);
		}
		loopCounter++;
	}
	endTime = System.nanoTime();
	System.out.println("unsorted numbers in HashSet duration = " + (endTime - startTime));
	System.out.println("loopCounter = " + loopCounter);
	loopCounter = 0;
	//System.out.println(duplicates.toString());
	duplicates.clear();

	/**************************************************
	 *  Make random strings out of numbers
	 **************************************************/
	String[] randomStrings = new String[randomNumberCount];
	for(int i = 0; i &lt; randomNumberCount; i++){
		randomStrings[i] = "" + random.nextInt(randomNumberCount);
	}
	String[] sortedWords = Arrays.copyOf(randomStrings, randomNumberCount);
	/********************************************************
	 *  Find a duplicate strings
	 *  sort the array, then check for sequential duplicates
	 ********************************************************/
	startTime = System.nanoTime();
	Arrays.sort(sortedWords);
	for (int i = 0; i &lt; sortedWords.length; i++) {
		String testVal = sortedWords[i];
		int tempPos = i + 1;
		boolean foundDuplicate = false;
		loopCounter++;
		while(tempPos &lt; sortedWords.length &amp;&amp; testVal.equals(sortedWords[tempPos])){
			foundDuplicate = true;
			tempPos ++;
			loopCounter++;
		}
		if(foundDuplicate){
			duplicates.add(testVal);
			i = tempPos-1;
		}
	}
	endTime = System.nanoTime();
	System.out.println("sorted strings duration = " + (endTime - startTime));
	System.out.println("loopCounter = " + loopCounter);
	loopCounter = 0;
	//System.out.println(duplicates.toString());
	duplicates.clear();
	/********************************************************
	 *  Find a duplicate int
	 *  Use nested for loops to search for duplicate strings.
	 ********************************************************/
	String[] unSortedWords = Arrays.copyOf(randomStrings, randomNumberCount);
	startTime = System.nanoTime();
	for (int i = 0; i &lt; unSortedWords.length; i++) {
		boolean foundDuplicate = false;
		for (int j = i+1; j &lt; unSortedWords.length; j++) {
			if(i != j &amp;&amp; unSortedWords[i].equals(unSortedWords[j])){
				foundDuplicate = true;
			}
			loopCounter++;
		}
		if(foundDuplicate){
			duplicates.add("" + unSortedWords[i]);
		}
	}
	endTime = System.nanoTime();
	System.out.println("unsorted strings duration = " + (endTime - startTime));
	System.out.println("loopCounter = " + loopCounter);
	loopCounter = 0;
	//System.out.println(duplicates.toString());
	duplicates.clear();

	/********************************************************
	 *  Find a duplicate strings
	 *  add each string to a HashSet to check for duplicates.
	 ********************************************************/
	HashSet stringHashSet = new HashSet();
	startTime = System.nanoTime();
	for (String string : unSortedWords) {
		if(!stringHashSet.add(string)){
			duplicates.add(string);
		}
		loopCounter++;
	}
	endTime = System.nanoTime();
	System.out.println("unsorted strings in HashSet duration = " + (endTime - startTime));
	System.out.println("loopCounter = " + loopCounter);
	loopCounter = 0;
	//System.out.println(duplicates.toString());
	duplicates.clear();

}

Leave a Reply