InfoQ Homepage Articles How to Speed up Large Collections Processing in Java

How to Speed up Large Collections Processing in Java

Lire ce contenu en français

Aug 26, 2022 13 min read

InfoQ Article Contest

Share your knowledge Win a ticket to a QCon event
or an InfoQ Dev SummitFind out more

Key Takeaways

High performance is crucial for the adoption of applications and can be negatively impacted by processing large or disparate data sets.
Java developers must know how to use built-in features like collections to optimize data processing and performance.
Java collections and java streams are two fundamental tools for improving application performance.
Developers should consider how various parallel stream processing approaches can affect app performance.
Applying the right parallel stream processing strategy to collections can be the difference between increased adoption and loss of customers.

Programming today involves working with large data sets, often including many different types of data. Manipulating these data sets can be a complex and frustrating task. To ease the programmer’s job, Java introduced the Java Collections Framework collections in 1998.

This article discusses the purpose behind the Java Collections Framework, how Java collections work, and how developers and programmers can use Java collections to their best advantage.

What is a Java collection?

Although it has passed the venerable age of 25, Java remains one of the most popular programming languages today. Over 1,000,000 websites use Java in some form, and more than a third of software developers have Java in their toolbox.

Throughout its life, Java has undergone substantial evolution. One early advancement came in 1998 when Java introduced the Collection Framework (JCF), which simplified working with Java objects. The JCF provided a standardized interface and common methods for collections, reduced programming effort, and increased the speed of Java programs.

Understanding the distinction between Java collections and the Java Collections Framework is essential. Java collections are simply data structures representing a group of Java objects. Developers can work with collections in much the same way they work with other data types, performing common tasks such as searches or manipulating the collection's contents.

An example of a collection in Java is the Set Collection interface (java.util.Set). A Set is a collection that does not allow for duplicate elements and does not store elements in any particular order. The Set interface inherits its methods from Collection (java.util.Collection) and contains only those methods.

In addition to sets, there are queues (java.util.Queue) and maps (java.util.Map). Maps are not collections in the truest sense as they don’t extend collection interfaces, but developers can manipulate Maps as if they are collections. Sets, Queues, Lists, and Maps each have descendants, such as sorted sets (java.util.SortedSet) and navigable maps (java.util.NavigableMap).

In working with collections, developers need to be familiar with and understand some specific collections-related terminology:

Modifiable vs. unmodifiable: As these terms suggest on their face, different collections may or may not support modification operations.
Mutable vs. immutable: Immutable collections cannot be modified after creation. While there are situations where unmodifiable collections may still change due to access by other code, immutable collections prevent such changes. Collections that can guarantee that no changes are visible with the Collection objects are immutable, whereas unmodifiable collections are collections that do not allow modification operations such as ‘add’ or ‘clear.’
Fixed-size vs. variable size: These terms refer only to the size of the collection and make no indication as to whether the collection is modifiable or mutable.
Random access vs. sequential access: If a collection allows for the indexing of individual elements, it is random access. In sequential access collections, you must progress through all prior elements to reach a given element. Sequential access collections can be easier to extend but take more time to search.

Beginning programmers may find it difficult to grasp the difference between unmodifiable and immutable collections. Unmodifiable collections are not necessarily immutable. Indeed, unmodifiable collections are often wrappers around a modifiable collection that other code can still access and modify. Other code may actually be able to modify the underlying collection. It will take some time working with collections to gain a degree of comfort with unmodifiable and immutable collections.

As an example, consider creating a modifiable list of the top five cryptocurrencies by market capitalization. You can create an unmodifiable version of the underlying modifiable list using the java.util.Collections.unmodifiableList() method. You can still modify the underlying list, which will appear in the unmodifiable list. But you cannot directly modify the unmodifiable version.

import java.util.*;
public class UnmodifiableCryptoListExample {  
    public static void main(String[] args) {  

        List<String> cryptoList = new ArrayList<>();  
        Collections.addAll(cryptoList, "BTC", "ETH", "USDT", "USDC", "BNB");  
        List<String> unmodifiableCryptoList = Collections.unmodifiableList(cryptoList);  
        System.out.println("Unmodifiable crypto List: " + unmodifiableCryptoList);  

        // try to add one more cryptocurrency to modifiable list and show in unmodifiable list
        cryptoList.add("BUSD");
        System.out.println("New unmodifiable crypto List with new element:" + unmodifiableCryptoList);

        // try to add one more cryptocurrency to unmodifiable list and show in unmodifiable list - unmodifiableCryptoList.add would throw an uncaught exception and the println would not run.
        unmodifiableCryptoList.add("XRP");
        System.out.println("New unmodifiable crypto List with new element:" + unmodifiableCryptoList);

        }  
}

On execution, you will see that an addition to the underlying modifiable list shows up as a modification of the unmodifiable list.

Note the difference, however, if you create an immutable list and then attempt to change the underlying list. There are many ways to create immutable lists from existing modifiable lists, and below, we use the List.copyOf() method.

import java.util.*;
public class UnmodifiableCryptoListExample {  
    public static void main(String[] args) {  

        List<String> cryptoList = new ArrayList<>();  
        Collections.addAll(cryptoList, "BTC", "ETH", "USDT", "USDC", "BNB");
        List immutableCryptoList = List.copyOf(cryptoList);
        System.out.println("Underlying crypto list:" + cryptoList)
        System.out.println("Immutable crypto ist: " + immutableCryptoList);  

        // try to add one more cryptocurrency to modifiable list and show immutable does not display change
        cryptoList.add("BUSD");
        System.out.println("New underlying list:" + cryptoList);
        System.out.println("New immutable crypto List:" + immutableCryptoList);

        // try to add one more cryptocurrency to unmodifiable list and show in unmodifiable list -
        immutableCryptoList.add("XRP");
        System.out.println("New unmodifiable crypto List with new element:" + immutableCryptoList);

        }  
}

After modifying the underlying list, the immutable list does not display the change. And trying to modify the immutable list directly results in an UnsupportedOperationException:

How do collections relate to the Java Collections Framework?

Prior to the introduction of the JCF, developers could group objects using several special classes, namely the array, the vector, and the hashtable classes. Unfortunately, these classes had significant limitations. In addition to lacking a common interface, they were difficult to extend.

The JCF provided an overarching common architecture for working with collections. The Collections Interface contains several different components, including:

Common interfaces: representations of the primary collection types, including sets, lists, and maps
Implementations: specific implementations of the collection interfaces, ranging from general-purpose to special-purpose to abstract; in addition, there are legacy implementations related to the older array, vector, and hashtable classes
Algorithms: static methods for manipulating collections
Infrastructure: underlying support for the various collections interfaces

The JCF offered developers many benefits compared to the prior object grouping methods. Notably, the JCF made Java programming more efficient by reducing the need for developers to write their own data structures.

But the JCF also fundamentally altered how developers worked with APIs. With a new common language for dealing with different APIs, the JCF made it simpler for developers to learn and design APIs and implement them. In addition, APIs became vastly more interoperable. An example is Eclipse Collections, an open source Java collections library fully compatible with different Java collections types.

Additional development efficiencies arose because the JCF provided structures that made it much easier to reuse code. As a result, development time decreased, and program quality increased.

The JCF has a defined hierarchy of interfaces. java.util.collection extends the superinterface Iterable. Within Collection there are numerous descendant interfaces and classes, as shown below:

As noted previously, Sets are unordered groups of unique objects. Lists, on the other hand, are ordered collections that may contain duplicates. While you can add elements at any point in a list, the remainder of the order is maintained.

Queues are collections where elements are added at one end and removed from the other end, i.e., it is a first-in, first-out (FIFO) interface. Deques (double-ended queues) allow for the addition or removal of elements from either end.

Methods for working with Java collections

Each interface in the JCF, including java.util.collection, has specific methods available for accessing and manipulating individual elements of the collection. Among the more common methods used in collections are:

size (): returns the number of elements in a collection
add (Collection element) / remove (Collection object): as suggested, these methods alter the contents of a collection; note that in the event a collection has duplicates, remove only affects a single instance of the element.
equals (Collection object): compares an object for equivalence with a collection
clear (): removes every element from a collection

Each subinterface may have additional methods as well. For example, although the Set interface includes only the methods from the Collection interface, the List interface has many additional methods based on accessing specific list elements, including:

get (int index): returns the list element from the specified index location
set (int index, element): sets the contents of the list element at the specified index location
remove (int,index): removes the element at the specified index location

Performance of Java collections

As the size of collections grows, they can develop noticeable performance issues. And it turns out that the proper selection of collection types and associated collection design can also substantially affect performance.

The ever-increasing amount of data available to developers and applications led Java to introduce new ways to process collections to increase overall performance. In Java 8, released in 2014, Java introduced Streams - new functionality whose purpose was to simplify and increase the speed of bulk object processing. Since their introduction, Streams have had numerous improvements.

It is essential to understand that streams are not themselves data structures. Instead, as Java explains it, streams are "Classes that support functional-style operations on streams of elements, such as map-reduced transformations on collections."

Streams use pipelines of methods to process data received from a data source such as a collection. Every stream method is either an intermediate method (methods that return new streams that can be further processed) or a terminal method (after which no additional stream processing is possible). Intermediate methods in the pipeline are lazy; that is, they are evaluated only when necessary.

Both parallel and sequential execution options exist for streams. Streams are sequential by default.

Applying parallel processing to improve performance

Processing large collections in Java can be cumbersome. While Streams simplified dealing with large collections and coding operations on large collections, it was not always a guarantee of improved performance; indeed, programmers frequently found that using Streams actually slowed processing.

As is well known with respect to websites, in particular, users will only allow a matter of seconds for loads before they move on out of frustration. So to provide the best possible customer experience and maintain the developer’s reputation for offering quality products, developers must consider how to optimize processing efforts for large data collections. And while parallel processing cannot guarantee improved speeds, it is a promising place to start.

Parallel processing, i.e., breaking the processing task into smaller chunks and running them simultaneously, offers one way to reduce the processing overhead when dealing with large collections. But even parallel stream processing can lead to decreased performance, even if it is simpler to code. In essence, the overhead associated with managing multiple threads can offset the benefits of running threads in parallel.

Because collections are not thread-safe, parallel processing can result in thread interference or memory inconsistency errors (when parallel threads do not see changes made in other threads and therefore have differing views of the same data). The Collections Framework attempts to prevent thread inconsistencies during parallel processing using synchronization wrappers. While the wrapper can make a collection thread-safe, allowing for more efficient parallel processing, it can have undesirable effects. Specifically, synchronization can cause thread contention, which can result in threads executing more slowly or ceasing execution.

Java has a native parallel processing function for collections: Collection.parallelstream. One significant difference between the default sequential stream processing and parallel processing is that the order of execution and output, which is always the same when processing sequentially, can vary from execution to execution when using parallel processing.

As a result, parallel processing is particularly effective in situations where processing order does not affect the final output. However, in situations where the state of one thread can affect the state of another, parallel processing can create problems.

Consider a simple example where we create a list of current accounts receivables for a list of 1000 customers. We want to determine how many of those customers have receivables in excess of $25,000. We can perform this check either sequentially or in parallel with differing processing speeds.

To set the example up for parallel processing, we will use the

import java.util.Random;
import java.util.ArrayList;
import java.util.List;

class Customer {

    static int customernumber;
    static int receivables;

    Customer(int customernumber, int receivables) {
        this.customernumber = customernumber;
        this.receivables = receivables;
    }

    public int getCustomernumber() {
        return customernumber;
    }

    public void setCustomernumber(int customernumber) {
        this.customernumber = customernumber;
    }

    public int getReceivables() {
        return receivables;
    }

    public void setReceivables() {
        this.receivables = receivables;
    }
}

public class ParallelStreamTest {

    public static void main( String args[] ) {

        Random receivable = new Random();

        int upperbound = 1000000;
   
            List < Customer > custlist = new ArrayList < Customer > ();

                for (int i = 0; i < upperbound; i++) {
    
                    int custnumber = i + 1;
                    int custreceivable = receivable.nextInt(upperbound);
                    custlist.add(new Customer(custnumber, custreceivable));
                 
 }
                
long t1 = System.currentTimeMillis();

                System.out.println("Sequential Stream count: " + custlist.stream().filter(c ->
c.getReceivables() > 25000).count());

                long t2 = System.currentTimeMillis();

                System.out.println("Sequential Stream Time taken:" + (t2 - t1));

               t1 = System.currentTimeMillis();

                System.out.println("Parallel Stream count: " + custlist.parallelStream().filter(c ->
c.getReceivables() > 25000).count());

                 t2 = System.currentTimeMillis();

                 System.out.println("Parallel Stream Time taken:" + (t2 - t1));

    }

}

Code execution demonstrates that parallel processing may lead to performance improvements when processing data collections:

Note, however, that each time you execute the code, you will obtain different results. In some instances, sequential processing will still outperform parallel processing.

In this example, we used Java’s native processes for splitting the data and assigning threads.

Unfortunately, Java’s native parallel processing efforts are not always faster in every situation than sequential processing, and indeed, they are frequently slower.

As one example, parallel processing is not useful when dealing with linked lists. Whereas data sources like ArrayLists are simple to split for parallel processing, the same is not true of LinkedLists. TreeMaps and HashSets lie somewhere in between.

One method for making decisions about whether to utilize parallel processing is Oracle’s NQ model. In the NQ model, N represents the number of data elements to be processed. Q, in turn, is the amount of computation required per data element. In the NQ model, you calculate the product of N and Q, with higher numbers indicating higher possibilities that parallel processing will lead to performance improvements.

When using the NQ model, there is an inverse relationship between N and Q. That is, the higher amount of computing required per element, the smaller the data set can be for parallel processing to have benefits. A rule of thumb is that for low computational requirements, a minimum data set of 10,000 is the baseline for using parallel processing.

Although beyond the scope of this article, there are more advanced methods for optimizing parallel processing in Java collections. For example, advanced developers can adjust the partitioning of data elements in the collection to maximize parallel processing performance. There are also third-party add-ons and replacements for the JCF that can improve performance. But beginners and intermediate developers, however, should focus on understanding which operations will benefit from Java’s native parallel processing features for data collections.

Conclusion

In a world of big data, finding ways to improve the processing of large data collections is a must to create high-performing web pages and applications. Java provides built-in collection processing features that help developers improve data processing, including the Collections Framework and native parallel processing functions. Developers need to become familiar with how to use these features and understand when the native features are acceptable and when they should shift to parallel processing.

About the Author

Nahla Davies

Show moreShow less

InfoQ Software Architects' Newsletter

Login with:

Don't have an InfoQ account?

How to Speed up Large Collections Processing in Java

InfoQ Article Contest

Key Takeaways

What is a Java collection?

Related Sponsored Content

How do collections relate to the Java Collections Framework?

Methods for working with Java collections

Performance of Java collections

Applying parallel processing to improve performance

Conclusion

About the Author

Nahla Davies

Rate this Article

This content is in the Java topic

Related Topics:

Related Editorial

Popular across InfoQ

The InfoQ Newsletter