Profiling IO performance on different buffer size

2022-12-12 / modified at 2023-08-08 / 577 words / 3 mins

️This article has been over 1 years since the last update.

Recently I’m working on a project with the IO streams, here are some details.

How does stream get transferred?

Basicly, streams can be copied via a method called transferTo, creating an 8k memory chunk where bytes get accumulated and emit at once.

// transferTo or IOUtils.copyLarge or BufferedInputStream are roughly the same
public long transferTo(OutputStream out) throws IOException {
    Objects.requireNonNull(out, "out");
    long transferred = 0;
    // might slow down CPU speed and use too much memories
    byte[] buffer = new byte[DEFAULT_BUFFER_SIZE];
    int read;
    // memcpy is called bwtween kernel space to user space twice
    while ((read = this.read(buffer, 0, DEFAULT_BUFFER_SIZE)) >= 0) {
        out.write(buffer, 0, read);
        transferred += read;
    }
    return transferred;
}
// flush() is required for OutputStream after transmission

In most cases, using as few as 8k chunk is enough while handling a continuous data stream, including calling an API with JSON, downloading a blob, and forwarding a socket. For example, JFrog Artifactory, a production-proven binaray storage solution, use FileUtils.copyInputStreamToFile internally.

Should I use a buffer for performance?

We might add a redundant buffer via BufferedInputStream for “performance” in transmission. But it is often a misused buffer that leads to insufficient performance. Here is an example.

1
2
3

InputStream fis = Files.newInputStream(Paths.get("/512MB.dmg")));
BufferedInputStream bis = new BufferedInputStream(fis, 8192+1);
bis.transferTo(OutputStream.nullOutputStream());

The unaligned chunk will hit the feek and fstat system call in each loop, making a performance loss in transmission. Even if the buffer size is fixed to a multiple of 8k, the insufficiency can’t be fixed duo to the overhead of memcpy call in a loop.

Next, let’s profile some popular open source frameworks with the following code.

// only okio example here
InputStream fis = Files.newInputStream(Paths.get("/512MB.dmg")));
BufferedSource source = Okio.buffer(Okio.source(fis);
BufferedSink sink = Okio.buffer(Okio.blackhole());
sink.writeAll(source);

Here is the result. The following percentage in the lists indicates how long does ChannelInputStream take in an execution.

No buffer, 96%
BufferedInputStream, 8k buffer, 94.6% (actually no cache is used)
BufferedInputStream, 8k + 1 buffer, 59.38% (wastes on fseek, fstat, memcpy)
BufferedInputStream, 16k and more buffer, 85.86% (wastes on memcpy)
Okio, 8k buffer, 84% (wastes on segment maintenance)

As seen above, no buffer is the best for simple and continuous transmission.

It is relatively imperative to profilie use cases before introducing a framework.

Search on StackOverflow, the only use case for BufferedInputStream is writing codec and parser while reading byte by byte. But if someone has the skill to write a parser, I believe he can also complete a more complicated buffering mechanism ranther than a simple array.

Here are some real use cases in famous projects

Okhttp, a popular HTTP written in Java, uses Okio internally.
Grasscutter, a game server, uses KcpChannel and io.netty.buffer internally.
opentelemetry-java, a java metric client, uses Protobuf and manages flushing and buffering by itself.
Apache mina, a java SSH implementation, implements pointers and arrays by itself.

Appendix

Timeout

Timeout can be implemented by

checking with throwIfReached in each while loop in Okio
Guava’s SimpleTimerTask through Future.get()
Synchronized code fragment with object.wait().

Zero-Copy

If you want to reduce memcpy and context switch overhead, some low-level skills such as Java NIO or MMAP will be required.
If you need zero-copy at the CPU level, RoCE Ethernet cards over RDMA will be required for network offloads.

Okio