Profiling IO performance on different buffer size
2022-12-12 / modified at 2023-08-08 / 577 words / 3 mins

Recently I’m working on a project with the IO streams, here are some details.

How does stream get transferred?

Basicly, streams can be copied via a method called transferTo, creating an 8k memory chunk where bytes get accumulated and emit at once.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
// transferTo or IOUtils.copyLarge or BufferedInputStream are roughly the same
public long transferTo(OutputStream out) throws IOException {
Objects.requireNonNull(out, "out");
long transferred = 0;
// might slow down CPU speed and use too much memories
byte[] buffer = new byte[DEFAULT_BUFFER_SIZE];
int read;
// memcpy is called bwtween kernel space to user space twice
while ((read = this.read(buffer, 0, DEFAULT_BUFFER_SIZE)) >= 0) {
out.write(buffer, 0, read);
transferred += read;
}
return transferred;
}
// flush() is required for OutputStream after transmission

In most cases, using as few as 8k chunk is enough while handling a continuous data stream, including calling an API with JSON, downloading a blob, and forwarding a socket. For example, JFrog Artifactory, a production-proven binaray storage solution, use FileUtils.copyInputStreamToFile internally.

Should I use a buffer for performance?

We might add a redundant buffer via BufferedInputStream for “performance” in transmission. But it is often a misused buffer that leads to insufficient performance. Here is an example.

1
2
3
InputStream fis = Files.newInputStream(Paths.get("/512MB.dmg")));
BufferedInputStream bis = new BufferedInputStream(fis, 8192+1);
bis.transferTo(OutputStream.nullOutputStream());

The unaligned chunk will hit the feek and fstat system call in each loop, making a performance loss in transmission. Even if the buffer size is fixed to a multiple of 8k, the insufficiency can’t be fixed duo to the overhead of memcpy call in a loop.

Next, let’s profile some popular open source frameworks with the following code.

1
2
3
4
5
// only okio example here
InputStream fis = Files.newInputStream(Paths.get("/512MB.dmg")));
BufferedSource source = Okio.buffer(Okio.source(fis);
BufferedSink sink = Okio.buffer(Okio.blackhole());
sink.writeAll(source);

Here is the result. The following percentage in the lists indicates how long does ChannelInputStream take in an execution.

  • No buffer, 96%
  • BufferedInputStream, 8k buffer, 94.6% (actually no cache is used)
  • BufferedInputStream, 8k + 1 buffer, 59.38% (wastes on fseek, fstat, memcpy)
  • BufferedInputStream, 16k and more buffer, 85.86% (wastes on memcpy)
  • Okio, 8k buffer, 84% (wastes on segment maintenance)

As seen above, no buffer is the best for simple and continuous transmission.

It is relatively imperative to profilie use cases before introducing a framework.

Search on StackOverflow, the only use case for BufferedInputStream is writing codec and parser while reading byte by byte. But if someone has the skill to write a parser, I believe he can also complete a more complicated buffering mechanism ranther than a simple array.

Here are some real use cases in famous projects

  • Okhttp, a popular HTTP written in Java, uses Okio internally.
  • Grasscutter, a game server, uses KcpChannel and io.netty.buffer internally.
  • opentelemetry-java, a java metric client, uses Protobuf and manages flushing and buffering by itself.
  • Apache mina, a java SSH implementation, implements pointers and arrays by itself.

Appendix

Timeout

Timeout can be implemented by

  • checking with throwIfReached in each while loop in Okio
  • Guava’s SimpleTimerTask through Future.get()
  • Synchronized code fragment with object.wait().

Zero-Copy

  • If you want to reduce memcpy and context switch overhead, some low-level skills such as Java NIO or MMAP will be required.
  • If you need zero-copy at the CPU level, RoCE Ethernet cards over RDMA will be required for network offloads.