The problem

When dealing with large files, for example files that don’t fit on RAM, splitting the file in chunks is a must.

Some data formats like NDJSON can be trivially split, as all \n bytes are line separators. However, CSV can have newlines within quotes, which don’t indicate a new row, but a newline in the row’s field.

There are two kinds of solutions to the problem of finding a splitting point in a CSV file:

  • Read it. By simply reading the file, we can distinguish between \n bytes that are within quotes and the ones that aren’t, and that are points in which we can split the CSV. With encodings that preserve the ASCII code, like UTF-8, for the special characters we’ll be able to avoid extra state and code.
  • Statistical solutions. By going directly to the point in which we want to split the file, and by reading some bytes around it, we can guess a good splitting point. This solution is naturally faster, as you need to process less bytes, but it can return wrong splitting points too.

We decided to go with the first approach to favor correctness over speed. However, speed is very important to us, how much faster can we make it?

Speed wins

Our first implementation was a python-based parser. We tested its correct behavior, but we certainly saw that performance with this approach was a no-go.

Hello, C

Python performance can be very limitting, but the good thing about Python is that C code can be integrated easily on simple problems like this one.

We ported the Python implementation to C and called it using CFFI. This single optimization, without any other improvement, increased performance by two orders of magnitude, reaching now the 1GB/s barrier.

Simple is faster

We were quite happy about the 1GB/s initial implementation, but we knew we could do better. We started optimizing the algorithm. The first version was based on a single complex loop. The optimized version was simpler, and had a nested loop to deal with the quoted fields.

Initial implementation.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
while (pos < buffer_size) {

    if (buffer_text_encoded[pos] == escapechar) {
        next_special_char_found = escapechar;
    } else if (buffer_text_encoded[pos] == new_line) {
        next_special_char_found = new_line;
    } else if (buffer_text_encoded[pos] == quotechar) {
        next_special_char_found = quotechar;
    } else {
        next_special_char_found = 0;
    }

    if (parsing_state == estate_normal && next_special_char_found == 0) {
        pos = pos + 1;
        continue;
    }
    if (next_special_char_found == escapechar || next_special_char_found == quotechar) {
        if (parsing_state != estate_possible_escape_found) {
            state_previous_to_escape_char_found = parsing_state;
            parsing_state = estate_possible_escape_found;
            pos = pos + 1;
            continue;
        } else {
            parsing_state = state_previous_to_escape_char_found;
            pos = pos + 1;
            continue;
        }
    }
    if (parsing_state == estate_possible_escape_found) {
        if (state_previous_to_escape_char_found == estate_normal) {
            parsing_state = estate_quoted_text;
        } else {
            parsing_state = estate_normal;
        }
    }
    if (parsing_state == estate_normal && next_special_char_found == new_line) {
        last_unescaped_new_line_found = pos;
        pos = pos + 1;
        continue;
    }

    pos = pos + 1;
}

Optimized implementation.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
for (long int i = 0; i < buffer_size; i++){
    if (b[i] == escapechar){
        i++;
        continue;
    }

    if (b[i] == quotechar){
        do{
            i++;
        } while (((b[i] != quotechar) || (b[i] == quotechar && b[i-1] == escapechar)) && i < buffer_size);
        continue;
    }

    if (b[i] == '\n') {
        last_newline = i;
    }
}

SIMD

This implementation was quite fast, but we still thought we could go a step further. This optimized version seemed very difficult to improve, as it just performed three quick comparisons per byte on the happy path.

Improving the time it took to process each byte seemed almost impossible, and actually, we didn’t do that. Instead, we processesed multiple bytes per iteration.

All modern x86 CPUs include SIMD instructions like SSE and AVX. These instructions allow the processor to work on multiple registers in parallel, improving throughput.

This is how our last version looks like:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
__m128i sse_newline = _mm_set1_epi8('\n');
__m128i sse_quotechar = _mm_set1_epi8(quotechar);
__m128i sse_escapechar = _mm_set1_epi8(escapechar);
int max_simd = buffer_size - 16; // avoid reading past the buffer end due to SIMD
for (; i < buffer_size; i++){
    for (; i < max_simd; i += 16){
        // Load 16 bytes of the CSV in the SSE registers
        __m128i *sse_p = (__m128i *)&buffer[i];
        __m128i sse_a = _mm_loadu_si128(sse_p);
        // compare the 16 bytes to quotechar, and compact the result in the first 16 bits of mask_quoted
        int mask_quoted = _mm_movemask_epi8(_mm_cmpeq_epi8(sse_quotechar, sse_a));
        // compare against the newline
        int mask_newline = _mm_movemask_epi8(_mm_cmpeq_epi8(sse_newline, sse_a));
        int mask = mask_quoted | mask_newline;
        if (escapechar){
            // compare against the escapechar
            mask |= _mm_movemask_epi8(_mm_cmpeq_epi8(sse_escapechar, sse_a));
        }
        if (mask != 0){
            // there is at least one special character in the 16 bytes
            if (mask == mask_quoted){
                // the only special characters are quotes
                // we simply can count the number of quotes
                // the quoting state will change if this count is an odd number
                int quotes = __builtin_popcount(mask_quoted);
                state = (state + quotes) % 2;
            } else if (mask == mask_newline) {
                // the only special characters are newlines
                if (state == NORMAL) {
                    // and we were outside quotes
                    // the last bit set to '1' in the mask indicates where the last line was found
                    last_new_line_found = i + (31 - __builtin_clz(mask));
                }
            } else {
                // there is a complex combination of special characters
                // however, we can advance `i` by the number of initial bits on the mask that are set
                // to '0'
                i += __builtin_ctz(mask);
                break;
            }
        }
    }
    if (buffer[i] == quotechar){
        state = !state;
    } else if (state == NORMAL && buffer[i] == '\n') {
        last_new_line_found = i;
    } else if (buffer[i] == escapechar) {
        i++;
    }
}

Used compiler intrinsics:

  • __m128i _mm_set1_epi8(unsigned char c). Returns a 16-byte SSE group of registers with all bytes filled with c.
  • __m128i _mm_loadu_si128(__m128i const* mem_addr). Returns a 16-byte SSE group of registers with bytes copied from mem_addr.
  • __m128i _mm_cmpeq_epi8 (__m128i a, __m128i b). Returns a 16-byte SSE group of registers, comparing for equality the bytes of a and b. Each returned byte is set to 0xFF if the corresponding bytes of a and b are equal, 0 otherwise.
  • int _mm_movemask_epi8 (__m128i a). Compacts the 16 bytes of a by taking the most significand bit of each byte in a, returning a 16-bit sized mask.
  • int __builtin_popcount(unsigned int x). Counts the number of bits in x set to 1.
  • int __builtin_clz(unsigned int x). Returns the number of leading 0-bits in x, starting at the most significant bit position. If x is 0, the result is undefined.
  • int __builtin_ctz (unsigned int x). Returns the number of trailing 0-bits in x, starting at the least significant bit position. If x is 0, the result is undefined.

See also Intel’s and gcc’s docs.

This is certainly a more complex version, but thanks to SSE, it works at 3GB/s!

Want to help us solve these problems? Join us!