Speed up inference ~4x for 7B model

kris-jusiak · kris-jusiak · commit 3cdc72a3265e · 2023-07-25T21:40:11.000-05:00
Problem:
  - inference for 7B model is slow.

Solution:
  - unroll the loop in matmul to perform 4 operation in parallel with
    simd.

Result (with float16):
  - before: 16tok/s
  - after:  71tok/s
diff --git a/run.c b/run.c
@@ -193,8 +193,12 @@ void matmul(float* xout, float* x, float* w, int n, int d) {
     #pragma omp parallel for
     for (int i = 0; i < d; i++) {
         float val = 0.0f;
-        for (int j = 0; j < n; j++) {
-            val += w[i * n + j] * x[j];
+        const int i_n = i * n;
+        for (int j = 0; j < n; j+=4) {
+            val += w[i_n + j] * x[j];
+            val += w[i_n + j + 1] * x[j + 1];
+            val += w[i_n + j + 2] * x[j + 2];
+            val += w[i_n + j + 3] * x[j + 3];
         }
         xout[i] = val;
     }