#### NSFlow: An End-to-End FPGA Framework with Scalable Dataflow Architecture for Neuro-Symbolic Al

Hanchen Yang<sup>1\*</sup>, Zishen Wan<sup>1\*</sup>, Ritik Raj<sup>1</sup>, Joongun Park<sup>1</sup>, Ziwei Li<sup>1</sup>, Ananda Samajdar<sup>2</sup>, Arijit Raychowdhury<sup>1</sup>, Tushar Krishna<sup>1</sup> (\*Equal Contributions)

> <sup>1</sup>Georgia Tech, Atlanta, GA, USA <sup>2</sup>IBM Research, Yorktown Heights, NY, USA







# Background

#### Traditional NNs are not good at reasoning



(i) Remove all gray spheres. How many spheres are there? (3), (ii) Take away 3 cubes. How many objects are there? (7), (iii) How many blocks must be removed to get 1 block? (2)





#### Abstract Reasoning NN accuracy: 53%

Interactive Learning NN accuracy: 71%



Imagine that there are five people who are waiting in line to use a single-occupancy bathroom at a concert venue. Someone at the back of the line needs to throw up immediately. That person skips to the front of the line instead of waiting in the back.

At a summer camp, there is a pool. Right next to the pool is a tent where the kids at the camp have art class. The camp made a rule that there would be no cannonballing in the pool so that the art wouldn't get ruined by the splashing water. Today, there is a bee attacking this kid, and she needs to jump into the water quickly. This kid cannonballs into the pool

> **Ethical Decision Making** NN accuracy: 65%



IMO 2015 P3

"Let ABC be an acute triangle. Let (O) be its circumcircle, H its orthocenter, and F the foot of the altitude from A. Let M be the midpoint of BC. Let Q be the point on (O) such that  $QH \perp QA$  and let K be the point on (O) such that KH  $\perp$ KQ. Prove that the circumcircles (O<sub>1</sub>) and (O<sub>2</sub>) of triangles FKM and KQH are tangent to each other."



#### Automated Theorem Proving NN accuracy: 0%

Farmer John has N cows  $(2 \le N \le 10^5)$ . Each cow has a breed that is either Guernsey or Holstein. As is often the case, the cows are standing in a line, numbered  $1 \cdots N$  in this order.

Over the course of the day, each cow writes down a list of cows. Specifically, cow i's list contains the range of cows starting with herself (cow i) up to and including cow  $E_i$   $(i \leq E_i \leq N)$ .

FJ has recently discovered that each breed of cow has exactly one distinct leader. FJ does not know who the leaders are, but he knows that each leader must have a list that includes all the cows of their breed, or the other breed's leader (or both).

Help FJ count the number of pairs of cows that could be leaders. It is guaranteed that there is at least one possible pair. 🗩 Problem

**Competitive Programming** NN accuracy: 8.7%



# Background

#### Neural-Symbolic AI (NSAI)

• A compositional system to enhance cognitive capability for reasoning tasks.





#### Neural-Symbolic AI example

Visual Reasoning



**Question:** Are there an **equal number** of large things and metal spheres?





#### Neural-Symbolic AI example

Visual Reasoning



**Question:** Are there an **equal number** of large things and metal spheres?





Slide Adapted from MIT 6.S191: Neurosymbolic AI

#### Neural-Symbolic AI example

Visual Reasoning



**Question:** Are there an **equal number** of large things and metal spheres?





Slide Adapted from MIT 6.S191: Neurosymbolic AI

#### Neural-Symbolic AI example

Visual Reasoning



#### **Question:** Are there an **equal number** of large things and metal spheres?





#### Neural-Symbolic AI example

Visual Reasoning



#### Neural-Symbolic AI example

Visual Reasoning



#### NSAI algorithm structure

| <b>Representative Neuro-</b> |             | Neuro-Vector-Symbolic                                                              | Multiple-Input-Multiple-Output        | Probabilistic Abduction via Learning Rules   | Probabilistic Abduction and             |
|------------------------------|-------------|------------------------------------------------------------------------------------|---------------------------------------|----------------------------------------------|-----------------------------------------|
| Symbolic Al                  | [ Workloads | Architecture (NVSA) [13]                                                           | Neural Networks (MIMONet) [24]        | in Vector-symbolic Architecture (LVRF) [12]  | Execution Learner (PrAE) [40]           |
| Compute                      | Neuro       | CNN                                                                                | CNN/Transformer                       | CNN                                          | CNN                                     |
| Pattern                      | Symbolic    | VSA binding/unbinding (Circular Conv)                                              | VSA binding (Circular Conv)           | VSA binding/unbinding (Circular Conv)        | Probabilistic abduction                 |
| Application                  | Use Case    | Spatial-temporal and abstract reasoning<br>Higher joint representation efficiency, | Multi-input simultaneously processing | Probabilistic reasoning, OOD data processing | Spatial-temporal and abstract reasoning |
| Scenario                     | Advantage   | Higher joint representation efficiency,                                            | Higher throughput, Lower latency,     | Stronger OOD handling capability, One-pass   | Higher generalization, Transparency,    |
| Scenario                     | vs. Neural  | Better reasoning capability, Transparency                                          | Compositional compute, Transparency   | learning, Higher flexibility, Transparency   | Interpretability, Robustness            |







NSAI workloads system characterization





Z. Wan *et al.*, "Towards Cognitive AI Systems: Workload and Characterization of Neuro-Symbolic AI," 2024 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)

#### NSAI workloads system characterization





Z. Wan *et al.*, "Towards Cognitive AI Systems: Workload and Characterization of Neuro-Symbolic AI," 2024 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)

NSAI workloads system characterization





Z. Wan *et al.*, "Towards Cognitive AI Systems: Workload and Characterization of Neuro-Symbolic AI," 2024 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)

NSAI workloads system characterization





*Neuro-symbolic workload exhibits high latency compared to neural models;* **Symbolic component** is processed inefficiently on off-the-shelf CPU/GPUs

NSAI algorithms deployment challenges:

Memory & compute inefficiency



NSAI algorithms deployment challenges:

- Memory & compute inefficiency
- Heterogeneous compute kernel



NSAI algorithms deployment challenges:

- Memory & compute inefficiency
- Heterogeneous compute kernel
- Algorithm diversity



NSAI algorithms deployment challenges:

- Memory & compute inefficiency
- Heterogeneous compute kernel
- Algorithm diversity







Efficient and flexible architecture

NSAI algorithms deployment challenges:

- Memory & compute inefficiency
- Heterogeneous compute kernel
- Algorithm diversity



Automated and customized FPGA deployment

Efficient and flexible architecture







#### **NSFlow**





#### **NSFlow**















An end-to-end automated FPGA framework for accelerating and deploying generic NSAI workloads.

Identifies data dependency for the workload





- Identifies data dependency for the workload
- Explores design space with parameterizable HW blocks.





- Identifies data dependency for the workload
- Explores design space with parameterizable HW blocks.
- Generates and deploys optimal dataflow architecture on FPGA





- Identifies data dependency for the workload
- Explores design space with parameterizable HW blocks.
- Generates and deploys optimal dataflow architecture on FPGA
- Enables automated, efficient and scalable dataflow and architecture solutions for NSAI









#### Frontend

#### Backend





Backend











#### **NSFlow Backend**

**Flexible Hardware Architecture** 





#### An accelerator template

w/ RTL blocks parameterizable by config files





An accelerator template

w/ RTL blocks parameterizable by config files

□ Adaptive Systolic Array:

for efficient Neuro & Symbolic processing









Z. Wan, H. Yang, R. Raj et al., "CogSys: Efficient and Scalable Neurosymbolic Cognition System via Algorithm-Hardware Co-Design," 2025 IEEE International Symposium on High Performance Computer Architecture (HPCA)





Enables efficient Vector Symbolic operation processing on systolic array

#### An accelerator template

w/ RTL blocks parameterizable by config files

□ Adaptive Systolic Array:

for efficient Neuro & Symbolic processing

□ SIMD Unit:

for element-wise ops





#### An accelerator template

w/ RTL blocks parameterizable by config files

□ Adaptive Systolic Array:

for efficient Neuro & Symbolic processing

□ SIMD Unit:

for element-wise ops

#### BRAM Blocks:

for flexible on-chip memory





#### An accelerator template

w/ RTL blocks parameterizable by config files

#### □ Adaptive Systolic Array:

for efficient Neuro & Symbolic processing

#### □ SIMD Unit:

for element-wise ops

#### BRAM Blocks:

for flexible on-chip memory

#### □ Control Logic:

for HW-level task scheduling





#### **Dataflow Architecture Generation (DAG)**





- Analyze workload characteristics and data dependencies
- Explore optimal HW configurations and dataflow for backend



1. Extract Execution Trace from the workload:

wokload.py



### 1. Extract Execution Trace from the workload:

wokload.py => et.json



```
graph():
   // Neuro Operation - CNN (Resnet18)
    %relu_1[16,64,160,160] : call_module[relu](args = (%bn1
         [16, 64, 160, 160]))
   %maxpool_1[16,64,160,160] : call_module[maxpool](args =
          (%relu 1[16,64,160,160]))
    %conv2d 1[16,64,160,160] : call module[conv2d](args =
         (%maxpool_1[16,64,160,160]))
    // Symbolic Operations
    // Inverse binding of two block codes vectors by
        blockwise cicular correlation
    %inv_binding_circular_1[1,4,256] : call_function[nvsa.
         inv binding circular (args = (\$vec 0[1, 4, 256]), \$
         vec 1[1,4,256]))
   %inv_binding_circular_2[1,4,256] : call_function[nvsa.
         inv binding circular](args = (\$vec 3[1,4,256], \$
         vec_4[1, 4, 256])
   // Compute similarity between two block codes vectors
    %match_prob_1[1] : call_function[nvsa.match_prob](args
         = (%inv_binding_circular_1[1,4,256], %vec_2
         [1, 4, 256]))
    // Compute similarity between a dictionary and a batch
         of query vectors
    %match_prob_multi_batched_1[1]: call_function[nvsa.
        match prob multi batched](args = (%
        inv_binding_circular_2[1,4,256], %vec_5[7,4,256]))
    %sum 1[1] : call function[torch.sum](args = (%)
        match_prob_multi_batched_1[1]))
    %clamp_1[1] : call_function[torch.clamp](args = (%sum_1)
         [1]))
    %mul_1[1] : call_function[operator.mul](args = (%
        match_prob_1[1], %clamp_1[1]))
    . . .
```















3. Explore optimal HW config and array partition strategy



### 3. Explore optimal HW config and array partition strategy

Algorithm 1: NSFlow Two-Phase DSE Algorithm

Data: R<sub>l</sub>, R<sub>v</sub>, Range<sub>H</sub> (H search range), Range<sub>W</sub> (W search range), M (max #PEs), Iter<sub>max</sub> (Phase II max iterations)
Result: H, W, N (total #sub-arrays), N<sub>l</sub>, N<sub>v</sub>
1 / \* Phase I \* /

2 for *H* in  $Range_H$ , *W* in  $Range_W$  do 3  $N = \lfloor M/(H \times W) \rfloor$  // get total #sub-arrays 4 for  $\overline{N}_l$  in [1, N) do 5 | // get optimal HW config for parallel mapping

6 Set all elements in  $N_l$  to  $\bar{N}_l$ 7 Set all elements in  $N_v$  to  $N - \bar{N}_l$ 

 $t_{para} = max(t_{nn}(H, W, N_l), t_{vsa}(H, W, N_v))$ 

Save the  $H, W, \overline{N}_l$  (and  $\overline{N}_v$ ) with minimal  $t_{para}$ .

#### 10 end

8 9

 $\begin{array}{ll} & \\ \textbf{11} & \\ \textbf{12} & \\ t_{seq} = \Sigma_i^G f_{l_i}(H,W,N) + \\ & \\ min(\Sigma_j^G f_{v_j,temp}(H,W,N), \ \Sigma_j^G f_{v_j,spatial}(H,W,N)) \end{array}$ 

13 // Set to sequential mode in case it has better performance

14 Return and set sequential mode if  $t_{seq} < t_{para}$  else Continue 15 end

#### **16** / \* Phase II \* /

17 for it in Iterman do for layer i in  $R_1$  do 18 Locate VSA node j' and j'' where layer i starts and ends 19 if  $t_{seq} < t_{para}$  do  $N_l[i] - -; N_v[j':j''] + +;$ else do  $N_l[i] + +; N_v[j':j''] - -;$ 20 21  $t_{para} = max(t_{nn}(H, W, N_l), t_{vsa}(H, W, N_v))$ 22 Save the  $H, W, N_l, N_v$  with minimal  $t_{para}$ . 23 24 end 25 end 26 Return  $H, W, N, N_l, N_v$ .

62

- 3. Explore optimal HW config and array partition strategy
  - *Phase I:* Assuming **static partition**, find the optimal array size (H, W, N).

Algorithm 1: NSFlow Two-Phase DSE Algorithm

```
Data: R_l, R_v, Range_H (H search range), Range_W (W search
      range), M (max #PEs), Iter_{max} (Phase II max iterations)
Result: H, W, N (total #sub-arrays), N_l, N_v
```

```
1 / * Phase I * /
2 for H in Range<sub>H</sub>, W in Range<sub>W</sub> do
```

21

22

23 24

25 end

end

26 Return  $H, W, N, N_l, N_v$ .

```
N = |M/(H \times W)| // \text{get total #sub-arrays}
3
       for N_l in [1, N) do
4
            // get optimal HW config for parallel mapping
5
            Set all elements in N_l to \bar{N}_l
 6
            Set all elements in N_v to N - \bar{N}_l
7
            t_{para} = max(t_{nn}(H, W, N_l), t_{vsa}(H, W, N_v))
8
            Save the H, W, \overline{N}_l (and \overline{N}_v) with minimal t_{para}.
9
10
       end
       // get sequential runtime
11
       t_{seq} = \sum_{i}^{G} f_{l_i}(H, W, N) +
12
        \min(\Sigma_{j}^{G} f_{v_{j},temp}(H,W,N), \ \Sigma_{j}^{G} f_{v_{j},spatial}(H,W,N))
       // Set to sequential mode in case it has better performance
13
       Return and set sequential mode if t_{seq} < t_{para} else Continue
14
15 end
16 / * Phase II * /
17 for it in Itermax do
       for layer i in R_1 do
18
            Locate VSA node j' and j'' where layer i starts and ends
19
            if t_{seq} < t_{para} do N_l[i] - -; N_v[j':j''] + +;
20
            else do N_l[i] + +; N_v[j':j''] - -;
```

 $t_{para} = max(t_{nn}(H, W, N_l), t_{vsa}(H, W, N_v))$ 

Save the  $H, W, N_l, N_v$  with minimal  $t_{para}$ .



- 3. Explore optimal HW config and array partition strategy
  - *Phase I:* Assuming **static partition**, find the optimal array size (H, W, N).
  - *Phase II:* Fine-tune for **dynamic partition** to better balance Neuro & Symb at runtime



```
Algorithm 1: NSFlow Two-Phase DSE Algorithm
   Data: R_l, R_v, Range_H (H search range), Range_W (W search
         range), M (max #PEs), Iter_{max} (Phase II max iterations)
   Result: H, W, N (total #sub-arrays), N_l, N_v
1 / * Phase I * /
2 for H in Range<sub>H</sub>, W in Range<sub>W</sub> do
       N = |M/(H \times W)| // get total #sub-arrays
3
       for N_l in [1, N) do
4
           // get optimal HW config for parallel mapping
5
           Set all elements in N_l to \bar{N}_l
           Set all elements in N_v to N - \bar{N}_l
7
           t_{para} = max(t_{nn}(H, W, N_l), t_{vsa}(H, W, N_v))
8
9
           Save the H, W, \overline{N}_l (and \overline{N}_v) with minimal t_{para}.
10
       end
       // get sequential runtime
11
       t_{seq} = \sum_{i}^{G} f_{l_i}(H, W, N) +
12
        \min(\Sigma_{j}^{G} f_{v_{j},temp}(H,W,N), \Sigma_{j}^{G} f_{v_{j},spatial}(H,W,N))
       // Set to sequential mode in case it has better performance
13
       Return and set sequential mode if t_{seq} < t_{para} else Continue
14
15 end
16 / * Phase II * /
17 for it in Itermax do
       for layer i in R_1 do
18
           Locate VSA node j' and j'' where layer i starts and ends
19
           if t_{seq} < t_{para} do N_l[i] - -; N_v[j':j''] + +;
20
           else do N_l[i] + +; N_v[j':j''] - -;
21
           t_{para} = max(t_{nn}(H, W, N_l), t_{vsa}(H, W, N_v))
22
           Save the H, W, N_l, N_v with minimal t_{para}.
23
24
       end
25 end
26 Return H, W, N, N_l, N_v.
```

- 3. Explore optimal HW config and array partition strategy
  - *Phase I:* Assuming **static partition**, find the optimal array size (H, W, N).
  - *Phase II:* Fine-tune for **dynamic partition** to better balance Neuro & Symb at runtime

### ✓ Reduces search space x10<sup>100</sup>

|          | HW config $(H, W, N)$        |                                   | Total design space, $m = 10$ |
|----------|------------------------------|-----------------------------------|------------------------------|
| Original | $m \times (m+1)/2$           | $(N-1)^k$ for each N              | $10^{300}$                   |
| DAG      | $PhaseI: 1/4 \le H/W \le 16$ | $PhaseII$ : Iter $\times$ #layers | $10^{3}$                     |



Algorithm 1: NSFlow Two-Phase DSE Algorithm **Data:**  $R_l, R_v, Range_H$  (H search range),  $Range_W$  (W search range), M (max #PEs),  $Iter_{max}$  (Phase II max iterations) **Result:** H, W, N (total #sub-arrays),  $N_l$ ,  $N_v$ 1 / \* Phase I \* / 2 for H in Range<sub>H</sub>, W in Range<sub>W</sub> do  $N = |M/(H \times W)|$  // get total #sub-arrays for  $N_l$  in [1, N) do 4 // get optimal HW config for parallel mapping 5 Set all elements in  $N_l$  to  $\bar{N}_l$ Set all elements in  $N_v$  to  $N - \bar{N}_l$ 7 8  $t_{para} = max(t_{nn}(H, W, N_l), t_{vsa}(H, W, N_v))$ 9 Save the  $H, W, \overline{N}_l$  (and  $\overline{N}_v$ ) with minimal  $t_{para}$ . 10 end // get sequential runtime 11  $t_{seq} = \sum_{i}^{G} f_{l_i}(H, W, N) +$ 12  $\min(\Sigma_{j}^{G}f_{v_{j},temp}(H,W,N),\ \Sigma_{j}^{G}f_{v_{j},spatial}(H,W,N))$ // Set to sequential mode in case it has better performance 13 Return and set sequential mode if  $t_{seq} < t_{para}$  else Continue 14 15 end 16 / \* Phase II \* / 17 for it in Itermax do for layer i in  $R_1$  do 18 Locate VSA node j' and j'' where layer i starts and ends 19 if  $t_{seq} < t_{para}$  do  $N_l[i] - -; N_v[j':j''] + +;$ 20 else do  $N_l[i] + +; N_v[j':j''] - -;$ 21  $t_{para} = max(t_{nn}(H, W, N_l), t_{vsa}(H, W, N_v))$ 22 Save the  $H, W, N_l, N_v$  with minimal  $t_{para}$ . 23 24 end 25 end 26 Return  $H, W, N, N_l, N_v$ .





### Experiments setup

- > Workloads:
  - Algorithms: NVSA, MIMONet, LVRF
  - Datasets: RAVEN, I-RAVEN, PGM, CVR, and SVRT
- > Hardwares:
  - Baselines: TX2, Xavier NX, Xeon CPU, RTX 3080, ML accelerators (TPU, Xilinx DPU)
  - FPGA deployment: AMD U250

| Workloads | Precision AdArray Configuration |      | SIMD<br>Size      | On-chip<br>SRAM Blocks<br>(BRAM)               |    |                 | On-chip<br>Cache | •      |         |     |     |     |      | Frequency |        |         |
|-----------|---------------------------------|------|-------------------|------------------------------------------------|----|-----------------|------------------|--------|---------|-----|-----|-----|------|-----------|--------|---------|
|           | NN                              | Symb | Size<br>(H, W, N) | Default Partition<br>$(\bar{N}_l : \bar{N}_v)$ |    | MemA1, MemA2    | Mem B            | Mem C  | (URAM)  | DSP | LUT | FF  | BRAM | URAM      | LUTRAM |         |
| NVSA      | INT8                            | INT4 | 32, 16, 16        | 14:2                                           | 64 | 2.7 MB, 1.1 MB  | 2.7 MB           | 1.6 MB | 16.2 MB | 89% | 56% | 60% | 34%  | 8%        | 24%    | 272 MHz |
| MIMONet   | INT8                            | INT8 | 32, 32, 8         | 6:2                                            | 64 | 3.4 MB, 1.2 MB  | 3.4 MB           | 2.1 MB | 20.1 MB | 89% | 44% | 52% | 43%  | 10%       | 20%    | 272 MHz |
| LVRF      | INT8                            | INT4 | 32, 16, 16        | 14:2                                           | 64 | 2.7 MB, 0.96 MB | 2.7 MB           | 1.4 MB | 15.5 MB | 89% | 56% | 60% | 31%  | 7%        | 24%    | 272 MHz |





### End-to-end runtime improvement

✓ ~2x speedup over GPU, 2~8x speedup over TPU, ~3x speedup over DPU





## **Evaluation**

### Mixed-precision optimization √~6x Memory reduction

| Reasoning Accuracy | FP32  | FP16  | INT8  | MP (IN8 for NN, INT4 for Symb) | INT4  |
|--------------------|-------|-------|-------|--------------------------------|-------|
| RAVEN [39]         | 98.9% | 98.9% | 98.7% | 98.0%                          | 92.5% |
| I-RAVEN [16]       | 99.0% | 98.9% | 98.8% | 98.1%                          | 91.3% |
| PGM [3]            | 68.7% | 68.6% | 68.4% | 67.4%                          | 59.9% |
| Memory             | 32MB  | 16MB  | 8MB   | 5.5MB                          | 4MB   |





### Scalability

✓ Only 4x runtime increase when symbolic workloads scale by 150x









NSFlow is the first end-to-end design automation framework dedicated to accelerate generic NSAI systems





NSFlow is the first end-to-end design automation framework dedicated to accelerate generic NSAI systems

• Identifies the unique optimization opportunities for NSAI acceleration



# Conclusion

NSFlow is the first end-to-end design automation framework dedicated to accelerate generic NSAI systems

- Identifies the unique optimization opportunities for NSAI acceleration
- Explores the dataflow and architecture design space with a novel algorithm



# Conclusion

NSFlow is the first end-to-end design automation framework dedicated to accelerate generic NSAI systems

- Identifies the unique optimization opportunities for NSAI acceleration
- Explores the dataflow and architecture design space with a novel algorithm
- Generates a efficient scalable design for FPGA deployment



### Conclusion

NSFlow paves the way for advancing efficient cognitive reasoning systems and unlocking new possibilities in NSAI.



