표 4. | Table 4. Ablation study 결과 | Ablation study results
| Prosody embedding | MixLN | Mel2VAD predictor | CER(↓,%) | SECS(↑) | F0 PCC(↑) | Energy PCC(↑) |
| Seen | Unseen | Seen | Unseen | Seen | Unseen | Seen | Unseen |
| × | × | × | 28.7 | 27.6 | 0.762 | 0.767 | 0.739 | 0.742 | 0.970 | 0.971 |
| ○ | × | × | 28.5 | 27.3 | 0.762 | 0.767 | 0.739 | 0.743 | 0.970 | 0.971 |
| ○ | ○ | × | 23.9 | 22.3 | 0.750 | 0.758 | 0.741 | 0.745 | 0.971 | 0.973 |
| × | × | ○ | 27.0 | 26.1 | 0.765 | 0.769 | 0.740 | 0.743 | 0.971 | 0.972 |
| ○ | ○ | ○ | 23.7 | 22.1 | 0.751 | 0.759 | 0.745 | 0.747 | 0.971 | 0.973 |
MixLN, mix layer normalization; CER, character error rate; SECS, speaker embedding cosine similarity; F0 PCC, F0 pearson correlation coefficient; Energy PCC, energy pearson coefficient.