Ali Safaya

I am a final-year Computer Science PhD candidate at Koç University, advised by Deniz Yuret, and an AI Research Fellow at the KUIS AI Center in Istanbul. I expect to defend my thesis in 2026. My research focuses on memory augmentation for language models — adding external memory to transformers so they can model long contexts efficiently.

Email  /  CV  /  Semantic Scholar  /  Google Scholar  /  Twitter  /  Github

profile photo
Research

I am interested in natural language processing, language modeling (both large and small models), and deep learning in general. My PhD research is primarily focused on separating semantic and episodic information in language models, as well as developing improved memory architectures for modeling long sequences with Transformers.

In addition, I am an active member of Turkish Data Depository project, an open-source repository for data and tools for the Turkish language. I am responsible for managing TDD's dataset hub and for overseeing Mukayese, a collection of Turkish language benchmarks, including the newly released MukayeseLLM leaderboard.

Selected Publications & Projects
Neurocache architecture diagram Neurocache: Efficient Vector Retrieval for Long-range Language Modeling
Ali Safaya, Deniz Yuret
NAACL 2024

ACL Anthology / arXiv / Code

Neurocache extends the effective context of large language models with an external vector cache of compressed past states, retrieved through efficient k-nearest-neighbor lookup and fused back into the attention layers.

Kanarya logo Kanarya: An Open Turkish Language Model
Ali Safaya · Turkish Data Depository
2B-parameter Turkish LLM · Hugging Face

Model card

Kanarya is a 2-billion-parameter Turkish language model based on the GPT-J architecture, pretrained on large-scale filtered Turkish web text and released openly for Turkish text generation and NLP research.

Mukayese Mukayese: Turkish NLP Strikes Back
Ali Safaya, Emirhan Kurtulus, Arda Goktogan, Deniz Yuret
Findings of ACL 2022

Project page / Video / ArXiv

Mukayese is a collection of NLP benchmarks for the Turkish language, consisting of seven leaderboards for different NLP tasks. For each benchmark, we work with one or more datasets and present two or more baselines.

Other Publications

HuBERT-TR: Reviving Turkish Automatic Speech Recognition with Self-supervised Speech Representation Learning
Ali Safaya, Engin Erzin · arXiv, 2022

Beyond the Imitation Game: Quantifying and Extrapolating the Capabilities of Language Models (BIG-bench)
Aarohi Srivastava et al. (incl. Ali Safaya) · arXiv, 2022

Event Coreference Resolution for Contentious Politics Events
Ali Hurriyetoglu et al. (incl. Ali Safaya) · arXiv, 2022

KUISAIL at SemEval-2020 Task 12: BERT-CNN for Offensive Speech Identification in Social Media
Ali Safaya, Moutasem Abdullatif, Deniz Yuret · SemEval-2020

Automated Extraction of Socio-political Events from News (AESPEN): Workshop and Shared Task Report
Ali Hurriyetoglu et al. (incl. Ali Safaya) · AESPEN, 2020

COVCOR20 at WNUT-2020 Task 2: An Attempt to Combine Deep Learning and Expert Rules
Ali Hurriyetoglu et al. (incl. Ali Safaya) · W-NUT, 2020

Event Sentence Detection Task using Attention Model
Ali Safaya · CLEF, 2019

Blog Posts

Understanding the Impact of Token Redundancy in Language Models
How subword tokenizers (BPE, SentencePiece) end up learning duplicate tokens for the same word depending on a leading space, why that wastes capacity, and how models still learn to treat the duplicates alike — shown through embedding cosine-similarity heatmaps.

BERT-CNN for Detecting Offensive Speech
A BERT + CNN architecture for identifying offensive language in social media, from our SemEval-2020 Task 12 (OffensEval) submission.

Arabic-ALBERT: Pre-training Arabic Language Models
Open-source ALBERT language models pre-trained for Arabic, released with code and weights for the community.


This website is based on Jon Barron website's source code.