I work in the Speech and Natural Language Technologies department at Vicomtech, specializing in Natural Language Generation and Machine Translation, while pursuing my PhD. My research bridges computational linguistics and sustainability. I focus on optimizing language model training to enhance performance for low-resource languages while reducing energy consumption, contributing to the broader Green AI initiative. My work aims to make NLP technologies accessible, equitable, and environmentally sustainable.
🔥 News
- 2024.11: 🎉🎉 Published “Automating Easy Read Text Segmentation” at EMNLP 2024 Findings (Read Paper).
- 2024.08: 🎉🎉 Published “Split and Rephrase with Large Language Models” at ACL 2024 (Read Paper).
- 2023.07: 🎉🎉 Published “Unsupervised Subtitle Segmentation with Masked Language Models” at ACL 2023 (Read Paper).
📖 Education
- Ph.D. in Natural Language Processing (2022–Present)
University of the Basque Country (UPV/EHU)- Thesis: Optimization, Adaptation, and Applications of Large Language Models
- M.Sc. in Artificial Intelligence (2019–2020)
University of Leeds- Graduated with Distinction
- B.Sc. in Computer Science (2015–2019)
University of Malaga
🏢 Employment History
Vicomtech – Researcher in Speech and Natural Language Technologies
San Sebastián, Spain
2020 – Present
- Conduct research and development in Natural Language Generation, Machine Translation, and Text Simplification for underrepresented languages.
- Lead and contribute to cutting-edge projects, including ADAGIO, ADAPT-IA, and IRAZ, focusing on optimizing language models and advancing Green AI methodologies.
- Collaborate on national and international initiatives to improve accessibility and sustainability in NLP.
- Supervise and mentor interns and junior researchers in the Speech and Natural Language Technologies department.
🛠 Skills
Research Skills
- Large Language Models Optimization
- Neural Machine Translation (NMT)
- Text Simplification
- Machine Translation Quality Estimation
- Green AI methodologies for sustainable model training
Technical Skills
- Programming Languages: Python, Golang, C++, TypeScript, HTML, CSS, SQL
- Frameworks & Tools: PyTorch, llama.cpp, LlamaIndex, MarianNMT, COMET, FastAPI, Angular
- Specialized Tools: Energy usage profiling tools (e.g., CodeCarbon), lightweight deployment frameworks
Languages
- Spanish: Native
- English: Full Professional Proficiency
- Basque: Professional Proficiency
📝 Publications
2024
-
Split and Rephrase with Large Language Models
David Ponce, Thierry Etchegoyhen, Jesús Calleja, Harritxu Gete
ACL 2024 (Paper) -
Vicomtech@WMT 2024: Shared Task on Translation into Low-Resource Languages of Spain
David Ponce, Harritxu Gete, Thierry Etchegoyhen
WMT 2024 (Paper) -
Automating Easy Read Text Segmentation
Jesús Calleja, Thierry Etchegoyhen, Antonio David Ponce Martínez
EMNLP 2024 Findings (Paper)
2023
-
Unsupervised Subtitle Segmentation with Masked Language Models
David Ponce, Thierry Etchegoyhen, Víctor Ruiz
ACL 2023 (Paper) -
Learning from Past Mistakes: Quality Estimation from Monolingual Corpora and Machine Translation Learning Stages
Thierry Etchegoyhen, David Ponce
MT Summit XIX (Paper) -
IRAZ: Easy-to-Read Content Generation via Automated Text Simplification
Thierry Etchegoyhen, Jesús Calleja, David Ponce
SEPLN 2023 (Paper)
2022
- TANDO: A Corpus for Document-level Machine Translation
Harritxu Gete, Thierry Etchegoyhen, David Ponce, et al.
LREC 2022 (Paper)
2021
-
Online Learning over Time in Adaptive Neural Machine Translation
Thierry Etchegoyhen, David Ponce, Harritxu Gete, Víctor Ruiz
RANLP 2021 (Paper) -
ITAI: Adaptive Neural Machine Translation Platform
Thierry Etchegoyhen, David Ponce, Harritxu Gete, Víctor Ruiz
SEPLN 2021 (Paper)
📂 Projects
2024
-
BIKAIN
Research and development of a high-reliability automatic quality estimation system to identify translation errors at the sentence, word, terminology, and context levels in an integrated manner. -
IKUN
Research and development of Large Multimodal Models for industrial domain adaptation, facilitating the generation of synthetic images and time series to enhance quality assurance processes. Includes multimodal conversational interfaces for industrial dashboards and knowledge bases. -
LiveAI
Development of AI services for accessibility and audiovisual translation, enabling real-time transcription, translation, and spoken interpretation (dubbing) of live events.
2023
-
ADAGIO
Research and development of a system for automatic text generation adaptable to specific domains using Artificial Intelligence technologies. -
ADAPT-IA
Research and development of adaptive AI technologies and MLOps applied to Basque language technologies, focusing on industrial integration, continuous deployment of neural models, and exploring methodologies to optimize maintenance and adaptation. -
IACODE
Development of standardized code generation from existing code following MISRA technical programming guidelines, automating the process using generative AI models specialized in code generation.
2022
-
LIDO
Research and development of a system for the optimization of multilingual linguistic data using Artificial Intelligence technologies. -
IRAZ
Development of an easy-to-read solution through automated text simplification, aimed at improving accessibility for people with reading difficulties.
2021
-
STREAMS
Development of a cloud-based platform integrating AI-powered transcription, translation, automatic subtitling, and speech synthesis services in multiple languages (Basque, Spanish, French, and English). The platform enhances business processes across various sectors. -
IKA
Design, development, validation, and integration of an automatic translation quality estimation system to address challenges in the translation market and multilingual content generation.
2020
-
TANDO
Research and development of document-level neural machine translation systems for Basque-Spanish, including fine-grained evaluations of gender and contextual phenomena. -
ITAI
Design, development, validation, and integration of a continuous learning system for neural machine translation, aimed at addressing challenges in multilingual content generation.