I work in the Speech and Natural Language Technologies department at Vicomtech, specializing in Natural Language Generation and Machine Translation, while pursuing my PhD. My research bridges computational linguistics and sustainability. I focus on optimizing language model training to enhance performance for low-resource languages while reducing energy consumption, contributing to the broader Green AI initiative. My work aims to make NLP technologies accessible, equitable, and environmentally sustainable.


🔥 News

  • 2024.11:  🎉🎉 Published “Automating Easy Read Text Segmentation” at EMNLP 2024 Findings (Read Paper).
  • 2024.08:  🎉🎉 Published “Split and Rephrase with Large Language Models” at ACL 2024 (Read Paper).
  • 2023.07:  🎉🎉 Published “Unsupervised Subtitle Segmentation with Masked Language Models” at ACL 2023 (Read Paper).

📖 Education

  • Ph.D. in Natural Language Processing (2022–Present)
    University of the Basque Country (UPV/EHU)
    • Thesis: Optimization, Adaptation, and Applications of Large Language Models
  • M.Sc. in Artificial Intelligence (2019–2020)
    University of Leeds
    • Graduated with Distinction
  • B.Sc. in Computer Science (2015–2019)
    University of Malaga

🏢 Employment History

Vicomtech – Researcher in Speech and Natural Language Technologies

San Sebastián, Spain
2020 – Present

  • Conduct research and development in Natural Language Generation, Machine Translation, and Text Simplification for underrepresented languages.
  • Lead and contribute to cutting-edge projects, including ADAGIO, ADAPT-IA, and IRAZ, focusing on optimizing language models and advancing Green AI methodologies.
  • Collaborate on national and international initiatives to improve accessibility and sustainability in NLP.
  • Supervise and mentor interns and junior researchers in the Speech and Natural Language Technologies department.

🛠 Skills

Research Skills

  • Large Language Models Optimization
  • Neural Machine Translation (NMT)
  • Text Simplification
  • Machine Translation Quality Estimation
  • Green AI methodologies for sustainable model training

Technical Skills

  • Programming Languages: Python, Golang, C++, TypeScript, HTML, CSS, SQL
  • Frameworks & Tools: PyTorch, llama.cpp, LlamaIndex, MarianNMT, COMET, FastAPI, Angular
  • Specialized Tools: Energy usage profiling tools (e.g., CodeCarbon), lightweight deployment frameworks

Languages

  • Spanish: Native
  • English: Full Professional Proficiency
  • Basque: Professional Proficiency

📝 Publications

2024

  • Split and Rephrase with Large Language Models
    David Ponce, Thierry Etchegoyhen, Jesús Calleja, Harritxu Gete
    ACL 2024 (Paper)

  • Vicomtech@WMT 2024: Shared Task on Translation into Low-Resource Languages of Spain
    David Ponce, Harritxu Gete, Thierry Etchegoyhen
    WMT 2024 (Paper)

  • Automating Easy Read Text Segmentation
    Jesús Calleja, Thierry Etchegoyhen, Antonio David Ponce Martínez
    EMNLP 2024 Findings (Paper)

2023

  • Unsupervised Subtitle Segmentation with Masked Language Models
    David Ponce, Thierry Etchegoyhen, Víctor Ruiz
    ACL 2023 (Paper)

  • Learning from Past Mistakes: Quality Estimation from Monolingual Corpora and Machine Translation Learning Stages
    Thierry Etchegoyhen, David Ponce
    MT Summit XIX (Paper)

  • IRAZ: Easy-to-Read Content Generation via Automated Text Simplification
    Thierry Etchegoyhen, Jesús Calleja, David Ponce
    SEPLN 2023 (Paper)

2022

  • TANDO: A Corpus for Document-level Machine Translation
    Harritxu Gete, Thierry Etchegoyhen, David Ponce, et al.
    LREC 2022 (Paper)

2021

  • Online Learning over Time in Adaptive Neural Machine Translation
    Thierry Etchegoyhen, David Ponce, Harritxu Gete, Víctor Ruiz
    RANLP 2021 (Paper)

  • ITAI: Adaptive Neural Machine Translation Platform
    Thierry Etchegoyhen, David Ponce, Harritxu Gete, Víctor Ruiz
    SEPLN 2021 (Paper)


📂 Projects

2024

  • BIKAIN
    Research and development of a high-reliability automatic quality estimation system to identify translation errors at the sentence, word, terminology, and context levels in an integrated manner.

  • IKUN
    Research and development of Large Multimodal Models for industrial domain adaptation, facilitating the generation of synthetic images and time series to enhance quality assurance processes. Includes multimodal conversational interfaces for industrial dashboards and knowledge bases.

  • LiveAI
    Development of AI services for accessibility and audiovisual translation, enabling real-time transcription, translation, and spoken interpretation (dubbing) of live events.

2023

  • ADAGIO
    Research and development of a system for automatic text generation adaptable to specific domains using Artificial Intelligence technologies.

  • ADAPT-IA
    Research and development of adaptive AI technologies and MLOps applied to Basque language technologies, focusing on industrial integration, continuous deployment of neural models, and exploring methodologies to optimize maintenance and adaptation.

  • IACODE
    Development of standardized code generation from existing code following MISRA technical programming guidelines, automating the process using generative AI models specialized in code generation.

2022

  • LIDO
    Research and development of a system for the optimization of multilingual linguistic data using Artificial Intelligence technologies.

  • IRAZ
    Development of an easy-to-read solution through automated text simplification, aimed at improving accessibility for people with reading difficulties.

2021

  • STREAMS
    Development of a cloud-based platform integrating AI-powered transcription, translation, automatic subtitling, and speech synthesis services in multiple languages (Basque, Spanish, French, and English). The platform enhances business processes across various sectors.

  • IKA
    Design, development, validation, and integration of an automatic translation quality estimation system to address challenges in the translation market and multilingual content generation.

2020

  • TANDO
    Research and development of document-level neural machine translation systems for Basque-Spanish, including fine-grained evaluations of gender and contextual phenomena.

  • ITAI
    Design, development, validation, and integration of a continuous learning system for neural machine translation, aimed at addressing challenges in multilingual content generation.