Contact
Illustration de la compétence Data, AI & Machine Learning - Jose DA COSTA
Technical skillData & AI

Data, AI & Machine Learning

RAG and LLM workflows at ACCENSEO, ML scoring at AdsPower, Ligneurs ETL at Pichet. A skill grown from real business needs, not from the hype cycle.

Personal Confidence
4.1/5· Expert
FoundationalDevelopingProficientAdvancedExpert
How this competency evolved over time

My definition

Data, AI, and machine learning is, in my own definition, the competency that turns events and texts into decisions. It covers relational and NoSQL databases, dataviz (Apache ECharts, business dashboards), data engineering, ML fundamentals, and applied LLM workflows (RAG, agentic, evaluation). It is the explicit strategic axis of my 2026-2028 project: coding agents wired into the development cycle, agentic LLMs that orchestrate real actions instead of replies, new agentic dev cycles (spec → review → test → refactor → doc) augmented by AI, and organizational workflows rebuilt around AI agents - from product ideation to production delivery.

I work on 4 layers I hold in parallel. Storage and modeling: advanced SQL, Prisma modeling (~91 models accounting SaaS, 98 broker SaaS), MongoDB and PostgreSQL in production at several hundred GB of RAM at ACCENSEO. Data pipelines: custom ETL (Akeneo Ligneurs), Azure ML Studio ML pipelines (AdsPower 2016-2018), multi-vendor enrichment (Claude, GPT, Gemini, TRELLIS, TripoSR, Shap-E). Applied product AI: hands-on RAG in the ACCENSEO pipeline, classification, 3D generation, multilingual translation, attribute extraction from visuals. Agentic dev and organizational workflows: coding agents (Claude Code, Codex, GitHub Copilot Workspace) wired into my daily dev cycle on the accounting SaaS, n8n / Make.com / Power Automate orchestration on client engagements, agentic workflows already in place across spec, review, test, and documentation. Skill actively levelling toward Senior on the data engineering + applied ML + LLM-Ops + agentic dev quadruple.

Four parallel 2026 chantiers: agentic dev (MCP servers, coding agents, n8n), production LLM-Ops (hybrid RAG, multi-vendor routing across Claude/GPT/Gemini, monitoring), SaaS data engineering (Prisma, pgvector, MongoDB) and ML fundamentals (Azure ML Studio, k-means, NLP TF-IDF) — Jose DA COSTA

In 2026, the competitive moat of a vertical B2B SaaS is no longer in the chosen LLM but in the context you give it, proprietary permissioned data, real task execution with guardrails, and embedded distribution. This is the thesis Microsoft Azure develops in 10 RAG Shifts Redefining Production AI in 2026: agentic RAG is now the default pattern for answering complex questions and executing actions, and hybrid RAG is the production baseline. The CTO who knows how to design an industrialised RAG pipeline (eval + drift detection + cost per feature) on a regulated domain becomes sought after.

My evidence

Achievement

Anecdote 1 : Co-founding AdsPower around AdTech ML pipelines

In January 2016, I co-founded AdsPower as CTO and Technical Project Manager of an early-stage bootstrapped startup. The bet: compete with Optmyzr (US) and Dolead (FR) using an ML-first approach to automatically optimise bids on Google AdWords, Bing Ads and Facebook Ads. The market was dominated by heuristic-based recommendation engines, and Azure ML Studio had just left public preview - the window was real, but so was the challenge: scarcity of ML skills in Bordeaux in 2017 and a limited runway.

I built a complete ML pipeline: a Data Collection Service wired to Google AdWords + Bing Ads + Facebook Ads SDKs, a custom SERP Scraper (Goutte + CasperJS) covering 6 search engines (Google, Bing, Yahoo, Yandex, Baidu, DuckDuckGo) and absorbing more than 10 million requests per month through a Memcached cache + Redis queue, and a Python Flask sidecar running NLTK + TF-IDF for multilingual NLP. On the modelling side, I trained supervised classifiers on Azure ML Studio for bid prediction, k-means clusters for negative-keyword detection, and the Google Prediction API for audience segmentation. The application stack: Symfony 3.2 + Angular with Electron desktop builds (Mac/Windows/Linux) and Cordova mobile (iOS/Android). To source ML freelancers, I ran geo-targeted GitHub searches on the machine-learning tags.

3 major product iterations shipped in less than a year with a team of 4 freelancers I steered as Technical Project Manager, the platform covering 3 ad networks (Google, Bing, Facebook) with sub-500 ms recommendation latency, and 3 active beta testers on the v1 in November 2016.

That venture taught me viscerally that classification + bid optimisation can be productised - not just demoed in a notebook. The reflexes I forged there (sub-second latency, heuristic fallback when model confidence is low, quality-score monitoring) are the very ones I now replay on the ACCENSEO LLM workflows. AdsPower never reached PMF before the runway ran out, but it was my first real production ML school.

Achievement

Anecdote 2 : Industrialising multi-vendor LLM enrichment at ACCENSEO

At ACCENSEO, one of the recurring themes with my e-commerce and PIM customers is massive AI-driven product enrichment: tens of thousands of product sheets to optimise - automatic taxonomy, SEO rewriting, image processing (background removal, watermarking), 3D model generation, multi-language translation, attribute extraction from visuals. The trap: locking yourself onto a single LLM vendor means inheriting its outages, pricing, and rate limits.

I built a multi-vendor pipeline by default. On the text side, I integrated OpenAI GPT, Anthropic Claude and Google Gemini behind a router that picks the model per task (Claude for precision, GPT for creativity, Gemini for lightweight multimodal). On the 3D side, I wired in TRELLIS, TripoSR, and Shap-E to generate 3D models from product photos. On the image side, automated background processing, cut-out, and watermarking. Orchestration runs through n8n and Make.com for automated workflows, Power Automate for Microsoft triggers, and the whole thing runs on dedicated OVH servers to keep customer catalogue data confidential.

Enrichment deployed at scale across the e-commerce platforms of several customers (real estate, fashion, viticulture, automotive, fitted kitchens), measurable catalogue quality lift without a linear human cost - and an internal product Addly derived from this expertise for Confluence/Atlassian Forge.

On this work I understood that production generative AI is won on observability discipline (token cost, latency, detected hallucination rate) and on multi-vendor strategy, not on prompt sophistication. That is the angle I want to push on the next CTO scale-up role: turn AI into a moat, not into a demo gimmick.

Achievement

Anecdote 3 : Akeneo to portal real-estate ETL pipeline (Ligneurs)

For 4 years at Pichet (2019-2023), I was the sole technical owner of the Ligneurs export pipeline - the automated syndication engine for the group's real-estate listings, feeding around 20 partner portals (SeLoger, LeBonCoin, BienIci, LogicImmo...). The system fed an estimated volume of one lead every 2 seconds across all portals. Any interruption translated directly into lost leads and missed revenue.

I designed a per-partner modular architecture rather than a generic engine: one isolated Docker container per portal, orchestrated by Kubernetes on AWS EKS, with GitLab CI for targeted deployments that did not impact the other flows. On the ETL side, the pipeline extracts from the Akeneo PIM v2 REST API, transforms to each portal's specific format (XML, CSV, JSON), pre-renders multi-format images (4/3, 16/9, panoramic, square) centrally to avoid per-partner reprocessing, and ships via automated FTP/SFTP. I added defensive patterns on heterogeneous sources: circuit breaker on the PIM API, retry logic on FTP uploads, SKU matching algorithm between manual programs and ERP programs. The v1.4 to v2 migration was done portal by portal with business validation at every step, never big-bang.

Zero-downtime migration across every partner portal, centralised monitoring with automated email alerts, and the pipeline ran in continuous operation for 4 years without a major listing loss - no equivalent in the department was running with that level of reliability.

That project raised the data engineering bar I now carry on every ACCENSEO engagement: per-partner isolation, batch processing where real-time streaming brings nothing, observability per flow from day one. It is also where I durably understood data architectural debt: a generic single module looks easy at write time but becomes unmanageable at the tenth partner integration.

My self-critique

Level Confirmed actively levelling toward Senior. Foundations are solid: advanced SQL, Prisma modelling (~91 models accounting SaaS, 98 broker SaaS), MongoDB and PostgreSQL in production over hundreds of GB at ACCENSEO, Azure ML Studio ML pipelines (AdsPower), and multi-vendor applied LLM workflows (Claude, GPT, Gemini, Google Vertex). What still needs strengthening: industrialised RAG with eval and guardrails, production-grade MLOps (versioning, drift detection), and very large-scale data engineering (>TB).

Explicit strategic axis of my 2026-2027 project. It stitches three layers. data foundations (rapid schema reading, pipeline audit), applied ML (classification, scoring, recommendation), and generative AI in production (RAG, agents, eval). For a vertical B2B SaaS scale-up CTO role, it is what turns AI into a moat rather than a demo gimmick.

Axis #1 of the 2026-2028 project: shifting from a CTO as modern-stack operator to an agent-native CTO, defending an AI roadmap to a board on product velocity, AI FinOps and AI Act compliance, distinguishing moat from commodity — Jose DA COSTA

Deliberate Confirmed → Senior climb triggered late 2024 and still ongoing: hands-on RAG plugged into the ACCENSEO pipeline, multi-vendor (Claude + GPT + Gemini), AI enrichment of tens of thousands of product sheets. Cadence is measurable quarter by quarter.

To myself: ship one small RAG or agentic project per quarter, with explicit eval, to keep the competency sharp, and maintain a journal of prompts that work and that don't.

To others: do not confuse AI demo with AI production, invest from day one in pipeline observability (token cost, latency, detected hallucination rate) and in guardrails (sanitisation, rate limit, human fallback). Pick a data-first stack before the model stack.

My evolution in this skill

The 2026-2028 strategic axis

Data and AI are the axis that distinguishes my CTO profile in 2026. In the 24-month plan, they let me frame an AI-augmented vertical B2B SaaS product, hire a coherent data + ML / LLM team, and defend an AI product trajectory in front of a board separating moat from commodity. Without that axis, the 2026-2028 CTO role boils down to a modern-stack operator role.

By end of 2027, the observable goal is to operate a production-grade data + AI platform with industrialized RAG pipeline (eval + drift detection), explicit cost per AI feature and quarterly quality review. The Confirmed-to-Senior shift is measured on the triple mastery data engineering + applied ML + LLM-Ops, not on an abstract score.

Production-grade data + AI platform by end of 2027: industrialized RAG pipeline (eval, drift detection), LLM cost observatory (cost per feature, cost per agent), versioned prompt registry compared on eval datasets, Q1-Q4 quarterly review and Expert 7/10 → 8/10 tier consolidation across the triple mastery data engineering, applied ML, LLM-Ops + agentic dev — Jose DA COSTA

Hands-on RAG integrated into the ACCENSEO pipeline (Claude + GPT + Gemini + Google Vertex multi-vendor), weekly intake of LLM releases. Master in Software Engineering active until 2026.

DeepLearning.AI Specialization and Coursera MLOps programs planned 2026-2027. Maven Applied LLM cohort (Hamel Husain for example) targeted 2026. GCP Professional Data Engineer certification considered depending on the target context.

Anchor reads: Designing Machine Learning Systems (Chip Huyen), Building LLM Powered Applications (Valentina Alto), curated arXiv papers. Continuous follow of Latent Space, Eugene Yan, Simon Willison. Monthly routine: a new model evaluated on a real case.

AI Engineering: Building Applications with Foundation Models book cover by Chip Huyen (O'Reilly, 2025), major reference on building applications with foundation modelsDesigning Machine Learning Systems book cover by Chip Huyen (O'Reilly), reference on production-grade ML systemsLLM Engineer's Handbook book cover by Paul Iusztin and Maxime Labonne (Packt, 2024), comprehensive guide on RAG, fine-tuning, and LLMOps in productionHands-On Large Language Models book cover by Jay Alammar and Maarten Grootendorst (O'Reilly, 2024), visual and practical guide on LLMsBuild a Large Language Model (From Scratch) book cover by Sebastian Raschka (Manning, 2024), step-by-step construction of an LLMPrompt Engineering for LLMs book cover by John Berryman and Albert Ziegler (O'Reilly, 2024), discipline of context engineering in productionBuilding LLM Powered Applications book cover by Valentina Alto (Packt), practical guide on building LLM applicationsFundamentals of Data Engineering book cover by Joe Reis and Matt Housley (O'Reilly, 2022), data engineering foundation for B2B SaaSIngénierie de l'IA book cover by Chip Huyen (First Interactive, French translation of AI Engineering), building applications with foundation models in FrenchIntroduction au Machine Learning (3rd edition) book cover by Chloé-Agathe Azencott (Dunod, 2024), French academic reference from Mines Paris-PSL on ML fundamentals

Circular navigation