Annotation Formats

Computer Vision
COCO vs YOLO 8 min read
Deep Dive · Annotation Formats

COCO vs YOLO: Which Annotation Format Is Right for Your Pipeline?

Choosing the wrong annotation format isn't just an inconvenience — it's a project delay waiting to happen. COCO JSON and YOLO TXT are the two dominant formats in computer vision, but they were built for very different use cases, and the decision you make here echoes through your entire training pipeline.

COCO JSON gives you a richly structured format with support for bounding boxes, polygons, keypoints, and segmentation masks all in a single file. It's the format of choice for complex, multi-class datasets where you need to capture fine-grained object boundaries. The trade-off is file size and parsing overhead — COCO files grow fast, and loading a large dataset at training time requires careful handling.

YOLO TXT takes the opposite approach. One text file per image, one line per object, normalised coordinates. It's lean, fast, and modern YOLO architectures are optimised to consume it directly. If you're building a real-time inference pipeline — edge deployment, robotics, live video — YOLO format removes unnecessary overhead.

The conversion gotcha nobody warns you about: COCO stores polygon coordinates as absolute pixel values, while YOLO expects normalised centre-x, centre-y, width, height. Converting between them for complex segmentation masks is non-trivial and introduces rounding errors at scale. If you know your target model architecture upfront, annotate directly in that format from day one.

Contact Us
Annotation Guidelines 7 min read
Best Practice · Team Management

How to Write Annotation Guidelines Your Team Will Actually Follow

Vague guidelines are the most common root cause of label inconsistency we encounter when auditing annotation pipelines. Not bad annotators, not poor tooling — ambiguous instructions that leave too much room for individual interpretation.

The framework we use across all client projects starts with decision trees, not prose. If your guideline says "label the full vehicle including any visible trailer" and a flatbed truck appears with a load extending beyond its frame, what should the annotator do? A decision tree forces you to answer that question before your annotators encounter it at 2am on a deadline.

Real examples matter more than you think. Show labelled positives and negatives side by side. Annotators calibrate to visual examples faster than written descriptions, especially across language barriers. We include at minimum 3 positive and 3 negative examples per object class, plus 2 edge cases with explicit reasoning.

Finally: version your guidelines like code. Every change gets a changelog entry with a before/after example. When your IAA drops after a guideline update, you'll know exactly where to look.

Contact Us
IAA 6 min read
Quality · RLHF

What is Inter-Annotator Agreement — And Why 95% Isn't Enough

Inter-Annotator Agreement (IAA) is the metric that tells you how consistently your annotators are applying your guidelines. It's the single number that most accurately predicts downstream model performance — and the most commonly misunderstood quality metric in the industry.

Cohen's Kappa measures agreement between two annotators adjusted for chance. Fleiss' Kappa extends this to three or more. Both are more meaningful than raw percentage agreement, which is inflated by class imbalance — if 90% of your images contain no object, two annotators who always label "nothing" achieve 90% agreement without any actual agreement on the hard cases.

Here's the critical insight: a 98% IAA on a dataset where 95% of examples are clear-cut means your annotators disagree on nearly half the ambiguous edge cases. Those edge cases are exactly the scenarios your model will encounter first in production — unusual lighting, partial occlusion, out-of-distribution objects. High overall IAA can hide catastrophic disagreement precisely where it matters most.

We recommend reporting IAA separately for easy, medium, and hard examples. It takes more setup, but it's the only way to catch quality problems before they become model problems.

Contact Us

AV & Robotics

Perception AI
AV Edge Cases 9 min read
AV · Perception

Why Edge Cases Kill AV Models — And How to Label Them

Autonomous vehicle models don't fail on clear highway driving in good weather. They fail on scenarios that appear once in ten thousand miles: the pedestrian in a mascot costume, the shopping trolley blown into the lane, the partial occlusion of a stop sign by overgrown foliage at dusk. The long tail of rare scenarios disproportionately determines AV safety.

The challenge is that edge cases are, by definition, underrepresented in your natural data collection. A fleet driving 10 million miles may capture fewer than 100 examples of a specific failure-critical scenario. Annotating those 100 examples correctly is more valuable than perfectly annotating the next 100,000 highway miles.

How to identify edge cases systematically: Use your model's uncertainty outputs to surface high-entropy predictions on validation data. Cluster by failure mode rather than by object class. Review your incident logs — near-misses in simulation often predict real-world edge cases better than any synthetic taxonomy.

For labelling edge cases, we increase our standard annotation consensus requirement from 2-of-3 annotators to 4-of-5, and add a mandatory written justification field for any bounding box touching the edge case criteria. The additional cost is real, but it's a fraction of the cost of a safety recall.

Contact Us
LiDAR 7 min read
3D Annotation · LiDAR

LiDAR Point Cloud Annotation: A Practical Guide for Robotics Teams

3D cuboid annotation in point cloud data is fundamentally different from drawing bounding boxes on images, and teams that treat it like an image task routinely ship datasets that degrade model performance. The mistakes are more expensive because they're harder to spot during QA and propagate invisibly into downstream sensor fusion steps.

The most common error: annotators tighten cuboids to the visible point cloud density rather than the physical object boundary. A vehicle in sparse LiDAR returns may have a visible cloud occupying only 60% of its actual frame. If your annotator snaps the cuboid to the visible returns, you're training your model to perceive vehicles as smaller than they are — with predictable consequences for path planning margin calculations.

Our annotation spec requires annotators to estimate the full physical extent of the object using all available cues: camera fusion, object class priors, and adjacent frame continuity. For moving objects, we annotate in bird's-eye view first, then verify in front and side projections before finalising.

Contact Us

Voice AI & Audio

Speech Annotation
Speaker Diarisation 8 min read
Voice AI · NLP

Speaker Diarisation: The Unsung Hero of Conversational AI

Every voice AI product that involves more than one speaker — call-centre analytics, meeting transcription, interview tools, conversational AI evaluation — depends on correctly attributing speech segments to individual speakers. This is speaker diarisation, and it's consistently underestimated during dataset planning.

Diarisation is harder than it looks for three reasons. First, speaker boundaries rarely align with clean silence — overlapping speech, laughter, and filler words create annotation ambiguity that no automated system handles reliably. Second, speaker identity is relative within a recording, not absolute across recordings, which means annotators can't rely on voice recognition shortcuts. Third, multi-speaker audio in real environments has variable background noise, channel effects, and recording artefacts that degrade automated pre-annotation quality.

Our quality protocol includes a dedicated calibration session for every new annotator on each language-accent combination, double annotation for any segment under 1.5 seconds, and a separate adjudication pass for all overlapping speech regions. Across our 40+ language variants, calibration reduces diarisation error rate by 31% compared to briefing-only onboarding.

Contact Us

LLM & RLHF

AI Alignment
RLHF 10 min read
RLHF · LLM Alignment

RLHF From Scratch: What Your Human Annotators Actually Need to Know

Reinforcement Learning from Human Feedback sounds conceptually simple: humans rank or rate model outputs, those signals train a reward model, and that reward model guides further fine-tuning. In practice, the quality of your RLHF data is determined almost entirely by how well your annotators understand the task — and most annotator briefings vastly underestimate what that understanding requires.

The most critical concept to convey is implicit preference calibration. When an annotator rates two responses, they're not just evaluating what's written — they're encoding a value judgement about what a helpful, honest, and appropriate response should look like. Without explicit calibration sessions, annotators default to personal preferences, which vary wildly across demographic groups and produce noisy, inconsistent reward signals.

Domain expertise matters more than general writing quality judgement. An RLHF task for a medical information model requires annotators who can evaluate clinical accuracy, not just fluency. A coding assistant task requires annotators who can test whether the code actually runs. Staffing RLHF projects with general-purpose raters is one of the most common and costly mistakes we see.

For ambiguous cases — and there are always ambiguous cases — a structured disagreement protocol beats majority vote. When annotators disagree on a preference pair, surface it for senior adjudication with a written rationale requirement. These disagreements are your most valuable signal for guideline refinement.

Contact Us
Red Teaming 7 min read
AI Safety · Red Teaming

Red Teaming Your LLM: What to Test Before You Ship

Red teaming is the practice of systematically attempting to elicit unsafe, harmful, or policy-violating outputs from your model before it reaches users. For LLMs entering any production environment, it's no longer optional — it's a regulatory expectation under the EU AI Act and an increasingly common requirement from enterprise customers conducting security due diligence.

Effective red teaming requires a taxonomy. Broad categories include: direct harmful instruction requests, indirect jailbreaks through roleplay or hypothetical framing, prompt injection via user-supplied context, multi-turn manipulation that gradually shifts model behaviour, and adversarial inputs targeting specific knowledge gaps. Each category requires different annotator expertise and different evaluation criteria.

The most important insight: your model's failure modes are domain-specific. A customer service LLM has very different vulnerability surfaces than a coding assistant or medical information tool. Generic red team prompts from public benchmarks will miss the failure modes that matter most for your deployment context. Effective red teaming requires custom adversarial prompts designed around your specific use case and user population.

Contact Us

Synthetic Data

Data Strategy
Synthetic Data Validation 6 min read
Synthetic Data · Validation

Synthetic Data Is Only as Good as Its Human Validation Layer

The promise of synthetic data is compelling: near-infinite scale, perfect label accuracy, controllable distribution. The reality is more complicated. Generated datasets consistently ship three categories of problems that automated QA misses and that degrade real-world model performance in ways that are difficult to diagnose: distribution gaps, visual artefacts, and physics violations.

Distribution gaps occur when your synthetic data generator doesn't accurately model the statistical properties of real-world data. A synthetic pedestrian dataset generated from a 3D engine with limited character assets will underrepresent body size diversity, clothing variation, and motion patterns in ways that are invisible to automated metrics but immediately apparent to a trained human reviewer.

Visual artefacts — texture seams, incorrect shadow directions, physically impossible reflections — are reliably caught by human reviewers in seconds but missed by automated perceptual quality metrics. We've seen artefact rates above 12% in synthetic datasets that passed automated QA with scores above 95%.

The human validation layer we recommend: a stratified sample review covering at least 5% of generated images, reviewed by annotators with domain expertise in the target environment. For any artefact category appearing in more than 0.5% of reviewed samples, the full dataset should be flagged for generator-level correction before training begins.

Contact Us