Stanford Hazy Research · 2018–2019

Programmatic Data Labeling for Multi-Sentence Relation Extraction

We extended Snorkel's programmatic labeling framework beyond single sentences, using heuristic labeling functions to generate training data for cross-sentence relation extraction: no manual annotation required. 12% F1 improvement over single-sentence baselines.

Felipe Felix Arias Alex Ratner Christopher Ré

UIUC · Stanford University

Poster Snorkel AI

The Problem

Relations span sentences. Labels don't.

Most relation extraction systems assume both entities appear in the same sentence. In practice, key relations often span multiple sentences, and manually labeling cross-sentence examples is expensive and slow. We needed a way to generate training labels programmatically at scale.

Approach: Programmatic Labeling Functions

Heuristics that write labels, not humans.

Using Snorkel's weak supervision framework, we wrote labeling functions: programmatic heuristics based on dependency paths, sliding windows, and structural signals that automatically label multi-sentence entity pairs. Snorkel's noise-aware generative model combines noisy, overlapping labeling functions into probabilistic training labels.

Results

+12% F1 over single-sentence extraction baselines via novel multi-sentence labeling functions.
Multi-task learning across sentence spans boosted extraction quality on long-range relations.
Zero hand-labeled data: all training labels generated programmatically from heuristic functions.
Collaborated with Alex Ratner (now CEO of Snorkel AI) and Christopher Ré.