Stanford Hazy Research · 2018–2019

Programmatic Data Labeling for Multi-Sentence Relation Extraction

We extended Snorkel's programmatic labeling framework beyond single sentences, using heuristic labeling functions to generate training data for cross-sentence relation extraction: no manual annotation required. 12% F1 improvement over single-sentence baselines.

Felipe Felix Arias Alex Ratner Christopher Ré

UIUC · Stanford University

Snorkel overview diagram

The Problem

Relations span sentences. Labels don't.

Most relation extraction systems assume both entities appear in the same sentence. In practice, key relations often span multiple sentences, and manually labeling cross-sentence examples is expensive and slow. We needed a way to generate training labels programmatically at scale.

Approach: Programmatic Labeling Functions

Heuristics that write labels, not humans.

Using Snorkel's weak supervision framework, we wrote labeling functions: programmatic heuristics based on dependency paths, sliding windows, and structural signals that automatically label multi-sentence entity pairs. Snorkel's noise-aware generative model combines noisy, overlapping labeling functions into probabilistic training labels.

Sliding window heuristic
Queue-based labeling
Union heuristic
Heuristic combinations

Results