Compute Matthew Jaro's Jaro similarity (1989) and William Winkler's Jaro-Winkler refinement (1990) — the string-matching metrics designed for short proper nouns, typo correction, and record linkage. The Winkler tweak gives extra credit for sharing a common prefix (up to 4 chars), which is why it dominates name-matching in census and address-cleaning systems.
Jaro counts a character as matching if the same character appears in the other string within a window of
⌊max(|s₁|,|s₂|)/2⌋ − 1 positions. Matches kept in the same order are good; matches that change order count as half a "transposition."
Below: prefix-bonus chars share a position-0 alignment, matched chars are in-window, transposed chars are out of order, grey is unmatched.
For strings s₁, s₂ with m matches and t transpositions (each pair-out-of-order counts as one transposition):
jaro = ⅓ · ( m/|s₁| + m/|s₂| + (m − t/2)/m ) if m > 0
= 0 if m = 0
The match window is ⌊max(|s₁|,|s₂|)/2⌋ − 1. Inside that window, a character in s₁ can match the same character in s₂ only once.
Winkler observed that human-entered names get the start of the surname right far more often than the end. The Winkler variant adds a bonus proportional to the length of the common prefix (up to ℓ chars, conventionally 4):
jaro_winkler = jaro + ℓ_common · p · (1 − jaro)
where p is the scaling factor (≤ 0.25 to keep the score in [0, 1]; the conventional value is p = 0.1). Some implementations only apply the boost if jaro ≥ threshold (often 0.7) — toggle the threshold above.
| Metric | Best for | Watch out for |
|---|---|---|
| Jaro | Symmetric short-string similarity (≤ ~10 chars). | Ignores order beyond a small window; long strings drift toward 1. |
| Jaro-Winkler | Person names, surnames, business names. Heavily used in U.S. Census record linkage. | Same-prefix typos are scored more leniently than same-suffix typos. |
| Levenshtein | Free-text typo correction, autocorrect, DNA. Counts insert/delete/sub operations. | Score grows with length — normalise by max length. |
| Damerau-Levenshtein | Like Levenshtein but counts a swap as one op — better for human typing. | Slightly slower. |
| Dice on bigrams | Document-length fuzzy match, code clone detection. | Less sensitive to character order than Jaro. |
p = 0.1, ℓ = 4)jaro("MARTHA", "MARHTA") = 0.9444… ; winkler = 0.9611…
jaro("DWAYNE", "DUANE") = 0.8222… ; winkler = 0.8400…
jaro("DIXON", "DICKSONX") = 0.7666… ; winkler = 0.8133…