Corpus (click to download zip file)
|
LDC
Catalog #
|
Languages
|
URL for description
|
WITHOUT SPEECH FILES
|
|
CELEX-2 (CEnter for LEXical Information)
|
LDC96L14
|
Dutch, British English, German
|
http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC96L14
|
Treebank-2 (Brown, Wall Street Journal)
|
LDC95T7
|
English
|
http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC95T7
|
TIPSTER Vol. 1 (Wall Street Journal, AP Newswire, etc.)
|
LDC93T3B
|
English
|
http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC93T3B
|
ICE-GB (International Corpus of English Great Britain) (tagged and parsed)
|
N/A
|
British English
|
http://www.ucl.ac.uk/english-usage/projects/ice-gb/
|
CSPAE (Corpus of Spoken Professional American English) (tagged and untagged)
|
N/A
|
American English
|
http://www.athel.com/cspatg.html
|
COMLEX Syntax Text Corpus Version 2.0 (COMputational LEXicon)
|
LDC96T11
|
English
|
http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC96T11
AND
http://nlp.cs.nyu.edu/comlex/
|
COMLEX English Syntax Lexicon
|
LDC98L21
|
English
|
http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC98L21
AND
http://nlp.cs.nyu.edu/comlex/
|
CALLHOME American English Lexicon/ PRONLEX (PRONunciation LEXicon)
|
LDC97L20
|
American English
|
http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC97L20
|
European Language Newspaper Text
|
LDC95T11
|
French, German, Portuguese
|
http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC95T11
|
Hansard French/English
|
LDC95T20
|
Canadian English, French
|
http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC95T20
|
North American News Text Corpus
|
LDC95T21
|
English
|
http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC95T21
|
TDT (Topic Detection & Tracking) Pilot Study Corpus
|
LDC98T25
|
English
|
|
WITH SPEECH FILES
|
|
1997 Broadcast News Speech Corpus (HUB4)
Transcripts
|
LDC98S71
LDC98T28
|
English
|
http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC98S71
http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC98T28
|
HTIMIT (TIMIT re-recorded through handsets)
|
LDC98S67
|
English
|
http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC98S67
|
HUB5 Mandarin Telephone Speech Corpus (from CALLFRIEND)
Transcripts
|
LDC98S69
LDC98T26
|
Mandarin
|
http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC98S69
http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC98T26
|
HUB5 Spanish Telephone Speech Corpus
(from CALLFRIEND)
Transcripts
|
LDC98S70
LDC98T27
|
Spanish
|
http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC98S70
http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC98T27
|
LLHDB (MIT Lincoln Laboratories Handset Database)
|
LDC98S68
|
English
|
http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC98S68
|
Santa Barbara Corpus of Spoken American English Part I
|
LDC2000S85
|
English
|
http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2000S85
|
Santa Barbara Corpus of Spoken American English Part II
|
LDC2003S06
|
English
|
http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2003S06
|