Corpus (click to download zip file)

LDC
Catalog #

Languages

URL for description

WITHOUT SPEECH FILES

 

CELEX-2 (CEnter for LEXical Information)

LDC96L14

Dutch, British English, German

http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC96L14

Treebank-2 (Brown, Wall Street Journal)

LDC95T7

English

http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC95T7

TIPSTER Vol. 1 (Wall Street Journal, AP Newswire, etc.)

LDC93T3B

English

http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC93T3B

ICE-GB (International Corpus of English Great Britain) (tagged and parsed)

N/A

British English

http://www.ucl.ac.uk/english-usage/projects/ice-gb/

CSPAE (Corpus of Spoken Professional American English) (tagged and untagged)

N/A

American English

http://www.athel.com/cspatg.html

COMLEX Syntax Text Corpus Version 2.0 (COMputational LEXicon)

LDC96T11

English

http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC96T11
AND

http://nlp.cs.nyu.edu/comlex/

COMLEX English Syntax Lexicon

LDC98L21

English

http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC98L21
AND
http://nlp.cs.nyu.edu/comlex/

CALLHOME American English Lexicon/ PRONLEX (PRONunciation LEXicon)

LDC97L20

American English

http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC97L20

European Language Newspaper Text

LDC95T11

French, German, Portuguese

http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC95T11

Hansard French/English

LDC95T20

Canadian English, French

http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC95T20

North American News Text Corpus

LDC95T21

English

http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC95T21

 

TDT (Topic Detection & Tracking) Pilot Study Corpus

LDC98T25

English

http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC98T25

 

WITH SPEECH FILES

 

1997 Broadcast News Speech Corpus (HUB4)


Transcripts

LDC98S71

 

 

LDC98T28

English

http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC98S71

 

http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC98T28

HTIMIT (TIMIT re-recorded through handsets)

LDC98S67

English

http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC98S67

HUB5 Mandarin Telephone Speech Corpus (from CALLFRIEND)

 

Transcripts

LDC98S69

 

 

LDC98T26

Mandarin

http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC98S69

 

http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC98T26

HUB5 Spanish Telephone Speech Corpus
(from CALLFRIEND)

 

Transcripts

LDC98S70

 

 

LDC98T27

Spanish

http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC98S70

 

http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC98T27

LLHDB (MIT Lincoln Laboratories Handset Database)

LDC98S68

English

http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC98S68

Santa Barbara Corpus of Spoken American English Part I

LDC2000S85

English

http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2000S85

Santa Barbara Corpus of Spoken American English Part II

LDC2003S06

English

http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2003S06