Tutorial Series on NLP: Catalogue of Language Resources and Tools in Japan


Information Extraction refers to the automatic extraction of structured information such as entities, relationships between entities, and attributes describing entities from unstructured sources.

In this Series of blogs I will walkthrough several tutorials giving you what composes of Information Extraction tasks and give you fundamental code samples on which you can further work on. Find link to article here.


Categories listed and briefed in later sections:

  1. Newspaper
  2. Annotated corpus
  3. Unannotated corpus
  4. Thesaurus
  5. Lexicon
  6. Text(misc.)
  7. Speech
  8. Morphological analyzer
  9. Parser
  10. Annotation tool
  11. Visualization tool
  12. Search tool
  13. Machine Learning
  14. Tool(misc.)

Build better voice apps. Get more articles & interviews from voice technology experts at voicetechpodcast.com

Newspaper

  • Mainichi Shimbun CD-ROM
    Type: Text
    Type.linguistics: annotation/corpus
    Description: Raw text of Mainichi Shimbun newspaper articles for 1991–2001.
    Annotation.document: keyword
    Creator: Mainichi Shimbun Publisher, Japan
    Contact person: Nichigai Associates Co., Ltd. (data-sale(at)nichigai.co.jp)
    Price: 126,000 JPY per a year
    Subject.language: Japanese
    Date: 1991–2001
    Format: 1 or 2 CD-ROM per year.
    Format.encoding: Shift_JIS
    URI: http://www.nichigai.co.jp/sales/mainichi/mainichi-data.html (in Japanese)
    Usage Case: (Show in new window in Japanese)
  • Mainichi Shimbun CD-ROM (1991)
    Type: Text
    Type.linguistics: annotation/corpus
    Description: Raw text of Mainichi Shimbun newspaper articles for 1991. Approximate 100,000 articles.
    Annotation.document: keyword
    Creator: Mainichi Shimbun Publisher, Japan
    Contact person: Nichigai Associates Co., Ltd. (data-sale(at)nichigai.co.jp)
    Price: 126,000 JPY
    Subject.language: Japanese
    Date: 1991
    Format: 1 CD-ROM.
    Format.encoding: Shift_JIS
    URI: http://www.nichigai.co.jp/sales/mainichi/mainichi-data.html (in Japanese)
  • Mainichi Shimbun CD-ROM (1992)
    Type: Text
    Type.linguistics: annotation/corpus
    Description: Raw text of Mainichi Shimbun newspaper articles for 1992. Approximate 100,000 articles.
    Annotation.document: keyword
    Creator: Mainichi Shimbun Publisher, Japan
    Contact person: Nichigai Associates Co., Ltd. (data-sale(at)nichigai.co.jp)
    Price: 126,000 JPY
    Subject.language: Japanese
    Date: 1992
    Format: 1 CD-ROM.
    Format.encoding: Shift_JIS
    URI: http://www.nichigai.co.jp/sales/mainichi/mainichi-data.html (in Japanese)
  • Mainichi Shimbun CD-ROM (1993)
    Type: Text
    Type.linguistics: annotation/corpus
    Description: Raw text of Mainichi Shimbun newspaper articles for 1993. Approximate 100,000 articles.
    Annotation.document: keyword
    Creator: Mainichi Shimbun Publisher, Japan
    Contact person: Nichigai Associates Co., Ltd. (data-sale(at)nichigai.co.jp)
    Price: 126,000 JPY
    Subject.language: Japanese
    Date: 1993
    Format: 1 CD-ROM.
    Format.encoding: Shift_JIS
    URI: http://www.nichigai.co.jp/sales/mainichi/mainichi-data.html (in Japanese)
  • Mainichi Shimbun CD-ROM (1994)
    Type: Text
    Type.linguistics: annotation/corpus
    Description: Raw text of Mainichi Shimbun newspaper articles for 1994. Approximate 100,000 articles.
    Annotation.document: keyword
    Creator: Mainichi Shimbun Publisher, Japan
    Contact person: Nichigai Associates Co., Ltd. (data-sale(at)nichigai.co.jp)
    Price: 126,000 JPY
    Subject.language: Japanese
    Date: 1994
    Format: 1 CD-ROM.
    Format.encoding: Shift_JIS
    URI: http://www.nichigai.co.jp/sales/mainichi/mainichi-data.html (in Japanese)
  • Mainichi Shimbun CD-ROM (1995)
    Type: Text
    Type.linguistics: annotation/corpus
    Description: Raw text of Mainichi Shimbun newspaper articles for 1995. Approximate 100,000 articles.
    Annotation.document: keyword
    Creator: Mainichi Shimbun Publisher, Japan
    Contact person: Nichigai Associates Co., Ltd. (data-sale(at)nichigai.co.jp)
    Price: 126,000 JPY
    Subject.language: Japanese
    Date: 1995
    Format: 1 CD-ROM.
    Format.encoding: Shift_JIS
    URI: http://www.nichigai.co.jp/sales/mainichi/mainichi-data.html (in Japanese)
  • Nihon Keizai Shimbun CD-ROM
    Type: Text
    Type.linguistics: annotation/corpus
    Description: Raw text of Nihon Keizai Shimbun newspaper articles for 1990–2000. Purchase details can be viewed at http://www.nikkeish.co.jp/gengo/zenbun.htm.
    Annotation.document: keyword
    Creator: Nihon Keizai Shimbun Inc., Japan
    Contact person: Nikkei Shuppan Hanbai Co., Ltd. (eizo(at)nikkeish.co.jp)
    Price: 136,500 JPY per year
    Subject.language: Japanese
    Date: 1990–2000
    Format: 1 CD-ROM per year.
    URI: http://www.nikkeish.co.jp/shop/top.aspx (in Japanese)
    Usage Case: (Show in new window in Japanese)
  • Nihon Keizai Sangyo, Kin’yu, Ryutsu Shimbun CD-ROM
    Type: Text
    Type.linguistics: annotation/corpus
    Description: Raw text of Nihon Keizai Sangyo, Kin’yu, Ryutsu Shimbun newspaper articles for 1994–2000. Purchase details can be viewed at http://www.nikkeish.co.jp/gengo/zenbun.htm.
    Annotation.document: keyword
    Creator: Nihon Keizai Shimbun Inc., Japan
    Contact person: Nikkei Shuppan Hanbai Co., Ltd. (eizo(at)nikkeish.co.jp)
    Price: 136,500 JPY per year
    Subject.language: Japanese
    Date: 1994–2000
    Format: 1 CD-ROM per year.
    URI: http://www.nikkeish.co.jp/shop/top.aspx (in Japanese)
  • Yomiuri Shimbun CD-ROM (Japanese)
    Type: Text
    Type.linguistics: annotation/corpus
    Description: Raw text of Japanese newspaper articles of Yomiuri Shimbun for 1987–2001. Approximate 110,000 articles per year for 1987–1997, 230,000 for 1998–2000, 340,000 for 2001. Purchase details can be viewed at http://www.ndk.co.jp/yomiuri/.
    Annotation.document:keyword
    Creator: The Yomiuri Shimbun, Japan
    Contact person: Nihon Database Kaihatsu Co., Ltd. (yomiuri(at)ndk.co.jp)
    Price: 120,000–270,000 JPY per year (academic), 190,000–490,000 JPY per year (general)
    Subject.language: Japanese
    Date: 1987–2005
    Format: 1 or 2 CD-ROM per year.
    Format.encoding: Shift_JIS
    URI: http://www.ndk.co.jp/yomiuri/ (in Japanese)
    Usage Case: (Show in new window in Japanese)
  • Yomiuri Shimbun CD-ROM (English)
    Type: Text
    Type.linguistics: annotation/corpus
    Description: Raw text of English newspaper articles of Yomiuri Shimbun for 1989–2001. Approximate 9,000 articles per year. Purchase details can be viewed at http://www.ndk.co.jp/yomiuri/.
    Creator: The Yomiuri Shimbun, Japan
    Contact person: Nihon Database Kaihatsu Co., Ltd. (yomiuri(at)ndk.co.jp)
    Price: 110,000–170,000 JPY per year (academic), 190,000–270,000 JPY per year (general)
    Subject.language: English
    Date: 1989–2005
    Format: 1 CD-ROM per year.
    URI: http://www.ndk.co.jp/yomiuri/ (in Japanese)
  • Asahi Shimbun CD-ROM
    Type: Text
    Type.linguistics: annotation/corpus
    Description: Raw text of Asahi Shimbun newspaper articles for 1984–2005. Approximate 100,000 articles per year.
    Creator: Asahi Shimbun, Japan
    Contact person: Nichigai Associates Co., Ltd. (data-sale(at)nichigai.co.jp)
    Price: 126,000–189,000 JPY per a year
    Subject.language: Japanese
    Date: 1984–2005
    Format: 1 CD-ROM per year.
    Usage Case: (Show in new window in Japanese)

Annotated corpus

  • RWC Text Database
    Type: Collection
    Description: The collection of text database developed by RWCP. 
    Distribution of this corpus is now suspended.
    Creator: Real World Computing Partnership, Japan
    Subject.Language: Japanese
    Language: Japanese
    Date: 1998
    Format: 381 MB.
    Format.encoding: EUC-JP
    Usage Case: (Show in new window in Japanese)
  • RWC-DB-TEXT-94–1
    Type: Text
    Type.linguistics: annotation/corpus
    Description: Morphologically analyzed data of MITI (Ministry of International Trade and Industry, Japan) white papers for 1993–1995, manually post-edited. 
    Distribution of this corpus is now suspended.
    Annotation.corpus: word segmentation, part-of-speech
    Creator: Real World Computing Partnership, Japan
    Subject.Language: Japanese
    Language: Japanese
    Date: 1994
    Format: 8.1 MB.
    Format.encoding: EUC-JP
  • RWC-DB-TEXT-94–2
    Type: Text
    Type.linguistics: annotation/corpus
    Description: Morphologically analyzed data of the Japan Electronics Industry Development Agency’s annual report, survey report on the trend of natural language processing. Manually post-edited. 
    Distribution of this corpus is now suspended.
    Annotation.corpus: word segmentation, part-of-speech
    Creator: Real World Computing Partnership, Japan
    Subject.Language: Japanese
    Language: Japanese
    Date: 1994
    Format: 2.1 MB.
    Format.encoding: EUC-JP
  • RWC-DB-TEXT-95–3
    Type: Text
    Type.linguistics: annotation/text categorization
    Description: Articles tagged with UDC. (30000 articles from Mainichi Shimbun 1994.) 
    Distribution of this corpus is now suspended.
    Annotation.document
    text category
    Creator: Real World Computing Partnership, Japan
    Subject.Language: Japanese
    Date: 1995
    Format: 1 MB.
  • RWC-DB-TEXT-96–2
    Type: Text
    Type.linguistics: annotation/corpus
    Description: Morphologically analyzed data of Iwanami Japanese Dictionary (5th edition) with index tags. Manually post-edited. 
    Distribution of this corpus is now suspended.
    Annotation.corpus: word segmentation, part-of-speech
    Creator: Real World Computing Partnership, Japan
    Subject.Language: Japanese
    Language: Japanese
    Date: 1996
    Format: 40.6 MB.
    Format.encoding: EUC-JP
  • RWC-DB-TEXT-97–1
    Type: Text
    Type.linguistics: annotation/corpus
    Description: Differential data of the results of morphological analysis of the CD-Mainichi shimbun. (all articles from 1991–1995) 
    Distribution of this corpus is now suspended.
    Annotation.corpus: word segmentation, part-of-speech
    Creator: Real World Computing Partnership, Japan
    Subject.Language: Japanese
    Date: 1997
    Rights: research purpose
    Format: 280.5 MB.
  • CRL-DB-TEXT-97–1
    Type: Text
    Type.linguistics: annotation/corpus
    Description: Syntactically analyzed sentences of RWC-DB-TEXT-95–2. Manually post-edited.
    Annotation.corpus: syntax
    Creator: Communication Research Laboratory, Japan
    Subject.Language: Japanese
    Language: Japanese
    Date: 1997
    Source: jp:rwc95–2
    Format: 40 MB.
    Format.encoding: EUC-JP
    Relation
    URI: http://www.rwcp.or.jp/wswg/rwcdb/text/ (in Japanese)
  • EDR Japanese Corpus
    Type: Text
    Type.linguistics: annotation/corpus
    Description: The linguistic data which the EDR Corpus contains has been obtained by collecting a large number of example sentences and analyzing them on morphological, syntactic, and semantic levels. The Japanese Corpus contains approximately 200,000 sentences. Ver. 4.0 is released in 2010.
    Annotation.corpus: word segmentation, part-of-speech, syntax, word sense
    Creator: Japan Electronic Dictionary Research Institute, Ltd., Japan
    Contact person: National Institute of Information and Communications (thoth(at)edr.co.jp)
    Price: 50,000 JPY (academic), 1,200,000 JPY (general, research use), 2,400,000 JPY (commercial use)
    Subject.Language: Japanese
    Language: Japanese
    Format: 355 MB. 200,000 sentences.
    Format.encoding: EUC-JP
    URI: http://www2.nict.go.jp/ipp/EDR/ENG/indexTop.html
    Usage Case: (Show in new window in Japanese)
  • EDR English Corpus
    Type: Text
    Type.linguistics: annotation/corpus
    Description: The linguistic data which the EDR Corpus contains has been obtained by collecting a large number of example sentences and analyzing them on morphological, syntactic, and semantic levels. The English Corpus contains approximately 120,000 sentences. Ver. 4.0 is released in 2010.
    Annotation.corpus: word segmentation, part-of-speech, syntax, word sense
    Creator: Japan Electronic Dictionary Research Institute, Ltd., Japan
    Contact person: National Institute of Information and Communications (thoth(at)edr.co.jp)
    Price: 50,000 JPY (academic), 1,200,000 JPY (general, research use), 2,400,000 JPY (commercial use)
    Subject.Language: English
    Language: English, Japanese
    Format: 218 MB. 120,000 sentences.
    Format.encoding: EUC-JP
    URI: http://www2.nict.go.jp/ipp/EDR/ENG/indexTop.html
    Usage Case: (Show in new window in Japanese)
  • Kyoto University Text Corpus
    Type: Text
    Type.linguistics: annotation/corpus
    Description: Morphologically and syntactically annotated corpus for 40,000 sentences in Mainichi Shimbun newspaper articles for 1995. 5,000 sentences out of them are also annotated with the information of case, anaphora and coreference. Annotation is manually post-edited. Due to copyright restrictions, users can obtain only annotation data for free and are required to purchase ``Mainichi Shimbun 1995 CD-ROM’’ to reconstruct the original corpus.
    Annotation.corpus: word segmentation, part-of-speech, syntax, case, anaphora, coreference
    Creator: Kurohashi and Kawahara laboratory, Kyoto University
    Contact person: Kurohashi and Kawahara laboratory, Kyoto University (nl-resource(at)nlp.ist.i.kyoto-u.ac.jp)
    Price: free
    Subject.Language: Japanese
    Language: Japanese
    Format: 6 MB.
    Format.encoding: EUC-JP
    URI: http://nlp.ist.i.kyoto-u.ac.jp/EN/index.php?Kyoto%20University%20Text%20Corpus
    Usage Case: (Show in new window in Japanese)
  • JEITA multimodal dialog corpus
    Type: Text
    Type.linguistics: transcription/
    Description: Task oriented dialog corpus of two humans. It contains 80 minutes movies of 9 dialogs of two tasks, `face task’ and `traveling task’. Transcriptions of dialogs are also included. Transcriptions are annotated with tags of dialog structure, syntax, coreference, prosody and facial expression. You can obtain it through GSK.
    Annotation.corpus: word segmentation, part-of-speech, syntax, dialog structure, coreference, prosody, facial expression
    Creator: Japan Electronics and Information Technology Industries Association (JEITA)
    Contact person: GSK (Gengo Shigen Kyokai)
    Price: 21,600 JPY for personal members of GSK, 43,200 JPY for personal non-members, 43,200 JPY for organization members, 86,400 JPY for organization non-members
    Subject.Language: Japanese
    Format: 2 CD-ROM
    URI: http://www.gsk.or.jp/en/catalog/gsk2007-a/
  • IREX public data and tools (final version)
    Type: Text
    Description: Test collection of IR(Information Retrieval) and IE(Information Extraction) developed for IREX workshop held in 1999.
    Creator: IREX committee
    Contact person: IREX committee
    Price: free
    Subject.Language: Japanese
    Date: 1999
    Format: gzipped file, 2.82MB
    Format.encoding: EUC-JP
    URI: http://nlp.cs.nyu.edu/irex/index-e.html
    Usage Case: (Show in new window in Japanese)
  • NTCIR test collection
    Type: Text
    Description: Test collection of information retrieval, information extraction, question answering and summarization developed by NTCIR project.
    Creator: NTCIR project
    Contact person: NTCIR secretariat, Research Center for Information Resources, National Institute of Informatics: ntc-secretariat(at)nii.ac.jp
    Price: free
    Language: Japanese
    Date: 1999–2007
    Format: CD-ROM
    Format.encoding: EUC-JP
    URI: http://research.nii.ac.jp/ntcir/index-en.html
    Usage Case: (Show in new window in Japanese)
  • KNB corpus (Kyoto-University and NTT Blog corpus)
    Type: Text
    Description: Analyzed blog corpus consisting of 4,186 sentences, 249 articles on the 4 themes (sightseeing in Kyoto, mobile phone, sport, gourmet). It is manually annotated with morphological, syntactic, case, ellipsis, opinion tags. (Distribution is now suspended.)
    Annotation.corpus: word segmentation, part-of-speech, syntax, case, ellipsis, opinion information
    Creator: Kyoto University, NTT Communication Science Laboratories
    Contact person: Kurohashi and Kawahara Laboratory, Kyoto University
    Price: free
    Subject.Language: Japanese
    Format.encoding: EUC-JP
    Usage Case: (Show in new window in Japanese)
  • News Article GDA Corpus 2004
    Type: Text
    Type.linguistics: annotation/corpus
    Description: 3000 newspaper articles (about 37,000 sentences, 910,000 words) annotated with morphological information, syntactic structures and word senses. All annotations are manually revised. It is compiled in GDA (Global Document Annotation) format. This data contains only metadata, but not the original text. To restore the complete corpus containing the text, Mainichi Shimbun CD-ROM (1994) is required.
    Annotation.corpus: word segmentation, part-of-speech, syntax, word sense, co-reference
    Creator: Mitsubishi Electric Corporation
    Contact person: GSK (Gengo Shigen Kyokai)
    Price: 21,600 JPY for personal members of GSK, 43,200 JPY for personal non-members, 43,200 JPY for organization members, 86,400 JPY for organization non-members
    Subject.Language: Japanese
    Date: 2010/2
    Rights: Research/Education purpose only
    Format: 1 CD-ROM (60,7MB)
    Format.encoding: Shift_JIS
    URI: http://www.gsk.or.jp/en/catalog/gsk2009-b/
    Usage Case: (Show in new window in Japanese)
  • Annotated Corpus of Iwanami Japanese Dictionary Fifth Edition 2004
     Type: Text
     Description: The corpus of Iwanami Japanese Dictionary Fifth Edition consisting of 56,000 headwords. It is annotated with morphological information, syntactic structures and word senses defined by the dictionary itself. All annotations are manually revised. It is compiled in GDA (Global Document Annotation) format. It contains about 198,000 sentences, 1,120,000 words.
     Annotation.corpus: word segmentation, part-of-speech, syntax, word sense, co-reference
     Creator: Iwanami Shoten Publishers, Mitsubishi Electric Corporation
     Contact person: GSK (Gengo Shigen Kyokai)
     Price: 21,600 JPY for personal members of GSK, 43,200 JPY for personal non-members, 43,200 JPY for organization members, 86,400 JPY for organization non-members
     Subject.Language: Japanese
     Date: 2010/5
     Rights: Research/Education purpose only
     Format: 1 CD-ROM (255MB)
     Format.encoding: Shift_JIS
     URI: http://www.gsk.or.jp/en/catalog/gsk2010-a/
     Usage Case: (Show in new window in Japanese)
  • Balanced Corpus of Contemporary Written Japanese
     Type: Text
     Type.linguistics: annotation/
     Description: A balanced corpus randomly collecting texts from contemporary written Japanese. It consists of publish based sub-corpus (35 million words), library based sub-corpus (30 million words) and non-sampling sub-corpus (35 million words). A part of the corpus is annotated with manually edited morphological tags.
     Creator: National Institute for Japanese Language and Linguistics
     Contact person: National Institute for Japanese Language and Linguistics (kotonoha(at)ninjal.ac.jp)
     Subject.Language: Japanese
     Date: 2006-
     URI: http://www.ninjal.ac.jp/kotonoha/index.html (in Japanese)
     Usage Case: (Show in new window in Japanese)
  • Corpus of Spontaneous Japanese
     Type: Text
     Description: A database collecting spontaneous speech of Japanese with annotations for speech processing research. It consists of speech data of spontaneous speech (660 hours), their transcriptions (7 million words) and their POSs. For the core data (45 hours, 500 thousands words), articulation and intonation labels are annotated.
     Creator: National Institute for Japanese Language and Linguistics, National Institute of Information and Communications Technology, Tokyo Institute of Technology
     Contact person: National Institute for Japanese Language and Linguistics
     Price: 25,000JPY (academic,students), 50,000JPY (academic,university/research institute), 250,000JPY (academic,company), negotiable(commercial use), tax excluded
     Subject.Language: Japanese
     URI: http://www.ninjal.ac.jp/products-k/katsudo/seika/corpus/ (in Japanese)
     Usage Case: (Show in new window in Japanese)
  • NAIST Text Corpus
     Type: Text
     Description: A Japanese corpus consisting of 40,000 sentences excepted from Mainichi Shimbun 1995 articles, which are same sentences in Kyoto text corpus, annotated with co-reference and predicate-argument relations. Only annotations are available in public.
     Creator: Computational Linguistics Laboratory, Nara Institute of Science and Technology
     Contact person: Computational Linguistics Laboratory, Nara Institute of Science and Technology
     Price: free
     Subject.Language: Japanese
     Date: 2006-
     URI: http://cl.naist.jp/nldata/corpus/ (in Japanese)
  • Japanese Corpus Annotated with Semantic Relations between Arguments Version 1.0
     Type: Text
     Type.linguistics: annotation/corpus
     Description: A corpus annotated with semantic relations between arguments.
     Creator: Inui-Okazaki Laboratory, Tohoku University
     Contact person: Inui-Okazaki Laboratory, Tohoku University
     Subject.Language: Japanese
     Rights: Contract required
     URI: http://www.cl.ecei.tohoku.ac.jp/stmap/sem_corpus.html (in Japanese)
  • OpenMWE for Japanese — Corpus
     Type: Text
     Type.linguistics: annotation/corpus
     Description: A corpus for `idiom identification task’ (a task to judge if an expression is an idiom or has a literal meaning). Each example sentence is annotated with a label `idiom’ or `literal meaning’. 1,000 example sentences are collected for one idiom.
     Creator: Chikara Hashimoto, Daisuke Kawahara
     Contact person: Chikara Hashimoto, Daisuke Kawahara
     Price: Free
     Subject.Language: Japanese
     URI: http://openmwe.sourceforge.jp/pukiwiki-j/index.php?Corpus (in Japanese)
  • JEC Basic Sentence Data
     Type: Text
     Description: Automatically extracted basic Japanese sentences based on Kyoto University Case Frame data. It contains manually modified 5304 sentences. It also contains manually translated data from Japanese basic sentences into English and Chinese.
     Creator: Kurohashi-Kawahara Lab., Kyoto University / NICT MASTAR Project, Multilingual Translation Lab.
     Contact person: Kurohashi and Kawahara laboratory, Kyoto University (nl-resource(at)nlp.ist.i.kyoto-u.ac.jp)
     Price: free
     Subject.Language: Japanese, English, Chinese
     Date: 2011
     Rights: Creative Commons Attribution 3.0 Unported
     Format: Excel file
     URI: http://nlp.ist.i.kyoto-u.ac.jp/EN/index.php?JEC%20Basic%20Sentence%20Data
     Usage Case: (Show in new window in Japanese)
  • Konan-JIEM Learner Corpus Fourth Edition
     Type: Text
     Description: It consists of 233 essays written by Japanese college students where all essays are manually annotated with grammatical errors, POS tags, and phrase structures. It also consists of error detection/correction results with the error detection/correction systems obtained in Error Detection and Correction Workshop (EDCW2012).
     Annotation.corpus: part-of-speech, syntax, error correction
     Creator: Nagata Laboratory, Konan University and The Japan Institute for Educational Measurement Inc. (JIEM)
     Contact person: GSK (Gengo Shigen Kyokai)
     Price: 21,600 JPY for personal members of GSK, 43,200 JPY for personal non-members, 43,200 JPY for organization members, 86,400 JPY for organization non-members
     Subject.Language: English
     Date: 2015/3
     Rights: No commercial use. Research/Education purpose only.
     Format: 1 CD-ROM
     URI: http://www.gsk.or.jp/en/catalog/gsk2015-a/
  • Dummy Electronic Health Record Text Data
     Type: Text
     Type.linguistics: annotation/corpus
     Description: Texts excepted from simulated clinical records. Texts written by a doctor are annotated with age, symptom, hospital, location, person, date etc. Expiration date for use is March 31st, 2016.
     Creator: The Joint Use Conference for Electronic Health Care Education (JUCEE), Aramaki Laboratory (Center for Knowledge Structuring, The University of Tokyo)
     Contact person: GSK (Gengo Shigen Kyokai)
     Price: free (GSK member), 32,400JPY (GSK non-member)
     Subject.Language: Japanese
     Date: 2013/2
     Format: 1 file (220KB, zip archive)
     Format.encoding: UTF-8
     URI: http://www.gsk.or.jp/en/catalog/gsk2012-d/
  • REX Corpora
     Type: Text
     Type.linguistics: annotation/corpus
     Description: The REX corpora consist of 6 multimodal corpora of referring expressions in collaborative puzzle solving dialogues. The corpora have two notable features, namely (1) they include time-aligned extra-linguistic information (dialogue speech, movies of puzzle solving processes, participant’s mouse operations and eye-gaze) on top of linguistic information (transcribed utterances, referring expressions for puzzle pieces), (2) dialogues were collected with various configurations in terms of the puzzle type, hinting and language.
     Creator: Tokyo Institute of Science and Technology (Tokunaga Laboratory)
     Contact person: GSK (Gengo Shigen Kyokai)
     Price: 21,600 JPY for personal members of GSK, 43,200 JPY for personal non-members, 43,200 JPY for organization members, 86,400 JPY for organization non-members (for educational and research purpose) 216,000 for members of GSK, 432,000 for non-members (for commercial use, contract is required)
     Subject.Language: Japanese, English
     Date: 2013/5
     Format: 1 USB flash drive (14.9GB)
     Format.encoding: UTF-8
     URI: http://www.gsk.or.jp/en/catalog/gsk2013-a/
     Usage Case: (Show in new window in Japanese)
  • ASPEC (Asian Scientific Paper Excerpt Corpus)
     Type: Text
     Type.linguistics: annotation/corpus
     Description: It is a large bilingual scientific paper corpus consisting of a Japanese-English paper abstract corpus of 3M parallel sentences (ASPEC-JE) and a Japanese-Chinese paper excerpt corpus of 680K parallel sentences (ASPEC-JC). ASPEC-JE was constructed from Japanese-English scientific paper abstracts, which are the property of JST. NICT automatically created the 1-to-1 sentence alignments. ASPEC-JC was constructed by manually translating Japanese scientific papers into Chinese.
     Annotation.corpus: sentence alignment
     Creator: The Japan Science and Technology Agency (JST), The National Institute of Information and Communications Technology (NICT)
     Contact person: The Japan Science and Technology Agency (JST)
     Price: Free
     Subject.Language: Japanese, English, Chinese
     Date: 2014.1
     Rights: Research purpose only
     URI: http://orchid.kuee.kyoto-u.ac.jp/ASPEC/
     Usage Case: (Show in new window in Japanese)
  • Extended Named Entity Tagged Corpus
     Type: Text: Type.linguistics: annotation/corpus
     Description: The core data of “Balanced Corpus of Contemporary Written Japanese (BCCWJ)” (about 2,000 documents) and the collection of newspaper articles “Mainichi Shimbun CD-ROM 1995” (about 8,000 documents) are manually annotated with named entity tags defined in Sekine’s extended named entity hierarchy. There are 43,000 entities (100,000 tokens) in BCCWJ and 60,000 entities (240,000 tokens) in the newspaper. It does not contain the text, that is, only annotation is provided to the users.
     Annotation.corpus: named entity
     Creator: Tokyo Institute of Technology
     Contact person: GSK (Gengo Shigen Kyokai)
     Price: free (GSK member), 32,400JPY (GSK non-member)
     Subject.language: Japanese
     Date: 2015/3
     Format: 1 CD-R
     URI: http://www.gsk.or.jp/catalog/gsk2014-a/ (in Japanese)
     Usage Case: (Show in new window in Japanese)
  • Extended Named Entity & Wikipedia Data
     Type: Text
     Description: Text data of all entries in Japanese Wikipedia annotated with extended named entities. 20,000 entries are manually annotated, while the rest of the entries are tagged by machine learning method.
     Annotation.corpus: named entity
     Creator: Language Craft
     Contact person: Language Craft (enew(at)languagecraft.com)
     Price: charged
     Subject.language: Japanese
     Date: 2016.3
     Format.markup: JSON
     URI: http://www.languagecraft.com/enew/ (in Japanese)
  • Kyutech corpus
     Type: Text
     Type.linguistics: annotation
     Description: A collection of dialog where four people discuss for decision making. Each transcribed utterance is annotated with topic tags. A summary of each discussion is also included.
     Creator: Shimada Laboratory, Kyushu Institute of Technology
     Contact person: Shimada Laboratory, Kyushu Institute of Technology
     Price: Free
     Subject.language: Japanese
     Date: 2015
     Rights: CC-BY-ND
     URI: http://www.pluto.ai.kyutech.ac.jp/~shimada/resources.html (in Japanese)
     Usage Case: (Show in new window in Japanese)
  • JAIST Annotated Free Conversation Corpus
     Type: Text
     Type.linguistics: annotation
     Description: A corpus of free conversation between two people, where each utterance is annotated with a dialog act and sympathy tag. Nine categories are used for annotation of dialog acts: Self-disclosure, Question(YesNo), Question(What), Response(YesNo), Response(Declaration), Backchannel, Filler, Confirmation and Request. On the other hand, three categories are used as the sympathy tags: Sympathy, Antipathy and Neutral. They represents if a speaker shows his/her sympathy or antipathy for a partner. The number of the dialog (chat) is 97, while the number of utterance is 92,020.
     Annotation.corpus: dialog act, sympathy
     Creator: Shirai Laboratory, Japan Advanced Institute of Science and Technology (JAIST)
     Contact person: GSK (Gengo Shigen Kyokai)
     Price: 21,600 JPY for personal members of GSK, 43,200 JPY for personal non-members, 43,200 JPY for organization members, 86,400 JPY for organization non-members (for educational and research purpose) 216,000 for members of GSK, 432,000 for non-members
     Subject.language: Japanese
     Date: 2017/07
     Rights: Research purpose only
     Format: 1 CD-ROM
     Format.encoding: EUC-JP
     Format.markup: tab-separated sheet
     URI: http://www.gsk.or.jp/en/catalog/gsk2017-b/
     Usage Case: (Show in new window in Japanese)

Unannotated corpus

  • ATR Dialogue Database
     Type: Text
     Type.linguistics: transcription/dialogue
     Description: Transcription of dialogues in both Japanese and English for the same conversations. It contains 4 different conversations for 2 different topic (registration of international conference, conversation between a travel agency and a customer) and 2 different media (telephone, keyboard) , each of them is in 1 CD-ROM.
     Creator: Advanced Telecommunications Research Institute International, Japan
     Contact person: Advanced Telecommunications Research Institute International, Japan
     Price: 50,000 JPY per a CD-ROM (research purpose)
     Subject.language: Japanese, English
     Format: 4 CD-ROM.
     URI: http://www.red.atr.co.jp/database_main.html (in Japanese)
  • Examples for Writing English Business Letter
     Type: Text
     Type.linguistics: annotation/corpus
     Description: Japanese and English examples for writing business letters.
     Creator: Nihon Keizai Shimbun Inc., Japan
     Contact person: Nikkei Shuppan Hanbai Co., Ltd. (eizo(at)nikkeish.co.jp)
     Price: 70,000 JPY
     Subject.language: Japanese, English
     Date: 1998
     Format: 1 CD-ROM.
     Format.encoding: Shift_JIS
     Format.markup: SGML
     URI: http://www.nikkeish.co.jp/gengo/eibun.htm (in Japanese)
  • Bensei Database
     Type: Text
     Type.linguistics: annotation/corpus
     Description: Text database of Kobun (Japanese ancient writings), Waka (31-syllable Japanese poem) and Kanbun (text written in classical Chinese). It consists of approximate 50 text.
     Creator: Bensei Data Center, Japan
     Contact person: Bensei Data Center (+81–3–5351–3141)
     Price: 3,000–4,000 JPY per a floppy disk
     Subject.language: Japanese
     Format: 1 floppy disk.
  • Data Novels
     Type: Text
     Type.linguistics: annotation/corpus
     Description: Full text of Japanese literature.
     Creator: Computer Shuppan, Publisher, Japan
     Publisher: Computer Shuppan, Publisher: Contact person: Computer Shuppan, Publisher (+81–3–5486–9481)
     Price: 1,800–18,000 JPY
     Subject.language: Japanese
     Format: 1 floppy disk.
  • Blue Sky Collection (Aozora Bunko)
     Type: Text
     Type.linguistics: annotation/corpus
     Description: Internet Library. A lot of Japanese literature, such as works which are out of copyright etc., is available to the general public.
     Publisher: http://www.aozora.gr.jp/
     Contact person: aozora(at)voyager.co.jp
     Price: free
     Subject.language: Japanese
     URI: http://www.aozora.gr.jp/ (in Japanese)
     Usage Case: (Show in new window in Japanese)
  • Precedent Master (Hanrei Master)
     Type: Text
     Type.linguistics: annotation/corpus
     Description: Text database of approximate 95,000 judicial precedents in 1947–1994. Database is updated once every half a year.
     Creator: Shin Nihon Houki Shuppan, Publisher, Japan
     Publisher: Shin Nihon Houki Shuppan, Publisher: Contact person: Shin Nihon Houki Shuppan, Publisher (+81–52–211–1525)
     Price: 267,800 JPY, 40,000 JPY for upDate: Subject.language: Japanese
  • Japanese Patent Database CD-ROM
     Type: Text
     Type.linguistics: annotation/corpus
     Description: Database of Japanese patent from 1994. Approximate 150 CD-ROM per a year.
     Creator: Japan Patent Information Organization, Japan
     Contact person: Japan Patent Information Organization (+81–3–3503–3900)
     Price: 13,500–20,600 JPY per a CD-ROM
     Subject.language: Japanese
     Usage Case: (Show in new window in Japanese)
  • Kodansha Japanese-English Dictionary
     Type: Text
     Type.linguistics: annotation/corpus
     Description: Text corpus of Kodansha Japanese-English Dictionary, including 38,000 example Japanese sentences with English translations. Users are required to submit a license agreement form to National Institute of Advanced Industrial Science and Technology, Japan.
     Creator: Kodansha Ltd., Japan
     Contributor: Hasida, Koiti
     Contact person: Hasida, Koiti (hasida.k(at)aist.go.jp)
     Price: free
     Subject.language: Japanese
     Language: English
     Usage Case: (Show in new window in Japanese)
  • ZenBase CD-ROM
     Type: Text
     Type.linguistics: annotation/corpus
     Description: Corpus of Japanese ``Zen’’ text.
     Creator: International Research Institute for Zen Buddhism, Japan
     Contact person: International Research Institute for Zen Buddhism (ursapp(at)mbox.kyoto-inet.or.jp)
     Price: 1,000 JPY
     Subject.language: Japanese
     Format.encoding: ISO-2022-JP(JIS code)
  • Power Shift Corpus G1–2009
     Type: Text
     Description: Collection of E-mail messages about business or private. Messages are created by 10–39 years old men and women with specified mobile phones or PCs through simulation.
     Creator: Straight Word Inc.
     Publisher: Power Shift Inc.
     Contact person: Power Shift Inc. (http://www.powershift.co.jp/company/form.html)
     Price: 880,000 JPY + tax
     Subject.language: Japanese
     URI: http://www.powershift.co.jp/it/corpus.html (in Japanese)
  • The Konan Kodomo Corpus
     Type: Text
     Description: The Konan Kodomo corpus (KK corpus) consists of texts written by students in primary school. The number of students is 66 and the period of the data collection is eight month.
     Creator: Edu-mining project team, Konan University
     Contact person: GSK (Gengo Shigen Kyokai)
     Price: free for members of GSK
     Subject.language: Japanese
     Date: 2010/5
     Rights: Research/Education purpose only
     Format: 1 file (260KB, zip archive)
     Format.encoding: EUC-JP
     URI: http://www.gsk.or.jp/en/catalog/gsk2010-b/
  • CASTEL/J CD-ROM V1.5
     Type: Text
     Description: Corpora and databases for Learning of Japanese developed by CASTEL/J. It contains books, white books, movie scripts, Kanji database and Japanese-English dictionary database and so on.
     Creator: CASTEL/J
     Contact person: GSK (Gengo Shigen Kyokai)
     Price: 21,600 JPY for personal members of GSK, 43,200 JPY for personal non-members, 43,200 JPY for organization members, 86,400 JPY for organization non-members
     Subject.language: Japanese
     Date: 2009/8
     Rights: Research/Education purpose only
     Format: 1 CD-ROM (594MB)
     Format.encoding: Shift_JIS
     URI: http://www.gsk.or.jp/en/catalog/gsk2009-a/
  • Natural conversation corpus in Japanese (former Meidai conversation corpus)
     Type: Text
     Type.linguistics: transcription/
     Description: A corpus of transcription of chat among Japanese native speakers. It consists of 120 dialogs. The total time of the conversation is about 100 hours.
     Creator: Oso Mieko
     Price: Free
     Subject.language: Japanese
     Date: 2003
     Format.encoding: EUC-JP
     URI: https://nknet.ninjal.ac.jp/nknet/ndata/nuc/ (in Japanese)
     Usage Case: (Show in new window in Japanese)

Thesaurus

  • Bunrui Goi Hyo Database CD-ROM (enlarged and revised version)
     Type: Text
     Type.linguistics: lexicon/thesaurus
     Description: A Japanese thesaurus that comprises about 100,000 words. It is data originated from the paper book ”Bunrui Goi Hyo (enlarged and revised version)’’.
     Creator: National Institute for Japanese Language, Japan
     Price: free
     Subject.language: Japanese
     Rights: research purpose only
     Format: zip file
     Format.encoding: Shift_JIS
     Format.markup: comma-separated data
     URI: http://www.ninjal.ac.jp/publication/catalogue/goihyo/ (in Japanese)
     Usage Case: (Show in new window in Japanese)
  • Contemporary Japanese Noun Thesaurus
     Type: Text
     Type.linguistics: lexicon/thesaurus
     Description: Contemporary Japanese Noun Thesaurus consisting of 70,000 words.
     Creator: Ogino, Tsunao
     Contact person: Ogino, Tsunao (http://www.chs.nihon-u.ac.jp/jp_dpt/ogino/)
     Price: free (research purpose)
     Subject.language: Japanese
  • GoiTaikei
     Type: Text
     Type.linguistics: lexicon/thesaurus
     Description: Japanese thesaurus developed for the machine translation system ALT-J/E. It comprises about 300,000 words, which are classified into 3,000 semantic classes. It also has a valency dictionary containing 14,000 Japanese subcategorization patterns with corresponding English patterns.
     Creator: NTT Communication Science Laboratories
     Publisher: Iwanami Shoten, Publishers
     Contact person: NTT Communication Science Laboratories, Machine Translation Research Group (mt(at)cslab.kecl.ntt.co.jp)
     Price: 60,000 JPY
     Subject.language: Japanese, English
     Format: CD-ROM
     URI: http://www.kecl.ntt.co.jp/icl/mtg/resources/GoiTaikei/
     Usage Case: (Show in new window in Japanese)
  • BioCaster ontology
     Type: Text
     Type.linguistics: lexicon/thesaurus
     Description: The BioCaster public health ontology is based on a top-level SUMO taxonomy and covers 27 high priority infectious diseases including the pathogens that cause them, their symptoms, syndrome groupings etc. in six Asia-Pacific languages: Chinese (standard), English, Japanese, Korean, Thai, and Vietnamese. Term variants are also given for all terms. Links to major external resources such as MeSH, SNOMED CT and Wikipedia are included.
     Creator: Nigel Collier research group, National Institute of Informatics
     Contact person: Koichi Takeuchi (Okayama University, koichi(at)cl.it.okayama-u.ac.jp), Nigel Collier and AI Kawazoe(National Institute of Informatics, collier(at)nii.ac.jp)
     Price: Free
     Subject.language: Chinese (standard), English, Japanese, Korean, Thai, Vietnamese
     Date: 2007
     URI: http://biocaster.nii.ac.jp/index.php?page=downloads
  • Japanese WordNet
     Type: Text
     Description: WordNet for Japanese. Japanese equivalents are given to synsets of the Princeton WordNet 3.0. It consists of 49,190 concepts (synsets), 85,966 words and 156,684 senses (synset-word pairs).
     Creator: National Institute of Information and Communications Technology
     Contact person: Francis Bond (jwordnet(at)gmail.com)
     Price: free
     URI: http://nlpwww.nict.go.jp/wn-ja/index.en.html
     Usage Case: (Show in new window in Japanese)
  • Verb Thesaurus
     Type: Text
     Type.linguistics: lexicon/verb thesaurus
     Description: A verb dictionary for natural language processing. It consists of 4425 verbs, 7473 senses. It contains hierarchical semantic categories, case frames and typical example sentences for each sense.
     Creator: Koichi Takeuchi, Kentaro Inui, Atsushi Fujita, Nao Takeuchi
     Contact person: Koichi Takeuchi
     Price: Free
     URI: http://cl.it.okayama-u.ac.jp/rsc/data/index.html (in Japanese)
     Usage Case: (Show in new window in Japanese)

Lexicon

  • IPAL
     Type: Text
     Type.linguistics: lexicon/subcategorization dictionary
     Description: Japanese lexicon comprises 861 verbs, 136 adjectives and 1,081 nouns, which are considered as representative examples of Japanese words. Each entry includes information about semantics, morphology, grammatical categories, case frames and idiomatic usage. You can obtain it through GSK.
     Creator: Information-technology Promotion Agency, Japan
     Contact person: GSK (Gengo Shigen Kyokai)
     Price: free (GSK member), 32,400JPY (GSK non-member)
     Subject.language: Japanese
     Language: Japanese
     Date: 1998
     Format: 11 MB.
     Format.encoding: EUC-JP
     URI: http://www.gsk.or.jp/en/catalog/gsk2007-d/
     Usage Case: (Show in new window in Japanese)
  • EDR Electronic Dictionary
     Type: Collection
     Description: The EDR Electronic Dictionary is composed of 9 types of dictionaries (Japanese Word, English Word, Concept, Japanese Co-occurrence, English Co-occurrence, Japanese-English Bilingual, Japanese-Chinese Bilingual, English-Japanese Bilingual and Technical Terminology), as well as the EDR Corpus. Ver. 4.0 is released in 2010.
     Creator: Japan Electronic Dictionary Research Institute, Ltd., Japan
     Contact person: National Institute of Information and Communications (thoth(at)edr.co.jp)
     Subject.language: Japanese, English
     Language: Japanese, English
     Format: 9 CD-ROM.
     URI: http://www2.nict.go.jp/ipp/EDR/ENG/indexTop.html
     Usage Case: (Show in new window in Japanese)
  • EDR Japanese Word Dictionary
     Type: Text
     Type.linguistics: lexicon/
     Description: The basic roles of the Word Dictionary include providing the relations between words and concepts related to each other, and providing grammatical attributes regarding these relationships. The Japanese Word Dictionary contains approximately 260,000 words. Ver. 4.0 is released in 2010.
     Creator: Japan Electronic Dictionary Research Institute, Ltd., Japan
     Contact person: National Institute of Information and Communications (thoth(at)edr.co.jp)
     Price: 50,000 JPY (academic), 1,200,000 JPY (general, research use), 2,400,000 JPY (commercial use)
     Subject.language: Japanese
     Language: Japanese, English
     Format: 103 MB. 260,000 entries.
     Format.encoding: EUC-JP
     URI: http://www2.nict.go.jp/ipp/EDR/ENG/indexTop.html
     Usage Case: (Show in new window in Japanese)
  • EDR English Word Dictionary
     Type: Text
     Type.linguistics: lexicon/
     Description: The basic roles of the Word Dictionary include providing the relations between words and concepts related to each other, and providing grammatical attributes regarding these relationships. The English Word Dictionary contains approximately 190,000 words. Ver. 4.0 is released in 2010.
     Creator: Japan Electronic Dictionary Research Institute, Ltd., Japan
     Contact person: National Institute of Information and Communications (thoth(at)edr.co.jp)
     Price: 50,000 JPY (academic), 1,200,000 JPY (general, research use), 2,400,000 JPY (commercial use)
     Subject.language: English
     Language: English, Japanese
     Format: 86 MB. 190,000 entries.
     Format.encoding: EUC-JP
     URI: http://www2.nict.go.jp/ipp/EDR/ENG/indexTop.html
  • EDR Japanese-English Bilingual Dictionary
     Type: Text
     Type.linguistics: lexicon/bilingual lexicon
     Description: The main role of the Japanese-English Bilingual Dictionary is to describe the correspondence between the Japanese word and the concept represented by the word and to provide the English correspondence word when used with the given meaning. It contains approximately 240,000 words. Ver. 4.0 is released in 2010.
     Creator: Japan Electronic Dictionary Research Institute, Ltd., Japan
     Contact person: National Institute of Information and Communications (thoth(at)edr.co.jp)
     Price: 50,000 JPY (academic), 1,200,000 JPY (general, research use), 2,400,000 JPY (commercial use)
     Subject.language: Japanese
     Language: English, Japanese
     Format: 85 MB. 240,000 entries.
     Format.encoding: EUC-JP
     URI: http://www2.nict.go.jp/ipp/EDR/ENG/indexTop.html
     Usage Case: (Show in new window in Japanese)
  • EDR Japanese-Chinese Bilingual Dictionary
     Type: Text
     Type.linguistics: lexicon/bilingual lexicon
     Description: The main role of the Japanese-Chinese Bilingual Dictionary is to describe the correspondence between the Japanese word and the concept represented by the word and to provide the Chinese correspondence word when used with the given meaning. It contains approximately 230,000 words. It is released in 2010.
     Creator: National Institute of Information and Communications
     Contact person: National Institute of Information and Communications (thoth(at)edr.co.jp)
     Price: 50,000 JPY (academic), 1,200,000 JPY (general, research use), 2,400,000 JPY (commercial use)
     Subject.language: Japanese
     Language: Chinese, Japanese
     Date: 2010
     Format: 85 MB. 240,000 entries.
     Format.encoding: EUC-JP
     URI: http://www2.nict.go.jp/ipp/EDR/ENG/indexTop.html
  • EDR English-Japanese Bilingual Dictionary
     Type: Text
     Type.linguistics: lexicon/bilingual lexicon
     Description: The main role of the English-Japanese Bilingual Dictionary is to describe the correspondence between the English word and the concept represented by the word and to provide the Japanese correspondence word when used with the given meaning. It contains approximately 160,000 words. Ver. 4.0 is released in 2010.
     Creator: Japan Electronic Dictionary Research Institute, Ltd., Japan
     Contact person: National Institute of Information and Communications (thoth(at)edr.co.jp)
     Price: 50,000 JPY (academic), 1,200,000 JPY (general, research use), 2,400,000 JPY (commercial use)
     Subject.language: Japanese
     Language: English, Japanese
     Format: 53 MB. 160,000 entries.
     Format.encoding: EUC-JP
     URI: http://www2.nict.go.jp/ipp/EDR/ENG/indexTop.html
  • EDR Concept Dictionary
     Type: Text
     Type.linguistics: lexicon/thesaurus
     Description: The Concept Dictionary contains information on the approximately 410,000 concepts listed in the Word Dictionary and is divided according to information type into the Headconcept Dictionary, the Concept Classification Dictionary, and the Concept Description Dictionary. The Headconcept Dictionary describes information on the concepts themselves. The Concept Classification Dictionary describes the super-sub relations among the approximately 410,000 concepts. The “super-sub” relation refers to the inclusion relation between concepts, and the set of interlinked concepts can be regarded as a type of thesaurus. The Concept Description Dictionary describes the semantic (binary) relations, such as ‘agent,’ ‘implement,’ and ‘place,’ between concepts that co-occur in a sentence. Ver. 4.0 is released in 2010.
     Creator: Japan Electronic Dictionary Research Institute, Ltd., Japan
     Contact person: National Institute of Information and Communications (thoth(at)edr.co.jp)
     Price: 50,000 JPY (academic), 1,200,000 JPY (general, research use), 2,400,000 JPY (commercial use)
     Subject.language: Japanese, English
     Language: Japanese, English
     Format: 97 MB. 410,000 entries.
     Format.encoding: EUC-JP
     URI: http://www2.nict.go.jp/ipp/EDR/ENG/indexTop.html
     Usage Case: (Show in new window in Japanese)
  • EDR Japanese Co-occurrence Dictionary
     Type: Text
     Type.linguistics: lexicon/cooccurrence database
     Description: The Co-occurrence Dictionary describes collocational information in the form of binary relations. The Japanese Co-occurrence Dictionary contains approximately 930,000 phrases. Ver. 4.0 is released in 2010.
     Creator: Japan Electronic Dictionary Research Institute, Ltd., Japan
     Contact person: National Institute of Information and Communications (thoth(at)edr.co.jp)
     Price: 50,000 JPY (academic), 1,200,000 JPY (general, research use), 2,400,000 JPY (commercial use)
     Subject.language: Japanese
     Language: Japanese
     Format: 445 MB. 930,000 entries.
     Format.encoding: EUC-JP
     URI: http://www2.nict.go.jp/ipp/EDR/ENG/indexTop.html
     Usage Case: (Show in new window in Japanese)
  • EDR English Co-occurrence Dictionary
     Type: Text
     Type.linguistics: lexicon/cooccurrence database
     Description: The Co-occurrence Dictionary describes collocational information in the form of binary relations. The English Co-occurrence Dictionary contains approximately 460,000 phrases. Ver. 4.0 is released in 2010.
     Creator: Japan Electronic Dictionary Research Institute, Ltd., Japan
     Contact person: National Institute of Information and Communications (thoth(at)edr.co.jp)
     Price: 50,000 JPY (academic), 1,200,000 JPY (general, research use), 2,400,000 JPY (commercial use)
     Subject.language: English
     Language: English, Japanese
     Format: 242 MB. 460,000 entries.
     Format.encoding: EUC-JP
     URI: http://www2.nict.go.jp/ipp/EDR/ENG/indexTop.html
  • EDR Technical Terminology Dictionary
     Type: Text
     Type.linguistics: lexicon/technical terminology
     Description: The Technical Terms Dictionary contains technical terms in English and Japanese from the field of information processing. The Technical Terms Dictionary is composed of the following subdictionaries: the Japanese Technical Terms Dictionary, the English Technical Terms Dictionary, the Japanese-English Bilingual Dictionary of Technical Terms, the English-Japanese Bilingual Dictionary of Technical Terms, the Concept Dictionary of Technical Terms, the Japanese Technical Terms Co-occurrence Data, and the English Technical Terms Co-occurrence Data. It contains 119,000 Japanese words and 78,000 English words. Ver. 4.0 is released in 2010.
     Creator: Japan Electronic Dictionary Research Institute, Ltd., Japan
     Contact person: National Institute of Information and Communications (thoth(at)edr.co.jp)
     Price: 50,000 JPY (academic), 1,200,000 JPY (general, research use), 2,400,000 JPY (commercial use)
     Subject.language: Japanese, English
     Language: Japanese, English
     Format: 145 MB. 197,000 entries.
     Format.encoding: EUC-JP
     URI: http://www2.nict.go.jp/ipp/EDR/ENG/indexTop.html
  • Vocabulary of Japanese ancient writing (Koten Taisho Goi Hyo)
     Type: Text
     Type.linguistics: lexicon/
     Description: Vocabulary of approximate 23,000 content words in 14 Japanese ancient writing such as ``Tsurezuregusa’’ and ``Hojoki’’. It also contains word frequencies.
     Creator: Kasama Shoin, Publisher, Japan
     Publisher: Kasama Shoin, Publisher: Contact person: Kasama Shoin, Publisher (+81–3–3295–1331)
     Price: 6,695 JPY
     Subject.language: Japanese
  • ICOT Morphological Dictionary
     Type: Text
     Type.linguistics: lexicon/
     Description: Morphological dictionary consisting of approximate 120,000 Japanese word. Headword, reading and POS are described for each word.
     Creator: Institute for New Generation Computer Technology, Japan
     Publisher: ftp://ftp.icot.or.jp
     Price: free
     Subject.language: Japanese
     Language: Japanese
     Format.encoding: ISO-2022-JP(JIS code)
     URI: ftp://ftp.icot.or.jp/ifs/README.j (in Japanese)
     Usage Case: (Show in new window in Japanese)
  • Life Science Dictionary
     Type: Text
     Type.linguistics: lexicon/
     Description: Database for life science terms in English and Japanese.
     Creator: Life Science Dictionary Project, Japan
     Contributor: Faculty of Pharmaceutical Sciences, Kyoto University / National Institute of Genetics
     Publisher: http://lsd.pharm.kyoto-u.ac.jp
     Contact person: Life Science Dictionary Project (lsd(at)lsd.pharm.kyoto-u.ac.jp)
     Price: free
     Subject.language: Japanese, English
     URI: http://lsd.pharm.kyoto-u.ac.jp/index.html
     Usage Case: (Show in new window in Japanese)
  • English Basic Vocabulary List
     Type: Text
     Type.linguistics: lexicon/
     Description: List of approximate 5,000 English basic words selected by Linda Woo.
     Creator: Woo, Linda
     Contributor: Tonoike, Toshiyuki
     Publisher: http://www.lang.nagoya-u.ac.jp/~tonoike/linda5000.html
     Contact person: Tonoike, Toshiyuki (f43633a(at)nucc.cc.nagoya-u.ac.jp)
     Price: free
     Subject.language: English
     URI: http://www.lang.nagoya-u.ac.jp/~tonoike/linda5000.html (in Japanese)
  • Hokkaido University English Vocabulary List
     Type: Text
     Type.linguistics: lexicon/
     Description: List of approximate 7,500 English basic words developed by Hokkaido University.
     Creator: Hokkaido University, Japan
     Contact person: Sonoda, Katsuhide (ksonoda(at)ilcs.hokudai.ac.jp)
     Price: free
     Subject.language: English
  • EDICT
     Type: Text
     Type.linguistics: lexicon/
     Description: The EDICT file results from a long-running project to produce a freely available Japanese/English Dictionary in machine-readable form.
     Creator: The Electronic Dictionary Research and Development Group, Monash University
     Contact person: Jim Breen (jwb(at)csse.monash.edu.au)
     Price: free for research purpose
     Subject.language: Japanese, English
     Format: about 106,000 entries
     URI: http://www.csse.monash.edu.au/~jwb/edict_doc.html
     Usage Case: (Show in new window in Japanese)
  • CICC Malaysian Basic Dictionary
     Type: Text
     Type.linguistics: lexicon/
     Description: The lexicon consists of 70,000 basic Malay words. POS, grammatical information and English translation are compiled for each word. Technical terms are also included.
     Creator: Center for the International Cooperation for Computerization
     Contact person: GSK (Gengo Shigen Kyokai)
     Price: 21,600 JPY for personal members of GSK, 43,200 JPY for personal non-members, 43,200 JPY for organization members, 86,400 JPY for organization non-members
     Subject.language: Malay
     Date: 1995
     Rights: Academic use only
     Format: 1 CD-ROM
     Format.encoding: Ascii code
     URI: http://www.gsk.or.jp/en/catalog/gsk2006-a-1/
  • CICC Indonesian Basic Dictionary
     Type: Text
     Type.linguistics: lexicon/
     Description: The lexicon consists of 50,000 basic Indonesian words. POS, grammatical information and English translation are compiled for each word. Idioms, acronyms and technical terms are also included.
     Creator: Center for the International Cooperation for Computerization
     Contact person: GSK (Gengo Shigen Kyokai)
     Price: 21,600 JPY for personal members of GSK, 43,200 JPY for personal non-members, 43,200 JPY for organization members, 86,400 JPY for organization non-members
     Subject.language: Indonesian
     Date: 1995
     Rights: Academic use only
     Format: 1 CD-ROM
     Format.encoding: Ascii code
     URI: http://www.gsk.or.jp/en/catalog/gsk2006-a-2/
  • CICC Chinese Basic Dictionary
     Type: Text
     Type.linguistics: lexicon/
     Description: The lexicon consists of 50,000 basic Chinese words. Pronunciations and grammatical information are compiled for each word. Technical terms are also included.
     Creator: Center for the International Cooperation for Computerization
     Contact person: GSK (Gengo Shigen Kyokai)
     Price: 21,600 JPY for personal members of GSK, 43,200 JPY for personal non-members, 43,200 JPY for organization members, 86,400 JPY for organization non-members
     Subject.language: Chinese
     Date: 1995
     Rights: Academic use only
     Format: 1 CD-ROM
     Format.encoding: GB code
     URI: http://www.gsk.or.jp/en/catalog/gsk2006-a-3/
  • CICC Thai Basic Dictionary
     Type: Text
     Type.linguistics: lexicon/
     Description: The lexicon consists of 50,000 basic Thai words. English translations are compiled for each word. Collocation and technical terms are also included.
     Creator: Center for the International Cooperation for Computerization
     Contact person: GSK (Gengo Shigen Kyokai)
     Price: 21,600 JPY for personal members of GSK, 43,200 JPY for personal non-members, 43,200 JPY for organization members, 86,400 JPY for organization non-members
     Subject.language: Thai
     Date: 1995
     Rights: Academic use only
     Format: 1 CD-ROM
     Format.encoding: TIS0 620–2529
     URI: http://www.gsk.or.jp/en/catalog/gsk2006-a-4/
  • CICC Technical Term Dictionary
     Type: Text
     Type.linguistics: lexicon/
     Description: The lexicon consists of technical terms in Malay, Indonesian, Chinese and Thai. Technical terms about computer, electronics, engineering and related area are included. Japanese translations, English translations, POS, pronunciations, classifier, syntactic information etc. are compiled for each word.
     Creator: Center for the International Cooperation for Computerization
     Contact person: GSK (Gengo Shigen Kyokai)
     Price: 21,600 JPY for personal members of GSK, 43,200 JPY for personal non-members, 43,200 JPY for organization members, 86,400 JPY for organization non-members
     Subject.language: Malay, Indonesian, Chinese, Thai
     Language: Malay, Indonesian, Chinese, Thai, English, Japanese
     Date: 1995
     Rights: Academic use only
     Format: 1 CD-ROM
     Format.encoding: ASCII code, GB code, TIS 620–2529, EUC, Shift-JIS
     URI: http://www.gsk.or.jp/en/catalog/gsk2006-a-5/
  • MUST1: Example Database of Japanese Compound Functional Expressions
     Type: Text
     Type.linguistics: lexicon/
     Description: Database of Japanese compound functional expressions and their example usage. It contains 337 compound functional expressions and at most 50 examples per an expression. Examples are excerpted from newspaper articles. Mainichi Shimbun CD-ROM 1995 is required to reconstruct complete database.
     Creator: Group MUST
     Contact person: Group MUST (Suguru Matsuyoshi, Takehiro Utsuro, Satoshi Sato, Masatoshi Tsuchiya)
     Price: free
     Subject.language: Japanese
     Date: 2007
     URI: http://nlp.iit.tsukuba.ac.jp/must/ (in Japanese)
     Usage Case: (Show in new window in Japanese)
  • Tori-Bank
     Type: Text
     Type.linguistics: lexicon/
     Description: Japanese Semantic Pattern Dictionary -Compound and Complex Sentence Eds.-, which is a lexicon consisting of 227,000 translation patterns of Japanese and English, and its related documents and tools.
     Creator: Nihongo Hyogen Imijishoto Kanri Iinkai
     Price: Free(Research Purpose Only)
     Subject.language: Japanese
     Date: 2007
     URI: http://unicorn.ike.tottori-u.ac.jp/toribank/ (in Japanese)
     Usage Case: (Show in new window in Japanese)
  • Tsutsuji — Lexicon of Japanese functional expressions
     Type: Text
     Type.linguistics: lexicon/
     Description: A lexicon compiling Japanese functional expressions (both functional words and compound words). It has a 9-level hierarchical structure. Number of functional expressions at the lowest level is 16,801.
     Creator: Suguru Matsuyoshi, Satoshi Sato
     Contact person: tsu90tsu80ji%sslab.nuee.nagoya-75u.ac.jp (remove all numbers, replace % with (at).)
     Price: free
     Subject.language: Japanese
     Date: 2007
     Rights: Creative Commons 3.0, Attribution-Noncommercial-Share Alike
     URI: http://kotoba.nuee.nagoya-u.ac.jp/tsutsuji/ (in Japanese)
     Usage Case: (Show in new window in Japanese)
  • UniDic
     Type: Text
     Type.linguistics: lexicon/word
     Description: A machine readable dictionary for morphological analysis of Japanese. It can be used as a dictionary of ChaSen and MeCab, which are public Japanese morphological analyzers. It contains canonical form, word form, writing variants, speech variants and accent. It contains 15,000 words (canonical forms) in July 2009.
     Creator: DEN Yasuharu, YAMADA Atsushi, OGURA Hideki, KOISO Hanae, OGISO Toshinobu
     Contact person: unidic(at)ninjal.ac.jp
     Price: free
     Subject.language: Japanese
     Date: 2007-
     URI: http://www.tokuteicorpus.jp/dist/ (in Japanese)
     Usage Case: (Show in new window in Japanese)
  • EVALDIC: Dictionary of Evaluative Expressions
     Type: Text
     Type.linguistics: lexicon/evaluation expressions
     Description: A lexicon compiling Japanese expressions used for evaluation of something. It consists of about 52,000 evaluative expressions.
     Creator: Computational Linguistics Laboratory, Nara Institute of Science and Technology, Japan
     Contact person: Computational Linguistics Laboratory, Nara Institute of Science and Technology
     Price: Free
     Subject.language: Japanese
     Date: 2006
     URI: http://www.syncha.org/evaluative_expressions.html (in Japanese)
     Usage Case: (Show in new window in Japanese)
  • NAIST English Dictionary
     Type: Text
     Type.linguistics: lexicon/word
     Description: An English dictionary with parts-of-speech used in Penn Treebank. It contains a base form for each word.
     Creator: Computational Linguistics Laboratory, Nara Institute of Science and Technology, Japan
     Contact person: Computational Linguistics Laboratory, Nara Institute of Science and Technology
     Price: Free
     Subject.language: English
     Date: 2007
     URI: http://sites.google.com/site/masayua/p/naist-edic (in Japanese)
  • NAIST Japanese Dictionary
     Type: Text
     Type.linguistics: lexicon/word
     Description: A Japanese lexicon which is a successor of IPAdic. Parts-of-speech of all words except for proper nouns are rechecked. Variants and compound word structures are added. It is used for ChaSen and MeCab.
     Creator: Computational Linguistics Laboratory, Nara Institute of Science and Technology, Japan
     Contact person: Computational Linguistics Laboratory, Nara Institute of Science and Technology
     Price: Free
     Subject.language: Japanese
     URI: http://sourceforge.jp/projects/naist-jdic/ (in Japanese)
  • NAIST Chinese Dictionary
     Type: Text
     Type.linguistics: lexicon/word
     Description: A Chinese dictionary for morphological analyzer MeCab. It compiles about 120,000 words with their parts-of-speech.
     Creator: Computational Linguistics Laboratory, Nara Institute of Science and Technology, Japan
     Contact person: Computational Linguistics Laboratory, Nara Institute of Science and Technology
     Subject.language: Chinese
     Rights: Contract required
     URI: http://cl.naist.jp/~masayu-a/ncd/ (in Japanese)
  • NAIST Japanese ENE Dictionary on Wikipedia
     Type: Text
     Type.linguistics: lexicon/named entity
     Description: A Japanese dictionary compiling lexical entries in Wikipedia with extended named entity tags proposed by Prof. Sekine in NYU.
     Creator: Computational Linguistics Laboratory, Nara Institute of Science and Technology, Japan
     Contact person: Computational Linguistics Laboratory, Nara Institute of Science and Technology
     Price: Free
     Subject.language: Japanese
     URI: http://sites.google.com/site/masayua/p/naist-jene (in Japanese)
  • Kyoto University’s Case Frame Data ver 1.0
     Type: Text
     Type.linguistics: lexicon/subcategorization dictionary
     Description: A large-scale Japanese case frame dictionary automatically obtained from 1.6 billion Web pages. A case frame is data about a predicate and its associated nouns. It contains case frames for 40,000 predicates. It is available only for GSK members.
     Creator: Language Media Lab., Kyoto University, Japan
     Contact person: Kurohashi and Kawahara laboratory, Kyoto University (nl-resource(at)nlp.ist.i.kyoto-u.ac.jp)
     Price: only for GSK members, free
     Subject.language: Japanese
     Format: 11 MB.
     Format.encoding: EUC-JP
     URI: http://www.gsk.or.jp/en/catalog/gsk2008-b/
     Usage Case: (Show in new window in Japanese)
  • OpenMWE for Japanese — Idioms
     Type: Text
     Type.linguistics: lexicon/idioms
     Description: A list of 926 basic Japanese idioms. Each idiom is classified according to syntactic flexibility and semantic ambiguity.
     Creator: Chikara Hashimoto, Daisuke Kawahara
     Contact person: Chikara Hashimoto, Daisuke Kawahara
     Price: Free
     Subject.language: Japanese
     URI: http://openmwe.sourceforge.jp/pukiwiki-j/index.php?Idioms (in Japanese)
  • JC2 — List of Japanese Basic Words
     Type: Text
     Type.linguistics: lexicon/basic word
     Description: A list of Japanese basic words. 2,800 level A words, 3,000 level B words, total about 5,800 words are compiled. Not only a word but also functional expressions/idioms are also included.
     Creator: Sato Laboratory, Nagoya University
     Contact person: Satoshi Sato
     Price: Free
     Subject.language: Japanese
     URI: http://kotoba.nuee.nagoya-u.ac.jp/jc2/base/list (in Japanese)
     Usage Case: (Show in new window in Japanese)
  • Five Literature Comparison Table of Basic Idioms
     Type: Text
     Type.linguistics: lexicon/idiom
     Description: A list of Japanese basic idioms. Idioms in five literature books are compiled as a comparison table. Number of idioms is 3,629.
     Creator: Satoshi Sato
     Contact person: Satoshi Sato
     Price: Free
     Subject.language: Japanese
     URI: http://kotoba.nuee.nagoya-u.ac.jp/jc2/base/list (in Japanese)
     Usage Case: (Show in new window in Japanese)
  • Dictionary of Japanese Evaluation Expression (Predicates)
     Type: Text
     Type.linguistics: lexicon/evaluation expressions
     Description: Collection of 5,000 Japanese evaluation expressions (predicates) manually annotated with polarity tags. Four classes, which are combination of positive/negative and subjective/objective, are used as polarity tags.
     Creator: Inui-Okazaki Laboratory, Tohoku University
     Contact person: Inui-Okazaki Laboratory, Tohoku University
     Price: Free
     Subject.language: Japanese
     URI: http://www.cl.ecei.tohoku.ac.jp/index.php?%E5%85%AC%E9%96%8B%E8%B3%87%E6%BA%90%2F%E6%97%A5%E6%9C%AC%E8%AA%9E%E8%A9%95%E4%BE%A1%E6%A5%B5%E6%80%A7%E8%BE%9E%E6%9B%B8
     Usage Case: (Show in new window in Japanese)
  • Dictionary of Japanese Evaluation Expression (Nouns)
     Type: Text
     Type.linguistics: lexicon/evaluation expressions
     Description: Collection of 8,500 Japanese evaluation expressions (nouns or compound nouns) with their polarity. Polarity tags are manually checked.
     Creator: Inui-Okazaki Laboratory, Tohoku University
     Contact person: Inui-Okazaki Laboratory, Tohoku University
     Price: Free
     Subject.language: Japanese
     URI: http://www.cl.ecei.tohoku.ac.jp/index.php?%E5%85%AC%E9%96%8B%E8%B3%87%E6%BA%90%2F%E6%97%A5%E6%9C%AC%E8%AA%9E%E8%A9%95%E4%BE%A1%E6%A5%B5%E6%80%A7%E8%BE%9E%E6%9B%B8
     Usage Case: (Show in new window in Japanese)
  • Kyoto University’s Nominal Case Frame
     Type: Text
     Type.linguistics: lexicon/subcategorization dictionary
     Description: A large scale lexicon of nominal case frames. Nominal case frame is a set of necessary elements for interpretation of the noun. Nominal case frames are compiled for each sense of a noun. It consists of 160,000 nouns automatically constructed from 1.6 billion of Japanese sentences in Web.
     Creator: Kurohashi and Kawahara laboratory, Kyoto University
     Contact person: Kurohashi and Kawahara laboratory, Kyoto University (nl-resource(at)nlp.ist.i.kyoto-u.ac.jp)
     Price: free
     Subject.language: Japanese
     Format: 68MB.
     URI: http://nlp.ist.i.kyoto-u.ac.jp/index.php?%E4%BA%AC%E9%83%BD%E5%A4%A7%E5%AD%A6%E5%90%8D%E8%A9%9E%E6%A0%BC%E3%83%95%E3%83%AC%E3%83%BC%E3%83%A0 (in Japanese)
  • Japanese Dictionary of Appraisal -attitude-
     Type: Text
     Type.linguistics: lexicon/
     Description: An electronic dictionary consisting of 8,544 evaluative expressions (word senses) with their polarity (positive or negative). It can be used to classify evaluative expressions from some points of views (tender emotion view, ethic view).
     Creator: Center for Corpus Development, National Institute for Japanese Language and Linguistics
     Contact person: GSK (Gengo Shigen Kyokai)
     Price: Free
     Subject.language: Japanese
     Date: 2011/9
     Rights: only for research or education
     Format: zip file
     Format.encoding: UTF-8
     URI: http://www.gsk.or.jp/en/catalog/gsk2011-c/
     Usage Case: (Show in new window in Japanese)
  • GSK Dictionary of Places and Facilities (Second Edition)
     Type: Text
     Type.linguistics: lexicon/
     Description: A set of three kinds of dictionaries. Dictionary of places is a collection of 117,075 place names (addresses) from across Japan. Compiled information on places includes pronunciations, Romanized transliterations, orthographic variants, latitude, longitude, etc. Dictionary of facilities is a collection of about 1,000 facilities, such as art galleries, museums, amusement parks, etc. Compiled information on facilities includes pronunciations, orthographic variants, latitude, longitude, etc. Web dictionary of facilities is a collection of facilities excerpted from Japanese Wikipedia. Compiled information on facilities includes pronunciations, addresses, categories, etc. Latitude and longitude are also compiled on the part of facilities. Number of facilities in the dictionary is 32,419 (24,859 with precise latitude and longitude). The dictionary may contain errors since it is automatically compiled.
     Creator: GSK (Gengo Shigen Kyokai)
     Contact person: GSK (Gengo Shigen Kyokai)
     Price: 21,600 JPY for personal members of GSK, 43,200 JPY for personal non-members, 43,200 JPY for organization members, 86,400 JPY for organization non-members (for educational and research purpose) 216,000 for members of GSK, 432,000 for non-members (for commercial use, contract is required)
     Subject.language: Japanese
     Date: 2009/2
     Format: 1 CD-ROM
     Format.encoding: Shift_JIS
     URI: http://www.gsk.or.jp/en/catalog/gsk2012-c/
     Usage Case: (Show in new window in Japanese)
  • Japanese MWE Lexicon (JMWEL)
     Type: Text
     Type.linguistics: lexicon/
     Description: A comprehensive database of Japanese multiword expression, multiword unit and formulaic language. It consists of 18 lexicons.1. JMWEL_nominal v2.0
     A Lexicon of Japanese idiomatic or collocational nominal phrases, e.g., makka-na-uso 真っ赤-な-嘘 “downright lie”, amai-mi-toosi 甘い-見-通し “over-optimistic prospect”. It includes about 23500 head entries each of which is given the information on its notational variants, morphological structure, syntactic function and structure, and internal modifiability. (8000JPY)
     2. JMWEL_verbal(class1) v2.0
     A Lexicon of Japanese idiomatic or collocational phrases which have NOUN-PARTICLE(“ga”,”wo”, or “ni”)-VERB construction, e.g., abura-wo-uru 油-を-売る “idle away one’s time”, hara-ga-tatsu 腹-が-立つ “get angry”. It includes about 36000 head entries each of which is given the information on its notational variants, morphological structure, syntactic function and structure, and internal modifiability. (23000JPY)
     3. JMWEL_verbal(class2) v2.0
     A Lexicon of Japanese idiomatic or collocational verbal phrases which have the form other than NOUN-PARTICLE(“ga”,”wo”, or “ni”)-VERB construction, e.g., tama-no-kosi-ni-noru 玉-の-輿-に-乗る “marry into money”, bake-no-kawa-ga-hageru 化け-の-皮-が-剥げる “expose one’s true colors”. It includes about 13800 head entries each of which is given the information on its notational variants, morphological structure, syntactic function and structure, and internal modifiability. (15000JPY)
     4. JMWEL_verbal(class3) v2.0
     A Lexicon of Japanese verbal compounds such as wameki-tateru 喚き-立てる“shout”, kakki-zuku 活気-づく “get lively”. It includes about 3700 head entries each of which is given the information on its notational variants and morphological structure. (7000JPY)
     5. JMWEL_adjective v2.0
     A Lexicon of Japanese idiomatic or collocational adjective phrases such as ki-ga-chiisai 気-が-小さい “be timid”, kigen-ga-yoi 機嫌-が-良い “be cheerful”. It includes about 3700 head entries each of which is given the information on its notational variants, morphological structure, syntactic function and structure, and internal modifiability. (11000JPY)
     6. JMWEL_adjective verbal v2.0
     A Lexicon of Japanese idiomatic or collocational adjective verbal phrases such as sensaku-zuki 詮索-好き “be inquisitive”, kingen-jicchoku 謹厳-実直 “be serious-minded”. It includes about 2600 head entries each of which is given the information on its notational variants, morphological structure, syntactic function and structure, and internal modifiability. (7000JPY)
     7. JMWEL_adverbial v2.0
     A Lexicon of Japanese idiomatic or collocational adverbial phrases such as omoi-mo-yora-zu 思い-も-よら-ず “unexectedly”, ki-wo-tuke-te 気-を-付け-て “carefully”. It includes about16200 head entries each of which is given the information on its notational variants, morphological structure, syntactic function and structure, and internal modifiability. (15000JPY)
     8. JMWEL_adnominal v2.0
     A Lexicon of Japanese idiomatic or collocational adnominal phrases such as yo-ni-iu世-に-云う “so called”, suji-no-toot-ta 筋-の-通っ-た “reasonable”. It includes about16500 head entries each of which is given the information on its notational variants, morphological structure, syntactic function and structure, and internal modifiability. (15000JPY)
     9. JMWEL_discourse marker v2.0
     A Lexicon of Japanese idiomatic or collocational discourse-marking expressions or sentence connectives such as sou-ha-it-temo そう-は-言っ-ても “however”, odoroku-beki-koto-ni 驚く-べき-こと-に “astonishingly”. It includes about1200 head entries each of which is given the information on its notational variants, morphological structure, syntactic function and structure, and internal modifiability. (9000JPY)
     10. JMWEL_post-predicative v2.0
     A Lexicon of Japanese post-predicate multiword expressions such as beki-dat-ta-n-da-kedo べき-だっ-た-ん-だ-けど “… should have Vpp …”, te-itadake-mase-n-ka-ne て-頂け-ませ-ん-かね “Would you V …” which give the information on tense, aspect, modality, polarity, mood, speaker’s attitude to the proposition, etc. It includes about 4900 head entries each of which is given the information on its notational variants, morphological and syntactic function, syntactic structure, and semantic feature. (23000JPY)
     11. JMWEL_postpositional v2.0
     A Lexicon of Japanese postpositional multiword expressions such as ni-kansi-te に-関し-て “about”, wo-gisei-ni を-犠牲-に “at the expense of”, ta-ato-ni た-後-に ”after” which give the semantic relationship among noun phrases and predicative phrases in the sentence. It includes about 2700 head entries each of which is given the information on its notational variants, morphological and syntactic function, syntactic structure, and samples of usage. (16000JPY)
     12. JMWEL_idiom v2.0
     A Lexicon of Japanese idioms such as abura-wo-uru 油-を-売る “idle away one’s time”, youryou-ga-ii 要領-が-良い “know how to swim with the tide”, me-to-hana-no-saki 目-と-鼻-の-先 “just a stone’s throw away”. It includes about 4500 head entries each of which is given the information on its notational variants, morphological structure, syntactic function and structure, and internal modifiability. (17000JPY)
     13. JMWEL_proverb saying cliche v2.0
     A Lexicon of Japanese proverbs, sayings, cliches such as kame-no-kou-yori-tosi-no-kou 亀-の-甲-より-年-の-功 “Age and experience teach wisdom”, kabe-ni-mimi-ari 壁-に-耳-あり “Walls have ears”, gou-ni-it-te-ha-gou-ni-sitagae 郷-に-入っ-て-は-郷-に-従え “When you are in Rome, do as Romans do”. It includes about 4000 head entries each of which is given the information on its notational variants, morphological structure, syntactic function and structure, and internal modifiability. (9000JPY)
     14. JMWEL_onomatopoeic v2.0
     A Lexicon of Japanese onomatopoeic expressions such as kachikachi-ni-kooru カチカチ-に-凍る “get frozen solid”, buruburu-furueru ブルブル-震える “tremble”. It includes about 13000 head entries each of which is given the information on its notational variants, morphological structure, syntactic function and structure. (20000JPY)
     15. JMWEL_four character word v2.0
     A Lexicon of Japanese four character words such as sessa-takuma 切磋-琢磨 “improving each other through friendly rivalry”, isseki-nichou 一石-二鳥 “killing two birds with one stone”. It includes about 3500 head entries each of which is given the information on its notational variants, morphological structure, syntactic function and structure. Some head entries are given meaning explanations. (8000JPY)
     16. JMWEL_incomplete phrase v2.0
     A Lexicon of Japanese incomplete phrases which are commonly used in daily life, such as neko-ni-koban 猫-に-小判 “casting pearls before swine”, yamai-ha-ki-kara 病-は-気-から“worry often causes the illness”. It includes about 470 head entries each of which is given the information on its notational variants, type of the incompleteness, morphological structure, syntactic function and structure. (5000JPY)
     17. JMWEL_cranberry v2.0
     A Lexicon of Japanese cranberry expressions which include cranberry-type morphs as substrings. For example, shigami-tuku しがみ-付く “cling to” and usiro-metai 後ろ-めたい“feel guilty” are included as head entries, because shigami and metai are thought cranberry-type, respectively. It includes about 180 head entries each of which is given the information on its notational variants, morphological structure, syntactic function and structure. (3000JPY)
     18. JMWEL_call response greeting monologue interjection v2.0
     A Lexicon of Japanese calling, responding, greeting, monologue or interjective expressions, such as ara-maa あら-まあ, uso ウソ, arigatou 有難う, o-tsukare-sama お-疲れ-様. It includes about 1050 head entries each of which is given the information on its notational variants, morphological structure, syntactic function and structure, and the semantic feature vector. (18000JPY)
     
     
     Creator: JEFI: Japanese Expressions Forest Institute
     Price: Free (research purpose only)
     Subject.language: Japanese
     URI: http://jefi.info (in Japanese)
  • mecab-ipadic-NEologd (Neologism dictionary for MeCab)
     Type: Text
     Type.linguistics: lexicon/
     Description: Customized system dictionary for morphological analyzer MeCab. It includes many neologisms (new word), which are extracted from many language resources on the Web. When you analyze the Web documents, it’s better to use this system dictionary and default one (ipadic) together.
     Creator: Toshinori Sato
     Contact person: Toshinori Sato
     Price: Free
     Subject.language: Japanese
     Date: 2015-
     Rights: Apache License, Version 2.0
     URI: https://github.com/neologd/mecab-ipadic-neologd
     Usage Case: (Show in new window in Japanese)

Text(misc.)

  • Studies on the Vocabulary of High and Middle School Textbooks
     Type: Text
     Type.linguistics: lexicon/
     Description: Reports on studies on the vocabulary of high and middle school textbooks in 1974 and 1980 in Japan. The vocabulary list is also included in ``Vocabulary Survey of Broadcasts CD-ROM’’.
     Creator: National Institute for Japanese Language, Japan
     Subject.language: Japanese
  • Vocabulary Survey of Broadcasts CD-ROM
     Type: Text
     Type.linguistics: lexicon/
     Description: Vocabulary on TV program and CM broadcasted in April — June 1989 (26,000 words). It also includes the vocabulariy list of ``Studies on the Vocabulary of High and Middle School Textbooks’’.
     Creator: National Institute for Japanese Language, Japan
     Publisher: Dainippon-tosho
     Contact person: Dainippon-tosho Co., Ltd. (+81–3–3561–8679)
     Price: 2500 JPY
     Subject.language: Japanese
     Format: 1 CD-ROM
     Format.encoding: Shift_JIS
  • Word Collocation Database
     Type: Text
     Description: Word collocation database consisting of 1,160,000 entries, such as triples of a verb, its case filler noun and its case marker, extracted from Japanese newspaper articles.
     Creator: Tanaka, Yasuhito
     Contact person: Tanaka, Yasuhito (+81–794–27–5111)
     Price: low price (carrying charge etc.)
     Subject.language: Japanese
  • Conversation among women — office version -
     Type: Text
     Type.linguistics: transcription/dialogue
     Description: Transcription of conversations (about 9 hours) at offices among 19 women who are from 20’s to 50’s. ISBN 4–938669–93–5
     Creator: Gendai Nihongo Kenkyukai
     Publisher: Hituzi Syobo
     Contact person: Hituzi Syobo
     Price: 3675 JPY
     Subject.language: Japanese
     Format: 1 FD
  • Conversation among men — office version -
     Type: Text
     Type.linguistics: transcription/dialogue
     Description: Transcription of conversations (about 12 hours) at offices among 21 men who are from 20’s to 50’s. ISBN 4–89476–161–0
     Creator: Gendai Nihongo Kenkyukai
     Publisher: Hituzi Syobo
     Contact person: Hituzi Syobo
     Price: 2940 JPY
     Subject.language: Japanese
     Format: 1 CD-ROM
  • Spoken language in days of World War II — from scenarios of radio dramas -
     Type: Text
     Type.linguistics: transcription/dialogue
     Description: Scenarios of radio dramas broadcasted by NHK (Japan Broadcasting Corporation) from 1936 to 1955. They are written by Masaru Kobayashi. ISBN 4–89476–222–6
     Creator: Orie Endo et al.
     Publisher: Hituzi Syobo
     Contact person: Hituzi Syobo
     Price: 3990 JPY
     Language: Japanese
     Format: 1 CD-ROM
  • Research on stories in chats among Japanese native speakers
     Type: Text
     Type.linguistics: transcription/dialogue
     Description: Transcriptions of chats (about 10 hours) among 15 pairs of Japanese native women speakers who are from 19 to 35 years old. ISBN 978–4–87424–194–3
     Creator: Gendai Nihongo Kenkyukai
     Publisher: Kuroshio Publisher: Contact person: Kuroshio Publisher (frontier_series(at)nifty.ne.jp)
     Price: 3990 JPY
     Language: Japanese
     Date: 2000
     Format: PDF
  • Japanese-English English-Japanese Corpus Dictionary of Science and Technology
     Type: Text
     Description: Collection of Japanese translations for 15,000 English sentences excerpted from books, magazines and pamphlets about science and technology. ISBN 4–621–04991–7
     Annotation.document
     keyword
     Creator: Atsushi Tomii
     Publisher: Maruzen CO.,LTD.
     Contact person: Maruzen CO.,LTD.
     Price: 18900 JPY
     Subject.language: English, Japanese
     Format: 1 CD-ROM
     URI: http://pub.maruzen.co.jp/cd_others/ko-pas/index.html (in Japanese)
  • Web Japanese N-gram version 1
     Type: Text
     Type.linguistics: n-gram
     Description: N-gram are obtained from open Web pages written in Japanese that is crawled by Google. It contains 1-gram to 7-gram which occur more than or equal 20 times in 20 billion sentences.
     Creator: Google Inc.
     Contact person: GSK (Gengo Shigen Kyokai)
     Price: 21,600 JPY for personal members of GSK, 43,200 JPY for personal non-members, 43,200 JPY for organization members, 86,400 JPY for organization non-members
     Subject.language: Japanese
     Date: 2007
     Rights: Academic use only
     Format: 6 DVD-ROM
     Format.encoding: Unicode
     URI: http://www.gsk.or.jp/en/catalog/gsk2007-c/
     Usage Case: (Show in new window in Japanese)
  • Data Collection for Textual Entailment
     Type: Text
     Type.linguistics: data collection
     Description: Data collection for evaluation of recognizing textual entailment (RTE). It consists of 2,700 sets, where 4-scaled values evaluating degree of entailment are attached. Each set is classified into 5 categories: inclusion, word(noun), word(predicate), syntax and inference.
     Creator: Kurohashi and Kawahara laboratory, Kyoto University
     Contact person: Kurohashi and Kawahara laboratory, Kyoto University (nl-resource(at)nlp.ist.i.kyoto-u.ac.jp)
     Price: Free
     Subject.language: Japanese
     Date: 2010
     URI: http://nlp.ist.i.kyoto-u.ac.jp/index.php?Textual%20Entailment%20%E8%A9%95%E4%BE%A1%E3%83%87%E3%83%BC%E3%82%BF (in Japanese)
  • Baidu Blog/BBS Timed Corpus
     Type: Text
     Type.linguistics: n-gram
     Description: N-gram (1-gram to 3-gram) of Japanese words obtained from texts in BBS and blog crawled by Baidu. It provides N-grams for every month from January 2000 to July 2010.
     Creator: Baidu Japan
     Price: Free
     Date: 2010
     Format.encoding: UTF-8
     URI: http://www.baidu.jp/corpus/ (in Japanese)
  • Baidu Mobile Web Corpus with Pictograph
     Type: Text
     Type.linguistics: n-gram
     Description: N-gram (1-gram to 5-gram) of Japanese words obtained from texts which is crawled by Baidu for mobile search. It contains pictographs.
     Creator: Baidu Japan
     Price: Free
     Date: 2010
     Format.encoding: UTF-8
     URI: http://www.baidu.jp/corpus/ (in Japanese)
  • Rakuten Data
     Type: Text
     Description: Various data in Rakuten, Inc. (1) ``Rakuten Ichiba’’ All product data (Approx. 50 million items). (2) ``Rakuten Travel’’ Facility data (11,468 facilities), review data (350,000 reviews, 340,000 evaluations). (3) ``Rakuten GORA’’ (Rakuten’s golf service) Facility data (1,669 facilities), review data (320,000 reviews). Data is available via NII or ALAGIN.
     Creator: Rakuten Institute of Technology
     Subject.language: Japanese
     Date: 2010
     URI: http://rit.rakuten.co.jp/rdr/index_en.html
     Usage Case: (Show in new window in Japanese)

Speech

  • ATR Speech Database
     Type: Collection
     Description: The database of read speech consisting of six different sets.
     Creator: Advanced Telecommunications Research Institute International, Japan
     Contact person: Advanced Telecommunications Research Institute International, Japan
     Subject.language: Japanese, English
     URI: http://www.red.atr.co.jp/database_main.html (in Japanese)
     Usage Case: (Show in new window in Japanese)
  • ATR Speech Database (Set A)
     Type: Sound
     Type.linguistics: transcription/read speech
     Description: Set A: Japanese read speech of different 20 speakers, 8,500 different words.
     Creator: Advanced Telecommunications Research Institute International, Japan
     Contact person: Advanced Telecommunications Research Institute International, Japan
     Price: 600,000 JPY (research purpose)
     Subject.language: Japanese
     Format: 1 CD-ROM.
     URI: http://www.red.atr.co.jp/database_main.html (in Japanese)
  • ATR Speech Database (Set B)
     Type: Sound
     Type.linguistics: transcription/read speech
     Description: Japanese read speech of different 10 speakers, 503 different sentences.
     Creator: Advanced Telecommunications Research Institute International, Japan
     Contact person: Advanced Telecommunications Research Institute International, Japan
     Price: 350,000 JPY (research purpose)
     Subject.language: Japanese
     Format: 1 CD-ROM.
     URI: http://www.red.atr.co.jp/database_main.html (in Japanese)
  • ATR Speech Database (Set C)
     Type: Sound
     Type.linguistics: transcription/read speech
     Description: Japanese read speech of different 20 speakers, 84 different documents.
     Creator: Advanced Telecommunications Research Institute International, Japan
     Contact person: Advanced Telecommunications Research Institute International, Japan
     Price: 600,000 JPY (research purpose)
     Subject.language: Japanese
     Format: 1 CD-ROM.
     URI: http://www.red.atr.co.jp/database_main.html (in Japanese)
  • ATR Speech Database (Set D)
     Type: Sound
     Type.linguistics: transcription/read speech
     Description: Japanese read speech of different 4 speakers, 400 different documents.
     Creator: Advanced Telecommunications Research Institute International, Japan
     Contact person: Advanced Telecommunications Research Institute International, Japan
     Price: 270,000 JPY (research purpose)
     Subject.language: Japanese
     Format: 1 CD-ROM.
     URI: http://www.red.atr.co.jp/database_main.html (in Japanese)
  • ATR Speech Database (Set E)
     Type: Sound
     Type.linguistics: transcription/read speech
     Description: English read speech of different 4 speakers, 5,000 different words.
     Creator: Advanced Telecommunications Research Institute International, Japan
     Contact person: Advanced Telecommunications Research Institute International, Japan
     Price: 270,000 JPY (research purpose)
     Subject.language: English
     Format: 1 CD-ROM.
     URI: http://www.red.atr.co.jp/database_main.html (in Japanese)
  • ATR Speech Database (Set F)
     Type: Sound
     Type.linguistics: transcription/read speech
     Description: English read speech of different 6 speakers, 1,100 different sentences.
     Creator: Advanced Telecommunications Research Institute International, Japan
     Contact person: Advanced Telecommunications Research Institute International, Japan
     Price: 600,000 JPY (research purpose)
     Subject.language: English
     Format: 1 CD-ROM.
     URI: http://www.red.atr.co.jp/database_main.html (in Japanese)
  • ATR Spoken Dialogue Database
     Type: Sound
     Type.linguistics: transcription/dialogue
     Description: 5 sets of speech database of simulated conversation between a travel agency and a customer. 892 conversations in Japanese, and 618 in Japanese and English. Transcription and morphological annotation are also available.
     Annotation.corpus
     word segmentation, part-of-speech
     Creator: Advanced Telecommunications Research Institute International, Japan
     Contact person: Advanced Telecommunications Research Institute International, Japan
     Price: 180,000 JPY per a set (research purpose)
     Subject.language: Japanese, English
     Format: 4 CD-ROM.
     URI: http://www.red.atr.co.jp/database_main.html (in Japanese)
  • ATR Speech Database of Many Speakers
     Type: Collection
     Description: Speech database uttered by many speakers.
     Creator: Advanced Telecommunications Research Institute International, Japan
     Contact person: Advanced Telecommunications Research Institute International, Japan
     Subject.language: Japanese
     URI: http://www.red.atr.co.jp/database_main.html (in Japanese)
     Usage Case: (Show in new window in Japanese)
  • ATR Speech Database of Many Speakers (Conversation)
     Type: Sound
     Type.linguistics: transcription/conversation
     Description: Speech database uttered by many speakers. 3774 speakers had a simulated conversation about schedule of meeting. This database is divided into 4 sets.
     Creator: Advanced Telecommunications Research Institute International, Japan
     Contact person: Advanced Telecommunications Research Institute International, Japan
     Price: 180,000 JPY per a set (research purpose), 1,000,000 JPY per a set (commercial use)
     Subject.language: Japanese
     Format: 3–5 CD-ROM per a set.
     URI: http://www.red.atr.co.jp/database_main.html (in Japanese)
  • ATR Speech Database of Many Speakers (Sentence)
     Type: Sound
     Type.linguistics: transcription/read sentence
     Description: Speech database uttered by many speakers. 3774 speakers read sentences balanced for phonetic. This database is divided into 4 sets.
     Creator: Advanced Telecommunications Research Institute International, Japan
     Contact person: Advanced Telecommunications Research Institute International, Japan
     Price: 180,000 JPY per a set (research purpose), 1,000,000 JPY per a set (commercial use)
     Subject.language: Japanese
     Format: 7–10 CD-ROM per a set.
     URI: http://www.red.atr.co.jp/database_main.html (in Japanese)
  • ATR Speech Database of Many Speakers (Dictionary)
     Type: Sound
     Type.linguistics: transcription/read sentence
     Description: Speech database uttered by many speakers. 3770 speakers read sentences in Japanese dictionaries.
     Creator: Advanced Telecommunications Research Institute International, Japan
     Contact person: Advanced Telecommunications Research Institute International, Japan
     Price: 180,000 JPY (research purpose), 1,000,000 JPY (commercial use)
     Subject.language: Japanese
     Format: 5 CD-ROM.
     URI: http://www.red.atr.co.jp/database_main.html (in Japanese)
  • ASJ Continuous Speech for Research
     Type: Sound
     Type.linguistics: transcription/dialogue
     Description: Speech database containing the following three contents: (a) ATR 503 phonetic balanced sentences (read speech) uttered by 64 speakers (30 males and 34 females), total 9,600 sentences. (b) Various guide task sentences (read speech) uttered by 36 speakers (18 males and 18 females), total 12,474 sentences. © Simulated 37 dialogues with transcribed texts uttered by 37 speakers (29 males and 8 females).
     Creator: The Acostical Society of Japan
     Contact person: Nishigaki, Shigeo (AI and Fuzzy Promotion Center, Japan Information Processing Development Center (JIPDEC), 3–5–8 Shibakoen, Minatoku, Tokyo 105, JAPAN, TEL. +81–3–3432–9390, FAX. +81–3–3431–4324)
     Price: 3090 JPY + carrying charge
     Subject.language: Japanese
     Format: 7 CD-ROM. Sampling: 16kHz, 16bits.
     Usage Case: (Show in new window in Japanese)
  • ASJ Continuous Speech Corpus — Japanese Newspaper Article Sentences(JNAS) — 
     Type: Sound
     Type.linguistics: transcription/dialogue
     Description: JNAS contains speech recordings and their orthographic transcriptions of 306 speakers (153 males and females each) reading excerpts from the Mainichi Newspaper and the ATR 503 PB-Sentences. All utterances and sentences are in the Japanese language.
     Creator: The Acoustical Society of Japan
     Contact person: Miyai, Chiyoko (Media Drive Co., Ltd. ) chiyoko(at)mediadrive.co.jp
     Price: carrying charge
     Subject.language: Japanese
     Format: 16 CD-ROM. Sampling: 16kHz, 16bits.
     URI: http://www.milab.is.tsukuba.ac.jp/jnas/instruct.html
     Usage Case: (Show in new window in Japanese)
  • ETL Spoken Dialog Corpus 1998
     Type: Sound
     Type.linguistics: transcription/dialogue
     Description: A corpus of dialogues for town guidance task between the system and human recorded by Wizard of Oz. It could be used for analysis of turn taking, head nodding, interruption, reply for interruption and so on. It consists of 162 dialogues of 33 speakers, that are more than 1000 minutes in total. Speech data, pitch pattern, transcriptions, tags representing beginnings and endings of dialog and semantic representation of utterances are contained.
     Creator: Advanced Industrial Science and Technology (AIST)
     Contact person: GSK (Gengo Shigen Kyokai)
     Price: 32,400 JPY for personal members of GSK, 64,800 JPY for personal non-members, 64,800 JPY for organization members, 129,600 JPY for organization non-members
     Subject.language: Japanese
     Date: 1998
     Rights: Research purpose only
     Format: 1 DVD-R (3.66GB)
     Format.encoding: UTF-8
     URI: http://www.gsk.or.jp/en/
  • ETL Phonetic Balanced Word Set WD-I & II
     Type: Sound
     Type.linguistics: transcription/word
     Description: Speech data of phonetic balanced word set uttered by 10 male speakers. WD-I consists of 492 words, while WD-II consists of 1,542 words. WD-I is a subset of WD-II.
     Creator: Electrotechnical Laboratory, Japan
     Contact person: Tanaka, Kazuyo (kaz.tanaka(at)aist.go.jp)
     Price: carrying charge
     Subject.language: Japanese
     URI: http://unit.aist.go.jp/is/speech/etlwd12a.html
  • The JEIDA Japanese Common Speech Corpus — The DAT version — 
     Type: Sound
     Description: This corpus is composed of 323 items with 4 repetitions for each item including 110 monosyllables, 178 isolated words and 35 4-digit sequences. The total data amounts to 120 hours, contained on 76 DAT cassettes. Each item is uttered by 75 male and 75 female speakers. Speakers range in age from 20 to 60. The total data nmuber is 193,800 samples.
     Creator: Japan Electronics Industry Development Association
     Contact person: Sasaki (Sunrise Music Incorporated, Roppongi Fuji Bldg. 4F, 4–11–10 Roppongi, Minato-ku, Tokyo, 106 , Japan, Tel: +81–3–3408–6541, Fax: +81–3–3408–1505 )
     Subject.language: Japanese
     Format: Sampling: 44kHz, 16bits.
  • Continuous Speech Data (``Shiken Kenkyu’’, Grant-in-Aid for Scientific Research, MEXT)
     Type: Sound
     Type.linguistics: transcription/
     Description: Speech database of various monosyllables, words, short sentences and documents uttered by 6 males and 6 females.
     Creator: Itahashi Lab., University of Tsukuba, Japan
     Contact person: Itahashi, Shuichi (itahashi(at)milab.is.tsukuba.ac.jp)
     Price: free (CD-ROM, for researchers), 70,000 JPY (DAT)
     Subject.language: Japanese
     Format: CD-ROM or DAT. Sampling: 16kHz, 16bit.
  • Dialect Speech Database
     Type: Sound
     Description: Speech Database of Japanese dialects. It is only distributed to universities and national research institutes.
     Creator: Tahara, Hiroshi (Osaka Shoin Women’s Univ., Japan), Egawa, Kiyoshi (National Institute for Japanese Language, Japan)
     Contributor: Grant-in-Aid for Scientific research on Priority Areas on ``Spoken Japanese’’, provided by MEXT (Ministry of Education, Culture, Sports, Science and Technology, Japan)
     Contact person: Tahara, Hiroshi (Osaka Shoin Women’s Univ., Tel. +81–6–723–8181, Fax. +81–6–723–8881), Egawa, Kiyoshi (National Institute for Japanese Language, Tel. +81–3–3900–3111, Fax. +81–3–3906–3530)
     Subject.language: Japanese
     Format: 19 Audio CD. 3 CD-ROM.
  • ``Juten Ryoiki Kenkyu’’ Speech Dialogue Corpus
     Type: Sound
     Type.linguistics: transcription/dialogue
     Description: This corpus consists of speech and transcriptions of 93 dialogues. (``Juten Ryoiki Kenkyu’’ means grant-in-aid for scientific research on priority areas, the name of the fund from which this corpus was supported)
     Creator: Doshita, Shuji
     Contributor: Grant-in-Aid for Scientific research on Priority Areas on ``Understanding and Generating Dialogue by Integrated Processing of Speech, Language and Concept’’ provided by MEXT (Ministry of Education, Culture, Sports, Science and Technology, Japan)
     Contact person: Media Drive Co., Ltd. (juten-corpus(at)mediadrive.co.jp)
     Price: 10,000 JPY
     Subject.language: Japanese
     Format: 4 CD-ROM.
     URI: http://winnie.kuis.kyoto-u.ac.jp/taiwa-corpus/ (in Japanese)
  • RWCP-DB-SPEECH-96-I (RWC Speech Dialogue Database)
     Type: Sound
     Type.linguistics: transcription/dialogue
     Description: Speech and transcriptions of 28 dialogues in the task ``plan of travel abroad’’ and 28 in ``purchase of a car’’.
  • Distribution of this corpus is now suspended.
     Creator: Real World Computing Partnership, Japan
     Subject.language: Japanese
     Format: 4 CD-ROM.
  • Tohoku University — Matsushita Speech Database of Words
     Type: Sound
     Description: Speech database of words. It is only distributed to universities and national research institutes.
     Creator: Makino, Shozo. Niyada, Masayuki. Mafune, Hiroo. Kido, Ken’ichi
     Contact person: Makino, Shozo (Tohoku Univ., Tel. +81–22–262–3469, Fax. +81–22–262–3469)
  • 100 Place Name Database by Shirai Lab. in Waseda University
     Type: Sound
     Description: Speech data of 100 words of place name. 12 males uttered each word twice.
     Creator: Shirai Lab., Waseda University, Japan
     Contact person: Ohira, Shigeki (ohira(at)shirai.info.waseda.ac.jp)
     Subject.language: Japanese
     Format: Sampling: 12.5kHz, 12bit.
  • Phonetic Balanced Word Set by Doshita Lab. in Kyoto University
     Type: Sound
     Description: Speech data of phonetic balanced word set uttered by 48 males and 16 females.
     Creator: Doshita Lab., Kyoto university, Japan
     Contact person: Kawahara, Tatsuya (kawahara(at)kuis.kyoto-u.ac.jp)
     Format: Sampling: 16kHz, 16bit.
  • Power Shift Corpus V1–2009
     Type: Sound
     Description: Spontaneous speech corpus uttered by old men and women. The topics are their memories in Taisho and early Showa era or playing in their childhood.
     Creator: Straight Word Inc.
     Publisher: Power Shift Inc.
     Contact person: Power Shift Inc. (http://www.powershift.co.jp/company/form.html)
     Price: 550,000 JPY + tax
     Subject.language: Japanese
     URI: http://www.powershift.co.jp/it/corpus.html (in Japanese)

Morphological analyzer

  • JUMAN
     Type: Software
     Type.functionality: morphological analyzer
     Description: A User-Extensible Morphological Analyzer for Japanese. The latest version is 7.0 (in January 2012).
     Creator: Kurohashi and Kawahara laboratory, Kyoto University
     Contact person: Kurohashi and Kawahara laboratory, Kyoto University (nl-resource(at)nlp.ist.i.kyoto-u.ac.jp)
     Price: free
     Subject.language: Japanese
     Format: 4 MB.
     Format.os: unix,MSWindows
     Format.sourcecode: C
     URI: http://nlp.ist.i.kyoto-u.ac.jp/EN/index.php?JUMAN
     Usage Case: (Show in new window in Japanese)
  • JUMAN++
     Type: Software
     Type.functionality: morphological analyzer
     Description: A morphological analyzer for Japanese using a language model trained by Recurrent Neural Network Language Model (RNNLM). Its accuracy is much improved comparing to JUMAN or MeCab by considering semantic fluency of word sequence. Formats of grammar, lexicon, output and so on are inherited from JUMAN.
     Creator: Kurohashi and Kawahara laboratory, Kyoto University
     Contact person: Kurohashi and Kawahara laboratory, Kyoto University (nl-resource(at)nlp.ist.i.kyoto-u.ac.jp)
     Price: free
     Subject.language: Japanese
     Format: 631 MB.
     Format.os: unix
     Format.sourcecode: C++
     URI: http://nlp.ist.i.kyoto-u.ac.jp/EN/index.php?JUMAN%2B%2B
  • ChaSen
     Type: Software
     Type.functionality: morphological analyzer
     Description: ChaSen is a FREE Japanese Morphological analyzer. It grows out of developing JUMAN version 2.0 and has made a significant improvement in system performance. ChaSen version 1.0 is officially released on 19 February 1997 by Computational Linguistics Laboratory, Graduate School of Information Science, Nara Institute of Science and Technology (NAIST). Latest version is 2.2.9 released on 8 February 2002.
     Creator: Computational Linguistics Laboratory, Nara Institute of Science and Technology, Japan
     Contact person: Computational Linguistics Laboratory, Nara Institute of Science and Technology (chasen(at)is.aist-nara.ac.jp)
     Price: free
     Subject.language: Japanese
     Format: 3.3MB.
     Format.os: unix,MSWindows
     Format.sourcecode: C
     URI: http://chasen.naist.jp/hiki/ChaSen/ (in Japanese)
     Usage Case: (Show in new window in Japanese)
  • Sumomo
     Type: Software
     Type.functionality: morphological analyzer
     Description: A morphological analyzer for Japanese. It was customized to rapidly output only the most appropriate result. It also identifies unknown words with a simple scheme.
     Creator: NTT Communication Science Laboratory, Japan
     Contact person: Washisaka, Koichi (wasisaka(at)nttlabs.com), Yamazaki, Kenichi (yamazaki(at)t.onlab.ntt.co.jp)
     Price: free
     Subject.language: Japanese
     URI: http://www.t.onlab.ntt.co.jp/sumomo/index.html (in Japanese)
  • Breakfast
     Type: Software
     Type.functionality: morphological analyzer
     Description: A fast morphological analyzer. Users can analyze sentences with their own morphological grammar.
     Creator: Fujitsu Laboratories, Japan
     Contact person: Sassano, Manabu (bf-staff(at)ling.flab.fujitsu.co.jp)
     Price: free
     Subject.language: Japanese
     Format.os: Windows 95, NT 3.51, NT 4.0
     URI: http://www.labs.fujitsu.com/free/breakfast/index.html (in Japanese)
  • MeCab
     Type: Software
     Type.functionality: morphological analyzer
     Description: Another implementation of morphological analyzer ChaSen, which is 3–4 times faster than the original ChaSen.
     Creator: Computational Linguistics Laboratory, Nara Institute of Science and Technology, Japan
     Contact person: Kudo, Taku (taku-ku(at)is.aist-nara.ac.jp)
     Price: free
     Date: 2001
     Format.os: unix
     URI: http://taku910.github.io/mecab/ (in Japanese)
     Usage Case: (Show in new window in Japanese)
  • KyTea (Kyoto Text Analysis Toolkit)
     Type: Software
     Description: A general toolkit developed for analyzing text, with a focus on Japanese, Chinese and other languages requiring word or morpheme segmentation. It is abl e to perform word segmentation, pronunciation tagging and POS tagging. Users can also train models by themselves.
     Creator: Graham Neubig, Tetsuro Sasada, Shinsuke Mori
     Contact person: Graham Neubig, Tetsuro Sasada, Shinsuke Mori
     Price: free
     Subject.language: Japanese etc.
     Date: 2009
     Rights: Apache License Version 2
     Format.os: Linux, Mac OS X, Cygwin
     URI: http://www.phontron.com/kytea/
     Usage Case: (Show in new window in Japanese)

Parser

  • KNP
     Type: Software
     Type.functionality: syntactic analyzer
     Description: A syntactic analyzer for Japanese. KNP first identifies ``Bunsetsu’’(a chunk of words) boundaries for an input sentence, then analyzes dependencies between them. The latest version is 4.0 (in January 2012).
     Creator: Kurohashi and Kawahara laboratory, Kyoto University
     Contact person: Kurohashi and Kawahara laboratory, Kyoto University (nl-resource(at)nlp.ist.i.kyoto-u.ac.jp)
     Price: free
     Subject.language: Japanese
     Format: 145 KB.
     Format.os: unix
     Format.sourcecode: C
     URI: http://nlp.ist.i.kyoto-u.ac.jp/EN/index.php?KNP
     Usage Case: (Show in new window in Japanese)
  • MSLR parser tool kit
     Type: Software
     Type.functionality: morphological and syntactic analyzer
     Description: This tool kit contains a morphological and syntactic LR parser (MSLR parser) and some related tools. MSLR parser is a tool for simultaneous analysis of syntactic and morphological form. A grammar and a dictionary for analysis on Japanese are included in the package. Furthermore, users can replace them with their own grammars and dictionaries.
     Creator: Tokyo Institute of Technology, Japan
     Contact person: Tokunaga Laboratory, Tokyo Institute of Technology (mslr(at)cl.cs.titech.ac.jp)
     Price: free
     Subject.language: Japanese
     Format: 1.5 MB.
     Format.os: unix
     Format.sourcecode: C
     Usage Case: (Show in new window in Japanese)
  • SAX
     Type: Software
     Type.functionality: Tool for syntactic analysis
     Description: It generates a syntactic analyzer, a Prolog program based on a bottom-up chart algorithm, from a definite clause grammar (DCG). It requires SICStus Prolog.
     Creator: Computational Linguistics Laboratory, Nara Institute of Science and Technology, Japan
     Contact person: Computational Linguistics Laboratory, Nara Institute of Science and Technology (nlt(at)is.aist-nara.ac.jp)
     Price: free
     URI: http://chasen.naist.jp/sax.html (in Japanese)
     Usage Case: (Show in new window in Japanese)
  • BUP
     Type: Software
     Type.functionality: Tool for syntactic analysis
     Description: It generates a syntactic analyzer, a Prolog program based on a left corner parsing method, from a definite clause grammar (DCG). It requires SICStus Prolog.
     Creator: Computational Linguistics Laboratory, Nara Institute of Science and Technology, Japan
     Contact person: Computational Linguistics Laboratory, Nara Institute of Science and Technology (nlt(at)is.aist-nara.ac.jp)
     Price: free
     URI: http://chasen.naist.jp/bup.html (in Japanese)
     Usage Case: (Show in new window in Japanese)
  • CaboCha
     Type: Software
     Type.functionality: syntactic analyzer
     Description: Syntactic analyzer for Japanese based on Support Vector Machine.
     Creator: Computational Linguistics Laboratory, Nara Institute of Science and Technology, Japan
     Contact person: Kudo, Taku (taku-ku(at)is.aist-nara.ac.jp)
     Price: free
     Subject.language: Japanese
     Date: 2001
     Format.os: unix, windows
     URI: http://code.google.com/p/cabocha/ (in Japanese)
     Usage Case: (Show in new window in Japanese)

Annotation tool

  • VisualMorphs
     Type: Software
     Type.functionality: assistant tool for constructing POS-tagged corpora
     Description: assistant tool for constructing POS-tagged corpora. It is a GUI tool to graphically display outputs of morphological analisys tool and allow human annotators to modify them.
     Creator: Computational Linguistics Laboratory, Nara Institute of Science and Technology, Japan
     Contact person: Computational Linguistics Laboratory, Nara Institute of Science and Technology (chasen(at)cl.aist-nara.ac.jp)
     Price: free
     Subject.language: Japanese
     Date: 2001
     Format.os: unix, windows
     Format.sourcecode: java
     URI: http://chasen.naist.jp/vm/
     Usage Case: (Show in new window in Japanese)
  • Tagrin
     Type: Software
     Type.functionality: Annotation Tool
     Description: Tagrin is a task independent, customizable, and SGML-based annotation tool. It imports/exports the annotated corpus in SGML format.
     Creator: Tetsuro Takahashi
     Contact person: Tetsuro Takahashi
     Price: free
     Format.os: windows, linux (Tcl/Tk)
     URI: http://kagonma.org/tagrin/ (in Japanese)
     Usage Case: (Show in new window in Japanese)
  • FuuTag
     Type: Software
     Type.functionality: Annotation Tool
     Description: FuuTag is an annotation tool for a SGML text, which is based on Sekine’s extended Named Entity hierarchy in default configuration. It is possible to customize tag descriptions with config file.
     Creator: Satoshi Sekine
     Contact person: Satoshi Sekine
     Price: free
     Format.os: unix, windows
     URI: http://nlp.cs.nyu.edu/ene/
     Usage Case: (Show in new window in Japanese)
  • ChaKi
     Type: Software
     Type.functionality: annotation tool
     Description: A tool kit supporting a development, search and annotation of natural language corpora.
     Creator: Computational Linguistics Laboratory, Nara Institute of Science and Technology, Japan
     Contact person: Nara Institute of Science and Technology, Japan
     Price: Free
     Format.os: Windows
     URI: http://sourceforge.jp/projects/chaki/releases/ (in Japanese)
     Usage Case: (Show in new window in Japanese)
  • PDFAnno
     Type: Software
     Description: A browser-based linguistic annotation tool for PDF documents. It offers functions for various types of linguistic annotations, including part-of-speech, named entity, dependency relation, and coreference chain.
     Price: Free
     Date: 2016
     URI: https://github.com/paperai/pdfanno
     Usage Case: (Show in new window in Japanese)

Visualization tool

  • ViJUMAN
     Type: Software
     Type.functionality: Visualization tool for morphological analyzer
     Description: Visualization tool for morphological analyzer ``JUMAN’’.
     Creator: Computational Linguistics Laboratory, Nara Institute of Science and Technology, Japan
     Contact person: Computational Linguistics Laboratory, Nara Institute of Science and Technology (vijuman-adm(at)cl.aist-nara.ac.jp)
     Price: free
     Subject.language: Japanese
     Format.os: unix
     URI: http://chasen.naist.jp/vi4ma.html (in Japanese)
  • ViCha
     Type: Software
     Type.functionality: Visualization tool for morphological analyzer
     Description: Visualization tool for morphological analyzer ``ChaSen’’.
     Creator: Computational Linguistics Laboratory, Nara Institute of Science and Technology, Japan
     Contact person: Computational Linguistics Laboratory, Nara Institute of Science and Technology (vijuman-adm(at)cl.aist-nara.ac.jp)
     Price: free
     Subject.language: Japanese
     Format.os: unix
     URI: http://chasen.naist.jp/vi4ma.html (in Japanese)
     Usage Case: (Show in new window in Japanese)
  • VisIPS
     Type: Software
     Type.functionality: Visualization tool for syntactic analyzer
     Description: Visualization tool for syntactic analyzer. It displays CKY table and syntactic trees graphically.
     Creator: Computational Linguistics Laboratory, Nara Institute of Science and Technology, Japan
     Contact person: Computational Linguistics Laboratory, Nara Institute of Science and Technology (nlt(at)is.aist-nara.ac.jp)
     Price: free
     Format.os: unix
     URI: http://chasen.naist.jp/visips.html (in Japanese)
  • TableDisplay
     Type: Software
     Type.functionality: visualization tool
     Description: A visualization tool to display results of natural language analysis. Since it is implemented as CGI, it is applicable for many platforms.
     Creator: Kurohashi and Kawahara laboratory, Kyoto University
     Contact person: Kurohashi and Kawahara laboratory, Kyoto University (nl-resource(at)nlp.ist.i.kyoto-u.ac.jp)
     Price: free
     URI: http://nlp.ist.i.kyoto-u.ac.jp/nl-resource/tabledisplay/index.cgi (in Japanese)
     Usage Case: (Show in new window in Japanese)

Search tool

  • SUFARY
     Type: Software
     Type.functionality: Tool for string matching
     Description: Software package for string matching using suffix array.
     Creator: Computational Linguistics Laboratory, Nara Institute of Science and Technology, Japan
     Contact person: Computational Linguistics Laboratory, Nara Institute of Science and Technology (sufary(at)cl.aist-nara.ac.jp)
     Price: free
     Format.os: unix
     Format.sourcecode: C
     URI: http://nais.to/%7Eyto/tools/sufary/ (in Japanese)
     Usage Case: (Show in new window in Japanese)
  • Minise: MIni Search Engine
     Type: Software
     Type.functionality: full text search tool
     Description: Minise is a compact search engine supporting basic features. Minise performs full-text search query using several types of indexes. Minise supports sequential search, inverted file index, character N-gram (q-gram), and suffix array. Minise is supposed to be used for a small-midium size document set (e.g. 200000 documents), for academic, research purpose.
     Creator: Daisuke Okanohara
     Contact person: Daisuke Okanohara
     Price: Free
     Date: 2009
     Rights: Research purpose only
     Format.os: unix
     Format.sourcecode: C++
     URI: http://www-tsujii.is.s.u-tokyo.ac.jp/~hillbig/minise.htm
  • Bep: Associative Arrays for Very Large Collections
     Type: Software
     Type.functionality: library for associative array
     Description: A library for the associative arrays for very large collections. It uses Minimal Perfect Hash Functions and keeps the collection compactly.
     Creator: Daisuke Okanohara
     Contact person: Daisuke Okanohara
     Price: Free
     Date: 2007
     Format.os: unix
     Format.sourcecode: C++
     URI: http://www-tsujii.is.s.u-tokyo.ac.jp/~hillbig/bep.htm
  • Tx: Succinct Trie Data structure
     Type: Software
     Type.functionality: library for Trie
     Description: A library for a compact trie data structure. It requires 1/4–1/10 of the memory usage compared to the previous implementations, and can therefore handle quite a large number of keys (e.g. 1 billion) efficiently.
     Creator: Daisuke Okanohara
     Contact person: Daisuke Okanohara
     Price: Free
     Date: 2007
     Format.os: unix
     Format.sourcecode: C++
     URI: http://code.google.com/p/tx-trie/
  • SimString
     Type: Software
     Type.functionality: library for string search
     Description: A simple library for fast approximate string retrieval. It can find strings in a database whose similarity with a query string is no smaller than a threshold. It is applicable for spelling correction, flexible dictionary matching, duplicate detection and so on.
     Creator: Naoaki Okazaki
     Contact person: Naoaki Okazaki
     Price: Free
     Date: 2010
     Format.os: Unix
     Format.sourcecode: C++
     URI: http://www.chokkan.org/software/simstring/index.html.en
     Usage Case: (Show in new window in Japanese)
  • SAGACE
     Type: Software
     Type.functionality: Concordancer and collocation extraction
     Description: Concordancer for languages with a little inflection such as Japanese. Users can search word sequence patterns from a corpus with a dictionary. It is distributed under CECILL(free) license.
     Creator: Blin R.
     Contact person: blin(at)ehess.fr
     Price: Free
     Subject.language: Japanese, any languages with a little inflection
     Format.os: Linux
     URI: http://crlao.ehess.fr/japonais-coreen/corpus/sagace/sagace_jp.html (in Japanese)

Machine Learning

  • TinySVM
     Type: Software
     Type.functionality: tool for training of Support Vector Machine
     Description: An implementation of Support Vector Machines.
     Creator: Computational Linguistics Laboratory, Nara Institute of Science and Technology, Japan
     Contact person: Kudo, Taku (taku-ku(at)is.aist-nara.ac.jp)
     Price: free
     Date: 2001
     Format.os: unix
     URI: http://chasen.org/%7Etaku/software/TinySVM/
     Usage Case: (Show in new window in Japanese)
  • CRF++
     Type: Software
     Type.functionality: tool for training of Conditional Random Field
     Description: A simple, customizable, and open source implementation of Conditional Random Fields (CRFs) for segmenting/labeling sequential data. It is designed for generic purpose and will be applied to a variety of NLP tasks.
     Creator: Taku Kudo
     Price: free
     URI: http://sourceforge.net/projects/crfpp/
  • ohmm: Online training for Hidden Markov Model
     Type: Software
     Type.functionality: library for training of Hidden Markov Model
     Description: A library for learning hidden Markov models by using Online EM algorithm. This library is specialized for large scale data; e.g. 1 million words. The output includes parameters, and estimation results.
     Creator: Daisuke Okanohara
     Contact person: Daisuke Okanohara
     Price: Free
     Date: 2009
     Format.os: unix
     URI: http://www-tsujii.is.s.u-tokyo.ac.jp/~hillbig/ohmm.htm
  • OLL: Online Learning Library
     Type: Software
     Type.functionality: library for online learning
     Description: A library for online-learning algorithms (Perceptron, Averaged Perceptron, Passive Agressive, ALMA, Confidence Weighted Linear-Classification), which is specialized for large-scale, but sparse, learning tasks such as Natural Language Processing tasks. While these algorithms are very efficient in terms of speed and space (linear in the number of examples, and features), its performances are comparable to the batch-style learning methods such as SVMs, MEs. It provides C++ library, and stand-alone programs for learning, predicting.
     Creator: Daisuke Okanohara
     Contact person: Daisuke Okanohara
     Price: Free
     Date: 2008
     Format.os: unix
     Format.sourcecode: C++
     URI: http://code.google.com/p/oll/wiki/OllMainEn
  • CRFsuite
     Type: Software
     Type.functionality: tool for training of Conditional Random Field
     Description: A fast implementation of Conditional Random Fields (CRFs) for labeling sequential data.
     Creator: Naoaki Okazaki
     Contact person: Naoaki Okazaki
     Price: Free
     Date: 2007
     Format.os: Linux, Windows
     URI: http://www.chokkan.org/software/crfsuite/
     Usage Case: (Show in new window in Japanese)
  • Classias
     Type: Software
     Type.functionality: machine learning tool
     Description: A collection of machine-learning algorithms for classification. Currently, it supports L1/L2-regularized logistic regression (aka. Maximum Entropy), L1/L2-regularized L1-loss linear-kernel Support Vector Machine (SVM) and Averaged Perceptron.
     Creator: Naoaki Okazaki
     Contact person: Naoaki Okazaki
     Price: Free
     Date: 2009
     Format.os: Unix, Windows
     URI: http://www.chokkan.org/software/classias/index.html.en
     Usage Case: (Show in new window in Japanese)
  • MACCORI: Marginal Containers Covering Relevant Items
     Type: Software
     Type.functionality: tool for combinatorial optimization problem
     Description: A tool to solve a combinatorial optimization problem similar to knapsack problem. For example, it can be used for multiple documents summarization, i.e. to choose a small number of important sentences (extracts) in a given set of source documents.
     Creator: Naoaki Okazaki
     Contact person: Naoaki Okazaki
     Price: Free
     Format.os: Unix, Windows
     URI: http://www.chokkan.org/software/maccori/
  • lda — a Latent Dirichlet Allocation package
     Type: Software
     Type.functionality: tool of Latent Dirichlet Allocation
     Description: Latent Dirichlet Allocation package written both in MATLAB and C (command line interface).
     Creator: Daichi Mochihashi
     Contact person: Daichi Mochihashi
     Price: Free
     Format.os: unix
     Format.sourcecode: C, MATLAB
     URI: http://chasen.org/~daiti-m/dist/lda/

Tool(misc.)

  • YamCha
     Type: Software
     Type.functionality: chunker
     Description: YamCha is a generic, customizable, and open source text chunker. It is using Support Vector Machine.
     Creator: Computational Linguistics Laboratory, Nara Institute of Science and Technology, Japan
     Contact person: Kudo, Taku (taku-ku(at)is.aist-nara.ac.jp)
     Price: free
     Subject.language: Japanese
     Date: 2001
     Format.os: unix
     URI: http://chasen.org/%7Etaku/software/yamcha/
     Usage Case: (Show in new window in Japanese)
  • Lexical Chainers
     Type: Software
     Type.functionality: text processing
     Description: Tool for calculating lexical chains, a sequence of words which are same or have same meanings each other, in a text.
     Creator: Mochizuki, Hajime
     Contact person: Mochizuki, Hajime (motizuki(at)tufs.ac.jp)
     Price: free for research purpose
     Subject.language: Japanese
     Format.os: unix
     Format.sourcecode: C
     URI: http://www.tufs.ac.jp/ts/personal/motizuki/software/chainers/
  • Posum (simple text summarizer)
     Type: Software
     Type.functionality: text summarizer
     Description: Summarization tool excerpting important sentences from a text.
     Creator: Mochizuki, Hajime
     Contact person: Mochizuki, Hajime (motizuki(at)tufs.ac.jp)
     Price: free for research purpose
     Subject.language: Japanese
     Format.os: unix
     Format.sourcecode: C, perl
     URI: http://www.tufs.ac.jp/ts/personal/motizuki/software/posumcl/
     Usage Case: (Show in new window in Japanese)
  • DL-MT
     Type: Software
     Type.functionality: text summarizer
     Description: Document reading assistant system for learners of Japanese, which segments Japanese text into words and shows English translations for each word.
     Creator: Mochizuki, Hajime
     Contact person: Mochizuki, Hajime (motizuki(at)tufs.ac.jp)
     Price: free for research purpose
     Subject.language: Japanese
     Format.os: unix
     Format.sourcecode: perl
     URI: http://www-cl.tufs.ac.jp/pub/tools/dlmt/dlmt.html
  • Julius
     Type: Software
     Type.functionality: speech recognition engine
     Description: Julius is a large vocabulary continuous speech recognition decoder software. It is based on word 3-gram and context-dependent HMM.
     Contact person: julius(at)kuis.kyoto-u.ac.jp
     Price: free
     Subject.language: Japanese
     Date: 2002
     Format.os: unix, windows
     Format.sourcecode: C
     URI: http://julius.sourceforge.jp/en/julius.html
     Usage Case: (Show in new window in Japanese)
  • SynCha
     Type: Software
     Type.functionality: predicate argument structure analyzer
     Description: A tool for predicate argument structure analysis of Japanese.
     Creator: Computational Linguistics Laboratory, Nara Institute of Science and Technology, Japan
     Contact person: Nara Institute of Science and Technology, Japan
     Price: Free
     URI: http://syncha.sourceforge.jp/ (in Japanese)
     Usage Case: (Show in new window in Japanese)
  • YuCha
     Type: Software
     Type.functionality: predicate argument structure analyzer
     Description: A tool for predicate argument structure analysis of Japanese.
     Creator: Computational Linguistics Laboratory, Nara Institute of Science and Technology, Japan
     Contact person: Hayashibe Yuta
     Price: Free
     Date: 2010
     URI: http://hayashibe.jp/yucha/ (in Japanese)
  • Semantic Role Labeling System
     Type: Software
     Type.functionality: semantic role labeling
     Description: A tool for semantic role labeling. It accepts a Japanese sentence, identifies a sense of the predicate and gives semantic roles for arguments of it.
     Creator: Takeuchi Laboratory, Okayama University
     Contact person: Takeuchi Laboratory, Okayama University
     Price: Free
     URI: http://cl.it.okayama-u.ac.jp/study/project/sea.html (in Japanese)
     Usage Case: (Show in new window in Japanese)
  • JACABIT Japanese term extraction system
     Type: Software
     Type.functionality: term extraction
     Description: A free software for extracting Japanese terms from plain text on the basis of POS-based morphological patterns.
     Creator: Takeuchi Laboratory, Okayama University
     Contact person: Takeuchi Laboratory, Okayama University
     Price: Free
     URI: http://cl.cs.okayama-u.ac.jp/rsc/jacabit/index.html
  • TETDM — Total Environment for Text Data Mining
     Type: Software
     Type.functionality: text mining tool
     Description: A package for text mining. It consists of 10 mining tools and 17 visualization interfaces. Users can customize and revise it.
     Creator: Challenge for Realizing Early Profits — Total Environment for Text Data Mining, (in The Japanese Society for Artificial Intelligence)
     Contact person: Wataru Sunayama (user-support(at)tetdm.jp)
     Price: free
     Format.os: Windows (XP, Vista, 7), Mac OS X
     Format.sourcecode: Java
     URI: http://www.sys.info.hiroshima-cu.ac.jp/people/sunayama/future/newfuture.html (in Japanese)
  • TermExtract
     Type: Software
     Description: Tool to extract technical terms from documents. There are 3 steps to extract terms: (1) word segmentation by morphological analyzers, (2)identification of compound words, (3) calculation of significance score. The target languages are Japanese and English. They also provides Web services called `Gensen-Web’.
     Creator: Hiroshi Nakagawa, Akira Maeda, Hiroyuki Kojima
     Contributor: Tatsunori Mori
     Contact person: gs-web(at)mm.itc.u-tokyo.ac.jp
     Price: Free
     Subject.language: Japanese, English
     Date: 2003
     Format.sourcecode: Perl module
     URI: http://gensen.dl.itc.u-tokyo.ac.jp/ (in Japanese)
     Usage Case: (Show in new window in Japanese)

In the next post I will post technique I will use to develop each of the modules described in post to perform our IE task.

Stay tuned :) 
Please support this article if it helps you.