The Chronicles of Riddick: Escape from Butcher Bay is a first-person action and stealth video game developed by Starbreeze Studios and published by Vivendi Universal Games.Released for the Xbox and Windows in 2004, the game is a tie-in prequel to the futuristic science fiction film The Chronicles of Riddick.Actor Vin Diesel—who was involved in the game's development—reprises his role as. See what Hakan (hakannuhi) has discovered on Pinterest, the world's biggest collection of ideas.

  1. Space Beards (jakub Sawczuk) Mac Os X
  2. Space Beards (jakub Sawczuk) Mac Os Version

Earlier this week Akshat took some time out to chat with open source user and talented designer Ian ‘Izo’ Cylkowski about his work, his tools, and his thoughts on designing in open source.

If the name sounds familiar then it should. Ian is an active designer within the open source community – for example, he created the logo for the semantic app launch tool ‘Synapse’ and has been working with the Novacut team on creating a brand identity for the project.

Izo’s work has also been featured here on OMG! Ubuntu! numerous times. This ranges from us drooling over his wonderfully rich Natty wallpaper and achingly beautiful ‘Ubuntu Tablet Designs,’ to us helping spread word of his work with Dan Rabbit on creating a compressive guide detailing the capabilities of the Murrine theme engine.

So, tell us something about yourself.

The name’s Ian Cylkowski aka “Izo”. I’m 28, British, love simplicity and typography and have a rad beard. I run designbyIZO which is the home of my blog and design portfolio. I’m also an avid fan of FOSS and the Ubuntu operating system as well.

When and how did you become interested in FOSS and Ubuntu?

About four years ago, I think. A friend of mine introduced me to Ubuntu on another forum I was on (he was called “Cookieninja, I believe). I was also extremely interested in computers and tech anyway, and had been venting my frustrations at a Windows XP system I was running that was slowly dying a painful and ugly death. He pointed me to a Linux operating system called “Ubuntu” (at the time, version 6.06 “Dapper Drake” had just been released).

Initially, I was rather weary, as my only knowledge of Linux was that a terminal was often required (I’ve never been a coder/programmer, all that black space terrifies me). He assured me that it was a lot simpler.

So, with practically nothing to lose, I downloaded the ISO, burnt it, booted it, installed it, fell in love with it and have been, bar a few forays with other distros, a consistent user of Ubuntu ever since.

Do you entirely use free software for your all design work?

Almost. In a typical logo and identity design project, for example, a lot of the development of the logo, typography and other design elements are done entirely in Inkscape, which, in some ways, I find much better than Adobe’s Illustrator (though Illustrator is still an excellent piece of software).

I will also break out GIMP when some form of web-safe image editing is required and use Scribus to design my documents, identity guidelines, brief questionnaires and more. It’s only when I absolutely must have Pantone colour system requirements that I then need to switch to either Windows or a Mac to load up Photoshop/Illustrator.

What are your biggest hurdles that stop you from going free software full time?

Mostly, colour management and that means having the software for things like the Spyder monitor calibrator, better colour management in GIMP (although I understand GIMP is currently struggling in terms of development as fundamental aspects of it are being rebuilt); GIMP, as it is, is pretty poor for IRL print-ready conditions.

Inkscape is pretty good for handling CMYK colour models but lacks better PDF exporting options, and Scribus is even better, but we really need to see in-built support for the Pantone CMS and the easy exporting of multiple strains of PDF formats.

To be honest, I’d love to see the day when the likes of GIMP, Inkscape and Scribus are all unified and integrated into one great FOSS graphic design suite, that would be bodacious, but whether this will happen or not is another question. Plus, Scribus really needs to get ported to GTK.

What motivates you to contribute to FOSS projects?

FOSS is very, very important and it has dramatically changed how software is developed, distributed and used in the last decade alone. It signified a powerful change of power from the companies developing the software to developers and users.

Nowadays, anyone is free to download a Linux-based OS, install it, modify it, share it, distribute it and more. This means the pace of development in the FOSS is staggeringly quick. Someone can download a piece of software, check out its source code and submit patches and improvements in a matter of minutes. In the corporate world, such a pace of development is hindered by checks, policy checks, and the wait for many nods of approval from various management positions.

So it’s easy to see how FOSS has changed the game and I think it’s very important that people now have the extremely viable option of running software on their computer that they are then completely free to modify and distribute without fear of retribution and legal punishment for massive global conglomerates… because sometimes just a single person can create something incredible that many millions of people can enjoy. FOSS has become a global community endeavour, sharing, reciprocating, collaborating… it’s a beautiful thing to see.

And, in my mind, one of FOSS’ greatest success stories has been Ubuntu.

When you took up the task of designing a brand identity for Novacut, what expectations did you have?

Well, before one of my blog commentators alerted me to the fact that Novacut required a complete visual identity system, I honestly had never heard of the project; though I think that’s largely down to me not really following the development of video editors.

As I got in touch the project leader, Jason DeRose, and his team, I began to realise that these guys meant business and were extremely ambitious with their goals and desires for Novacut.

It’s been extremely refreshing to see such a coordinated drive and determination in the project. Plus, the guys in Novacut place a strong priority on quality design; this is something I don’t often see in the FOSS world. There are many hundreds and thousands of incredible FOSS apps, but I can’t honestly say a lot of them are beautiful to look at and use. So it was nice to see the Novacut guys wanting to change this.

It has, so far, been an enlightened experience working for them.

What would you like to tell to the designers that are new to Free Software world/don’t know about it?

Largely, be patient, exciting stuff is appearing over the horizon.

GIMP is in the process of being rebuilt with GEGL as the foundation, so that images edited in GIMP can finally move beyond the 8-bit world. Inkscape is getting better all the time. Scribus is immensely powerful.

If you need a graphic design tool for Linux, ask around, there will be someone out there who has been using awesome graphic design apps for Linux that you had no idea existed.

When you first switched to Ubuntu, what were your biggest frustrations with the tools available at that time? Do you think they have progressed a lot since then?

When I first started using Ubuntu, a couple of exciting things were happening, one of which was desktop compositing. Compiz was in heavy development, which forked into Beryl and then the two merged again as Compiz Fusion. Being able to manipulate windows and menus with flashy animations and incredible effects was terribly exciting, especially considering that some of the effects you could do with Compiz were just not possible with the likes of Mac OS X or Windows (see the Cube Desktop Switch effect as just one example).

Of course, compositing required the use of 3D-enabled hardware acceleration drivers for your graphics card and they were very much in their infancy. I remember installing, with success, an nVidia graphics driver on my Dapper installation, installing and enabling Compiz, falling in love all over again and then being driven to despair when a kernel upgrade totally broke Xorg and my display. That was frustrating. Enabling MP3 playback, at that time, was also rather tricky, especially considering that their appeared to be hundreds of different ways to do it.

So yeah, it wasn’t an easy ride back then but the changes since have been incredible. MP3 playback can be enabled during the installation by clicking on a checkbox. A default Ubuntu new install, now, will seek out your graphics hardware and recommend 3D-enabled drivers for your discretion. Installation of these is painless. So these two elements in the Linux world has dramatically improved. And now, of course, we’re seeing open-source 3D-enabled drivers for the likes of ATi and nVidia, which has enabled the new GNOME3 and Shell desktop environment, as well as KDE4.x, to enable compositing by default.

Is there someone in design in free software world that you admire?

There are a few. Jakub Steiner, who is an incredible icon designer for GNOME and SuSE, was instrumental in how GNOME Shell now looks, and also has mad insane skills in web development and 3D modelling and animations.

There’s also Sean Wilson aka “half-left” on DeviantART, who’s taken on theming GNOME 3 and the new GNOME Shell by the horns ever since pre-beta. Dude has some rad skills.

I also love the work of Harno, who’s only 18 and already has incredible Inkscape and icon design skills; he was also instrumental in the identity design of Novacut.

What do you want to tell our wannabe-designer readers?

Learn the basics, I can’t stress that enough.

Space Beards (jakub Sawczuk) Mac Os X

Jumping onto a cracked copy of Photoshop and applying some blending options on some text doesn’t make you a designer. Start reading. Find some books from the great design masters of the past. Read about typography and learn to distinguish your titles from your stress lines. Read about grid systems and how they’re immensely important in the essential structuring of information and legibility. Read about design masters from the past. Read about design movements. Read about design history. Keep things simple. Look up Dieter Rams’ Ten Principles of Good Design.

Be not concerned with learning how to create specific design objects but instead come to understand the design process, which is much more important.

Ask questions. Question questions. Question answers. Learn what the design rules are, because there’s centuries of design experience that discovered awesome design rules before you. Then break the rules. As long as it looks good. Keep things simple. Keep. Things. Simple.

Get excited about colours. See how past design can be applied to new design in the realms of the web. Try to identity typefaces in every day life. Try to make the world, in your own way, nicer to look at and easier to use.

Which open source projects do you think have good design and an intuitive interface?

In no specific order, just off the top of my head: Google’s Chromium is an excellent redesign of the web browser interface, their implementation of closing tabs, as an example, is inspired; Banshee looks nice and the recent inclusion of offering little avatars for the artist list is definitely an improvement; Amarok, I think, is beautiful; the new Ubuntu One Control Panel is a vast improvement; Ubuntu’s new shell, Unity, is stunning as is GNOME’s new shell, GNOME Shell (for the record, I think both have their good and bad points, and I can see where one should borrow ideas from the other, I need to make that into a post soon); the KDE4.x desktop environment is also totally stunning. It’s an exciting time for design in the FOSS world.

Which projects you have contributed to?

I was once asked quite some time ago by Seif Lotfy of Zeitgeist to design a new logo for the Zeitgeist Project, which I did and then submitted it to the rest of the team. I think that one kind of dwindled away though, there didn’t appear to be any real need for a new logo. Mr. Lotfy also asked me to design the logo and icon for the Synapse smart launcher, which is now currently in use.

Of course, there’s the on-going work I’m producing for Novacut, and in the near future I may be working on the new logo and identity design for the Luz Live Visual Studio app, which looks pretty neat.

I’ve also contributed various wallpapers to Ubuntu and have produced some rather popular GTK and PekWM (a lightweight window manager) themes in the past.

What are some tips for a successful migration from Photoshop/other proprietary tools to GIMP etc.?

Patience, largely. You’ll need to get used to a different interface and, consequently, a different workflow. You’ll have to get used to different filter names in GIMP, for example, as well as generally different names for tools that may have used lots in the Creative Suite.

At this point, the internet is your friend so ask around, do a little Googling, plus Ubuntu has a magnificent community, someone will be able to help you.

With thanks to Ian.

Contents

Tools: Machine Translation, POS Taggers, NP chunking, Sequence models,Parsers, Semantic Parsers/SRL, NER, Coreference,Language models, Concordances, Summarization,Other
Corpora:Large collections, Particular languages, Treebanks,Discourse,WSD,Literature,Acquisition
SGML/XML
Dictionaries
Lexical/morphological resources
Courses, Syllabi, and other Educational Resources
Mailing lists
Other stuff on the Web:General, IR, IE/Wrappers, People, Societies

Tools

Machine Translation systems

Instructions

Building a baseline statistical phrase MT system
Wonderful pages about how to download a bunch of tools and some dataand put themtogether to build a very competent baseline statistical MT system:NAACL 2006WMT or2009 WMT.

Freely downloadable

Moses
The most-used open-sourcephrase-based MT decoder. By Philip Koehn and many others.
Phrasal
A Java phrase-based MT decoder, largely compatible with the core of Moses,with extra functionality for defining feature-rich ML models. By Daniel Cer, Michel Galley, Spence Green, and others.
Joshua
A Java hierarchical MT decoder, largely based on the design of Hiero.By Chris Callison-Burch and others.
Jane
A phrase-based MT decoder by the U. Aachen group.
cdec
A primarily SCFG-based MT decoder by Chris Dyer and many others. C++.
EGYPT system
System from 1999 JHU workshop. Mainly of historical interest.
GIZA++ and mkcls
Franz Och. C++. GPL. Still often used for word alignment.
Thot
Phrase-based model building kit
Phramer
An Open-Source Java Statistical Phrase-Based MT Decoder
Syntax Augmented MachineTranslation via Chart Parsing
Andreas Zollmann and Ashish Venugopal

Free, but getting them requires hassle

Pharaohdecoder
Philip Koehn, ISI.
MTTK
Machine Translation Tool Kit. Deng and Byrne.

Part of Speech Taggers

Freely downloadable

Stanford POStagger
Loglinear tagger in Java (by Kristina Toutanova)
hunpos
An HMM tagger with models available for English and Hungarian. Areimplementation of TnT (see below) in OCaml.pre-compiled models. Runs on Linux, Mac OS X, and Windows.
MBT: Memory-based Tagger
Based on TiMBL
TreeTagger
A decision tree based tagger from the University of Stuttgart(Helmut Scmid). It'slanguage independent, but comes complete with parameter files forEnglish, German, Italian, Dutch, French, Old French, Spanish, Bulgarian,and Russian. (Linux, Sparc-Solaris, Windows, and Mac OS X versions.Binary distribution only.) Page has links to sites where you can run it online.
SVMTool
POS Tagger based on SVMs (uses SVMlight). LGPL.
ACOPOST (formerly ICOPOST)
Open source C taggers originally written by by Ingo Schröder. Implements maximum entropy, HMM trigram, and transformation-based learning. C source available under GNU public license.
MXPOST: Adwait Ratnaparkhi's Maximum Entropy part of speech tagger
Java POS tagger. A sentenceboundary detector (MXTERMINATOR) is also included. Original version wasonly JDK1.1; later version worked with JDK1.3+. Class files, not source.
A fast and flexible implementation of Transformation-Based Learning in C++. Includes a POS tagger, but also NP chunking and general chunking models.
mu-TBL
An implementation of a Transformation-based Learner (a la Brill),usable for POS tagging and other things by Torbjörn Lager. Webdemo also available. Prolog.
YamCha
SVM-based NP-chunker, also usable for POS tagging, NER, etc. C/C++open source. Won CoNLL 2000 shared task. (Less automatic than a specialized POStagger for an end user.)
QTAGPart of speech tagger
An HMM-based Java POS tagger from Birmingham U. (Oliver Mason).English and German parameter files. [Java class files, not source.]
The TOSCA/LOB tagger.
Currently available for MS-DOS only. But the decision to make thisfamous system available is very interesting from an historicalperspective, and for software sharing in academia more generally.LOB tag set.
The venerable Brill's Transformation-based learning Tagger
A symbolic tagger, written in C. It's no longer available from acanonical location, but you might find a version from theWikipedia page or you could try a reimplementation suchas fnTBL.
Original Xerox Tagger
A common lisp HMM tagger available byftp.
Lingua-EN-Tagger
Perl POS tagger by Maciej Ceglowski and Aaron Coburn. Version0.11. (A bigram HMM tagger.)

Free, but require registration

TATOO
The ISSCO tagger. HMM tagger. Need to register to download.
PoSTech Koreanmorphological analyzer and tagger
Online registration.
TnT - A StatisticalPart-of-Speech Tagger
Trainable for various languages, comes with English and Germanpre-compiled models. Runs on Solaris and Linux.

Usable by email or on the web, but not distributed freely

Memory-based tagger
From ILK group, Catholic University Brabant (Jakub Zavrel/WalterDaelemans). Does Dutch, English, Spanish, Swedish, Slovene. Other MBLdemos are also available.
Birmingham tagger
Accepts only plain ASCII email message contents. The tagset used is similar to the Brown/LOB/Penn set.
CLAWS tagger
The UCREL CLAWS tagger is available for trial use on the web. (It'slimited to 300 words though -- this site is more of an advertisement forlicensing the real thing -- available as software for Suns or as a paid service.) You can also find info on CLAWS tagsets,though that page doesn't seem to link to the C7 tagset.
TheAMALGAM tagger
The AMALGAMProject also has various other useful resources, in particular a webguide to different tag sets in common use. The tagging is actuallydone by a (retrained) version of the Brill tagger (q.v.).
XeroxXRCE MLTT Part Of Speech Taggers
Tags any of 14 languages (European and Arabic), online on the web.
Portuguese taggers on the web: Projecto Natura and a QTAG adaptation.

Not free

Lingsoft
Lingsoft in Finland has (symbolic)analysis tools for many European languages. More information can beobtained by emailing info@lingsoft.fi. Thereis an online demo.
Conexor
Conexor in Finland hasdemonstrations of EngCG-style taggers and parsers, for English, Swedish,and Spanish.
Xerox
Xerox hasmorphological analyzers and taggers for many languages.There are demos of some of their tools on the web.More information can beobtained by contacting Daniella Russo.
Infogistics
Infogistics, an Edinburgh spinoff has a tagging and NP/Verb group chunker available commercially, including an evaluation version.

No longer available

LT POS and LT TTT
The Edinburgh Language Technology Group tagger and text tokenizer (andsentence splitter were binary-only Solaris tools which no longer seem tobe available.

NP chunking

Downloadable

YamCha
SVM-based NP-chunker, also usable for POS tagging, NER, etc. C/C++ open source. Won CoNLL 2000 shared task. (Less automatic than a specialized POS tagger for an end user.)
MarkGreenwood's Noun Phrase Chunker
A Java reimplementation of Ramshaw and Marcus (1995).
fnTBL
A fast and flexible implementation of Transformation-Based Learning in C++. Includes a POS tagger, but also NP chunking and general chunking models.

Generic sequence models

Downloadable

CRF++
Generic CRF-based model in C++. Open source. By the author of YamCha.
Carafe
Generic CRF-based sequence models in O-CaML. Open source. By BenWellner.
FreeLing
A largesuite of language analyzers. Written in C++.Covers text preprocessing, morphology, NER, POS tagging, parsing.

Parsers

Information on available probabilistic parsers can be found on theFSNLP: probabilistic parsing links page.

Semantic Parsers

Downloadable

ASSERT
PropBank semantic roles (and opinions, etc.) by Sameer Pradhan.
Shalmaneser
FrameNet-based by Katrin Erk.
TreeKernels in SVMlight by Alessandro Moschitti.
A general package, but ithas particularly been used for SRL.

Named Entity Recognition

Downloadable

Stanford NamedEntity Recognizer
A Java Conditional Random Field sequence model with trained modelsfor Named Entity Recognition. Java. GPL. By Jenny Finkel.
LingPipe
Tools include statistical named-entity recognition, a heuristic sentenceboundary detector, and a heuristic within-document coreferenceresolution engine. Java. GPL. By Bob Carpenter, Breck Baldwin and co.
YamCha
SVM-based NP-chunker, also usable for POS tagging, NER, etc. C/C++ open source. Won CoNLL 2000 shared task. (Less automatic than a specialized POS tagger for an end user.)

Coreference (Anaphora) Resolution

Downloadable

Stanford Deterministic Coreference Resolution System
Winner of CoNLL 2011 shared task, with subsequent improvements. Distributed as part of Stanford CoreNLP.Heeyoung Lee and others. Java. GPL.
Reconcile
By Ves Stoyanov and others. Java. GPL.
Illinois Coreference Package
Java. University of Illinois Research and Academic Use License.
Berkeley Coreference Resolution
Greg Durrett et al. Mainly Scala. GPL.
BART
A Beautiful Anaphora Resolution Toolkit. Java. By YannickVersley and many others. Java. Apache with GPL components.
Guitar
Java. GPL.

Language modeling toolkits

Downloadable

IRSTLM ToolkitCompatible with SRILM, suitable for very large language models. LGPL.By Marcello Federico, Nicola Bertoldi et al.
CMU-CambridgeStatistical Language Modeling toolkit

Downloadable, but requires registration

The SRI LanguageModeling toolkit
by Andreas Stolcke is another good system forbuilding language models, freely available for research purposes.

Not yet classified

Lextools
A package of tools for creating weighted finite-statetransducers (WFST) from high-level linguistic descriptions.Lextools binaries are available free for non-commercial useat: http://www.research.att.com/sw/tools/lextools/.Supported platforms are: linux (i686), sgi (mips2) and sun4.Lextools is built on top of, and requires, the AT&T WFSTtoolkit (version 3.6), available free for non-commercial usefrom: http://www.research.att.com/sw/tools/fsm/

Friendly concordancing and text analysis tools

Wordsmith Tools (Mike Scott)
The thing to get if you are working in the Windows world.

Text summarization tools

A prototype JavaSummarisation applet (System Quirk)
MEAD
A public domain portable multi-document summarizationsystem. (Dragomir Radev and others.)

Space Beards (jakub Sawczuk) Mac Os Version

Other

Downloadable

Tilburg University's TiMBL
Tilburg's Memory Based Learner by Walter Daelemans et al. A generalnear-neighbour-based machine learning package, but optimized for statistical NLPapplications.
splitta
Statistical sentence boundary detection by Dan Gillick.
TimeExpression taggers
TIMEX2 standard taggers (site at Mitre).
NLTK
An open source Python package for NLP application development withtools such as tokenization, POS TAGGING and parsers by Ed Loper and Steven Bird.
Ted Pedersen's code
Ngram Statistics Package: Perl code that implements: Fisher's exact test, the likelihood ratio, Pearson's chi squared test, the Dice Coefficient, and Mutual Information; Duluth Senseval-2 word sense disambiguation systems; Senseval-1 data in Senseval-2 format; various other WSD datasets in Senseval formats, and semantic distances derived via WordNet.
ISIPtools
The main aim is a publically available speech recognitionsystem (alpha release available), but along the way there are also toolkits for discrete HMMs and statistical decision trees, and for various aspects of signal processing.
Mem. A Perlimplementation of Generalized and Improved Iterative Scaling
by Hugo WL ter Doest.
Automorphology
A system (for Windows) for automatically learning the morphologicalforms of words in a corpus by John Goldsmith.
Wordnet
Wordnet is available by ftp,compiled for a variety of machine types. For money, one can also get EuroWordNet for variousEuropean languages, an Italian/English/Spanish MultiWordNetand there's now a site forGlobal Wordnet.(See also Mappings between WordNet versions and Perl WordNet-Similarity module by Ted Pedersen, andWordNet Domains (coarse-grained sense topic classifications).)
Penn XTAG project
A wide-coverage tree-adjoining grammar written in a mixture of Cand Common Lisp. Also includes a large coverage morphologicalanalyzer. Now includes more tools such as TCL/Tk tree viewer.
Dan Melamed'sAssorted Tools
A collection of various tools including a simulated annealling program, apost-processor for English stemming for the Penn XTAG morphologysystem, Good-Turing smoothing software, general text processing tools,text statistics tools and bitext geometry tools (mainly written in Perl 5).
MULTEXT
Constructing corpora and tools for processing multilingual corpora.Contact: Jean Veronis veronis@univ-aix.fr. Some stuffincluding a multilingual text editor is downloadable.MULTEXT EAST has parallel versions of Orwell's 1984 available free (upon registration) for a number of Central European languages.
NaiveBayes algorithm
Software from the Rainbow/Libbow software package that implementsseveral algorithms for text categorization, including naive Bayes,TF.IDF, and probabilistic algorithms. Accompanies Tom Mitchell's ML text.
HDDI
Text Data Mining API from Lehigh University.
Emdros: a text database engine for linguistic analysis and research
Chasen
Japanese morphological analyzer. Descendent of JUMAN.

Free, but require registration

Stuttgart's IMSCorpus Workbench (CWB)
A workbench for full-text retrieval from large corpora (with a query language and corpus indexing). Includes the Corpus Query Processor (CQP) and xkwic.Available free for research groups (currently only as Solaris 1/2 or Linux binaries), on signing a license agreement.
Gate
University of Sheffield's General Architecture for Text Engineering. Primarily an Information Extraction system.
MITRE'sAlembic Workbench
A workbench for the development of tagged corpora. Includes atagger based on Brill's TBL approach.
SNoW
SNoW is a learning program that can be used as a general purpose multi-class classifier and is specifically tailored for learning in the presence of a very large number of features. The learning architecture is a sparse network of linear units over a pre-defined or incrementally acquired feature space (Dan Roth).

Unsure

INTEX
a finite-state transducer analysis system for English, French, andItalian that runs under NextStep. Contact:Max Silberztein silberz@ladl.jussieu.fr

The PennToolspage collects information on a variety of NLP systems, many of which areavailable externally.

Corpora

Large collections aimed at the NLP community

LDC (LinguisticData Consortium) and its catalogue by year.
Email: ldc@ldc.upenn.edu. Provides the largest range ofcorpora on CD-ROM. Cost ranges from cheap (e.g., ACL-DCI disk) to pricey.CDs can be purchased individually; institutions can become members andreceive discounts on CDs. There's anLDC Online service forsearches over the web (mainly intended for members, but there are samplersavailable).
European LanguageResources Association and its catalogue.
Distribution agency is ELDA.Rapidly growing collection of materials in European languages.
ICAME(International Computer Archive of Modern English)
Sells various corpora (includingBrown and London-Lund). Information on corpora on the web, by sending themessage help to fileserv@nora.hd.uib.no, by ftp tonora.hd.uib.no.Also,manuals forthese corpora.
Reuters @NIST
Reuters corpora are now distributed by NIST.
TRACTOR
TELRI Research Archive of Computational Tools and Resource.Corpora, many multilingual, in European community languages. Small feefor joining in order to be able to get corpora (unless you havecontributed corpora).
CLR (Consortium for LexicalResearch)
Email: lexical@nmsu.edu. Focuses more on languageprocessing tools and lexicons, but does have some corpora. As of Feb 1996,you can get most of their stuff by anonymous ftp to clr.nmsu.edu. Their catalog isavailable as a postscript file.
OTA (Oxford Text Archive)
Provides mainly literary texts. Has a bright new website. Email: info@ota.ahds.ac.uk.Most materials are available on the web or by anonymous ftp toota.ox.ac.uk.Some require negotiations with the providers.
Leipzig Corpora Collection
Sentence collections in MySQL database for 17 mainly European languages.
BNC (British National Corpus)
A 100 million word corpus of British English. Youcan search it online from their simple webinterface or via View, a muchbetter interface by Mark Davies, and there is an index togenres by David Lee. And now, an XML edition.
European CorpusInitiative Multilingual Corpus I (ECI/MCI)
A 98 million word corpus, covering most of the majorEuropean languages, as well as Turkish, Japanese, Russian, Chinese, andMalay. Cheap. Need to sign a license agreement available at either theWWW site. Also available from the LDC.
Survey of English Usage
At the Department of English Language andLiterature at University College London. Includes the British part ofICE, the InternationalCorpus of English project. Now availabletagged, and parsed for function. 83,419 sentences. Includes ICECUP,dedicated retrieval software. Also, DiachronicCorpus of Present-Day Spoken English (800,000 words, tagged andparsed, half from ICE-GB and half from London-Lund).
International Corpus of English (ICE)
Million word collections of English from various world Englishes: ICE-NZ,ICE-HK, ICE-East Africa, etc. Severalof them are downloadable from this site.
Corporaheld by Lancaster University
This link provides its own annotations.
The European LanguageActivity Network
Promises a uniform query language for accessing corpora in all EUlanguages -- but isn't quite there yet.
Talkbank.
Rich video and transcripts.

Particular languages

English

English language corpora available from the sites above are not repeatedhere.

Corpora by Geoffrey Sampson's team
The SUSANNE corpusand the CHRISTINEcorpus (SUSANNE markup of a speech corpus).
Michigan Corpus of Academic Spoken English (MICASE).1.7 million words from 1997-2001.
Penn-Helsinki Parsed Corpus ofMiddle English
A syntactically annotated corpus of the Middle English prosesamples in the Helsinki Corpus of Historical English, withadditions. 1.3 million words. $200.
Corpus of Professional, SpokenAmerican-English (CPSA)
2 million words from faculty and committee meetings and White Housepress conferences (50K work sample free on internet).
Lancaster Parsed Corpus
Dialogue Diversity Corpus (Bill Mann)
American NationalCorpus

Chinese

English language corpora available from the sites above are not repeatedhere.

The Lancaster Corpus of Mandarin Chinese (LCMC)
By Tony McEnery and Richard Xiao. Distinguished by being a balanced corpus, and freely available.

Multilingual

JRC-Acquis
A parallel corpus of EU documents across all member states. 8 million words or more in each of 20 languages.
EMILLE/CIIL
Monolingual written corpus data for 14 SouthAsian languages (Assamese, Bengali, Gujarati, Hindi, Kannada, Kashmiri,Malayalam, Marathi, Oriya, Punjabi, Sinhala, Tamil, Telegu and Urdu).Orthographically transcribed spoken data and parallelcorpus data for five South Asian languages (Bengali, Gujarati, Hindi,Punjabi and Urdu). In addition, the parallel corpus contains the Englishoriginals from which the translations stored in the corpus were derived.All data in the corpus is CES and Unicode compliant. The EMILLE corpustotals some 94 million words. Downloadable.
OPUS
An open source parallel corpus, aligned, in many languages, based on free Linux etc. manuals.
WorldHealth Organization Computer Assisted Translation page.
Also includes a good selection of links on Computer AssistedTranslation. (See also thecopyright page.)
SearchableCanadian Hansard French-English parallel texts (1986-1993)
From the Laboratoirede Recherche Appliquée en Linguistique Informatique,Universite de Montréal
European Union web server
Parallel text in all EU languages. (In particular tryEuropean legislation.)
TELRI CD-ROMs
Parallel and other text in central and eastern european languages.

Bosnian

The Oslo Corpusof Bosnian Texts.

Czech

Parallel Czech-English
Literature translations in Czech and English
Czech National Corpus project: SYN2000
100 million words of contemporary Czech.

French

Association des BibliophilesUniversels
Various French literary works.
American andFrench Research on the Treasury of the French Language (ARTFL)
150 million word corpus of various genres of French. You have to be amember to use it (but membership is fairly cheap).

German

COSMASCorpus
Large (over a billion words!) online-searchable German and Austrian corpora. This is the publically available part of the 1.85 billion word Mannheimer Corpus Collection
NEGRACorpus
Saarland University Syntactically Annotated Corpus of GermanNewspaper Texts. Available free of charge to academics. 20,000sentences, tagged, and with syntactic structures. Free for academic use.

Russian

Russian National Corpus
150 million words, 5 million words POS-tagged, some in dependencytreebank.
Library ofRussian Internet Libraries
Various literary works.

Slovene

Slovene-English parallel corpus
1 M words, free to download + on-line concordances.
Coming soon: Slovene referencecorpus of 100 M words

Croatian

Croatian National Corpus
100 M words

Spanish and Portuguese

TychoBrahe Parsed Corpus of Historical Portuguese
Over a million words of Portuguese from different historical periods, some of it morphologically analyzed/tagged. Free.
Information about MarkDavies' collection of (mainly historical Spanish and Portuguese.
It's not clear what their availability is.
The CUMBRE corpus. Contact ProfessorAquilino Sánchez
The CRATER Spanish corpus
Morphosyntactically tagged telecommunicationmanuals) is available by ftp.
Corpusresources for Portuguese
In total about 70 million words, available free, from various sources (newswire, etc.)
Folha de S. Paulo newspaper
4 annual CDROMs with full text.
COMPARA
Portuguese-English parallel corpus. (In general, various resourcesat Linguateca site.
See also under ELRA, above.

Swedish

Spraakdata, Departmentof Swedish, Göteborgs University.
Has various searcable part of speechtagged Swedish corpora (Parole, Bank of Swedish, etc.), and somematerial in Zimbabwean languages.

Treebanks

NameLanguageSizeAvailabilityComments
Penn TreebankUS English2 million + wordsAvailable (distributed by LDC)1 million WSJ, 1 million speech, surface syntax (1970s TG)
BLLIP WSJ corpusUS English30 million wordsAvailable (distributed by LDC)WSJ newswire. Automatically parsed, not hand checked. Same structure as Penn Treebank, except for some additional coreference marking
ICE-GBUK English1 million words (83,394 sentences)Available; c. 500 poundsBritish part of ICE, the International Corpus of English project. Tagged and parsed for function. Half spoken material.
Bulgarian TreebankBulgariann/aPOS-tagged texts and dependencies analyses are available (some are free on the web, others via a license agreement)An under construction Bulgarian HPSG treebank.
Penn Chinese TreebankChinese100,000 wordsAvailable (LDC)Based on Xinhua news articles. 1980s-style GB syntax.
The Prague DependencyTreebank 1.0Czech500,000 wordsFree on completion of license agreement (available through LDC).Analyzed at thelevels of parts of speech, syntactic functions (and, in the future,semantic roles) level in a dependency framework. Text from newspapers and weekly magazines.
Danish Dependency Treebank 1.0Danish100,000 wordsAvailable free under the GPL.Built on a portion of the Parole corpus.
Alpino Dependency TreebankDutch150,000 wordsFreely downloadableAssorted subcorpora. By far the largest isthe full cdbl (newspaper) part of the Eindhoven corpus.
NEGRA CorpusGerman20,000 sentencesAvailable free of charge to academics on completion of license agreement.Saarland University Syntactically Annotated Corpus of German Newspaper Texts. Tagged, and with syntactic structures.
TIGER corpusGerman700,000 wordsAvailable free of charge for research purposes on completion of license agreement.German newspaper text (FrankfurterRundschau). Semi-automatically parsed.They also have a good treebank search tool, TIGERSearch.
Icelandic Parsed Historical Corpus (IcePaHC)Icelandic1,000,000 wordsFree download (LGPL)Texts from 1150 through 2008!
TUT:Turin University TreebankItalian2,400 sentencesFree download.Morhpological analysis and dependency analysis. Penn Treebank translation.Civil law and newspaper texts.
Floresta Sintá(c)ticaPortuguese168,000 words hand-corrected; 1,000,000 words automatically parsedHand corrected part is free web download; automatically parsed part available through email contactText from CETEMPúblicocorpus. Phrase structure and dependency representations. Available in several formats, including Penn Treebank format.
Talbanken05Swedish300,000 wordsFree downloadResurrects and modernizes an early treebank from the 1970s.
Verbmobil Tübingen: under construction treebanked corpus of German, English, and Japanese sentences from Verbmobil (appointment scheduling) data
Syntactic Spanish Database (SDB)University of Santago de Compostela. 160,000 clauses / 1.5 million words.
CKIP Chinese Treebank (Taiwan). Based on Academia Sinica corpus.(There's also a 100sentence Chinese treebank at U. Maryland.)
LDC Korean Treebank.
Dublin-EssexTreebank project
Deriving Linguistic Resources from Treebanks.

Treebanks

CSTBank:Cross-document Structure Theory: marking sentence functionalrelationships across related documents.

Resources for Word Sense Disambiguation

The Senseval web site
Has a comprehensive selection of resources for WSD, including a goodlist of WSD data resources, but not yet the new SEMCOR.
Ted Pedersen's code
Includes various WSD systems.
SenseClusters
Open source package for unsupervised discovery of word senses by clusteringtogether instances of a word (or words) that are used in similar contexts in raw text, supporting a wide range of clustering techniques based on both context vectors and similarity matrices, and including links toSVDPACKC and CLUTO. Ted Pedersen and Amruta Purandare.
EvocationWordNet synset similarity judgments
Judgments on how similar the meanings of synsets are and how commonthey are in the BNC from Jordan Boyd-Graber.

Literature

There are now quite large collections of online literature, available invarious languages (though the majority are in English, of course). Beloware pointers to some of the main collections:

Entirely or mainly English

Alex: A Catalogueof Electronic Texts on the Internet
Seems to have one of the largest collection. Searching and browsingfacilities through gopher menus. Many languages.
Wiretap Electronic Text Archive
Extensive and good quality. Still in the gopher age, though.
The On-line BooksPage
The index here only covers books in English, but there are lots oflinks to other collections of material in all languages.
Project Gutenberg
The oldest and largest project to get out of copyright literatureonline, freely available. (Or see the mirror, Sailor's ProjectGutenberg site.)
The Electronic TextCenter of the University of Virginia
Large collection of SGML text, mainly in English, but also in othermajor languages.
Center for Electronic Texts in theHumanities
Princeton/Rutgers collaboration. They didn't have it together withtheir web site when I stopped by, but they may soon.
Oxford Electronic Text Library Editions
Available fromOxford University Press, 200 Madison Ave, NY, NY 10016 212-679-7300.The Complete Works of Jane Austen is $95.00, and is reviewed inComputers and the Humanities, 28:4-5 (Aug/Oct, 1994), 317-321.
Coreference annotated texts
From University of Woverhampton (R. Mitkov, C. Barbu et al.).

Acquisition data

CHILDES database.
Database of child language transcriptions in English and many otherlanguages. Texts are also available by ftp. Certainusage requirements. Manuals and programs for accessing the data (theCLAN concordancer) are also available online. Now in Unicode XML.

SGML/XML

Robin Cover's SGML/XMLWeb Page
This is a wonderful compendium of information on SGML and XML, includinginformation onthe Text Encoding Initiative (TEI). This document is also a guide tomany text collections (ones using SGML).
Information about the Text EncodingInitiative (TEI). (The Pizza Chef acts asa TEI tag set selector.)
Xaira
XML Aware Indexing and Retrieval Application. The successor of SARA.
Microsoft's XML page
W3C XML page.
The Corpus EncodingStandard.
An SGML instance designed for language engineering applications.Also the XML version.

Dictionaries

Dictionaries of subcategorization frames

The following dictionaries all list surface subcategorization frames (eachwith a different annotation scheme). They are also all available inelectronic form from the publishers (not free).

COBUILD
Collins Cobuild English Language Dictionary. London: Collins, 1987.The COBUILD web sitelets you search their Bank of English corpus (but you need to pay to getmore than a trial.
LDOCE
Longman Dictionary of Contemporary English. Burnt Mill, Essex:Longman, 1978.
OALD
Oxford Advanced Learner's Dictionary of Current English. Oxford:Oxford University Press, Fourth Edition, 1989. The third edition also hadinformation on subcategorization frames, although in a differentincompatible format. However, apartial version ofthe third edition (with this information) is available free onlinefrom the Oxford Text Archive.

Not exactly a dictionary, but other popular sources are:

Levin (1993)
Beth Levin. 1993. English Verb Classes and Alternations: A PreliminaryInvestigation. Chicago. Discusses linguistic distinctions (likeunergative/unaccusative verbs, dative shift, etc., not made by the abovedictionaries). Theindex of verbs is online.
Englishsubcategorization evaluation resources
Gold standard data, from Cambridge University (Anna Korhonen)

See also COMLEX and CELEX available from the LDC.

Dictionaries of assorted languages on the web

The old version of RobertBeard's Web of Online Dictionaries long ago mutated into YourDictionary.com. I'm told theIPO has been delayed. Nevertheless, it's the most comprehensive index of dictionaries available on the web.

Names

U.S. names with frequency information, are available from the Census Bureau.

SGML structured dictionaries

Mac
Cambridge International Dictionary of English and other products in SGML.

Lexical/morphological resources

EnglishSENSEVAL Resources
Dictionary entries and tagged examples for 35 words.
ARIES Natural Language Tools
Lexicons and morphological analysis for Spanish. There is a freeProlog demonstrator, but the real lexicons and C/C++ access tools cost money.

Courses, Syllabi, and other EducationalResources

'Techie'

Foundations of Statistical Natural LanguageProcessing
Some information about, and sample chapters from, ChristopherManning and Hinrich Schütze's new textbook, published in June 1999by MIT Press. Read about courses using this book.
Corpus-based Linguistics
Christopher Manning's Fall 1994 CMU course syllabus (a postscript file).
Statistical NLP: Theory and Practice
Christopher Manning's Spring 1996 CMU course materials.
John Lafferty andRoni Rosenfeld's Spring 1997 CMU course Language and Statistics.
Boston University (JohnD. Burger and Lynette Hirschman)
A good course and web site, by the looks!
Draft of
A tutorialon concordances and corpora by Cathy Ball
Tony Berber Sardinha'sCorpus Linguistics course
Powerpoint slides in an interesting mixture of English andPortuguese (plus the rest of his homepage!)
Concordancing andcorpus linguistics
Notes prepared by Phil Benson, Hong Kong University.
Computational Approaches toCollocations
Discussion of all the measures that have been used, and software forcalculating them. By Evert and Krenn.

Mailing lists

Mailing lists that have information on these topics include:

Corpora
The main mailing list for info on corpus-based linguistics. Subscribe by sending the message:
subscribe corpora
to listserv@uib.no. Or if you want to subscribe with a differentemail address, send:
subscribe corpora email-address
(Note that you're now speaking to a Majordomo server, not a listserv, so you don't send your name!). Or you can subscribeon the web.
Empiricist
The empiricist list appears to be defunct now. You used to send a 'subscribe' message toempiricists-request@unagi.cis.upenn.edu.

Other stuff on the Web

General resources

NIST Human Language Technology programs
Including: TREC, TIDES, ACE, ....
Text summarization
Tons of resources (tutorialis, bibliographies, and software) for document summarization, maintained by Dragomir Radev.
PropositionBank @ UPenn
Statistical MT
Bookmarks for Corpus-based LinguistsAn extensive annotated collection by David Lee, aimed at linguisticsmore than NLP (includes web-searchable corpora and concordancing options).
HLTCentral
European site aiming to increase transfer of language technologies to the commercial market. News, etc.
Linguisticannotation
A description of formats for linguistic annotation by Steven Bird.
CTITextual Studies, University of Oxford, Guide to Digital Resources
Lists text analysis tools, corpora, and other stuff.
U. Essex W3-Corpora
Lots of teaching material, links, and online corpora.
ComputationalLinguistics and NLP (Kenji Kita, Tokushima U.)
A good well organized list of CL references, concentrating oncorpus-based and statistical NLP methods. See alsoSoftwaretools for NLP.
HLT Central
European Human Language Technology site
Survey ofthe State of the Art in Human Language Technology
ACL SIGLEX list ofLexical Resources
Onlinematerials for a course on Learning Dynamical Systems at BrownUniversity.
Lots of neat info.
Expert Advisory Groupfor Language Engineering Standards (EAGLES) home page
European standards organization.
Materials preparedfor Michael Barlow's Corpus Linguistics course
Corpus Linguistics University ofBirmingham
Chris Brew'sTeaching Materials for statistical NLP
Not much there last time I looked; you might also try his home page.
Edinburgh LTG HelpDesk's FAQ
Many of the questions in the concern issuesrelated to corpora and tagging.
Content AnalysisResources
Qualitative Text Analysis, Concordances, etc.
MT paper archive
Lots of papers, etc.
Space beards (jakub sawczuk) mac os x

Information Retrieval

The SMART IR system
ACM SIGIR
Managing Gigabytes
TREC conference
Text-based IntelligentSystems (Bruce Croft)

Information Extraction/Wrapper Induction

Introduction toInformation Extraction Technology. A tutorial by Douglas E. Appelt and David Israel.
IE data sets
Updated versions (i.e., now well-formed XML) of classic IE data sets:Seminar Announcements and Corporate Acquisitions.
Web-> KB. CMU World Wide Knowledge Base project (Tom Mitchell). Has a lot of the best recent probabilistic model IE work, and links to data sets.
RISE: Repository ofOnline Information Sources Used in Information Extraction Tasks, including links to people, papers, and many widely used data sets, etc.(Ion Muslea). Appears to not have been updated since 1999.
MessageUnderstanding Conference (MUC) information. A US government fundedinformation extraction exercise (from the 1990s).
Web IR and IE (Einat Amitay).Various links on IR and IE on the web.
Web question answering system (University of Michigan)
GATE: General Architecture for Text Engineering (Sheffield)
Genia Project.Biomedical text information extraction corpus (Tsujii lab). And IE tutorial slides.

People's homepages

Home pages with something useful on them.

University of Texas at AustinMachine Learning Research Group
Steven Abney (until 1997)
Adam Berger
Various stuff on statistical MT and maximum entropy models
Alex ChengyuFang
Provides a lot of info on the kinds of things they get up to at UCL,without actually giving you anything to play with yourself.

Societies/Journals

International QuantitativeLinguistics Association/Journal of Quantitative Linguistics
Not very hip.
Association for ComputationalLinguistics/Computational Linguistics
Hipper

Still under construction...

http://nlp.stanford.edu/links/statnlp.html
Christopher Manning-- <manning@cs.stanford.edu>--Last modified: Sat Nov 29, 2014