![]() ![]() wordcount – number of words in the document. ![]() url – the Uniform Resource Locator referring to the document's source.crawl date – date when the document was downloaded from the Web.web domain – collection of related web pages (e.g.website – identification string defining a realm of administrative autonomy within the Internet (e.g.top-level domain – domain at the highest level of the hierarchical Domain Name System (e.g.Some TenTen corpora can feature additional specific attributes. Metadata is contained in structural attributes that relate to individual documents and paragraphs in the corpus. TenTen corpora follow a specific metadata structure that is common to all of them. Eventually, the ONION tool is applied to remove duplicate text portions from the corpus, which naturally occur on the World Wide Web due to practices such as quoting, citing, copying etc. In a later stage, these texts undergo cleaning, which consists of removing any non-textual material such as navigation links, headers and footers from the HTML source code of web pages with the jusText tool, so that only full solid sentences are preserved. Īt the beginning, a huge amount of text data is downloaded from the World Wide Web by the dedicated SpiderLing web crawler. The procedure by which TenTen corpora are produced is based on the creators' earlier research in preparing web corpora and the subsequent processing thereof. This development was linked with the emergence of corpus creation tools that help achieve larger size, wider coverage, cleaner data etc. Over time, many further corpora were produced (such as the British National Corpus and the LOB Corpus) and work had begun also on corpora of larger sizes and covering other languages than English. This enables to narrow the search to a particular parts of speech, word sequences or a specific part of the corpus.įirst text corpora were created in the 1960s, such as the 1-million-word Brown Corpus of American English. Text processing procedures such as tokenization, part-of-speech tagging and word-sense disambiguation enrich corpus texts with detailed linguistic information. It is used to do hypothesis testing about languages, validating linguistic rules or the frequency distribution of words ( n-grams) within languages.Įlectronically processed corpora provide fast search. In corpus linguistics, a text corpus is a large and structured collection of texts that are electronically stored and processed. In the creation of the TenTen corpora, data crawled from the World Wide Web are processed with natural language processing tools developed by the Natural Language Processing Centre at the Faculty of Informatics at Masaryk University ( Brno, Czech Republic) and by the Lexical Computing company (developer of the Sketch Engine). Their target size is 10 billion (10 10) words per language, which gave rise to the corpus family's name. There are TenTen corpora for more than 35 languages. These corpora are made available through the Sketch Engine corpus manager. collections of texts that have been crawled from the World Wide Web and processed to match the same standards. The TenTen Corpus Family (also called TenTen corpora) is a set of comparable web text corpora, i.e. ![]()
0 Comments
Leave a Reply. |