SCM

R Development Page

tm log file (check_x86_64_windows)

* using log directory 'R:/run/building/build_2024-04-20-14-43/RF_PKG_CHECK/PKGS/tm.Rcheck'
* using R version 4.3.3 Patched (2024-04-09 r86444 ucrt)
* using platform: x86_64-w64-mingw32 (64-bit)
* R was compiled by
    gcc.exe (GCC) 12.3.0
    GNU Fortran (GCC) 12.3.0
* running under: Windows 10 x64 (build 19045)
* using session charset: UTF-8
* using option '--as-cran'
* checking for file 'tm/DESCRIPTION' ... OK
* this is package 'tm' version '0.7-13'
* checking CRAN incoming feasibility ... [14s] WARNING
Maintainer: 'Ingo Feinerer '

Insufficient package version (submitted: 0.7.13, existing: 0.7.13)

Days since last update: 0

Suggests or Enhances not in mainstream repositories:
  Rcampdf, tm.lexicon.GeneralInquirer
Availability using Additional_repositories specification:
  Rcampdf                      yes   https://datacube.wu.ac.at
  tm.lexicon.GeneralInquirer   yes   https://datacube.wu.ac.at
* checking package namespace information ... OK
* checking package dependencies ... NOTE
Packages suggested but not available for checking:
  'Rcampdf', 'Rpoppler', 'tm.lexicon.GeneralInquirer'
* checking if this is a source package ... OK
* checking if there is a namespace ... OK
* checking for hidden files and directories ... OK
* checking for portable file names ... OK
* checking serialization versions ... OK
* checking whether package 'tm' can be installed ... [26s] OK
* used C compiler: 'gcc.exe (GCC) 12.2.0'
* used C++ compiler: 'G__~1.EXE (GCC) 12.2.0'
* checking installed package size ... OK
* checking package directory ... OK
* checking for future file timestamps ... OK
* checking 'build' directory ... OK
* checking DESCRIPTION meta-information ... OK
* checking top-level files ... OK
* checking for left-over files ... OK
* checking index information ... OK
* checking package subdirectories ... OK
* checking R files for non-ASCII characters ... OK
* checking R files for syntax errors ... OK
* checking whether the package can be loaded ... OK
* checking whether the package can be loaded with stated dependencies ... OK
* checking whether the package can be unloaded cleanly ... OK
* checking whether the namespace can be loaded with stated dependencies ... OK
* checking whether the namespace can be unloaded cleanly ... OK
* checking loading without being on the library search path ... OK
* checking use of S3 registration ... OK
* checking dependencies in R code ... OK
* checking S3 generic/method consistency ... OK
* checking replacement functions ... OK
* checking foreign function calls ... OK
* checking R code for possible problems ... OK
* checking Rd files ... OK
* checking Rd metadata ... OK
* checking Rd line widths ... OK
* checking Rd cross-references ... NOTE
Packages unavailable to check Rd xrefs: 'tm.plugin.dc', 'readtext', 'Rpoppler', 'tau', 'tokenizers'
* checking for missing documentation entries ... OK
* checking for code/documentation mismatches ... OK
* checking Rd \usage sections ... OK
* checking Rd contents ... OK
* checking for unstated dependencies in examples ... OK
* checking contents of 'data' directory ... OK
* checking data for non-ASCII characters ... OK
* checking data for ASCII and uncompressed saves ... OK
* checking line endings in C/C++/Fortran sources/headers ... OK
* checking pragmas in C/C++ headers and code ... OK
* checking compilation flags used ... OK
* checking compiled code ... OK
* checking sizes of PDF files under 'inst/doc' ... NOTE
Unable to find GhostScript executable to run checks on size reduction
* checking installed files from 'inst/doc' ... OK
* checking files in 'vignettes' ... OK
* checking examples ... ERROR
Running examples in 'tm-Ex.R' failed
The error most likely occurred in:

> base::assign(".ptime", proc.time(), pos = "CheckExEnv")
> ### Name: readPDF
> ### Title: Read In a PDF Document
> ### Aliases: readPDF
> ### Keywords: file
> 
> ### ** Examples
> 
> uri <- paste0("file://",
+               system.file(file.path("doc", "tm.pdf"), package = "tm"))
> engine <- if(nzchar(system.file(package = "pdftools"))) {
+     "pdftools" 
+ } else {
+     "ghostscript"
+ }
> reader <- readPDF(engine)
> pdf <- reader(elem = list(uri = uri), language = "en", id = "id1")
> cat(content(pdf)[1])
                              Introduction to the tm Package
                                     Text Mining in R
                                                Ingo Feinerer
                                                April 20, 2024


Introduction
This vignette gives a short introduction to text mining in R utilizing the text mining framework provided by
the tm package. We present methods for data import, corpus handling, preprocessing, metadata management,
and creation of term-document matrices. Our focus is on the main aspects of getting started with text mining
in R—an in-depth description of the text mining infrastructure offered by tm was published in the Journal of
Statistical Software (Feinerer et al., 2008). An introductory article on text mining in R was published in R
News (Feinerer, 2008).


Data Import
The main structure for managing documents in tm is a so-called Corpus, representing a collection of text
documents. A corpus is an abstract concept, and there can exist several implementations in parallel. The
default implementation is the so-called VCorpus (short for Volatile Corpus) which realizes a semantics as known
from most R objects: corpora are R objects held fully in memory. We denote this as volatile since once the
R object is destroyed, the whole corpus is gone. Such a volatile corpus can be created via the constructor
VCorpus(x, readerControl). Another implementation is the PCorpus which implements a Permanent Corpus
semantics, i.e., the documents are physically stored outside of R (e.g., in a database), corresponding R objects
are basically only pointers to external structures, and changes to the underlying corpus are reflected to all R
objects associated with it. Compared to the volatile corpus the corpus encapsulated by a permanent corpus
object is not destroyed if the corresponding R object is released.
    Within the corpus constructor, x must be a Source object which abstracts the input location. tm provides a
set of predefined sources, e.g., DirSource, VectorSource, or DataframeSource, which handle a directory, a vector
interpreting each component as document, or data frame like structures (like CSV files), respectively. Except
DirSource, which is designed solely for directories on a file system, and VectorSource, which only accepts (char-
acter) vectors, most other implemented sources can take connections as input (a character string is interpreted
as file path). getSources() lists available sources, and users can create their own sources.
    The second argument readerControl of the corpus constructor has to be a list with the named components
reader and language. The first component reader constructs a text document from elements delivered by
a source. The tm package ships with several readers (e.g., readPlain(), readPDF(), readDOC(), . . . ). See
getReaders() for an up-to-date list of available readers. Each source has a default reader which can be
overridden. E.g., for DirSource the default just reads in the input files and interprets their content as text.
Finally, the second component language sets the texts’ language (preferably using ISO 639-2 codes).
    In case of a permanent corpus, a third argument dbControl has to be a list with the named components
dbName giving the filename holding the sourced out objects (i.e., the database), and dbType holding a valid
database type as supported by package filehash. Activated database support reduces the memory demand,
however, access gets slower since each operation is limited by the hard disk’s read and write capabilities.
    So e.g., plain text files in the directory txt containing Latin (lat) texts by the Roman poet Ovid can be
read in with following code:
> txt <- system.file("texts", "txt", package = "tm")
> (ovid <- VCorpus(DirSource(txt, encoding = "UTF-8"),
+                  readerControl = list(language = "lat")))

<>
Metadata: corpus specific: 0, document level (indexed): 0
Content: documents: 5



                                                       1
> VCorpus(URISource(uri, mode = ""),
+         readerControl = list(reader = readPDF(engine = "ghostscript")))
Error in system2(gs_cmd, c("-dNODISPLAY -q", sprintf("-sFile=%s", shQuote(file)),  : 
  '""' not found
Calls: VCorpus ... mapply ->  ->  -> pdf_info -> system2
Execution halted
* checking for unstated dependencies in 'tests' ... OK
* checking tests ...
  Running 'testthat.R'
 OK
* checking for unstated dependencies in vignettes ... OK
* checking package vignettes in 'inst/doc' ... OK
* checking re-building of vignette outputs ... [14s] OK
* checking PDF version of manual ... [11s] OK
* checking HTML version of manual ... [10s] OK
* checking for non-standard things in the check directory ... OK
* checking for detritus in the temp directory ... OK
* DONE

Status: 1 ERROR, 1 WARNING, 3 NOTEs
See
  'R:/run/building/build_2024-04-20-14-43/RF_PKG_CHECK/PKGS/tm.Rcheck/00check.log'
for details.

Run time: 143.31 seconds.

Additional Logs:   00install.out
Thanks to:
Vienna University of Economics and Business Powered By FusionForge