LexisNexisTools

:newspaper: Working with newspaper data from 'LexisNexis'

Generate Convert Improve

Install / Use

/learn @JBGruber/LexisNexisTools

About this skill

Quality Score

0/100

README

LexisNexisTools

Motivation

My PhD supervisor once told me that everyone doing newspaper analysis starts by writing code to read in files from the 'LexisNexis' newspaper archive. However, while I do recommend this exercise, not everyone has the time. This package provides functions to read in TXT, RTF, DOC and PDF files downloaded from the old 'LexisNexis' or DOCX from the new Nexis Uni, Lexis Advance and similar services. The package also comes with a few other features that should be useful while working with data from the popular newspaper archive.

Did you experience any problems, have questions or an idea about a great new feature? Then please don't hesitate to file an issue report.

Installation

Install via:

install.packages("LexisNexisTools")

Or get the development version by installing directly from GitHub (if you do not have remotes yet install it via install.packages("remotes") first):

remotes::install_github("JBGruber/LexisNexisTools")

Demo

Load Package

library("LexisNexisTools")

If you do not yet have files from 'LexisNexis' but want to test the package, you can use lnt_sample() to copy a sample file with mock data into your current working directory:

lnt_sample()

Rename Files

'LexisNexis' does not give its files proper names. The function lnt_rename() renames files to a standard format: For TXT files this format is "searchTerm_startDate-endDate_documentRange.txt" (e.g., "Obama_20091201-20100511_1-500.txt") (for other file types the format is similar but depends on what information is available). Note, that this will not work if your files lack a cover page with this information. Currently, it seems, like 'LexisNexis' only delivers those cover pages when you first create a link to your search ("link to this search" on the results page), follow this link, and then download the TXT files from there (see here for a visual explanation). If you do not want to rename files, you can skip to the next section. The rest of the package's functionality stays untouched by whether you rename your files or not. However, in a larger database, you will profit from a consistent naming scheme.

There are three ways in which you can rename the files:

Run lnt_rename() directly in your working directory without the x argument, which will prompt an option to scan for TXT files in your current working directory:

report <- lnt_rename()

Provide a folder path (and set recursive = TRUE if you want to scan for files recursively):

report <- lnt_rename(x = getwd(), report = TRUE)

Provide a character object with file names. Use list.files() to search for files in a certain path.

my_files <- list.files(pattern = ".txt", path = getwd(),
                       full.names = TRUE, recursive = TRUE, ignore.case = TRUE)
report <- lnt_rename(x = my_files, report = TRUE)

report

|name_orig |name_new |status |type | |:----------|:-------------------------------------|:-------|:----| |sample.TXT |SampleFile_20091201-20100511_1-10.txt |renamed |txt |

Using list.files() instead of the built-in mechanism allows you to specify a file pattern. This might be a preferred option if you have a folder in which only some of the TXT files contain newspaper articles from 'LexisNexis' but other files have the ending TXT as well. If you are unsure what the TXT files in your chosen folder might contain, use the option simulate = TRUE (which is the default). The argument report = TRUE indicates that the output of the function in R will be a data.frame containing a report on which files have been changed on your drive and how.

Read in 'LexisNexis' Files to Get Meta, Articles and Paragraphs

The main function of this package is lnt_read(). It converts the raw files into three different data.frames nested in a special S4 object of class LNToutput. The three data.frames contain (1.) the metadata of the articles, (2.) the articles themselves, and (3.) the paragraphs.

There are several important keywords that are used to split up the raw articles into article text and metadata. Those need to be provided in some form but can be left to 'auto' to use 'LexisNexis' defaults in several languages. All keywords can be regular expressions and need to be in most cases:

start_keyword: The English default is "\d+ of \d+ DOCUMENTS$" which stands for, for example, "1 of 112 DOCUMENTS". It is used to split up the text in the TXT files into individual articles. You will not have to change anything here, except you work with documents in languages other than the currently supported.
end_keyword: This keyword is used to remove unnecessary information at the end of an article. Usually, this is "^LANGUAGE:". Where the keyword isn't found, the additional information ends up in the article text.
length_keyword: This keyword, which is usually just "^LENGTH:" (or its equivalent in other languages) finds the information about the length of an article. However, since this is always the last line of the metadata, it is used to separate metadata and article text. There seems to be only one type of cases where this information is missing: if the article consists only of a graphic (which 'LexisNexis' does not retrieve). The final output from lnt_read() has a column named Graphic, which indicates if this keyword was missing. The article text then contains all metadata as well. In these cases, you should remove the whole article after inspecting it. (Use View(LNToutput@articles$Article[LNToutput@meta$Graphic]) to view these articles in a spreadsheet like viewer.)

To use the function, you can again provide either file name(s), folder name(s) or nothing---to search the current working directory for relevant files---as x argument:

LNToutput <- lnt_read(x = getwd())

## Creating LNToutput from 1 file...
##  ...files loaded [0.0016 secs]
##  ...articles split [0.0089 secs]
##  ...lengths extracted [0.0097 secs]
##  ...newspapers extracted [0.01 secs]
##  ...dates extracted [0.012 secs]
##  ...authors extracted [0.013 secs]
##  ...sections extracted [0.014 secs]
##  ...editions extracted [0.014 secs]
##  ...headlines extracted [0.016 secs]
##  ...dates converted [0.023 secs]
##  ...metadata extracted [0.026 secs]
##  ...article texts extracted [0.029 secs]
##  ...paragraphs extracted [0.041 secs]
##  ...superfluous whitespace removed from articles [0.044 secs]
##  ...superfluous whitespace removed from paragraphs [0.046 secs]
## Elapsed time: 0.047 secs

The returned object of class LNToutput is intended to be an intermediate container. As it stores articles and paragraphs in two separate data.frames, nested in an S4 object, the relevant text data is stored twice in almost the same format. This has the advantage, that there is no need to use special characters, such as "\n". However, it makes the files rather big when you save them directly.

The object can, however, be easily converted to regular data.frames using @ to select the data.frame you want:

meta_df <- LNToutput@meta
articles_df <- LNToutput@articles
paragraphs_df <- LNToutput@paragraphs

# Print meta to get an idea of the data
head(meta_df, n = 3)

| ID|Source_File |Newspaper |Date |Length |Section |Author |Edition |Headline |Graphic | |--:|:-------------------------------------|:-----------------|:----------|:---------|:---------------|:---------------|:-------------------|:--------------------------|:-------| | 1|SampleFile_20091201-20100511_1-10.txt |Guardian.com |2010-01-11 |355 words |NA |Andrew Sparrow |NA |Lorem ipsum dolor sit amet |FALSE | | 2|SampleFile_20091201-20100511_1-10.txt |Guardian |2010-01-11 |927 words |NA |Simon Tisdall |NA |Lorem ipsum dolor sit amet |FALSE | | 3|SampleFile_20091201-20100511_1-10.txt |The Sun (England) |2010-01-11 |677 words |FEATURES; Pg. 6 |TREVOR Kavanagh |Edition 1; Scotland |Lorem ipsum dolor sit amet |FALSE |

If you want to keep only one data.frame including metadata and text data you can easily do so:

meta_articles_df <- lnt_convert(LNToutput, to = "data.frame")

# Or keep the paragraphs
meta_paragraphs_df <- lnt_convert(LNToutput, to = "data.frame", what = "Paragraphs")

Alternatively, you can convert LNToutput objects to formats common in other packages using the function lnt_convert:

rDNA_docs <- lnt_convert(LNToutput, to = "rDNA")

quanteda_corpus <- lnt_convert(LNToutput, to = "quanteda")

tCorpus <- lnt_convert(LNToutput, to = "corpustools")

tidy <- lnt_convert(LNToutput, to = "tidytext")

Corpus <- lnt_convert(LNToutput, to = "tm")

dbloc <- lnt_convert(LNToutput, to = "SQLite")

See ?lnt_convert for details and comment in this issue if you want a format added to the convert function.

Related Skills

node-connect

342.5k

Diagnose OpenClaw node connection and pairing failures for Android, iOS, and macOS companion apps

frontend-design

85.3k

Create distinctive, production-grade frontend interfaces with high design quality. Use this skill when the user asks to build web components, pages, or applications. Generates creative, polished code that avoids generic AI aesthetics.

openai-whisper-api

342.5k

Transcribe audio via OpenAI Audio Transcriptions API (Whisper).

qqbot-media

342.5k

QQBot 富媒体收发能力。使用 <qqmedia> 标签，系统根据文件扩展名自动识别类型（图片/语音/视频/文件）。