BIBTOSQL 1 "17 Aug 2014" "Version 0.06"

Table of contents


NAME

bibtosql - convert BibTeX database files to SQL (Structured Query Language) database files

SYNOPSIS

bibtosql [--author] [--create] [--database dbname] [--help] [--server [ MySQL | psql | PostgreSQL | SQLite ]] [--version] < infile or bibfile1 bibfile2 bibfile3 ... > outfile

DESCRIPTION

bibtosql converts BibTeX database files to a format suitable for populating an SQL (Structured Query Language) relational database where most of the common bibliography entry keys (author, title, publisher, year, etc.) are SQL table columns. The original BibTeX entries are also recorded in the database tables as the entry column.

The acronym SQL is commonly pronounced either as it is spelled (ess cue ell), or like the name sequel.

SQL is defined in several national, government, international, and industry standards, including ANSI (X3.168-1989, X3.135-1992, 9579-2-1993, 9075-3-1995, and 9075-4-1996), FIPS (127:1990, 127-2:1993, 193:1995), ISO/IEC (9075:1987, 9075:1989, 9075:1992, 9075:2003, and 13249:2007), and X/Open (CAE 1994). Many 2008-vintage SQL systems claim conformance to most of the 1992 ISO Standard.

The data conversion provided by bibtosql gives access to the fast and powerful search and display facilities of modern relational database technology without giving up the convenience of simple bibliographic text files that can be prepared manually or with the help of other bibliographic software.

In addition, since SQL databases are in wide use, and most have powerful data-conversion facilities, having bibliographic data accessible through an SQL interface may facilitate its conversion to other bibliographic formats, without the need to handle BibTeX files directly.

The companion bibsql(1) utility provides a convenient interface to the SQL database, hiding the unimportant details of how the SQL client program is started, and where its data are located in the filesystem or elsewhere on the network.


OPTIONS

Command-line options may be abbreviated to a unique leading prefix, and lettercase is not significant.

The leading hyphen that distinguishes an option from a filename may be doubled, for compatibility with GNU and POSIX conventions. Thus, -a, -author and --author are equivalent.

--author
Display an author credit on stdout, and then terminate with a success return code.
--create
Output leading SQL commands to delete any existing data, and create a new BibTeX table for key/value searching. Without this option, output consists only of commands for adding new records to an existing database.
--database name of BibTeX database
Specify the name to be used for the database in the SQL system. The default name is bibtex.

This option exists primarily for experimentation, and should rarely be needed.

--help
Display a help message on stdout, giving a usage description, similar to this section of the manual pages, and then terminate with a success return code.
--server [ MySQL | psql | PostgreSQL | SQLite ]
Select the database server type for which input is to be prepared (default: SQLite).

The name psql is an alias for PostgreSQL, since the former is the name of the client program on many systems.

The name may be abbreviated to any unique leading prefix, and its lettercase is not significant.

Although the input formats of the supported SQL systems are similar, there are important small differences that make it imperative to identify the target system.

--version
Display the program version number and release date on stdout and then terminate with a success return code.
--
Terminate options. All remaining arguments are treated as filenames.

DETAILED DESCRIPTION

In order to keep the data-conversion task as simple as possible, bibtosql assumes that the BibTeX input files have been standardized, either manually, or more easily, with the help of bibclean(1). In particular, bibtosql assumes the following conventions:

Any BibTeX line may optionally have trailing whitespace, which is ignored.

These conventions are most easily illustrated by two simple examples, one fictional, and one real:

@Book{Jones:2008:GGE,
  author =       "Terry Jones",
  title =        "Gnat and Gnu Entozoology",
  publisher =    "The National Zoo Press",
  address =      "Washington, DC, USA",
  pages =        "xx + 723",
  year =         "2008",
  ISBN =         "0-9999999-9-0",
  ISBN-13 =      "978-0-9999999-9-8",
}

@Article{Boyer:2002:MEW,
  author =       "R. S. Boyer and W. Feijen and D. Gries and
                 C. A. R. Hoare and J. Misra and J. Moore
                 and H. Richards",
  title =        "In memoriam: {Edsger W. Dijkstra}
                 1930--2002",
  journal =      j-CACM,
  volume =       "45",
  number =       "10",
  pages =        "21--22",
  month =        oct,
  year =         "2002",
  CODEN =        "CACMA2",
  ISSN =         "0001-0782",
  bibdate =      "Wed Sep 3 17:06:29 MDT 2003",
  bibsource =    "http://www.acm.org/pubs/contents/journals/cacm/",
  acknowledgement = ack-nhfb,
}

BibTeX can process a compressed entry of the form

@Book{Jones:2008:GGE,author={Terry Jones},title=
  {Gnat and Gnu Entozoology},publisher={The Zoo Press},
  address={Washington,DC,USA},pages={xx + 723},
  year={2008},ISBN={0-9999999-9-0},ISBN-13={978-0-9999999-9-8}}
but bibtosql and most other simple BibTeX tools cannot.

The output of bibtosql consists of optional SQL commands to create a database, followed by a transaction block that contains zero or more SQL commands to insert data into the database in either of two tables, one for document entries, and the other for BibTeX @String{key = value} definitions. The data conversion provided by bibtosql is much faster than the loading of the converted data into the SQL database system, but each task needs to be done only once, as long as the database input format is not altered. Subsequent changes to the BibTeX data can be converted by bibtosql and entered into an SQL database previously created with input from that tool, without having to recreate the entire database.

Once the SQL database is available, starting the SQL program and issuing queries is relatively quick, and search responses can be produced in a fraction of a second on modern personal and office computers, even for a database with a million or so entries.

The database creation is initiated by an input block that might look something like this:

CREATE TABLE bibtab (
	authorcount  INTEGER,
	editorcount  INTEGER,
	pagecount    INTEGER,
	bibtype      TEXT,
	filename     TEXT,
	label        TEXT,
	author       TEXT,
	editor       TEXT,
	booktitle    TEXT,
	title        TEXT,
	crossref     TEXT,
	chapter      TEXT,
	journal      TEXT,
	volume       TEXT,
	type         TEXT,
	number       TEXT,
	institution  TEXT,
	organization TEXT,
	publisher    TEXT,
	school       TEXT,
	address      TEXT,
	edition      TEXT,
	pages        TEXT,
	day          TEXT,
	month        TEXT,
	monthnumber  TEXT,
	year         TEXT,
	CODEN        TEXT,
	DOI          TEXT,
	ISBN         TEXT,
	ISBN13       TEXT,
	ISSN         TEXT,
	ISSNL        TEXT,
	LCCN         TEXT,
	MRclass      TEXT,
	MRnumber     TEXT,
	MRreviewer   TEXT,
	bibdate      TEXT,
	bibsource    TEXT,
	bibtimestamp TEXT,
	note         TEXT,
	series       TEXT,
	URL          TEXT,
	abstract     TEXT,
	fjournal     TEXT,
	keywords     TEXT,
	language     TEXT,
	remark       TEXT,
	subject      TEXT,
	TOC          TEXT,
	ZMnumber     TEXT,
	acknowledgement TEXT,
	advisor      TEXT,
	affiliation  TEXT,
	affiliationaddress TEXT,
	ajournal     TEXT,
	annote       TEXT,
	articleno    TEXT,
	authordates  TEXT,
	bookDOI      TEXT,
	bookURL      TEXT,
	bookpages    TEXT,
	classcodes   TEXT,
	corpsource   TEXT,
	editordates  TEXT,
	howpublished TEXT,
	journalURL   TEXT,
	journalabr   TEXT,
	key          TEXT,
	onlinedate   TEXT,
	reviewer     TEXT,
	subjectdates TEXT,
	thesaurus    TEXT,
	treatment    TEXT,
	ZMclass      TEXT,
	issue        TEXT,
	rawabstract  TEXT,
	rawauthor    TEXT,
	rawbooktitle TEXT,
	raweditor    TEXT,
	rawnote      TEXT,
	rawtitle     TEXT,
	entry        TEXT NOT NULL UNIQUE
);

CREATE TABLE namtab (
        name         TEXT NOT NULL UNIQUE,
        count        INTEGER
);

CREATE TABLE strtab (
        key          TEXT,
        value        TEXT,
        entry        TEXT NOT NULL UNIQUE
);

The key values are mostly character strings, and the key names are available for use in subsequent searches in the database. The lettercase of key names is not significant. The values are of type TEXT, a character string type that holds at least 65,535 characters on all supported SQL databases. As described later, MySQL requires a following parenthesized length field.

The key names author through ZMnumber are given in the CREATE command in the order chosen by biborder(1), which is similar to that of the field order in conventional publication and reference lists. That order has no significance for either searching or database efficiency.

While most of the key names in the bibtab table are standard ones in BibTeX, a few are not. They include

In bibtosql, we follow the common SQL practice that indeterminate, unknown, or unset values are output as NULL, so as to facilitate their exclusion in later searches.

The NOT NULL attribute on the entry key requires that column to be supplied in each row: it is the only mandatory row entry.

The UNIQUE attribute on the entry key ensures that only one distinct copy of each BibTeX entry is stored in the database, even when identical copies are present in the input stream. That situation is common, since BibTeX entries are often copied between files and shared between users.

For PostgreSQL and SQLite, the TEXT type can represent a string of any size. For MySQL, that type must be followed by a parenthesized typical length: we use (32767). If that length is too small, long strings are arbitrarily, and silently, truncated.

The data-insertion block looks like this:

BEGIN TRANSACTION;
... INSERT commands go here ...
COMMIT;

This allows the database system to delay committing the input to permanent database storage until all of the input has been processed successfully. Any error since the start of the transaction causes the entire batch of updates to be discarded.

A single insertion command might look like this:

INSERT INTO bibtab
	(authorcount, editorcount, pagecount, bibtype, filename, label,
	author, editor, booktitle, title, crossref, chapter, journal, volume,
	type, number, institution, organization, publisher, school,
	address, edition, pages, day, month, monthnumber, year, CODEN,
	DOI, ISBN, ISBN13, ISSN, ISSNL, LCCN, MRclass, MRnumber, MRreviewer,
	bibdate, bibsource, bibtimestamp, note, series, URL, abstract,
	fjournal, keywords, language, remark, subject, TOC, ZMnumber,
	acknowledgement, advisor, affiliation, affiliationaddress,
	ajournal, annote, articleno, authordates, bookDOI, bookURL,
	bookpages, classcodes, corpsource, editordates, howpublished,
	journalURL, journalabr, key, onlinedate, reviewer,
	subjectdates, thesaurus, treatment, ZMclass,
	issue,
	rawabstract, rawauthor, rawbooktitle, raweditor, rawnote, rawtitle,
	entry)
	VALUES (
	NULL,
	5,
	NULL,
	'book',
	'foo2.bib',
	'Goossens:2008:LGC',
	NULL,
	'Michel Goossens and Frank Mittelbach and Sebastian Rahtz and Denis Roegel and Herbert Voss ',
	NULL,
	'The LaTeX Graphics Companion',
	NULL,
	NULL,
	NULL,
	NULL,
	NULL,
	NULL,
	NULL,
	NULL,
	'pub-AW',
	NULL,
	'pub-AW:adr',
	'Second',
	NULL,
	NULL,
	NULL,
	NULL,
	'2008',
	NULL,
	NULL,
	'0-321-50892-0 OR 0321508920',
	'978-0-321-50892-8 OR 9780321508928',
	NULL,
	NULL,
	'Z253.4.L38G663 2008',
	NULL,
	NULL,
	NULL,
	'Thu Sep 20 12:02:17 MDT 2007',
	'http://www.math.utah.edu/pub/tex/bib/master.bib; z3950.bibsys.no:2100/BIBSYS',
	'2007.09.20 12:02:17 MDT',
	NULL,
	'Addison-Wesley series on tools and techniques for computer typesetting',
	NULL,
	NULL,
	NULL,
	NULL,
	NULL,
	NULL,
	NULL,
	NULL,
	NULL,
	'ack-nhfb',
	NULL,
	NULL,
	NULL,
	NULL,
	NULL,
	NULL,
	NULL,
	NULL,
	NULL,
	NULL,
	NULL,
	NULL,
	NULL,
	NULL,
	NULL,
	NULL,
	NULL,
	NULL,
	NULL,
	NULL,
	NULL,
	NULL,
	NULL,
	NULL,
	NULL,
	NULL,
	NULL,
	'Michel Goossens and Frank Mittelbach and Sebastian Rahtz and Denis Roegel and Herbert Vo{\ss}',
	NULL,
	'The {\LaTeX} Graphics Companion',
	'
@Book{Goossens:2008:LGC,
  editor =       "Michel Goossens and Frank Mittelbach and Sebastian
                 Rahtz and Denis Roegel and Herbert Vo{\ss}",
  title =        "The {\LaTeX} Graphics Companion",
  publisher =    pub-AW,
  address =      pub-AW:adr,
  edition =      "Second",
  year =         "2008",
  ISBN =         "0-321-50892-0",
  ISBN-13 =      "978-0-321-50892-8",
  LCCN =         "Z253.4.L38G663 2008",
  bibdate =      "Thu Sep 20 12:02:17 MDT 2007",
  bibsource =    "http://www.math.utah.edu/pub/tex/bib/master.bib;
                 z3950.bibsys.no:2100/BIBSYS",
  series =       "Addison-Wesley series on tools and techniques for
                 computer typesetting",
  acknowledgement = ack-nhfb,
}'
);

Long lines have been wrapped with a backslash-newline for display purposes here; such wrapping is absent from the program output.

The ISBN and ISBN13 values are augmented with hyphen-free book numbers. Although the hyphen positions are well defined, they are usually unpredictable to humans, and may be incorrect in hand-produced bibliographic data. The hyphen-free form provides another way to find them in searches.

The INSERT statement is clearly verbose, and somewhat awkward and error-prone because of the separation of keys from their values, but that matters little for a computer program. When all fields are provided in the expected order, the parenthesized list of key names can be omitted. However, bibtosql intentionally includes the key names to ensure compatibility of its output file with databases created by future versions of the program that might define additional columns.

Manual insertions later can provide data for subsets of keys; any that are omitted simply acquire NULL values.

Except in the value of the entry key, consecutive whitespace is reduced to a single space, TeX macros are simplified or removed, and TeX accents and braces are stripped from key values. That convention makes specification of search strings for key values easier. Thus, to search for Paul Erd{\H{o}}s, use the pattern %P%Erdos% in the author or editor columns. To find documents with {\TeX}book or \TeX{}book or \TeX book in their titles, use the pattern %texbook%. Search patterns are described in the SQL SEARCH TUTORIAL section of the manual pages for bibsql(1).


SQLite SERVER SETUP

SQLite is the simplest of the databases supported by bibtosql and bibsql(1). The same program, sqlite3(1), acts as both a client and a server, and thus, the database must reside on the same system as the client. There is no data protection other than that provided by the host filesystem. In contrast to the complex access controls of client/server database systems, with SQLite, no special user privileges are needed to create or use an SQLite database.

SQLite is an excellent system for users new to databases to learn about, and experiment with, SQL.

The SQLite software is public domain and may be used for any purpose, and redistributed, without restriction. The software is fast, portable, and reliable. It can be built easily on almost any modern desktop operating system, including all current Unix, Mac OS X, and Windows systems. Package downloads and documentation are available at the developer site, http://www.sqlite.org/.

An interesting feature of SQLite is that there is an amalgamated distribution where the many ordinarily-separate source code files are collected into just two large files, enabling advanced compilers to do interprocedural optimizations and procedure inlining, further enhancing database performance.

An SQLite database consists of a single file that is independent of operating system, CPU architecture, and storage byte order. Once created, an SQLite database file can be used on any system where SQLite can be compiled, and is ideally suited for distribution on read-only media such as CD-ROMs or DVDs, or via the Web.

To create an SQLite database, name it on the command line, and then input a file of SQL commands to populate the database:

% bibtosql *.bib | sqlite3 bibtex.db
To search the database, just run the program again with the same filename argument:
% sqlite3 bibtex.db
sqlite> ... your input here ...
If a database filename is not provided, then SQLite creates an in-memory database that is lost when the program terminates. This is useful for testing new data and new table formats, as well as for fast searching of small private collections of BibTeX data. Here is a sample in-memory session:
% sqlite3
sqlite> .read testdata.sql
sqlite> ... your input here ...

For use with bibsql(1), the SQLite database file should be created and copied to the installation directory, and then set read-only to prevent unintentional or malicious modification. Assuming the default prefix of /usr/local, the job can be done like this:

% cp bibtex.db /usr/local/share/lib/bibsql/bibsql-x.yz/
% chmod a-wx,a+r /usr/local/share/lib/bibsql/bibsql-x.yz/bibtex.db

If the database file is large, searches may be made faster by creating indexes of important parts of the data:

sqlite> create index bibidx on bibtab (author, title, label);
sqlite> create index namidx on namtab (name);
sqlite> create index stridx on strtab (key, value);


MySQL SERVER SETUP

Setting up the server side of a MySQL connection is complex, and these step-by-step notes may be a helpful guide. It is a good thing to log what you are doing, either with the log facility provided in an xterm(1) window (usually on a menu bound to Ctl-Button-1), or else with the script(1) command.

We assume that the MySQL software has been installed in the directory /usr/local/bin, and its databases are to be created in subdirectories under the directory /var/lib/mysql, named by the database. For clarity, we use three different prompt strings: # for the root user, $ for the MySQL superuser, and % for an unprivileged ordinary user.

If the database file is large, searches may be made faster by creating indexes of important parts of the data:
mysql> alter table bibtab add index (author(50),title(100),label(50));
mysql> alter table strtab add index (`key`(50), value(50));
mysql> alter table namtab add index (name(50));

At the end of the next session, we discuss how to automatically update the database by regularly-scheduled jobs that find and insert new BibTeX data.


PostgreSQL SERVER SETUP

Setting up the server side of a PostgreSQL connection is a daunting process if you have never done it. These step-by-step notes may therefore be helpful. It is a good thing to log what you are doing, either with the log facility provided in an xterm(1) window (usually on a menu bound to Ctl-Button-1), or else with the script(1) command.

We assume that the PostgreSQL software has been installed in the tree /usr/local/pgsql, and its databases are to be created under the subdirectory data. For clarity, we use three different prompt strings: # for the root user, $ for the PostgreSQL superuser, and % for an unprivileged ordinary user.

If the database file is large, searches may be made faster by creating indexes of important parts of the data:
psql> create index bibidx on bibtab (author,title,label);
psql> create index namidx on namtab (name);
psql> create index stridx on strtab (key, value);


AUTHOR

Nelson H. F. Beebe
University of Utah
Department of Mathematics, 110 LCB
155 S 1400 E RM 233
Salt Lake City, UT 84112-0090
USA
Tel: +1 801 581 5254
FAX: +1 801 581 4148
Email: beebe@math.utah.edu, beebe@acm.org, beebe@computer.org (Internet)
URL: http://www.math.utah.edu/~beebe

The master archive for the bibsql(1) and bibtosql software is at these equivalent locations:

ftp://ftp.math.utah.edu/pub/tex/bibsql

http://www.math.utah.edu/pub/tex/bibsql

SEE ALSO

bib2xml(1), bib2xtx(1), bibcheck(1), bibclean(1), bibdestringify(1), bibdup(1), bibextract(1), bibindex(1), bibjoin(1), biblabel(1), biblex(1), biblook(1), biborder(1), bibparse(1), bibsearch(1), bibsort(1), bibsplit(1), bibsql(1), bibtex(1), bibunlex(1), cattobib(1), citefind(1), citesub(1), citetags(1), cron(8), date(1), find(1) latex(1), mysql(1), pg_ctl(1), postmaster(1), psql(1), ref2bib(1), scribe(1), script(1), sqlite3(1), tex(1), xml2bib(1), xml2wordbib(1), xterm(1), xtx2bib(1).