G2Loader - File Encoding – Senzing®

G2Loader.py is a sample utility for ingesting source data into Senzing. It supports CSV and JSON formats for mapping the data from your source(s) to the Senzing attribute terms.

G2Loader is a utility to expedite getting started and use in proof of concepts (PoC). In a typical production environment you wouldn't use G2Loader, instead calling the Senzing APIs directly.

The required file encoding for the data source files to be ingested is UTF-8. This is required even when you believe you have a different encoding, such as US ASCII. It is often observed - generally in files coming from or prepared in Windows environments - there are non-ascii characters in data sources even when they are not expected!

The export of data or ETL processes should always write the final file that will be sent to Senzing and G2Loader with UTF-8 encoding.

If you've been supplied data source files and are unsure of the file encoding you can check with the file or enca commands. Depending on the severity of spurious characters in a non-UTF-8 file you may also try using the iconv command to convert from one encoding to UTF-8; though be aware this does have limitations.

The following examples use native Linux commands, if these are not installed on your Linux system they are typically available in the standard distribution repositories and installed with your distributions package commands.

Checking a File Encoding

/home/g2/ > file demo/sample/sample_person.json
demo/sample/sample_person.json: ASCII text, with very long lines
/home/g2/ >

/home/g2/ > enca -L none demo/sample/sample_person.json 
7bit ASCII characters
/home/g2/ >

When using file to check the encoding up until version 5.26 it only checks a few kilobytes of a file to make a best effort understanding of the encoding. Due to this it can get it wrong. For example, a file can be reported as 7bit ASCII but still have spurious non-ascii characters later in the file.

Version 5.26+ of file added a new parameter (bytes=x) to scan a set number of bytes in the file. Scanning a larger portion, if not all of a file, will give a more accurate result.

/home/g2 > fl_size=$(wc -c < demo/sample/sample_person.json)
/home/g2 > file -P bytes=$file_size -i demo/sample/sample_person.json
demo/sample/sample_person.json: ISO-8859 text, with CRLF line terminators

Converting Encoding to UTF-8

To convert a file from one encoding to UTF-8 the iconv commands can be used.

/home/g2/ iconv -f us-ascii -t utf-8 <input_file> -o <output_file>

Upon attempting a conversion you may sometimes see errors that there are invalid bytes or characters that cannot be correctly converted and the conversion will stop. Such a message is usually similar to:

iconv: illegal input sequence at position 30

Iconv includes many useful features, two of these are the //IGNORE and //TRANSLIT options specified on the -t (--to-code) argument.

//IGNORE tells iconv to ignore invalid characters instead of stopping. //TRANSLIT tells iconv to try to transliterate to the closest possible matching character. If you need to use this options //IGNORE is recommended over //TRANSLIT.

/home/g2/ iconv -f us-ascii -t utf-8//IGNORE <input_file> -o <output_file>

Locating Invalid Characters / Bytes

If G2Loader reports an invalid byte or character for the file encoding whilst processing a data source file you can locate the failing record with grep and the hex value reported. For example, if G2Loader reports in can't decode a byte 0xb1 use the following to view any rows containing this value.

grep -n $'\xb1' <source_file>

Check the offending records and ideally correct them or the file encoding in your ETL or other processing scripts.

If there is only one or two offending characters/bytes you could also use sed to replace them.

sed -i 's/\xb1//g' <source_file>

Articles in this section

G2Loader - File Encoding

Checking a File Encoding

Converting Encoding to UTF-8

Locating Invalid Characters / Bytes

Comments

Articles in this section

Checking a File Encoding

Converting Encoding to UTF-8

Locating Invalid Characters / Bytes

Related articles