Typically, Senzing perpetually accumulates new and updated data from data sources. There may be situations where you need to delete an entire data source and its records. For example, regulatory mandates stipulate a data source can no longer be permanently stored. Another example is if a fundamental mapping and loading error has been detected and you wish to start from scratch with that one data source.
Using G2Loader every record from a previously loaded data source can be deleted using the '-D' flag. Additionally, a subset could also be removed. This article assumes the requirement is to purge every source record from a single data source.
The following instructions will completely purge your currently configured entity repository, only perform these steps on a test system! If you are using SQLite and wish to backup any currently loaded data review Backup & Restore SQLite Repository.
To demonstrate, consider the sample data shipped in the python/demo/sample directory. First, load the sample data using the sample project file.
./G2Loader.py -P -p demo/sample/project.csv
The project.csv file contains 2 data sources - PEOPLE and COMPANIES.
DATA_SOURCE,FILE_FORMAT,FILE_NAME PEOPLE ,CSV ,sample_person.csv COMPANIES ,CSV ,sample_company.csv
Utilizing the G2Command utility to call the Senzing APIs, you can demonstrate there are entities for both the PEOPLE and COMPANIES data sources present in the entity repository after the load.
getEntityByRecordIDV2 people 1003 0
Using the getEntityByRecordIDV2 API on the PEOPLE data source and a record ID of 1003, observe an entity returned corresponding to the source data and an individual known as Martin Jonze.
Performing similar on the COMPANIES data source results in a response message for an entity from the COMPANIES data source for a company called Fabrics Unlimited.
At this point you decide to remove the COMPANIES data source and all associated data from Senzing. To achieve this run G2Loader specifying only the COMPANIES data source file and the -D argument for G2Loader. Note, the -P (purge repository) flag isn't used.
./G2Loader.py -D -f demo/sample/sample_company.csv/?data_source=COMPANIES
Every record from the COMPANIES data source has now been deleted. If you perform the getEntityByRecordIDV2 API call on the same company record as before no resulting response is returned.
The getEntityByRecordIDV2 API call on the record from the PEOPLE data source will still return the record previously recalled. Only the COMPANIES data source records were removed.
Be aware, deleting records from Senzing is not a simple case of deleting records from the repository. When deleting data source records, the Senzing engine performs entity resolution on the impacted records and entities, just as it would upon loading them. This ensures that Senzing remains up-to-date with the most relevant analytical information from the data it currently has ingested.
As such, deleting records and data sources on large repositories could be time-consuming if millions of entities need to be re-evaluated as data is removed. We recommend contacting us if you have any concerns or need to perform high volume deletions on production deployments.