Typically G2 perpetually accumulates new data sources and additions and updates to existing data sources. There may be situations where you need to delete an entire data source and it's records from G2. For example, regulatory mandates stipulate a data source can no longer be permanently stored or a fundamental mapping and loading error has been detected and you wish to start from scratch with that one data source.
Using G2Loader every record from a previously loaded data source can be deleted using the -D flag. A subset could also be removed, this article assumes the requirement to remove every source record. In the near future additional API calls will be available to perform this action.
To demonstrate, consider the sample data shipped with G2 in the python/demo/sample directory. First, load the sample data using the default project (Note: this will purge your existing repository)
- python G2Loader.py -P
The sample project.csv loaded contains 2 data sources - PEOPLE and COMPANIES
Using the G2 Command utility to perform a G2 search, you can demonstrate there are entities for both the PEOPLE and COMPANIES data sources. The G2 Command utility is located in the G2 python directory and launched with
- python G2Command.py
Using the getEntityByRecordID API on the PEOPLE data source and a record ID of 1003 you observe a G2 entity returned corresponding to the source data and an individual known as Martin Jonze
Performing similar on the COMPANIES data source results in an entity from that data source for a company called Fabrics Unlimited
At this point you decide to remove the COMPANIES data source and all associated data from G2. To achieve this run G2Loader specifying only the COMPANIES data source file and the -D argument for G2Loader
- python G2Loader.py -D -f demo/sample/sample_company.csv/?data_source=COMPANIES,file_format=CSV
Every record from the COMPANIES data source has now been deleted from G2. If you perform the same search for the company as before no resulting response is returned.
The PEOPLE search will still return the matched entity, only the COMPANIES data source records were removed.
Be aware, deleting records from G2 is not a simple case of deleting records from the repository. When deleting source records from G2, the engine performs Entity Resolution on the impacted records and entities once again. This ensures G2 remains up to date with the most relevant analytical information with the data it currently has ingested.
As such deleting records and data sources from G2 on large repositories could be time consuming if millions of entities need to be re-evaluated as data is removed. We recommend contacting us if you have any concerns regard this or need to perform high volume deletions on production deployments.