SharePoint documents re-ingestion in Sidra
Before indexing all documents in Prod, it is mandatory to notify
- David Fernandez Luna
- Lisa Cummins
- John McCarthy
Prerequisites
To speed up the re-ingestion process, the following steps were performed:
- Temporarily scale-up the Azure App Service hosting the API/Hangfire jobs engine, that is ingesting the files
- Temporarily scale-up the Azure Search Service that is indexing the ingested files
- Also select the Blob Indexer and bump-up to 3 the setting for Batch size
- The Sidra Core API config setting AzureSearchIndexing:BatchSize should be set to 20 instead of the default 500.
(Otherwise, the Azure Search Service batch may terminate because it MAY not finish under the 30 minutes threshold.)
- Make sure that the Azure Storage Account allows for Public network access in its Network blade - DSU Stage storage.
- The Sidra Core DB must accept calls from Public networks
- Make sure that the Azure Search Service > Network blade allows for Public network access.
- Make sure that the Core Key Vault > Network blade allows for Public network access.
- Make sure that the DSU Key Vault > Network blade allows for Public network access.
- OAuth token should not expire, its new default value is 24 hours:
[IdentityServer].[dbo].[Clients].[AccessTokenLifetime] = 86400, at least for [ClientName] being 'EosPortal' / ClientId=5
- Azure Search Service, The Skillset used by the blob Indexer: Batch size should be 1, not 1000, when calling
https://a-p01-sdr-nlpl-wst-flask.azurewebsites.net/register:
{
"@odata.type": "#Microsoft.Skills.Custom.WebApiSkill",
"name": "#6",
"description": "Semantic Search skill",
"context": "/document",
"uri": "https://a-p01-sdr-nlpl-wst-flask.azurewebsites.net/register",
"httpMethod": "POST",
"timeout": "PT3M50S",
"batchSize": 1,
"degreeOfParallelism": null,
}
Step 0
In Sidra Core DB:
Mark the Pipeline(s) with IsRemoved=1, to avoid executing any file ingestion while the below steps are executed.
SELECT *
FROM [DataIngestion].[Provider]
WHERE [ProviderName] LIKE 'SharePoint%'
SELECT *
FROM [DataIngestion].[Provider]
WHERE [ProviderName] = 'SharePoint'
SELECT TOP 100 *
FROM [DataIngestion].[EntityPipeline]
WHERE [IdEntity] IN
(
SELECT [Id]
FROM [DataIngestion].[Entity]
WHERE [IdProvider] =
(
SELECT TOP 1 [Id] FROM [DataIngestion].[Provider] WHERE [ProviderName] = 'SharePoint'
)
)
DECLARE @Id_Pipeline INT = (SELECT TOP 1 [IdPipeline]
FROM [DataIngestion].[EntityPipeline]
WHERE [IdEntity] IN
(
SELECT [Id]
FROM [DataIngestion].[Entity]
WHERE [IdProvider] =
(
SELECT TOP 1 [Id] FROM [DataIngestion].[Provider] WHERE [ProviderName] = 'SharePoint'
)
)
)
-- SELECT @Id_Pipeline
SELECT *
FROM [DataIngestion].[Pipeline]
WHERE [Id] = @Id_Pipeline
SELECT *
FROM [DataIngestion].[Pipeline]
WHERE [Name] = 'PipelineSharePointAzureSearch'
/*
UPDATE [DataIngestion].[Pipeline]
SET [IsRemoved] = 1
WHERE [Id] = @Id_Pipeline
UPDATE [DataIngestion].[Pipeline]
SET [IsRemoved] = 1
WHERE [Name] = 'PipelineSharePointAzureSearch'
*/
Step 1
Using Azure Storage Explorer or az storage blob command:
Download files (blobs) from the STAGE RAW container of the provider
Step 2
In a backup copy in the local file system:
Rename the files to drop their _idNNN appended to the file names.
Get-ChildItem -Filter "*_id*" -Recurse | Rename-Item -NewName {$_.name -replace '_id\d+',''}
Step 3
In the Azure Resource Group of the Client App doing the Cognitive Search indexing:
Prepare the Cognitive Search Index, basically resetting it.
The API of the Semantic search is located at:
Dev: https://a-z01-sdr-nlpl-wst-flask.azurewebsites.net
UAT: https://a-t01-sdr-nlpl-wst-flask.azurewebsites.net
Prod: https://a-p01-sdr-nlpl-wst-flask.azurewebsites.net
GET https://a-p01-sdr-nlpl-wst-flask.azurewebsites.net/reset
In the DSU Azure Search Service > Skillsets, scrolling the JSON we can see how a call is being made to the above API endpoint:
"uri": "https://a-p01-sdr-nlpl-wst-flask.azurewebsites.net/register",
"httpMethod": "POST",
"timeout": "PT3M50S",
"batchSize": 1000,
...
"httpHeaders": {
"x-api-key": "7l6w2hjfAiZuTBNo3rRUGID1"
}
Step 4
In the Sidra Core DB:
Delete the Asset metadata from the Sidra's Core database:
SELECT *
FROM [DataIngestion].[Provider]
WHERE [ProviderName] LIKE 'SharePoint%'
SELECT *
FROM [DataIngestion].[Provider]
WHERE [ProviderName] = 'SharePoint'
SELECT *
FROM [DataIngestion].[Entity]
WHERE [IdProvider] =
(
SELECT TOP 1 [Id] FROM [DataIngestion].[Provider] WHERE [ProviderName] = 'SharePoint'
)
SELECT COUNT(*) AS [SharePoint-Assets-count-in-Core]
FROM [DataIngestion].[Asset]
WHERE [IdEntity] IN
(
SELECT [Id]
FROM [DataIngestion].[Entity]
WHERE [IdProvider] =
(
SELECT TOP 1 [Id] FROM [DataIngestion].[Provider] WHERE [ProviderName] = 'SharePoint'
)
)
/*
DELETE FROM [DataIngestion].[Asset]
WHERE [IdEntity] IN
(
SELECT [Id]
FROM [DataIngestion].[Entity]
WHERE [IdProvider] =
(
SELECT TOP 1 [Id] FROM [DataIngestion].[Provider] WHERE [ProviderName] = 'SharePoint'
)
)
*/
Step 5
Using Azure Storage Explorer or az storage blob command:
Cleanup the DSU STAGE Azure Storage Account, removing the blobs from the following containers:
- SharePoint/documents
- CAUTION: IndexLanding (MAYBE NOT)
- KnowledgeStore
az storage blob delete-batch --help
az storage blob delete-batch `
--account-name "ap01sdrdsustastage" `
--source "sharepoint" `
--pattern "*/*" `
--account-key "SECRET"
az storage blob copy start-batch --help
az storage blob copy start-batch `
--account-name "ap01sdrdsustastage" `
--source-container "knowledgestore" `
--pattern "*" `
--destination-container "knowledgestoredeleteme2" `
--dryrun
Step 6
In the DSU Resource Group:
Delete the tables from the Databricks. Delete the index structure in the tables, for the Provider.Entity:
DELETE/TRUNCATE TABLE SharePoint.Documents
DELETE/TRUNCATE TABLE SharePoint.DocumentsValidationErrors
Note: Execute the commands from a new Workspace Notebook.
Why delete tables
On re-ingesting files, were cleaning up the existing file content to avoid having duplicate content when the same files would be indexed twice. Same thing is done in...
Sidra Core DB: we're removing the existing Provider > Entity > Asset records;
Azure Search Service: we're removing the existing index/indexer, data source, and skillset
Step 7
In the DSU Resource Group:
Cleanup the Azure Search Service associated with the DSU.
To see the Index, Data Sources, Skillsets, it may be needed to enable access from the Azure services to the Search Service, from the Overview blade.
Go to Indexers. Remove them - in the pop-up, make sure to check both "Delete data source..." and "Delete Skillset…"
SemanticSearchPipeline-SharePoint-Documents-Blob
SemanticSearchPipeline-SharePoint-Documents-Sql
Make sure that Index, Data Sources and Skillsets are also deleted:
Index:
SemanticSearchPipeline
Data Sources:
DS-SharePoint-Documents-Blob
DS-SharePoint-Documents-Sql
Skillsets:
SemanticSearchPipeline
Step 8
In the DB of the client app - Portal DB:
Remove the Asset records from the Portal DB (not the Sidra Core DB, this time):
I'm guessing this is:
RG a-p01-sdr-portal-rg
SQL Server a-p01-sdr-portal-sql
DB a-p01-sdr-portal-operational-sqldb
SELECT *
FROM [Sidra].[Provider]
WHERE [ProviderName] LIKE 'SharePoint%'
SELECT *
FROM [Sidra].[Provider]
WHERE [ProviderName] = 'SharePoint'
SELECT *
FROM [Sidra].[Entity]
WHERE [IdProvider] =
(
SELECT TOP 1 [Id] FROM [Sidra].[Provider] WHERE [ProviderName] = 'SharePoint'
)
SELECT COUNT(*) AS [Documents-count-in-Portal]
FROM [eos_portal].[Documents] D
WHERE D.[SidraId] IN
(
SELECT [Id] FROM [Sidra].[Asset]
WHERE [IdEntity] IN
(
SELECT [Id]
FROM [Sidra].[Entity]
WHERE [IdProvider] =
(
SELECT TOP 1 [Id] FROM [Sidra].[Provider] WHERE [ProviderName] = 'SharePoint'
)
)
)
/*
DELETE FROM [eos_portal].[Documents]
WHERE [SidraId] IN
(
SELECT [Id] FROM [Sidra].[Asset]
WHERE [IdEntity] IN
(
SELECT [Id]
FROM [Sidra].[Entity]
WHERE [IdProvider] =
(
SELECT TOP 1 [Id] FROM [Sidra].[Provider] WHERE [ProviderName] = 'SharePoint'
)
)
)
*/
SELECT COUNT(*) AS [SharePoint-Assets-count-in-Portal]
FROM [Sidra].[Asset]
WHERE [IdEntity] IN
(
SELECT [Id]
FROM [Sidra].[Entity]
WHERE [IdProvider] =
(
SELECT TOP 1 [Id] FROM [Sidra].[Provider] WHERE [ProviderName] = 'SharePoint'
)
)
/*
DELETE FROM [Sidra].[Asset]
WHERE [IdEntity] IN
(
SELECT [Id]
FROM [Sidra].[Entity]
WHERE [IdProvider] =
(
SELECT TOP 1 [Id] FROM [Sidra].[Provider] WHERE [ProviderName] = 'SharePoint'
)
)
*/
Step 9
In the Sidra Core DB:
Change the LastUpdated property of the Pipeline - when a file will be uploaded the, Index would be created again.
DECLARE @Id_Pipeline INT = (SELECT TOP 1 [IdPipeline]
FROM [DataIngestion].[EntityPipeline]
WHERE [IdEntity] IN
(
SELECT [Id]
FROM [DataIngestion].[Entity]
WHERE [IdProvider] =
(
SELECT TOP 1 [Id] FROM [DataIngestion].[Provider] WHERE [ProviderName] = 'SharePoint'
)
)
)
-- SELECT @Id_Pipeline
SELECT *
FROM [DataIngestion].[Pipeline]
WHERE [Id] = @Id_Pipeline
SELECT *
FROM [DataIngestion].[Pipeline]
WHERE [Name] = 'PipelineSharePointAzureSearch'
/*
UPDATE [DataIngestion].[Pipeline]
SET [LastUpdated] = GETUTCDATE(),
[ValidFrom] = '2020-01-01'
WHERE [Id] = @Id_Pipeline
UPDATE [DataIngestion].[Pipeline]
SET [LastUpdated] = GETUTCDATE(),
[ValidFrom] = '2020-01-01'
WHERE [Name] = 'PipelineSharePointAzureSearch'
*/
Also, re-enable the pipeline setting IsRemoved to 0.
SELECT *
FROM [DataIngestion].[Provider]
WHERE [ProviderName] LIKE 'SharePoint%'
SELECT *
FROM [DataIngestion].[Provider]
WHERE [ProviderName] = 'SharePoint'
SELECT TOP 100 *
FROM [DataIngestion].[EntityPipeline]
WHERE [IdEntity] IN
(
SELECT [Id]
FROM [DataIngestion].[Entity]
WHERE [IdProvider] =
(
SELECT TOP 1 [Id] FROM [DataIngestion].[Provider] WHERE [ProviderName] = 'SharePoint'
)
)
DECLARE @Id_Pipeline INT = (SELECT TOP 1 [IdPipeline]
FROM [DataIngestion].[EntityPipeline]
WHERE [IdEntity] IN
(
SELECT [Id]
FROM [DataIngestion].[Entity]
WHERE [IdProvider] =
(
SELECT TOP 1 [Id] FROM [DataIngestion].[Provider] WHERE [ProviderName] = 'SharePoint'
)
)
)
-- SELECT @Id_Pipeline
SELECT *
FROM [DataIngestion].[Pipeline]
WHERE [Id] = @Id_Pipeline
SELECT *
FROM [DataIngestion].[Pipeline]
WHERE [Name] = 'PipelineSharePointAzureSearch'
/*
UPDATE [DataIngestion].[Pipeline]
SET [IsRemoved] = 0
WHERE [Id] = @Id_Pipeline
UPDATE [DataIngestion].[Pipeline]
SET [IsRemoved] = 0
WHERE [Name] = 'PipelineSharePointAzureSearch'
*/
Step 10
In the Sidra Core DB:
The SharePoint.Documents Entity must have its [ReCreateTableOnDeployment] set to true (1), so that Databricks tables are created:
SELECT *
FROM [DataIngestion].[Entity]
WHERE [IdProvider] =
(
SELECT TOP 1 [Id] FROM [DataIngestion].[Provider] WHERE [ProviderName] = 'SharePoint'
)
UPDATE [DataIngestion].[Entity]
SET [ReCreateTableOnDeployment] = 1
WHERE [IdProvider] =
(
SELECT TOP 1 [Id] FROM [DataIngestion].[Provider] WHERE [ProviderName] = 'SharePoint'
)
Step 11
Using Azure Storage Explorer:
Upload the files previously downloaded from container "SharePoint" (and maybe "SharePointOriginal" too) in the "IndexLanding" container, folder "sharepoint/documents", to have the files re-ingested.
Note: surdiales@plainconcepts.com already recommended setting a lower batch size in the settings of the Sidra API. This should ensure that the batch processing is not exceeding the 30 minute threshold, which would cause the batch to be prematurely terminated.
AzureSearchIndexing:BatchSize = 20
Step 12
If the Provider's Databricks table(s) don't show up:
- SharePoint.Documents
- SharePoint.DocumentsValidationErrors
Try executing manually the notebook
- Shared > AzureSearchKnowledgeStoreIngestion
Note: The code in the Notebook will stop without creating the tables if the Azure Search Service indexers are still busy.
Step 13
Re-build and the save the index of the Azure Search Service used by the Client App:
POST https://a-p01-sdr-nlpl-wst-flask.azurewebsites.net/upgrade { "recreate_index": true }
POST https://a-p01-sdr-nlpl-wst-flask.azurewebsites.net/save
Note: These 2 steps should be run periodically to keep the Search Service in good shape.
In the DSU Azure Search Service > Skillsets, scrolling the JSON we can see how a call is being made to the above API endpoint:
"uri": "https://a-p01-sdr-nlpl-wst-flask.azurewebsites.net/register",
"httpMethod": "POST",
"timeout": "PT3M50S",
"batchSize": 1000,
...
"httpHeaders": {
"x-api-key": "7l6…GID1"
}
Notes
Post re-ingestion: EOS Portal
After binary files are re-ingested to Sidra, they are being published into the Hermes' EOS Portal.
The publishing is done by executing the EMSLoad pipeline from the following Data Factory resource:
/resource/subscriptions/bef2db33-37d3-4615-8cd8-e78904d8a545
/resourceGroups/a-p01-sdr-portal2-rg
/providers/Microsoft.DataFactory/factories/a-p01-sdr-portal-dft/overview
The pipeline is executed with a manual trigger each morning around 4:40, but it can also be manually run.
Performance
Ingestion in the PROD was visibly slower than the ingestion in UAT, due to Azure Search Service indexing rate.
This happened despite the higher/better plan of Azure Search Service in PROD.

In fact, it is likely that the Azure Search Service plan has little influence; another component involved in file indexing has greater influence.
The application identified below has a much greater impact in the rate of indexing than the plan: It is greatly dragging down the performance.
/resource/subscriptions/bef2db33-37d3-4615-8cd8-e78904d8a545
/resourceGroups/a-p01-sdr-dsu-app-nlpdatalab-rg
/providers/Microsoft.Web/sites/a-p01-sdr-nlpl-wst-flask/appServices
The above application has frequent spikes of high latency: response time of 3 to 4 minutes, while its pipeline never exceeds 10 requests:

This application is hooking into the pipeline of the Search Service Skillset semanticsearchpipeline, as a step that performs calls to:
https://a-p01-sdr-nlpl-wst-flask.azurewebsites.net/register
This skillset is being used by Indexer semanticsearchpipeline-sharepoint-documents-blob.
The application, however, does not contribute to the Index of the Search Service (semanticsearchpipeline).
Hence, the app is dragging down the indexing rate without bringing a value to the indexing itself.
This app exerts same behavior in UAT, though (for unknown reasons) its impact was smaller there.
This app should do its job outside the pipeline of the Skillset, probably simply querying the index (semanticsearchpipeline).
Resource scaling
In an attempt to improve the performance during the file re-ingestion, the following resources have been bumped up:
Indexing app, its database, from 10 DTU and 250 GB
/resource/subscriptions/bef2db33-37d3-4615-8cd8-e78904d8a545
/resourceGroups/a-p01-sdr-dsu-app-nlpdatalab-rg
/providers/Microsoft.Sql/servers/a-p01-sdr-nlpl-sdb
/databases/Client
Indexing app, its hosting plan, from P1V2
/resource/subscriptions/bef2db33-37d3-4615-8cd8-e78904d8a545
/resourceGroups/a-p01-sdr-dsu-app-nlpdatalab-rg
/providers/Microsoft.Web/serverFarms/a-p01-sdr-nlpl-lnx/webHostingPlan