SharePoint documents re-ingestion in Sidra

Before indexing all documents in Prod, it is mandatory to notify

David Fernandez Luna
Lisa Cummins
John McCarthy

Prerequisites

To speed up the re-ingestion process, the following steps were performed:

Temporarily scale-up the Azure App Service hosting the API/Hangfire jobs engine, that is ingesting the files
Temporarily scale-up the Azure Search Service that is indexing the ingested files
Also select the Blob Indexer and bump-up to 3 the setting for Batch size
The Sidra Core API config setting AzureSearchIndexing:BatchSize should be set to 20 instead of the default 500.
(Otherwise, the Azure Search Service batch may terminate because it MAY not finish under the 30 minutes threshold.)
Make sure that the Azure Storage Account allows for Public network access in its Network blade - DSU Stage storage.
The Sidra Core DB must accept calls from Public networks
Make sure that the Azure Search Service > Network blade allows for Public network access.
Make sure that the Core Key Vault > Network blade allows for Public network access.
Make sure that the DSU Key Vault > Network blade allows for Public network access.
OAuth token should not expire, its new default value is 24 hours:
[IdentityServer].[dbo].[Clients].[AccessTokenLifetime] = 86400, at least for [ClientName] being 'EosPortal' / ClientId=5
Azure Search Service, The Skillset used by the blob Indexer: Batch size should be 1, not 1000, when calling
https://a-p01-sdr-nlpl-wst-flask.azurewebsites.net/register:

{

"@odata.type": "#Microsoft.Skills.Custom.WebApiSkill",

"name": "#6",

"description": "Semantic Search skill",

"context": "/document",

"uri": "https://a-p01-sdr-nlpl-wst-flask.azurewebsites.net/register",

"httpMethod": "POST",

"timeout": "PT3M50S",

"batchSize": 1,

"degreeOfParallelism": null,

}

Step 0

In Sidra Core DB:

Mark the Pipeline(s) with IsRemoved=1, to avoid executing any file ingestion while the below steps are executed.

SELECT *

FROM [DataIngestion].[Provider]

WHERE [ProviderName] LIKE 'SharePoint%'

SELECT *

FROM [DataIngestion].[Provider]

WHERE [ProviderName] = 'SharePoint'

SELECT TOP 100 *

FROM [DataIngestion].[EntityPipeline]

WHERE [IdEntity] IN

(

SELECT [Id]

FROM [DataIngestion].[Entity]

WHERE [IdProvider] =

(

SELECT TOP 1 [Id] FROM [DataIngestion].[Provider] WHERE [ProviderName] = 'SharePoint'

)

DECLARE @Id_Pipeline INT = (SELECT TOP 1 [IdPipeline]

FROM [DataIngestion].[EntityPipeline]

WHERE [IdEntity] IN

(

SELECT [Id]

FROM [DataIngestion].[Entity]

WHERE [IdProvider] =

(

SELECT TOP 1 [Id] FROM [DataIngestion].[Provider] WHERE [ProviderName] = 'SharePoint'

)

-- SELECT @Id_Pipeline

SELECT *

FROM [DataIngestion].[Pipeline]

WHERE [Id] = @Id_Pipeline

SELECT *

FROM [DataIngestion].[Pipeline]

WHERE [Name] = 'PipelineSharePointAzureSearch'

UPDATE [DataIngestion].[Pipeline]

SET [IsRemoved] = 1

WHERE [Id] = @Id_Pipeline

UPDATE [DataIngestion].[Pipeline]

SET [IsRemoved] = 1

WHERE [Name] = 'PipelineSharePointAzureSearch'

Step 1

Using Azure Storage Explorer or az storage blob command:

Download files (blobs) from the STAGE RAW container of the provider

Step 2

In a backup copy in the local file system:

Rename the files to drop their _idNNN appended to the file names.

Get-ChildItem -Filter "*_id*" -Recurse | Rename-Item -NewName {$_.name -replace '_id\d+',''}

Step 3

In the Azure Resource Group of the Client App doing the Cognitive Search indexing:

Prepare the Cognitive Search Index, basically resetting it.

The API of the Semantic search is located at:
Dev: https://a-z01-sdr-nlpl-wst-flask.azurewebsites.net
UAT: https://a-t01-sdr-nlpl-wst-flask.azurewebsites.net

Prod: https://a-p01-sdr-nlpl-wst-flask.azurewebsites.net

GET https://a-p01-sdr-nlpl-wst-flask.azurewebsites.net/reset

In the DSU Azure Search Service > Skillsets, scrolling the JSON we can see how a call is being made to the above API endpoint:

"uri": "https://a-p01-sdr-nlpl-wst-flask.azurewebsites.net/register",

"httpMethod": "POST",

"timeout": "PT3M50S",

"batchSize": 1000,

...

"httpHeaders": {

"x-api-key": "7l6w2hjfAiZuTBNo3rRUGID1"

}

Step 4

In the Sidra Core DB:

Delete the Asset metadata from the Sidra's Core database:

SELECT *

FROM [DataIngestion].[Provider]

WHERE [ProviderName] LIKE 'SharePoint%'

SELECT *

FROM [DataIngestion].[Provider]

WHERE [ProviderName] = 'SharePoint'

SELECT *

FROM [DataIngestion].[Entity]

WHERE [IdProvider] =

(

SELECT TOP 1 [Id] FROM [DataIngestion].[Provider] WHERE [ProviderName] = 'SharePoint'

)

SELECT COUNT(*) AS [SharePoint-Assets-count-in-Core]

FROM [DataIngestion].[Asset]

WHERE [IdEntity] IN

(

SELECT [Id]

FROM [DataIngestion].[Entity]

WHERE [IdProvider] =

(

SELECT TOP 1 [Id] FROM [DataIngestion].[Provider] WHERE [ProviderName] = 'SharePoint'

)

DELETE FROM [DataIngestion].[Asset]

WHERE [IdEntity] IN

(

SELECT [Id]

FROM [DataIngestion].[Entity]

WHERE [IdProvider] =

(

SELECT TOP 1 [Id] FROM [DataIngestion].[Provider] WHERE [ProviderName] = 'SharePoint'

)

Step 5

Using Azure Storage Explorer or az storage blob command:

Cleanup the DSU STAGE Azure Storage Account, removing the blobs from the following containers:

SharePoint/documents
CAUTION: IndexLanding (MAYBE NOT)
KnowledgeStore

az storage blob delete-batch --help

az storage blob delete-batch `

--account-name "ap01sdrdsustastage" `

--source "sharepoint" `

--pattern "*/*" `

--account-key "SECRET"

az storage blob copy start-batch --help

az storage blob copy start-batch `

--account-name "ap01sdrdsustastage" `

--source-container "knowledgestore" `

--pattern "*" `

--destination-container "knowledgestoredeleteme2" `

--dryrun

Step 6

In the DSU Resource Group:

Delete the tables from the Databricks. Delete the index structure in the tables, for the Provider.Entity:

DELETE/TRUNCATE TABLE SharePoint.Documents

DELETE/TRUNCATE TABLE SharePoint.DocumentsValidationErrors

Note: Execute the commands from a new Workspace Notebook.

Why delete tables

On re-ingesting files, were cleaning up the existing file content to avoid having duplicate content when the same files would be indexed twice. Same thing is done in...

Sidra Core DB: we're removing the existing Provider > Entity > Asset records;

Azure Search Service: we're removing the existing index/indexer, data source, and skillset

Step 7

In the DSU Resource Group:

Cleanup the Azure Search Service associated with the DSU.

To see the Index, Data Sources, Skillsets, it may be needed to enable access from the Azure services to the Search Service, from the Overview blade.

Go to Indexers. Remove them - in the pop-up, make sure to check both "Delete data source..." and "Delete Skillset…"

SemanticSearchPipeline-SharePoint-Documents-Blob

SemanticSearchPipeline-SharePoint-Documents-Sql

Make sure that Index, Data Sources and Skillsets are also deleted:

Index:

SemanticSearchPipeline

Data Sources:

DS-SharePoint-Documents-Blob

DS-SharePoint-Documents-Sql

Skillsets:

SemanticSearchPipeline

Step 8

In the DB of the client app - Portal DB:

Remove the Asset records from the Portal DB (not the Sidra Core DB, this time):

I'm guessing this is:

RG a-p01-sdr-portal-rg

SQL Server a-p01-sdr-portal-sql

DB a-p01-sdr-portal-operational-sqldb

SELECT *

FROM [Sidra].[Provider]

WHERE [ProviderName] LIKE 'SharePoint%'

SELECT *

FROM [Sidra].[Provider]

WHERE [ProviderName] = 'SharePoint'

SELECT *

FROM [Sidra].[Entity]

WHERE [IdProvider] =

(

SELECT TOP 1 [Id] FROM [Sidra].[Provider] WHERE [ProviderName] = 'SharePoint'

)

SELECT COUNT(*) AS [Documents-count-in-Portal]

FROM [eos_portal].[Documents] D

WHERE D.[SidraId] IN

(

SELECT [Id] FROM [Sidra].[Asset]

WHERE [IdEntity] IN

(

SELECT [Id]

FROM [Sidra].[Entity]

WHERE [IdProvider] =

(

SELECT TOP 1 [Id] FROM [Sidra].[Provider] WHERE [ProviderName] = 'SharePoint'

)

DELETE FROM [eos_portal].[Documents]

WHERE [SidraId] IN

(

SELECT [Id] FROM [Sidra].[Asset]

WHERE [IdEntity] IN

(

SELECT [Id]

FROM [Sidra].[Entity]

WHERE [IdProvider] =

(

SELECT TOP 1 [Id] FROM [Sidra].[Provider] WHERE [ProviderName] = 'SharePoint'

)

SELECT COUNT(*) AS [SharePoint-Assets-count-in-Portal]

FROM [Sidra].[Asset]

WHERE [IdEntity] IN

(

SELECT [Id]

FROM [Sidra].[Entity]

WHERE [IdProvider] =

(

SELECT TOP 1 [Id] FROM [Sidra].[Provider] WHERE [ProviderName] = 'SharePoint'

)

DELETE FROM [Sidra].[Asset]

WHERE [IdEntity] IN

(

SELECT [Id]

FROM [Sidra].[Entity]

WHERE [IdProvider] =

(

SELECT TOP 1 [Id] FROM [Sidra].[Provider] WHERE [ProviderName] = 'SharePoint'

)

Step 9

In the Sidra Core DB:

Change the LastUpdated property of the Pipeline - when a file will be uploaded the, Index would be created again.

DECLARE @Id_Pipeline INT = (SELECT TOP 1 [IdPipeline]

FROM [DataIngestion].[EntityPipeline]

WHERE [IdEntity] IN

(

SELECT [Id]

FROM [DataIngestion].[Entity]

WHERE [IdProvider] =

(

SELECT TOP 1 [Id] FROM [DataIngestion].[Provider] WHERE [ProviderName] = 'SharePoint'

)

-- SELECT @Id_Pipeline

SELECT *

FROM [DataIngestion].[Pipeline]

WHERE [Id] = @Id_Pipeline

SELECT *

FROM [DataIngestion].[Pipeline]

WHERE [Name] = 'PipelineSharePointAzureSearch'

UPDATE [DataIngestion].[Pipeline]

SET [LastUpdated] = GETUTCDATE(),

[ValidFrom] = '2020-01-01'

WHERE [Id] = @Id_Pipeline

UPDATE [DataIngestion].[Pipeline]

SET [LastUpdated] = GETUTCDATE(),

[ValidFrom] = '2020-01-01'

WHERE [Name] = 'PipelineSharePointAzureSearch'

Also, re-enable the pipeline setting IsRemoved to 0.

SELECT *

FROM [DataIngestion].[Provider]

WHERE [ProviderName] LIKE 'SharePoint%'

SELECT *

FROM [DataIngestion].[Provider]

WHERE [ProviderName] = 'SharePoint'

SELECT TOP 100 *

FROM [DataIngestion].[EntityPipeline]

WHERE [IdEntity] IN

(

SELECT [Id]

FROM [DataIngestion].[Entity]

WHERE [IdProvider] =

(

SELECT TOP 1 [Id] FROM [DataIngestion].[Provider] WHERE [ProviderName] = 'SharePoint'

)

DECLARE @Id_Pipeline INT = (SELECT TOP 1 [IdPipeline]

FROM [DataIngestion].[EntityPipeline]

WHERE [IdEntity] IN

(

SELECT [Id]

FROM [DataIngestion].[Entity]

WHERE [IdProvider] =

(

SELECT TOP 1 [Id] FROM [DataIngestion].[Provider] WHERE [ProviderName] = 'SharePoint'

)

-- SELECT @Id_Pipeline

SELECT *

FROM [DataIngestion].[Pipeline]

WHERE [Id] = @Id_Pipeline

SELECT *

FROM [DataIngestion].[Pipeline]

WHERE [Name] = 'PipelineSharePointAzureSearch'

UPDATE [DataIngestion].[Pipeline]

SET [IsRemoved] = 0

WHERE [Id] = @Id_Pipeline

UPDATE [DataIngestion].[Pipeline]

SET [IsRemoved] = 0

WHERE [Name] = 'PipelineSharePointAzureSearch'

Step 10

In the Sidra Core DB:

The SharePoint.Documents Entity must have its [ReCreateTableOnDeployment] set to true (1), so that Databricks tables are created:

SELECT *

FROM [DataIngestion].[Entity]

WHERE [IdProvider] =

(

SELECT TOP 1 [Id] FROM [DataIngestion].[Provider] WHERE [ProviderName] = 'SharePoint'

)

UPDATE [DataIngestion].[Entity]

SET [ReCreateTableOnDeployment] = 1

WHERE [IdProvider] =

(

SELECT TOP 1 [Id] FROM [DataIngestion].[Provider] WHERE [ProviderName] = 'SharePoint'

)

Step 11

Using Azure Storage Explorer:

Upload the files previously downloaded from container "SharePoint" (and maybe "SharePointOriginal" too) in the "IndexLanding" container, folder "sharepoint/documents", to have the files re-ingested.

Note: surdiales@plainconcepts.com already recommended setting a lower batch size in the settings of the Sidra API. This should ensure that the batch processing is not exceeding the 30 minute threshold, which would cause the batch to be prematurely terminated.

AzureSearchIndexing:BatchSize = 20

Step 12

If the Provider's Databricks table(s) don't show up:

SharePoint.Documents
SharePoint.DocumentsValidationErrors

Try executing manually the notebook

Shared > AzureSearchKnowledgeStoreIngestion

Note: The code in the Notebook will stop without creating the tables if the Azure Search Service indexers are still busy.

Step 13

Re-build and the save the index of the Azure Search Service used by the Client App:

POST https://a-p01-sdr-nlpl-wst-flask.azurewebsites.net/upgrade { "recreate_index": true }

POST https://a-p01-sdr-nlpl-wst-flask.azurewebsites.net/save

Note: These 2 steps should be run periodically to keep the Search Service in good shape.

In the DSU Azure Search Service > Skillsets, scrolling the JSON we can see how a call is being made to the above API endpoint:

"uri": "https://a-p01-sdr-nlpl-wst-flask.azurewebsites.net/register",

"httpMethod": "POST",

"timeout": "PT3M50S",

"batchSize": 1000,

...

"httpHeaders": {

"x-api-key": "7l6…GID1"

}

Notes

Post re-ingestion: EOS Portal

After binary files are re-ingested to Sidra, they are being published into the Hermes' EOS Portal.

The publishing is done by executing the EMSLoad pipeline from the following Data Factory resource:

/resource/subscriptions/bef2db33-37d3-4615-8cd8-e78904d8a545

/resourceGroups/a-p01-sdr-portal2-rg

/providers/Microsoft.DataFactory/factories/a-p01-sdr-portal-dft/overview

The pipeline is executed with a manual trigger each morning around 4:40, but it can also be manually run.

Performance

Ingestion in the PROD was visibly slower than the ingestion in UAT, due to Azure Search Service indexing rate.

This happened despite the higher/better plan of Azure Search Service in PROD.

In fact, it is likely that the Azure Search Service plan has little influence; another component involved in file indexing has greater influence.

The application identified below has a much greater impact in the rate of indexing than the plan: It is greatly dragging down the performance.

/resource/subscriptions/bef2db33-37d3-4615-8cd8-e78904d8a545

/resourceGroups/a-p01-sdr-dsu-app-nlpdatalab-rg

/providers/Microsoft.Web/sites/a-p01-sdr-nlpl-wst-flask/appServices

The above application has frequent spikes of high latency: response time of 3 to 4 minutes, while its pipeline never exceeds 10 requests:

This application is hooking into the pipeline of the Search Service Skillset semanticsearchpipeline, as a step that performs calls to:

https://a-p01-sdr-nlpl-wst-flask.azurewebsites.net/register

This skillset is being used by Indexer semanticsearchpipeline-sharepoint-documents-blob.

The application, however, does not contribute to the Index of the Search Service (semanticsearchpipeline).

Hence, the app is dragging down the indexing rate without bringing a value to the indexing itself.

This app exerts same behavior in UAT, though (for unknown reasons) its impact was smaller there.

This app should do its job outside the pipeline of the Skillset, probably simply querying the index (semanticsearchpipeline).

Resource scaling

In an attempt to improve the performance during the file re-ingestion, the following resources have been bumped up:

Indexing app, its database, from 10 DTU and 250 GB

/resource/subscriptions/bef2db33-37d3-4615-8cd8-e78904d8a545

/resourceGroups/a-p01-sdr-dsu-app-nlpdatalab-rg

/providers/Microsoft.Sql/servers/a-p01-sdr-nlpl-sdb

/databases/Client

Indexing app, its hosting plan, from P1V2

/resource/subscriptions/bef2db33-37d3-4615-8cd8-e78904d8a545

/resourceGroups/a-p01-sdr-dsu-app-nlpdatalab-rg

/providers/Microsoft.Web/serverFarms/a-p01-sdr-nlpl-lnx/webHostingPlan