Automatisation avancée de scraping web avec IA et intégration API

Documentation Complète

📋 Automatisation avancée de scraping web avec IA et intégration API

📁 Data🎯 Avancé⏱️ 2 heures

💡 Description

Ce workflow n8n utilise les outils avancés de Bright Data et l'intelligence artificielle de Google Gemini pour automatiser le scraping web. Il permet aux entreprises d'extraire des données structurées depuis des pages web en utilisant des agents intelligents. Grâce à l'automatisation, vous pouvez réduire considérablement le temps passé sur la collecte manuelle de données, améliorer la précision et augmenter l'efficacité globale. Le contenu extrait est sauvegardé sous forme de fichier, ce qui facilite l'analyse ultérieure. Ce processus améliore la prise de décision basée sur les données et optimise les opérations commerciales.

📈 Impact & ROI: En automatisant le processus de collecte de données, ce workflow réduit le temps et les coûts associés à la recherche manuelle tout en augmentant la précision, ce qui se traduit par un meilleur retour sur investissement.

🚀 Fonctionnalités Clés

✅ Automatisation du scraping web - Réduction des tâches manuelles
✅ Intégration avec Google Gemini - Intelligence contextuelle accrue
✅ Extraction de données précises - Amélioration de la qualité des informations
✅ Sauvegarde automatique - Facilite l'analyse et la traçabilité

📊 Architecture Technique

19
Nodes

13
Connexions

3
Services

🔌 Services Intégrés

Bright DataGoogle GeminiMCP Client

🔧 Composition du Workflow

Node	Type	Description
AI Agent	`@n8n/n8n-nodes-langchain.agent`	Traitement des données
When clicking ‘Test workflow’	`manualTrigger`	Traitement des données
MCP Client list all tools for Bright Data	`n8n-nodes-mcp.mcpClient`	Traitement des données
Sticky Note1	`stickyNote`	Traitement des données
MCP Client List all tools	`n8n-nodes-mcp.mcpClientTool`	Traitement des données
MCP Client Bright Data Web Scraper	`n8n-nodes-mcp.mcpClient`	Traitement des données
Webhook for web scraper	`httpRequest`	Requête HTTP vers une API externe
Set the URLs	`set`	Traitement des données
MCP Client to Scrape as Markdown	`n8n-nodes-mcp.mcpClientTool`	Traitement des données
MCP Client to Scrape as HTML	`n8n-nodes-mcp.mcpClientTool`	Traitement des données
Google Gemini Chat Model for AI Agent	`@n8n/n8n-nodes-langchain.lmChatGoogleGemini`	Traitement des données
Sticky Note	`stickyNote`	Traitement des données
Simple Memory	`@n8n/n8n-nodes-langchain.memoryBufferWindow`	Traitement des données
Webhook for Web Scraper AI Agent	`httpRequest`	Requête HTTP vers une API externe
Set the URL with the Webhook URL and data format	`set`	Traitement des données
Create a binary data	`function`	Transformation de données avec code personnalisé
Write the scraped content to disk	`readWriteFile`	Traitement des données
Sticky Note2	`stickyNote`	Traitement des données
Sticky Note3	`stickyNote`	Traitement des données

📖 Guide d'Implémentation

Import du workflow: Téléchargez le fichier JSON et importez-le dans votre instance n8n
Configuration des credentials: Configurez les accès pour chaque service utilisé
Personnalisation: Adaptez les paramètres selon vos besoins spécifiques
Test: Exécutez le workflow en mode test pour vérifier le bon fonctionnement
Activation: Activez le workflow pour une exécution automatique

🏷️ Tags

web scrapingautomatisationintelligence artificielle

Structure JSON

Voir le code JSON complet

{
    "id": "U6cY7PPR0vaRl1I0",
    "meta": {
        "instanceId": "885b4fb4a6a9c2cb5621429a7b972df0d05bb724c20ac7dac7171b62f1c7ef40",
        "templateCredsSetupCompleted": true
    },
    "name": "Scrape Web Data with Bright Data, Google Gemini and MCP Automated AI Agent",
    "tags": [
        {
            "id": "ZOwtAMLepQaGW76t",
            "name": "Building Blocks",
            "createdAt": "2025-04-13T15:23:40.462Z",
            "updatedAt": "2025-04-13T15:23:40.462Z"
        },
        {
            "id": "ddPkw7Hg5dZhQu2w",
            "name": "AI",
            "createdAt": "2025-04-13T05:38:08.053Z",
            "updatedAt": "2025-04-13T05:38:08.053Z"
        }
    ],
    "nodes": [
        {
            "id": "0c747f5b-ef72-4e00-a028-4a08461dae28",
            "name": "AI Agent",
            "type": "@n8n\/n8n-nodes-langchain.agent",
            "notes": "Bright Data Web Scraping Agent",
            "position": [
                -140,
                60
            ],
            "parameters": {
                "text": "=Scrape the web data as per the provided URL:  {{ $json.url }} using the format as {{ $json.format }}",
                "options": {
                    "systemMessage": "=You are a helpful assistant."
                },
                "promptType": "define"
            },
            "notesInFlow": true,
            "typeVersion": 1.8
        },
        {
            "id": "16c7cd90-39da-47a4-8020-a0aa8f87275a",
            "name": "When clicking ‘Test workflow’",
            "type": "n8n-nodes-base.manualTrigger",
            "position": [
                -640,
                -300
            ],
            "parameters": [],
            "typeVersion": 1
        },
        {
            "id": "7b544505-6b6b-4500-a2ad-f7cf62f98c13",
            "name": "MCP Client list all tools for Bright Data",
            "type": "n8n-nodes-mcp.mcpClient",
            "position": [
                -380,
                -300
            ],
            "parameters": [],
            "credentials": {
                "mcpClientApi": {
                    "id": "JtatFSfA2kkwctYa",
                    "name": "MCP Client (STDIO) account"
                }
            },
            "typeVersion": 1
        },
        {
            "id": "9f5bf319-9414-4974-bad2-ea24f09ae351",
            "name": "Sticky Note1",
            "type": "n8n-nodes-base.stickyNote",
            "position": [
                -220,
                -400
            ],
            "parameters": {
                "color": 3,
                "width": 440,
                "height": 320,
                "content": "## Bright Data Web Scraper"
            },
            "typeVersion": 1
        },
        {
            "id": "222e81cf-878e-42de-a325-6b6659145f98",
            "name": "MCP Client List all tools",
            "type": "n8n-nodes-mcp.mcpClientTool",
            "position": [
                440,
                420
            ],
            "parameters": [],
            "credentials": {
                "mcpClientApi": {
                    "id": "JtatFSfA2kkwctYa",
                    "name": "MCP Client (STDIO) account"
                }
            },
            "typeVersion": 1
        },
        {
            "id": "a10dfd4f-cadf-4952-8449-1865406358d4",
            "name": "MCP Client Bright Data Web Scraper",
            "type": "n8n-nodes-mcp.mcpClient",
            "notes": "Scrape a single webpage URL with advanced options for content extraction and get back the results in MarkDown language.",
            "position": [
                60,
                -300
            ],
            "parameters": {
                "toolName": "=scrape_as_markdown",
                "operation": "executeTool",
                "toolParameters": "={\n   \"url\": \"{{ $json.url }}\"\n} "
            },
            "credentials": {
                "mcpClientApi": {
                    "id": "JtatFSfA2kkwctYa",
                    "name": "MCP Client (STDIO) account"
                }
            },
            "notesInFlow": true,
            "typeVersion": 1
        },
        {
            "id": "0acbd4ff-ce4a-4ff2-b213-2d80dd91e302",
            "name": "Webhook for web scraper",
            "type": "n8n-nodes-base.httpRequest",
            "position": [
                280,
                -300
            ],
            "parameters": {
                "url": "=https:\/\/webhook.site\/daf9d591-a130-4010-b1d3-0c66f8fcf467",
                "options": [],
                "sendBody": true,
                "bodyParameters": {
                    "parameters": [
                        {
                            "name": "response",
                            "value": "={{ $json.result.content[0].text }}"
                        }
                    ]
                }
            },
            "typeVersion": 4.2
        },
        {
            "id": "009ac29f-8cad-4b58-9ca4-e75470a52dcc",
            "name": "Set the URLs",
            "type": "n8n-nodes-base.set",
            "position": [
                -160,
                -300
            ],
            "parameters": {
                "options": [],
                "assignments": {
                    "assignments": [
                        {
                            "id": "214e61a0-3587-453f-baf5-eac013990857",
                            "name": "url",
                            "type": "string",
                            "value": "https:\/\/about.google\/"
                        },
                        {
                            "id": "45014942-0a2e-4f46-b395-f82f97bfa93e",
                            "name": "webhook_url",
                            "type": "string",
                            "value": "https:\/\/webhook.site\/ce41e056-c097-48c8-a096-9b876d3abbf7"
                        }
                    ]
                }
            },
            "typeVersion": 3.4
        },
        {
            "id": "104706dd-ae58-47fd-8fea-cefa986ae40c",
            "name": "MCP Client to Scrape as Markdown",
            "type": "n8n-nodes-mcp.mcpClientTool",
            "notes": "Scrape a single webpage URL with advanced options for content extraction and get back the results in MarkDown language.",
            "position": [
                -60,
                400
            ],
            "parameters": {
                "toolName": "scrape_as_markdown",
                "operation": "executeTool",
                "toolParameters": "={\n  \"url\": \"{{ $json.url }}\"\n} ",
                "descriptionType": "manual",
                "toolDescription": "Scrape a single webpage URL with advanced options for content extraction and get back the results in MarkDown language."
            },
            "credentials": {
                "mcpClientApi": {
                    "id": "JtatFSfA2kkwctYa",
                    "name": "MCP Client (STDIO) account"
                }
            },
            "notesInFlow": false,
            "typeVersion": 1
        },
        {
            "id": "c03c655e-45c8-4278-a01f-ba48282459c5",
            "name": "MCP Client to Scrape as HTML",
            "type": "n8n-nodes-mcp.mcpClientTool",
            "notes": "Scrape a single webpage URL with advanced options for content extraction and get back the results in HTML.",
            "position": [
                200,
                400
            ],
            "parameters": {
                "toolName": "scrape_as_html",
                "operation": "executeTool",
                "toolParameters": "{\n  \"url\": \"{{ $json.url }}\"\n} ",
                "descriptionType": "manual",
                "toolDescription": "Scrape a single webpage URL with advanced options for content extraction and get back the results in HTML."
            },
            "credentials": {
                "mcpClientApi": {
                    "id": "JtatFSfA2kkwctYa",
                    "name": "MCP Client (STDIO) account"
                }
            },
            "notesInFlow": true,
            "typeVersion": 1
        },
        {
            "id": "587300ff-44e5-4ff2-9dee-a4f8720ca26b",
            "name": "Google Gemini Chat Model for AI Agent",
            "type": "@n8n\/n8n-nodes-langchain.lmChatGoogleGemini",
            "position": [
                -520,
                400
            ],
            "parameters": {
                "options": [],
                "modelName": "models\/gemini-2.0-flash-exp"
            },
            "credentials": {
                "googlePalmApi": {
                    "id": "YeO7dHZnuGBVQKVZ",
                    "name": "Google Gemini(PaLM) Api account"
                }
            },
            "typeVersion": 1
        },
        {
            "id": "38ba13a1-f8f7-48ce-a05d-2c2526de606d",
            "name": "Sticky Note",
            "type": "n8n-nodes-base.stickyNote",
            "position": [
                -140,
                340
            ],
            "parameters": {
                "color": 4,
                "width": 480,
                "height": 260,
                "content": "## Bright Data Web Scraper Tools"
            },
            "typeVersion": 1
        },
        {
            "id": "e7c8d333-e256-4944-a584-575162072ca4",
            "name": "Simple Memory",
            "type": "@n8n\/n8n-nodes-langchain.memoryBufferWindow",
            "position": [
                -280,
                400
            ],
            "parameters": {
                "sessionKey": "=Perform the web scraping for the below URL\n\n{{ $json.url }}",
                "sessionIdType": "customKey",
                "contextWindowLength": 10
            },
            "typeVersion": 1.3
        },
        {
            "id": "5e89519e-8ee5-4c3b-807d-21cef6e36c32",
            "name": "Webhook for Web Scraper AI Agent",
            "type": "n8n-nodes-base.httpRequest",
            "position": [
                260,
                120
            ],
            "parameters": {
                "url": "={{ $('Set the URL with the Webhook URL and data format').item.json.webhook_url }}",
                "options": [],
                "sendBody": true,
                "bodyParameters": {
                    "parameters": [
                        {
                            "name": "response",
                            "value": "={{ $json.output }}"
                        }
                    ]
                }
            },
            "typeVersion": 4.2
        },
        {
            "id": "2de093f4-15e5-4710-83d9-e6d9ed852873",
            "name": "Set the URL with the Webhook URL and data format",
            "type": "n8n-nodes-base.set",
            "position": [
                -400,
                60
            ],
            "parameters": {
                "options": [],
                "assignments": {
                    "assignments": [
                        {
                            "id": "214e61a0-3587-453f-baf5-eac013990857",
                            "name": "url",
                            "type": "string",
                            "value": "https:\/\/about.google\/"
                        },
                        {
                            "id": "45014942-0a2e-4f46-b395-f82f97bfa93e",
                            "name": "webhook_url",
                            "type": "string",
                            "value": "https:\/\/webhook.site\/daf9d591-a130-4010-b1d3-0c66f8fcf467"
                        },
                        {
                            "id": "7f6c03f6-9fa3-45f9-bf81-243b7106bdac",
                            "name": "format",
                            "type": "string",
                            "value": "scrape_as_markdown"
                        }
                    ]
                }
            },
            "typeVersion": 3.4
        },
        {
            "id": "04a5b11f-1990-440e-be23-2fbcb985dd4a",
            "name": "Create a binary data",
            "type": "n8n-nodes-base.function",
            "position": [
                260,
                -80
            ],
            "parameters": {
                "functionCode": "items[0].binary = {\n  data: {\n    data: new Buffer(JSON.stringify(items[0].json, null, 2)).toString('base64')\n  }\n};\nreturn items;"
            },
            "typeVersion": 1
        },
        {
            "id": "0765ffcd-4746-45f7-add8-f82d66709321",
            "name": "Write the scraped content to disk",
            "type": "n8n-nodes-base.readWriteFile",
            "position": [
                460,
                -80
            ],
            "parameters": {
                "options": [],
                "fileName": "d:\\Scraped-Content.json",
                "operation": "write"
            },
            "typeVersion": 1
        },
        {
            "id": "16cd9b48-765d-4f0d-8e78-6c41cfc50b03",
            "name": "Sticky Note2",
            "type": "n8n-nodes-base.stickyNote",
            "position": [
                -220,
                -540
            ],
            "parameters": {
                "width": 440,
                "height": 120,
                "content": "## Disclaimer\nThis template is only available on n8n self-hosted as it's making use of the community node for MCP Client."
            },
            "typeVersion": 1
        },
        {
            "id": "ff4b9e9c-a0f0-4cdb-ad53-55dc412794fc",
            "name": "Sticky Note3",
            "type": "n8n-nodes-base.stickyNote",
            "position": [
                -1080,
                -80
            ],
            "parameters": {
                "color": 5,
                "width": 480,
                "height": 380,
                "content": "## Note\nThe AI agent utilizes Bright Data's MCP tools to perform web scraping based on user requests. It intelligently selects the most suitable web scraping tool to fulfill the user's query.\n\nOnce the web scraping is complete, the AI agent's response is:\n\n1. Used to trigger a webhook call.\n\n2. Persisted to disk for future reference.\n\nGoogle Gemini is employed by the AI agent to understand and interpret user queries. Based on this interpretation, the agent initiates a call to the appropriate MCP client to perform the required web scraping task.\n\nSource - https:\/\/github.com\/luminati-io\/brightdata-mcp"
            },
            "typeVersion": 1
        }
    ],
    "active": false,
    "pinData": [],
    "settings": {
        "executionOrder": "v1"
    },
    "versionId": "ccf9f9af-82d5-4751-b06d-497c043c85fc",
    "connections": {
        "AI Agent": {
            "main": [
                [
                    {
                        "node": "Webhook for Web Scraper AI Agent",
                        "type": "main",
                        "index": 0
                    },
                    {
                        "node": "Create a binary data",
                        "type": "main",
                        "index": 0
                    }
                ]
            ]
        },
        "Set the URLs": {
            "main": [
                [
                    {
                        "node": "MCP Client Bright Data Web Scraper",
                        "type": "main",
                        "index": 0
                    }
                ]
            ]
        },
        "Simple Memory": {
            "ai_memory": [
                [
                    {
                        "node": "AI Agent",
                        "type": "ai_memory",
                        "index": 0
                    }
                ]
            ]
        },
        "Create a binary data": {
            "main": [
                [
                    {
                        "node": "Write the scraped content to disk",
                        "type": "main",
                        "index": 0
                    }
                ]
            ]
        },
        "MCP Client List all tools": {
            "ai_tool": [
                [
                    {
                        "node": "AI Agent",
                        "type": "ai_tool",
                        "index": 0
                    }
                ]
            ]
        },
        "MCP Client to Scrape as HTML": {
            "ai_tool": [
                [
                    {
                        "node": "AI Agent",
                        "type": "ai_tool",
                        "index": 0
                    }
                ]
            ]
        },
        "MCP Client to Scrape as Markdown": {
            "ai_tool": [
                [
                    {
                        "node": "AI Agent",
                        "type": "ai_tool",
                        "index": 0
                    }
                ]
            ]
        },
        "Webhook for Web Scraper AI Agent": {
            "main": [
                []
            ]
        },
        "When clicking ‘Test workflow’": {
            "main": [
                [
                    {
                        "node": "MCP Client list all tools for Bright Data",
                        "type": "main",
                        "index": 0
                    },
                    {
                        "node": "Set the URL with the Webhook URL and data format",
                        "type": "main",
                        "index": 0
                    }
                ]
            ]
        },
        "MCP Client Bright Data Web Scraper": {
            "main": [
                [
                    {
                        "node": "Webhook for web scraper",
                        "type": "main",
                        "index": 0
                    }
                ]
            ]
        },
        "Google Gemini Chat Model for AI Agent": {
            "ai_languageModel": [
                [
                    {
                        "node": "AI Agent",
                        "type": "ai_languageModel",
                        "index": 0
                    }
                ]
            ]
        },
        "MCP Client list all tools for Bright Data": {
            "main": [
                [
                    {
                        "node": "Set the URLs",
                        "type": "main",
                        "index": 0
                    }
                ]
            ]
        },
        "Set the URL with the Webhook URL and data format": {
            "main": [
                [
                    {
                        "node": "AI Agent",
                        "type": "main",
                        "index": 0
                    }
                ]
            ]
        }
    }
}

Nodes

Connexions

93855

Vues

21464

Copies

Détails Techniques

Type de workflow: Automatisation
Complexité: Avancée
Taille du fichier: 14.3 KB

Services Intégrés

@n8n/langchain.agent Manualtrigger Mcp.mcpclient Stickynote Mcp.mcpclienttool Httprequest @n8n/langchain.lmchatgooglegemini @n8n/langchain.memorybufferwindow Readwritefile

Distribution des Nodes

@n8n/langchain.agent 1

manualTrigger 1

mcp.mcpClient 2

stickyNote 4

mcp.mcpClientTool 3

httpRequest 2

set 2

@n8n/langchain.lmChatGoogleGemini 1

@n8n/langchain.memoryBufferWindow 1

function 1

readWriteFile 1

Workflows Similaires

Comprehensive n8n Creator Stats Automation Workflow

Automate the reporting of top n8n creators and workflows with this powerful workflow. By aggregating data from GitHub, g...

236 0

Analyse Automatisée des États Américains par l'IA

Ce workflow n8n permet d'analyser automatiquement les plus grands états des USA en termes de superficie, en listant leu...

94,158 28,143

Automatisez l'import de CSV vers Excel en toute simplicité

Ce workflow n8n simplifie la conversion de fichiers CSV en fichiers Excel (.xlsx), un processus essentiel pour les profe...

11,724 3,111