Optimisez vos ensembles de données vectorielles avec l'IA

Ce workflow n8n vous permet de créer des ensembles de données vectorielles optimisés pour les modèles de langage grâce à l'intégration de services avancés tels que Bright Data, Google Gemini et Pinecone. Idéal pour les entreprises cherchant à améliorer leurs capacités d'analyse et d'apprentissage automatique, ce processus automatisé extrait, formate et stocke efficacement les données web dans une base de données vectorielle. En réduisant le temps nécessaire à la préparation des données, ce workflow facilite la création d'applications IA plus intelligentes et réactives.

13,486 vues
6,324 copies
Data

Documentation Complète

📋 Optimisez vos ensembles de données vectorielles avec l'IA

💡 Description

Ce workflow n8n vous permet de créer des ensembles de données vectorielles optimisés pour les modèles de langage grâce à l'intégration de services avancés tels que Bright Data, Google Gemini et Pinecone. Idéal pour les entreprises cherchant à améliorer leurs capacités d'analyse et d'apprentissage automatique, ce processus automatisé extrait, formate et stocke efficacement les données web dans une base de données vectorielle. En réduisant le temps nécessaire à la préparation des données, ce workflow facilite la création d'applications IA plus intelligentes et réactives.

📈 Impact & ROI: Améliore l'efficacité opérationnelle en automatisant la préparation des données pour les projets IA, permettant ainsi un développement plus rapide et une meilleure exploitation des ressources analytiques.

🚀 Fonctionnalités Clés

  • ✅ Extraction de données web précise
  • ✅ Intégration fluide avec des modèles IA avancés
  • ✅ Stockage efficace dans une base de données vectorielle
  • ✅ Automatisation du processus de formatage et d'insertion

📊 Architecture Technique

21
Nodes
14
Connexions
3
Services

🔌 Services Intégrés

Bright DataGoogle GeminiPinecone

🔧 Composition du Workflow

NodeTypeDescription
When clicking ‘Test workflow’manualTriggerTraitement des données
AI Agent@n8n/n8n-nodes-langchain.agentTraitement des données
Pinecone Vector Store@n8n/n8n-nodes-langchain.vectorStorePineconeTraitement des données
Embeddings Google Gemini@n8n/n8n-nodes-langchain.embeddingsGoogleGeminiTraitement des données
Default Data Loader@n8n/n8n-nodes-langchain.documentDefaultDataLoaderTraitement des données
Recursive Character Text Splitter@n8n/n8n-nodes-langchain.textSplitterRecursiveCharacterTextSplitterDivision des données en plusieurs branches
Google Gemini Chat Model1@n8n/n8n-nodes-langchain.lmChatGoogleGeminiTraitement des données
Google Gemini Chat Model2@n8n/n8n-nodes-langchain.lmChatGoogleGeminiTraitement des données
Google Gemini Chat Model@n8n/n8n-nodes-langchain.lmChatGoogleGeminiTraitement des données
Structured Output Parser@n8n/n8n-nodes-langchain.outputParserStructuredTraitement des données
Sticky NotestickyNoteTraitement des données
Set Fields - URL and Webhook URLsetTraitement des données
Make a web requesthttpRequestRequête HTTP vers une API externe
Structured JSON Data Formatter@n8n/n8n-nodes-langchain.chainLlmTraitement des données
Webhook for structured datahttpRequestRequête HTTP vers une API externe
Webhook for structured AI agent responsehttpRequestRequête HTTP vers une API externe
Sticky Note1stickyNoteTraitement des données
Sticky Note2stickyNoteTraitement des données
Sticky Note3stickyNoteTraitement des données
Information Extractor with Data Formatter@n8n/n8n-nodes-langchain.informationExtractorTraitement des données
Sticky Note4stickyNoteTraitement des données

📖 Guide d'Implémentation

  1. Import du workflow: Téléchargez le fichier JSON et importez-le dans votre instance n8n
  2. Configuration des credentials: Configurez les accès pour chaque service utilisé
  3. Personnalisation: Adaptez les paramètres selon vos besoins spécifiques
  4. Test: Exécutez le workflow en mode test pour vérifier le bon fonctionnement
  5. Activation: Activez le workflow pour une exécution automatique

🏷️ Tags

IAVectorisationAutomatisation

Structure JSON

Voir le code JSON complet
{
    "id": "3Lih0LVosR8dZbla",
    "meta": {
        "instanceId": "885b4fb4a6a9c2cb5621429a7b972df0d05bb724c20ac7dac7171b62f1c7ef40",
        "templateCredsSetupCompleted": true
    },
    "name": "Create AI-Ready Vector Datasets for LLMs with Bright Data, Gemini & Pinecone",
    "tags": [
        {
            "id": "Kujft2FOjmOVQAmJ",
            "name": "Engineering",
            "createdAt": "2025-04-09T01:31:00.558Z",
            "updatedAt": "2025-04-09T01:31:00.558Z"
        },
        {
            "id": "ZOwtAMLepQaGW76t",
            "name": "Building Blocks",
            "createdAt": "2025-04-13T15:23:40.462Z",
            "updatedAt": "2025-04-13T15:23:40.462Z"
        },
        {
            "id": "ddPkw7Hg5dZhQu2w",
            "name": "AI",
            "createdAt": "2025-04-13T05:38:08.053Z",
            "updatedAt": "2025-04-13T05:38:08.053Z"
        }
    ],
    "nodes": [
        {
            "id": "0a468953-e348-420e-a6b3-c55fb20d3cbf",
            "name": "When clicking ‘Test workflow’",
            "type": "n8n-nodes-base.manualTrigger",
            "position": [
                200,
                -710
            ],
            "parameters": [],
            "typeVersion": 1
        },
        {
            "id": "3725e480-246f-4f32-b0a7-b946cacbe830",
            "name": "AI Agent",
            "type": "@n8n\/n8n-nodes-langchain.agent",
            "position": [
                1236,
                -60
            ],
            "parameters": {
                "text": "=Format the below search result\n\n{{ $json.output.search_result }}",
                "options": [],
                "promptType": "define",
                "hasOutputParser": true
            },
            "typeVersion": 1.8
        },
        {
            "id": "30a12b8e-02f5-4b2e-bf9f-20fd9658405e",
            "name": "Pinecone Vector Store",
            "type": "@n8n\/n8n-nodes-langchain.vectorStorePinecone",
            "position": [
                1628,
                -10
            ],
            "parameters": {
                "mode": "insert",
                "options": [],
                "pineconeIndex": {
                    "__rl": true,
                    "mode": "list",
                    "value": "hacker-news",
                    "cachedResultName": "hacker-news"
                }
            },
            "credentials": {
                "pineconeApi": {
                    "id": "wdfRQ6NE8yjCDFhY",
                    "name": "PineconeApi account"
                }
            },
            "typeVersion": 1.1
        },
        {
            "id": "1738dea6-fa4f-4a8d-a6fb-2f01feb1a6d5",
            "name": "Embeddings Google Gemini",
            "type": "@n8n\/n8n-nodes-langchain.embeddingsGoogleGemini",
            "position": [
                1612,
                210
            ],
            "parameters": {
                "modelName": "models\/text-embedding-004"
            },
            "credentials": {
                "googlePalmApi": {
                    "id": "YeO7dHZnuGBVQKVZ",
                    "name": "Google Gemini(PaLM) Api account"
                }
            },
            "typeVersion": 1
        },
        {
            "id": "e6443541-de71-4d26-ad58-d7c72868a190",
            "name": "Default Data Loader",
            "type": "@n8n\/n8n-nodes-langchain.documentDefaultDataLoader",
            "position": [
                1760,
                220
            ],
            "parameters": {
                "options": [],
                "jsonData": "={{ $('Information Extractor with Data Formatter').item.json.output.search_result }}",
                "jsonMode": "expressionData"
            },
            "typeVersion": 1
        },
        {
            "id": "09ffc8cd-096f-47fe-937d-f8ab4fb41266",
            "name": "Recursive Character Text Splitter",
            "type": "@n8n\/n8n-nodes-langchain.textSplitterRecursiveCharacterTextSplitter",
            "position": [
                1820,
                410
            ],
            "parameters": {
                "options": []
            },
            "typeVersion": 1
        },
        {
            "id": "90cc9aa4-0931-4c52-8734-e4e0de820205",
            "name": "Google Gemini Chat Model1",
            "type": "@n8n\/n8n-nodes-langchain.lmChatGoogleGemini",
            "position": [
                1240,
                160
            ],
            "parameters": {
                "options": [],
                "modelName": "models\/gemini-2.0-flash-exp"
            },
            "credentials": {
                "googlePalmApi": {
                    "id": "YeO7dHZnuGBVQKVZ",
                    "name": "Google Gemini(PaLM) Api account"
                }
            },
            "typeVersion": 1
        },
        {
            "id": "1090a4af-7e5d-446b-a537-3afe48cd4909",
            "name": "Google Gemini Chat Model2",
            "type": "@n8n\/n8n-nodes-langchain.lmChatGoogleGemini",
            "position": [
                948,
                -340
            ],
            "parameters": {
                "options": [],
                "modelName": "models\/gemini-2.0-flash-exp"
            },
            "credentials": {
                "googlePalmApi": {
                    "id": "YeO7dHZnuGBVQKVZ",
                    "name": "Google Gemini(PaLM) Api account"
                }
            },
            "typeVersion": 1
        },
        {
            "id": "324c530c-0a03-411e-acb0-d82e9dc635cf",
            "name": "Google Gemini Chat Model",
            "type": "@n8n\/n8n-nodes-langchain.lmChatGoogleGemini",
            "position": [
                948,
                160
            ],
            "parameters": {
                "options": [],
                "modelName": "models\/gemini-2.0-flash-exp"
            },
            "credentials": {
                "googlePalmApi": {
                    "id": "YeO7dHZnuGBVQKVZ",
                    "name": "Google Gemini(PaLM) Api account"
                }
            },
            "typeVersion": 1
        },
        {
            "id": "3226a2d6-ade1-4d6a-95c5-0be4d787a947",
            "name": "Structured Output Parser",
            "type": "@n8n\/n8n-nodes-langchain.outputParserStructured",
            "position": [
                1400,
                160
            ],
            "parameters": {
                "jsonSchemaExample": "[{\n\t\"id\": \"<string>\",\n\t\"title\": \"<string>\",\n    \"summary\": \"<string>\",\n    \"keywords\": [\"\"],\n    \"topics\": [\"\"]\n}]"
            },
            "typeVersion": 1.2
        },
        {
            "id": "a739a314-900a-4ef7-9cc2-1b65374e2e05",
            "name": "Sticky Note",
            "type": "n8n-nodes-base.stickyNote",
            "position": [
                40,
                -360
            ],
            "parameters": {
                "width": 480,
                "height": 220,
                "content": "## Note\nPlease make sure to set the URL for web crawling. \n\nWeb-Unlocker Product is being utilized for performing the web scrapping. \n\nThis workflow is utilizing the Basic LLM Chain, Information Extraction with the AI Agents for formatting, extracting and persisting the response in PineCone Vector Database"
            },
            "typeVersion": 1
        },
        {
            "id": "3dca6d46-c423-4fb5-a6e4-c2aa2852d51c",
            "name": "Set Fields - URL and Webhook URL",
            "type": "n8n-nodes-base.set",
            "notes": "Set the URL which you are interested to scrap the data",
            "position": [
                420,
                -710
            ],
            "parameters": {
                "options": [],
                "assignments": {
                    "assignments": [
                        {
                            "id": "1c132dd6-31e4-453b-a8cf-cad9845fe55b",
                            "name": "url",
                            "type": "string",
                            "value": "https:\/\/news.ycombinator.com?product=unlocker&method=api"
                        },
                        {
                            "id": "90f3272b-d13d-44e2-8b4c-0943648cfce9",
                            "name": "webhook_url",
                            "type": "string",
                            "value": "https:\/\/webhook.site\/bc804ce5-4a45-4177-a68a-99c80e5c86e6"
                        }
                    ]
                }
            },
            "notesInFlow": true,
            "typeVersion": 3.4
        },
        {
            "id": "216a3261-a398-484c-9bf4-ca5966b829b6",
            "name": "Make a web request",
            "type": "n8n-nodes-base.httpRequest",
            "position": [
                640,
                -260
            ],
            "parameters": {
                "url": "https:\/\/api.brightdata.com\/request",
                "method": "POST",
                "options": [],
                "sendBody": true,
                "sendHeaders": true,
                "authentication": "genericCredentialType",
                "bodyParameters": {
                    "parameters": [
                        {
                            "name": "zone",
                            "value": "web_unlocker1"
                        },
                        {
                            "name": "url",
                            "value": "={{ $json.url }}"
                        },
                        {
                            "name": "format",
                            "value": "raw"
                        }
                    ]
                },
                "genericAuthType": "httpHeaderAuth",
                "headerParameters": {
                    "parameters": [
                        []
                    ]
                }
            },
            "credentials": {
                "httpHeaderAuth": {
                    "id": "kdbqXuxIR8qIxF7y",
                    "name": "Header Auth account"
                }
            },
            "typeVersion": 4.2
        },
        {
            "id": "0c74e21c-3007-4297-b6ab-8ee17f4c6436",
            "name": "Structured JSON Data Formatter",
            "type": "@n8n\/n8n-nodes-langchain.chainLlm",
            "position": [
                860,
                -560
            ],
            "parameters": {
                "text": "=Format the below response and produce a textual data. Output the response as per the below JSON schema.\n\nHere's the input: {{ $json.data }}\nHere's the JSON schema: \n\n[{\n    \"rank\": { \"type\": \"integer\" },\n    \"title\": { \"type\": \"string\" },\n    \"site\": { \"type\": \"string\" },\n    \"points\": { \"type\": \"integer\" },\n    \"user\": { \"type\": \"string\" },\n    \"age\": { \"type\": \"string\" },\n    \"comments\": { \"type\": \"string\" }\n}]",
                "messages": {
                    "messageValues": [
                        {
                            "message": "You are an expert data formatter"
                        }
                    ]
                },
                "promptType": "define"
            },
            "typeVersion": 1.6
        },
        {
            "id": "012d4bb0-2b58-47cd-9cea-b4e0dced9082",
            "name": "Webhook for structured data",
            "type": "n8n-nodes-base.httpRequest",
            "position": [
                1314,
                -860
            ],
            "parameters": {
                "url": "={{ $json.webhook_url }}",
                "options": [],
                "sendBody": true,
                "bodyParameters": {
                    "parameters": [
                        {
                            "name": "response",
                            "value": "={{ $json.text }}"
                        }
                    ]
                }
            },
            "typeVersion": 4.2
        },
        {
            "id": "93b35e5e-6f52-4aeb-8f1b-39cc495beefe",
            "name": "Webhook for structured AI agent response",
            "type": "n8n-nodes-base.httpRequest",
            "position": [
                1750,
                -660
            ],
            "parameters": {
                "url": "={{ $json.webhook_url }}",
                "options": [],
                "sendBody": true,
                "bodyParameters": {
                    "parameters": [
                        {
                            "name": "response",
                            "value": "={{ $json.output }}"
                        }
                    ]
                }
            },
            "typeVersion": 4.2
        },
        {
            "id": "251b4251-255c-48c6-999b-02227fa2de9b",
            "name": "Sticky Note1",
            "type": "n8n-nodes-base.stickyNote",
            "position": [
                800,
                -620
            ],
            "parameters": {
                "width": 360,
                "height": 420,
                "content": "## AI Data Formatter\n"
            },
            "typeVersion": 1
        },
        {
            "id": "f62463cd-6be3-4942-a636-de980a3154b4",
            "name": "Sticky Note2",
            "type": "n8n-nodes-base.stickyNote",
            "position": [
                1560,
                -160
            ],
            "parameters": {
                "color": 4,
                "width": 520,
                "height": 720,
                "content": "## Vector Database Persistence\n"
            },
            "typeVersion": 1
        },
        {
            "id": "ad20cc91-766a-4a57-be54-6f0d09a784eb",
            "name": "Sticky Note3",
            "type": "n8n-nodes-base.stickyNote",
            "position": [
                1260,
                -920
            ],
            "parameters": {
                "color": 3,
                "width": 680,
                "height": 440,
                "content": "## Webhook Notification Handler\n"
            },
            "typeVersion": 1
        },
        {
            "id": "37ab5c0f-d36e-4131-844d-20a22d3f2861",
            "name": "Information Extractor with Data Formatter",
            "type": "@n8n\/n8n-nodes-langchain.informationExtractor",
            "position": [
                860,
                -60
            ],
            "parameters": {
                "text": "={{ $json.data }}",
                "options": {
                    "systemPromptTemplate": "You are an expert HTML extractor. Your job is to analyze the search result and extract the content as a collection on items"
                },
                "attributes": {
                    "attributes": [
                        {
                            "name": "search_result",
                            "description": "Search Response"
                        }
                    ]
                }
            },
            "typeVersion": 1
        },
        {
            "id": "e04e189a-8ba9-4ef4-9a49-fc13daf00828",
            "name": "Sticky Note4",
            "type": "n8n-nodes-base.stickyNote",
            "position": [
                800,
                -160
            ],
            "parameters": {
                "color": 5,
                "width": 720,
                "height": 720,
                "content": "## Data Extraction\/Formatting with the AI Agent\n"
            },
            "typeVersion": 1
        }
    ],
    "active": false,
    "pinData": [],
    "settings": {
        "executionOrder": "v1"
    },
    "versionId": "799fb406-600d-45a5-b926-24b8844f33a5",
    "connections": {
        "AI Agent": {
            "main": [
                [
                    {
                        "node": "Pinecone Vector Store",
                        "type": "main",
                        "index": 0
                    },
                    {
                        "node": "Webhook for structured AI agent response",
                        "type": "main",
                        "index": 0
                    }
                ]
            ]
        },
        "Make a web request": {
            "main": [
                [
                    {
                        "node": "Structured JSON Data Formatter",
                        "type": "main",
                        "index": 0
                    },
                    {
                        "node": "Information Extractor with Data Formatter",
                        "type": "main",
                        "index": 0
                    }
                ]
            ]
        },
        "Default Data Loader": {
            "ai_document": [
                [
                    {
                        "node": "Pinecone Vector Store",
                        "type": "ai_document",
                        "index": 0
                    }
                ]
            ]
        },
        "Pinecone Vector Store": {
            "ai_tool": [
                []
            ]
        },
        "Embeddings Google Gemini": {
            "ai_embedding": [
                [
                    {
                        "node": "Pinecone Vector Store",
                        "type": "ai_embedding",
                        "index": 0
                    }
                ]
            ]
        },
        "Google Gemini Chat Model": {
            "ai_languageModel": [
                [
                    {
                        "node": "Information Extractor with Data Formatter",
                        "type": "ai_languageModel",
                        "index": 0
                    }
                ]
            ]
        },
        "Structured Output Parser": {
            "ai_outputParser": [
                [
                    {
                        "node": "AI Agent",
                        "type": "ai_outputParser",
                        "index": 0
                    }
                ]
            ]
        },
        "Google Gemini Chat Model1": {
            "ai_languageModel": [
                [
                    {
                        "node": "AI Agent",
                        "type": "ai_languageModel",
                        "index": 0
                    }
                ]
            ]
        },
        "Google Gemini Chat Model2": {
            "ai_languageModel": [
                [
                    {
                        "node": "Structured JSON Data Formatter",
                        "type": "ai_languageModel",
                        "index": 0
                    }
                ]
            ]
        },
        "Structured JSON Data Formatter": {
            "main": [
                [
                    {
                        "node": "Webhook for structured data",
                        "type": "main",
                        "index": 0
                    }
                ]
            ]
        },
        "Set Fields - URL and Webhook URL": {
            "main": [
                [
                    {
                        "node": "Make a web request",
                        "type": "main",
                        "index": 0
                    },
                    {
                        "node": "Webhook for structured data",
                        "type": "main",
                        "index": 0
                    },
                    {
                        "node": "Webhook for structured AI agent response",
                        "type": "main",
                        "index": 0
                    }
                ]
            ]
        },
        "Recursive Character Text Splitter": {
            "ai_textSplitter": [
                [
                    {
                        "node": "Default Data Loader",
                        "type": "ai_textSplitter",
                        "index": 0
                    }
                ]
            ]
        },
        "When clicking ‘Test workflow’": {
            "main": [
                [
                    {
                        "node": "Set Fields - URL and Webhook URL",
                        "type": "main",
                        "index": 0
                    }
                ]
            ]
        },
        "Information Extractor with Data Formatter": {
            "main": [
                [
                    {
                        "node": "AI Agent",
                        "type": "main",
                        "index": 0
                    }
                ]
            ]
        }
    }
}
                                

Workflows Similaires

Comprehensive n8n Creator Stats Automation Workflow

Automate the reporting of top n8n creators and workflows with this powerful workflow. By aggregating data from GitHub, g...

Analyse Automatisée des États Américains par l'IA

Ce workflow n8n permet d'analyser automatiquement les plus grands états des USA en termes de superficie, en listant leu...

Automatisez l'import de CSV vers Excel en toute simplicité

Ce workflow n8n simplifie la conversion de fichiers CSV en fichiers Excel (.xlsx), un processus essentiel pour les profe...