Updated Contributing for Celery (#629)

2025-09-25 19:37:29 +02:00 · 2023-10-25 18:26:02 -07:00
parent fbb05e630d
commit 9a51745fc9
5 changed files with 74 additions and 74 deletions
--- a/CONTRIBUTING.md
+++ b/CONTRIBUTING.md
@@ -6,7 +6,7 @@ As an open source project in a rapidly changing space, we welcome all contributi

 ## 💃 Guidelines
 ### Contribution Opportunities
-The [GitHub issues](https://github.com/danswer-ai/danswer/issues) page is a great place to start for contribution ideas.
+The [GitHub Issues](https://github.com/danswer-ai/danswer/issues) page is a great place to start for contribution ideas.

 Issues that have been explicitly approved by the maintainers (aligned with the direction of the project)
 will be marked with the `approved by maintainers` label.
@@ -19,7 +19,9 @@ If you have a new/different contribution in mind, we'd love to hear about it!
 Your input is vital to making sure that Danswer moves in the right direction.
 Before starting on implementation, please raise a GitHub issue.

-And always feel free to message us (Chris Weaver / Yuhong Sun) on Slack / Discord directly about anything at all. 
+And always feel free to message us (Chris Weaver / Yuhong Sun) on 
+[Slack](https://join.slack.com/t/danswer/shared_invite/zt-1u3h3ke3b-VGh1idW19R8oiNRiKBYv2w) / 
+[Discord](https://discord.gg/TDJ59cGV2X) directly about anything at all. 


 ### Contributing Code
@@ -44,8 +46,8 @@ We would love to see you there!


 ## Get Started 🚀
-Danswer being a fully functional app, relies on several external pieces of software, specifically:
- Postgres (Relational DB)
+Danswer being a fully functional app, relies on some external pieces of software, specifically:
+- [Postgres](https://www.postgresql.org/) (Relational DB)
 - [Vespa](https://vespa.ai/) (Vector DB/Search Engine)

 This guide provides instructions to set up the Danswer specific services outside of Docker because it's easier for
@@ -54,11 +56,9 @@ development purposes but also feel free to just use the containers and update wi


 ### Local Set Up
-We've tested primarily with Python versions >= 3.11 but the code should work with Python >= 3.9.
+It is recommended to use Python versions >= 3.11.

-This guide skips a few optional features for simplicity, reach out if you need any of these:
- User Authentication feature
- File Connector background job
+This guide skips setting up User Authentication for the purpose of simplicity


 #### Installing Requirements
@@ -93,18 +93,11 @@ playwright install


 #### Dependent Docker Containers
-First navigate to `danswer/deployment/docker_compose`, then start up the containers with:
-
-Postgres:
+First navigate to `danswer/deployment/docker_compose`, then start up Vespa and Postgres with:
 ```bash
-docker compose -f docker-compose.dev.yml -p danswer-stack up -d relational_db
+docker compose -f docker-compose.dev.yml -p danswer-stack up -d document_index relational_db
 ```
-
-Vespa:
-```bash
-docker compose -f docker-compose.dev.yml -p danswer-stack up -d index
-```
-
+(document_index refers to Vespa and relational_db refers to Postgres)

 #### Running Danswer

@@ -115,27 +108,33 @@ mkdir dynamic_config_storage

 To start the frontend, navigate to `danswer/web` and run:
 ```bash
-AUTH_TYPE=disabled npm run dev
-```
-_for Windows, run:_
-```bash
-(SET "AUTH_TYPE=disabled" && npm run dev)
+npm run dev
 ```

+Package the Vespa schema. This will only need to be done when the Vespa schema is updated locally.

-The first time running Danswer, you will need to run the DB migrations for Postgres.
-Navigate to `danswer/backend` and with the venv active, run:
-```bash
-alembic upgrade head
-```
-
-Additionally, we have to package the Vespa schema deployment:
 Nagivate to `danswer/backend/danswer/datastores/vespa/app_config` and run:
 ```bash
 zip -r ../vespa-app.zip .
 ```
 - Note: If you don't have the `zip` utility, you will need to install it prior to running the above

+The first time running Danswer, you will also need to run the DB migrations for Postgres.
+After the first time, this is no longer required unless the DB models change.
+
+Navigate to `danswer/backend` and with the venv active, run:
+```bash
+alembic upgrade head
+```
+
+Next, start the task queue which orchestrates the background jobs.
+Jobs that take more time are run async from the API server.
+
+Still in `danswer/backend`, run:
+```bash
+python ./scripts/dev_run_background_jobs.py
+```
+
 To run the backend API server, navigate back to `danswer/backend` and run:
 ```bash
 AUTH_TYPE=disabled \
@@ -153,33 +152,6 @@ powershell -Command "
 "
 ```

-To run the background job to check for connector updates and index documents, navigate to `danswer/backend` and run:
-```bash
-PYTHONPATH=. DYNAMIC_CONFIG_DIR_PATH=./dynamic_config_storage python danswer/background/update.py
-```
-_For Windows:_
-```bash
-powershell -Command " $env:PYTHONPATH='.'; $env:DYNAMIC_CONFIG_DIR_PATH='./dynamic_config_storage'; python danswer/background/update.py "
-```
-
-To run the background job to check for periodically check for document set updates, navigate to `danswer/backend` and run:
-```bash
-PYTHONPATH=. DYNAMIC_CONFIG_DIR_PATH=./dynamic_config_storage python danswer/background/document_set_sync_script.py
-```
-_For Windows:_
-```bash
-powershell -Command " $env:PYTHONPATH='.'; $env:DYNAMIC_CONFIG_DIR_PATH='./dynamic_config_storage'; python danswer/background/document_set_sync_script.py "
-```
-
-To run Celery, which handles deletion of connectors + syncing of document sets, navigate to `danswer/backend` and run:
-```bash
-PYTHONPATH=. DYNAMIC_CONFIG_DIR_PATH=./dynamic_config_storage celery -A  danswer.background.celery worker --loglevel=info --concurrency=1
-```
-_For Windows:_
-```bash
-powershell -Command " $env:PYTHONPATH='.'; $env:DYNAMIC_CONFIG_DIR_PATH='./dynamic_config_storage'; celery -A  danswer.background.celery worker --loglevel=info --concurrency=1 "
-```
-
 Note: if you need finer logging, add the additional environment variable `LOG_LEVEL=DEBUG` to the relevant services.

 ### Formatting and Linting
--- a/backend/scripts/dev_run_background_jobs.py
+++ b/backend/scripts/dev_run_background_jobs.py
@@ -1,4 +1,5 @@
-# This file is purely for development use, not included in any builds
+import argparse
+import os
 import subprocess
 import threading

@@ -16,18 +17,20 @@ def monitor_process(process_name: str, process: subprocess.Popen) -> None:
            break


-def run_celery() -> None:
+def run_jobs(exclude_indexing: bool) -> None:
    cmd_worker = [
        "celery",
        "-A",
        "danswer.background.celery",
        "worker",
+        "--pool=threads",
+        "--autoscale=3,10",
        "--loglevel=INFO",
        "--concurrency=1",
    ]
+
    cmd_beat = ["celery", "-A", "danswer.background.celery", "beat", "--loglevel=INFO"]

-    # Redirect stderr to stdout for both processes
    worker_process = subprocess.Popen(
        cmd_worker, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, text=True
    )
@@ -35,7 +38,6 @@ def run_celery() -> None:
        cmd_beat, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, text=True
    )

-    # Monitor outputs using threads
    worker_thread = threading.Thread(
        target=monitor_process, args=("WORKER", worker_process)
    )
@@ -44,10 +46,37 @@ def run_celery() -> None:
    worker_thread.start()
    beat_thread.start()

-    # Wait for threads to finish
+    if not exclude_indexing:
+        update_env = os.environ.copy()
+        update_env["PYTHONPATH"] = "."
+        update_env["DYNAMIC_CONFIG_DIR_PATH"] = "./dynamic_config_storage"
+        update_env["FILE_CONNECTOR_TMP_STORAGE_PATH"] = "./dynamic_config_storage"
+        cmd_indexing = ["python", "danswer/background/update.py"]
+
+        indexing_process = subprocess.Popen(
+            cmd_indexing,
+            env=update_env,
+            stdout=subprocess.PIPE,
+            stderr=subprocess.STDOUT,
+            text=True,
+        )
+
+        indexing_thread = threading.Thread(
+            target=monitor_process, args=("INDEXING", indexing_process)
+        )
+
+        indexing_thread.start()
+        indexing_thread.join()
+
    worker_thread.join()
    beat_thread.join()


 if __name__ == "__main__":
-    run_celery()
+    parser = argparse.ArgumentParser(description="Run background jobs.")
+    parser.add_argument(
+        "--no-indexing", action="store_true", help="Do not run indexing process"
+    )
+    args = parser.parse_args()
+
+    run_jobs(args.no_indexing)
--- a/deployment/docker_compose/docker-compose.dev.yml
+++ b/deployment/docker_compose/docker-compose.dev.yml
@@ -11,7 +11,7 @@ services:
      uvicorn danswer.main:app --host 0.0.0.0 --port 8080"
    depends_on:
      - relational_db
-      - index
+      - document_index
    restart: always
    ports:
      - "8080:8080"
@@ -23,7 +23,7 @@ services:
      - GEN_AI_HOST_TYPE=${GEN_AI_HOST_TYPE:-}
      - NUM_DOCUMENT_TOKENS_FED_TO_GENERATIVE_MODEL=${NUM_DOCUMENT_TOKENS_FED_TO_GENERATIVE_MODEL:-}
      - POSTGRES_HOST=relational_db
-      - VESPA_HOST=index
+      - VESPA_HOST=document_index
      - AUTH_TYPE=${AUTH_TYPE:-disabled}
      - QA_TIMEOUT=${QA_TIMEOUT:-}
      - VALID_EMAIL_DOMAINS=${VALID_EMAIL_DOMAINS:-}
@@ -60,7 +60,7 @@ services:
    command: /usr/bin/supervisord
    depends_on:
      - relational_db
-      - index
+      - document_index
    restart: always
    environment:
      - INTERNAL_MODEL_VERSION=${INTERNAL_MODEL_VERSION:-openai-chat-completion}
@@ -69,7 +69,7 @@ services:
      - GEN_AI_ENDPOINT=${GEN_AI_ENDPOINT:-}
      - GEN_AI_HOST_TYPE=${GEN_AI_HOST_TYPE:-}
      - POSTGRES_HOST=relational_db
-      - VESPA_HOST=index
+      - VESPA_HOST=document_index
      - API_BASE_OPENAI=${API_BASE_OPENAI:-}
      - API_TYPE_OPENAI=${API_TYPE_OPENAI:-}
      - API_VERSION_OPENAI=${API_VERSION_OPENAI:-}
@@ -129,7 +129,7 @@ services:
      - "5432:5432"
    volumes:
      - db_volume:/var/lib/postgresql/data
-  index:
+  document_index:
    image: vespaengine/vespa:8
    restart: always
    ports:
--- a/deployment/docker_compose/docker-compose.prod.yml
+++ b/deployment/docker_compose/docker-compose.prod.yml
@@ -11,14 +11,14 @@ services:
      uvicorn danswer.main:app --host 0.0.0.0 --port 8080"
    depends_on:
      - relational_db
-      - index
+      - document_index
    restart: always
    env_file:
      - .env
    environment:
      - AUTH_TYPE=${AUTH_TYPE:-google_oauth}
      - POSTGRES_HOST=relational_db
-      - VESPA_HOST=index
+      - VESPA_HOST=document_index
    volumes:
      - local_dynamic_storage:/home/storage
      - file_connector_tmp_storage:/home/file_connector_storage
@@ -33,14 +33,14 @@ services:
    command: /usr/bin/supervisord
    depends_on:
      - relational_db
-      - index
+      - document_index
    restart: always
    env_file:
      - .env
    environment:
      - AUTH_TYPE=${AUTH_TYPE:-google_oauth}
      - POSTGRES_HOST=relational_db
-      - VESPA_HOST=index
+      - VESPA_HOST=document_index
    volumes:
      - local_dynamic_storage:/home/storage
      - file_connector_tmp_storage:/home/file_connector_storage
@@ -69,7 +69,7 @@ services:
      - .env
    volumes:
      - db_volume:/var/lib/postgresql/data
-  index:
+  document_index:
    image: vespaengine/vespa:8
    restart: always
    ports:
--- a/deployment/docker_compose/env.prod.template
+++ b/deployment/docker_compose/env.prod.template
@@ -38,7 +38,6 @@ SESSION_EXPIRE_TIME_SECONDS=86400

 # The following are for configuring User Authentication, supported flows are:
 # disabled
-# simple (email/password + user account creation in Danswer)
 # google_oauth (login with google/gmail account)
 # oidc (only in Danswer enterprise edition)
 # saml (only in Danswer enterprise edition)