79752059

Date: 2025-08-31 22:56:41
Score: 1.5
Natty:
Report link

I think the speed problem is due to the difference between logical locations and physical locations. When you logically mount your Google Drive in Google Colab, the physical location of the files is very much not on Google Colab.

I tried to find some code I wrote to deal with this, but I couldn't find it.

Caveat: I dealt with the problems described below approximately 12 months ago, so there is a small chance that some things have changed.

My perspective: I'm not a programmer, but I can code in Python. I was a sys/net-admin, teacher, MCSE, "webmaster"--prior to 2005.

My "proof" of a physical issue

Because I cannot cite documentation of my claim, I will describe my problem and solution as proof of my claim. If you believe my claim, you can probably skim or skip this section.

My problem: I had up to 80 GB of (WAV) files that were physically in up to six different Google Drive accounts. With my sym-fu skills, I could effectively mount all six Google Drives at the same in one Colab session. Obviously, Colab did not transfer that data to the physical server on which my Colab session was running.

Let's say I had a Python command to concatenate 30 files into one new file: newFile.wav = concat(listPathFilenames). Those 30 files were physically spread across six different Google Drives. The Python interpreter would request the files from the OS (Colab), and the OS would use filesystem-level operations to move the physical files to the Colab server. Just waiting for 600 MB of files to transfer could take 30 seconds, but the operation would only take 2-5 seconds. (I wasn't really concatenating, you know?)

So, at least for a little while, my solution was to "operate" on the files before I need to operate on the files. My flow allowed me to easily predict which files I would soon need, so I had logic that would do something like

for pathFilename in listPathFilenames:
    pathlib.Path('pathFilename.wav').stat()

I had to try out a few different functions to find right one. I didn't want to modify the file. And some functions wouldn't force the physical transfer of the file: like, I think .exists() didn't work. The net effect was that the physical location of the files would be on the Colab server, and when I did the real operations on the files, there would not be a delay as the files were retrieved from Google Drive.

My solution is not your solution

First, I don't have enough knowledge of pip to understand the answer from https://stackoverflow.com/users/14208544/hor-golzari, so I would still incorporate his guidance. (Well, I mean, since you seem to understand it, you should use his knowledge.)

From what I can tell, Colab uses multiple excellent tactics to speed up on-the-fly environment creation. Off the top of my head:

In contrast, the filesystem-level transfers to and from Google Drive and absolutely not prioritized. One way I know that for sure: if you "write" a (large) file Google Drive, if the Colab environment says, "the file has been written," then even a catastrophic failure in your Colab will not prevent the file from reaching Google Drive. How? It's buffered. It's not fast--some files take 15 minutes before I can see them on Google Drive--but it is reliable.

Therefore, I suspect Google Drive won't accomplish what you want: only because Colab has decided to prioritize the speed the physical connection to Google Drive as too slow to be useful.

Appendix: deep thoughts

I'm trying to optimize my Google Colab workflow

I don't know what needs optimizing, but some things I've done (that I can recall off the top of my head):

  1. Prune the list of installed packages.
  2. Conditionally install packages.
  3. Delay installing some packages until you need them: especially if you might not need them.
  4. Pin packages so pip doesn't need to think.
  5. Don't try to install packages that Google has already pre-installed.
  6. If Google has a pre-installed package that does the same thing as your not-preinstalled package: refactor.
  7. My brain just told me, "I quit," but I know I had a couple of other tricks.

Appendix: old code

The following used to be my template for quickly installing stuff. I still used "requirements.txt" files at the time. I've switched to pyproject.toml, and I guess I would probably use something like pip install {repoTarget}@git+https://{accessTokenGithub}@github.com/{repoOwner}/{repoTarget}.git. idk.

import sys
import subprocess
import pathlib
listPackages = ['Z0Z_tools']
def cloneRepo(repoTarget:str, repoOwner:str = 'hunterhogan') -> None:
    if not pathlib.Path(repoTarget).exists():
        accessTokenGithub = userdata.get('github_token')
        subprocess.run(["git", "clone", f"https://{accessTokenGithub}@github.com/{repoOwner}/{repoTarget}.git"], check=True)
        pathFilenameRequirements = pathlib.Path(repoTarget) / 'requirements.txt'
        if pathFilenameRequirements.exists():
            listPackages.append(f"-r {pathFilenameRequirements}")
    sys.path.append(repoTarget)
if 'google.colab' in sys.modules:
    from google.colab import drive, userdata
    drive.mount('/content/drive')
    cloneRepo('stubFileNotFound')
    cloneRepo('astToolFactory')
    %pip install -q {' '.join(listPackages)}
Reasons:
  • Blacklisted phrase (0.5): I need
  • Blacklisted phrase (1): stackoverflow
  • Blacklisted phrase (0.5): I cannot
  • Long answer (-1):
  • Has code block (-0.5):
  • Contains question mark (0.5):
  • Low reputation (0.5):
Posted by: hunterhogan