Which function can be used to read the entire contents of a text file line by line?
File Input/OutputFile Input/Ouput (IO) requires 3 steps: Show
Python provides built-in functions and modules to support these operations. Opening/Closing a File
Reading/Writing Text FilesThe Reading Line/Lines from a Text File
Writing Line to a Text File
Examples>>> f = open('test.txt', 'w') >>> f.write('apple\n') >>> f.write('orange\n') >>> f.write('pear\n') >>> f.close() >>> f = open('test.txt', 'r') >>> f.readline() 'apple\n' >>> f.readlines() ['orange\n', 'pear\n'] >>> f.readline() '' >>> f.close() >>> f = open('test.txt', 'r') >>> f.read() 'apple\norange\npear\n' >>> f.close() >>> f = open('test.txt') >>> line = f.readline() >>> while line: line = line.rstrip() print(line) line = f.readline() apple orange pear >>> f.close() Processing Text File Line-by-LineWe can use a with open('path/to/file.txt', 'r') as f: for line in f: line = line.strip() The try: f = open('path/to/file.txt') for line in f: line = line.strip() finally: f.close() Example: Line-by-line File CopyThe following script copies a file into another line-by-line, prepending each line with the line number.
Binary File Operations[TODO] Intro
For example [TODO] Directory and File ManagementIn Python,
directory and file management are supported by modules Path Operations Using Module os.pathIn Python, a path could refer to:
A path could be absolute (beginning with root) or relative to the current working directory (CWD). The path separator is platform-dependent (Windows use Checking Path Existence and Type
For examples, >>> import os >>> os.path.exists('/usr/bin') True >>> os.path.isfile('/usr/bin') False >>> os.path.isdir('/usr/bin') True Forming a New PathThe path separator is platform-dependent (Windows use
For examples, >>> import os >>> print(os.path.sep) / >>> print(os.path.join(os.path.sep, 'etc', 'apache2', 'httpd.conf')) /etc/apache2/httpd.conf >>> print(os.path.join('..', 'apache2', 'httpd.conf')) ../apache2/httpd.conf Manipulating Directory-name and Filename
For example, to form an absolute path of a file called os.path.join(os.path.dirname(os.path.abspath('in.txt')), 'out.txt') os.path.join(os.path.dirname('in.txt'), 'out.txt') For example, import os print('__file__:', __file__) print('dirname():', os.path.dirname(__file__)) print('abspath():', os.path.abspath(__file__)) print('dirname(abspath()):', os.path.dirname(os.path.abspath(__file__))) When a module is loaded in Python, $ python3 ./test_ospath.py $ python3 test_ospath.py $ python3 ../parent_dir/test_ospath.py $ python3 /path/to/test_ospath.py Handling Symlink (Unixes/Mac OS)
For example, import os print('__file__:', __file__) print('abspath():', os.path.abspath(__file__)) print('realpath():', os.path.realpath(__file__)) $ python3 test_realpath.py # Same output for abspath() and realpath() becuase there is no symlink $ ln -s test_realpath.py test_realpath_link.py $ python3 test_realpath_link.py #abspath(): /path/to/test_realpath_link.py #realpath(): /path/to/test_realpath.py (symlink resolved) Directory & File Managament Using Modules os and shutilThe modules However,
Directory Management
File Management
For examples [TODO], >>> import os >>> dir(os) ...... >>> help(os) ...... >>> help(os.getcwd) ...... >>> os.getcwd() ... current working directory ... >>> os.listdir() ... contents of current directory ... >>> os.chdir('test-python') >>> exec(open('hello.py').read()) >>> os.system('ls -l') >>> os.name 'posix' >>> os.makedir('sub_dir') >>> os.makedirs('/path/to/sub_dir') >>> os.remove('filename') >>> os.rename('oldFile', 'newFile') List a Directory
For examples, >>> import os >>> help(os.listdir) ...... >>> os.listdir() [..., ..., ...] >>> for f in sorted(os.listdir('/usr')): print(f) ...... >>> for f in sorted(os.listdir('/usr')): print(os.path.abspath(f)) ...... List a Directory Recursively via os.walk()
For example,
List a Directory Recursively via Module glob (Python 3.5)[TODO] Intro
Copying File
Shell Command [TODO]
Environment Variables [TODO]
fileinput ModuleThe import fileinput def main(): lineNumber = 0 for line in fileinput.input(): line = line.rstrip() lineNumber += 1 print('{}: {}'.format(lineNumber, line)) if __name__ == '__main__': main() Text ProcessingFor simple text string operations such as string search and replacement, you can use the built-in string functions (e.g.,
String OperationsThe built-in class Strip whitespaces (blank, tab and newline)
Uppercase/Lowercase
Find
For examples, >>> s = '/test/in.txt' >>> s.find('in') 6 >>> s[0 : s.find('in')] + 'out.txt' '/test/out.txt' Find and Replace
For examples, >>> s = 'hello hello hello, world' >>> help(s.replace) >>> s.replace('ll', '**') 'he**o he**o he**o, world' >>> s.replace('ll', '**', 2) 'he**o he**o hello, world' Split into Tokens and Join
For examples, >>> 'apple, orange, pear'.split() ['apple,', 'orange,', 'pear'] >>> 'apple, orange, pear'.split(', ') ['apple', 'orange', 'pear'] >>> 'apple, orange, pear'.split(', ', maxsplit=1) ['apple', 'orange, pear'] >>> ', '.join(['apple', 'orange, pear']) 'apple, orange, pear' Regular Expression in Module reReferences:
I assume that you are familiar with regex, otherwise, you could read:
The >>> import re >>> dir(re) ...... >>> help(re) ...... Backslash (\), Python Raw String r'...' vs Regular StringRegex's syntax uses backslash (
On the other hand, Python' regular strings also use backslash for escape sequences, e.g., To
write the regex pattern Python's solution is using raw string with a prefix Furthermore, Python denotes parenthesized back references (or capturing groups) as I suggest that you use raw strings for regex pattern strings and replacement strings. Compiling (Creating) a Regex Pattern Object
For examples, >>> import re >>> p1 = re.compile(r'[1-9][0-9]*|0') >>> type(p1) Invoking Regex OperationsYou can invoke most of the regex functions in two ways:
Find using finaAll()
For examples, >>> p1 = re.compile(r'[1-9][0-9]*|0') >>> p1.findall('123 456') ['123', '456'] >>> p1.findall('abc') [] >>> p1.findall('abc123xyz456_7_00') ['123', '456', '7', '0', '0'] >>> re.findall(r'[1-9][0-9]*|0', '123 456') ['123', '456'] >>> re.findall(r'[1-9][0-9]*|0', 'abc') [] >>> re.findall(r'[1-9][0-9]*|0', 'abc123xyz456_7_00') ['123', '456', '7', '0', '0'] Replace using sub() and subn()
For examples, >>> p1 = re.compile(r'[1-9][0-9]*|0') >>> p1.sub(r'**', 'abc123xyz456_7_00') 'abc**xyz**_**_****' >>> p1.subn(r'**', 'abc123xyz456_7_00') ('abc**xyz**_**_****', 5) >>> p1.sub(r'**', 'abc123xyz456_7_00', count=3) 'abc**xyz**_**_00' >>> re.sub(r'[1-9][0-9]*|0', r'**', 'abc123xyz456_7_00') 'abc**xyz**_**_****' >>> re.sub(p1, r'**', 'abc123xyz456_7_00') 'abc**xyz**_**_****' >>> re.subn(p1, r'**', 'abc123xyz456_7_00', count=3) ('abc**xyz**_**_00', 3) >>> re.subn(p1, r'**', 'abc123xyz456_7_00', count=10) ('abc**xyz**_**_****', 5) Notes: For simple string replacement, use Using Parenthesized Back-References \1, \2, ... in Substitution and PatternIn Python, regex parenthesized back-references (capturing groups) are denoted as For examples, >>> re.sub(r'(\w+) (\w+)', r'\2 \1', 'aaa bbb ccc') 'bbb aaa ccc' >>> re.sub(r'(\w+) (\w+)', r'\2 \1', 'aaa bbb ccc ddd') 'bbb aaa ddd ccc' >>> re.subn(r'(\w+) (\w+)', r'\2 \1', 'aaa bbb ccc ddd eee') ('bbb aaa ddd ccc eee', 2) >>> re.subn(r'(\w+) \1', r'\1', 'hello hello world again again') ('hello world again', 2) Find using search() and Match Object
The
For example, >>> p1 = re.compile(r'[1-9][0-9]*|0') >>> inStr = 'abc123xyz456_7_00' >>> m = p1.search(inStr) >>> m <_sre.SRE_Match object; span=(3, 6), match='123'> >>> m.group() '123' >>> m.span() (3, 6) >>> m.start() 3 >>> m.end() 6 >>> m = p1.search(inStr, m.end()) >>> m <_sre.SRE_Match object; span=(9, 12), match='456'> >>> m = p1.search(inStr) >>> while m: print(m, m.group()) m = p1.search(inStr, m.end()) <_sre.SRE_Match object; span=(3, 6), match='123'> 123 <_sre.SRE_Match object; span=(9, 12), match='456'> 456 <_sre.SRE_Match object; span=(13, 14), match='7'> 7 <_sre.SRE_Match object; span=(15, 16), match='0'> 0 <_sre.SRE_Match object; span=(16, 17), match='0'> 0 To retrieve the back-references (or capturing groups) inside the Match object:
>>> p2 = re.compile('(A)(\w+)', re.IGNORECASE) >>> inStr = 'This is an apple.' >>> m = p2.search(inStr) >>> while m: print(m) print(m.group()) print(m.groups()) for idx in range(1, m.lastindex + 1): print(m.group(idx), end=',') print() m = p2.search(inStr, m.end()) <_sre.SRE_Match object; span=(8, 10), match='an'> an ('a', 'n') a,n, <_sre.SRE_Match object; span=(11, 16), match='apple'> apple ('a', 'pple') a,pple, Find using match() and fullmatch()
The For example, >>> p1 = re.compile(r'[1-9][0-9]*|0') >>> m = p1.match('aaa123zzz456') >>> m >>> m = p1.match('123zzz456') >>> m <_sre.SRE_Match object; span=(0, 3), match='123'> >>> m = p1.fullmatch('123456') >>> m <_sre.SRE_Match object; span=(0, 6), match='123456'> >>> m = p1.fullmatch('123456abc') >>> m Find using finditer()
The >>> p1 = re.compile(r'[1-9][0-9]*|0') >>> inStr = 'abc123xyz456_7_00' >>> p1.findall(inStr) ['123', '456', '7', '0', '0'] >>> for s in p1.findall(inStr): print(s, end=' ') 123 456 7 0 0 >>> for m in p1.finditer(inStr): print(m) <_sre.SRE_Match object; span=(3, 6), match='123'> <_sre.SRE_Match object; span=(9, 12), match='456'> <_sre.SRE_Match object; span=(13, 14), match='7'> <_sre.SRE_Match object; span=(15, 16), match='0'> <_sre.SRE_Match object; span=(16, 17), match='0'> >>> for m in p1.finditer(inStr): print(m.group(), end=' ') 123 456 7 0 0 Spliting String into Tokens
The >>> p1 = re.compile(r'[1-9][0-9]*|0') >>> p1.split('aaa123bbb456ccc') ['aaa', 'bbb', 'ccc'] >>> re.split(r'[1-9][0-9]*|0', 'aaa123bbb456ccc') ['aaa', 'bbb', 'ccc'] Notes: For simple delimiter, use Web ScrapingReferences:
Web Scraping (or web harvesting or web data extraction) refers to reading the raw HTML page to retrieve desired data. Needless to say, you need to master HTML, CSS and JavaScript. Python supports web scraping via packages requests and BeautifulSoup (bs4). Install PackagesYou could install the relevant packages using $ pip install requests $ pip install bs4 Step 0: Inspect the Target Webpage
Step 1: Send a HTTP GET request to the target URL to retrieve the raw HTML page using module requests>>> import requests >>> url = "http://your_target_webpage" >>> response = requests.get(url) >>> type(response) Step 2: Parse the HTML Text into a Tree-Structure using BeautifulSoup and Search the Desired Data>>> from bs4 import BeautifulSoup >>> soup = BeautifulSoup(response.text, "html.parser") >>> type(soup) You could write out the selected data to a file: with open(filename, 'w') as fp: for row in rows: fp.wrire(row + '\n') You could also use >>> import csv >>> with open(filename, 'w') as fp: writer = csv.DictWriter(fp, ['colHeader1', 'colHeader2', 'colHeader3']) writer.writeheader() for row in rows: writer.writerow(row) Step 3: Download Selected Document Using urllib.requestYou may want to download documents such as text files or images. >>> import urllib.request >>> downloadUrl = '.....' >>> file = '......' >>> urllib.request.urlretrieve(download_url, file) Step 4: DelayTo avoid spamming a website with download requests (and flagged as a spammer), you need to pause your code for a while. >>> import time >>> time.sleep(1) REFERENCES & RESOURCES Which function can be used to read an entire line of text data?readline() function reads a line of the file and return it in the form of the string. It takes a parameter n, which specifies the maximum number of bytes that will be read.
Which method is used to read the contents of a file line by line?We can use java. io. BufferedReader readLine() method to read file line by line to String.
Which function will read entire content of file?The read() function is designed to be called once, and it returns the entire contents of the file.
Which function can be used for reading entire content of a file into a string while using a file object for reading from a file?read() : This function reads the entire file and returns a string. readline() : This function reads lines from that file and returns as a string.
|